Keyword Search Modules¶

DAS Keyword Search modules

Module author: Vidmantas Zemleris <vidmantas.zemleris@gmail.com>

Overview
Adapters to access Metadata
Tokenizing and parsing the query
Entity matchers
Generating and Ranking the Query suggestion
- A ranker implemented in Cython and built into a C extension
Presenting the Results to the user

Overview ¶

The basic keyword search workflow:

The query is tokenized (including shallow parsing for key=val patterns and quotes)
Then by a number of entity matchers the entry points are generated (matches of each keyword or the few nearby keywords into some of the schema or value terms which make part of a structured query)
Exploring the various combinations of the entry points the candidates for query suggestions are evaluated and ranked
The top results are presented to the users (generating a valid DAS query and a providing a readable query description)

Adapters to access Metadata ¶

Contain meta-data related functions:

accessing integration schema: fields, values, constraints on inputs/queries
tracking fields available
tracking known (input field) values

The schema adapter ¶

Provide a layer of abstraction between Keyword Search and the Data Integration System.

DAS.keywordsearch.metadata.schema_adapter2.ApiDef¶: alias of ApiInputParamsEntry

class DAS.keywordsearch.metadata.schema_adapter2.DasSchemaAdapter¶

provides an adapter between keyword search and Data integration system.

classmethod are_wildcards_allowed(entity, wildcards, params)¶

Whether wildcards are allowed for given inputs

currently only these simple rules allowed: site=W* dataset=W* dataset site=T1_CH_* dataset site=T1_CH_* dataset=/A/B/C file dataset=/DoubleMu/Run2012A-Zmmg-13Jul2012-v1/RAW-RECO site=T1_* file block=/A/B/C#D file file=W* dataset=FULL file file=W* block=FULL

# these are supported (probably) because of DAS-wrappers file dataset=*DoubleMuParked25ns* file dataset=*DoubleMuParked25ns* site=T2_RU_JINR

check_result_field_match(fieldname)¶: checks for complete match to a result field

entities_for_inputs(params)¶: lists entities that could be retrieved with given input params

get_api_param_definitions()¶: returns a list of API input requirements

get_result_field_title(result_entity, field, technical=False, html=True)¶: returns name (and optionally title) of output field

init(dascore=None)¶: initialization or re-initialization

list_result_fields(entity=None, inputs=None)¶: lists attributes available in all service outputs (aggregated)

validate_input_params(params, entity=None, final_step=False, wildcards=None)¶: checks if DIS can answer query with given params.

validate_input_params_lookupbased(params, entity=None, final_step=False, wildcards=None)¶: checks if DIS can answer query with given params.

Gathers the list of fields available in service outputs

DAS.keywordsearch.metadata.das_output_fields_adapter.flatten(list_of_lists)¶: Flatten one level of nesting

DAS.keywordsearch.metadata.das_output_fields_adapter.get_outputs_field_list(dascore)¶: makes a list of output fields available in each DAS entity this is taken from keylearning collection.

DAS.keywordsearch.metadata.das_output_fields_adapter.get_titles_by_field(dascore)¶: returns a dict of titles taken from presentation cache

DAS.keywordsearch.metadata.das_output_fields_adapter.is_reserved_field(field, result_type)¶: returns whether the field is reserved, e.g. *.error, *.reason, qhash

DAS.keywordsearch.metadata.das_output_fields_adapter.print_debug(dascore, fields_by_entity, results_by_entity)¶: verbose output for get_outputs_field_list

DAS.keywordsearch.metadata.das_output_fields_adapter.result_contained_errors(rec)¶: decide whether keylearning record contain errors (i.e. as responses from services contained errors) and whether the record shall be excluded

Input Values Tracker ¶

DAS Query Language definitions ¶

defines DASQL and keyword search features, e.g. what shall be considered as: * word * simple operators * aggregation operators (not implemented)

DAS.keywordsearch.metadata.das_ql.flatten(list_of_lists)¶: Flatten one level of nesting

DAS.keywordsearch.metadata.das_ql.get_operator_synonyms()¶: return synonyms for das aggregation operators (not used yet)

Tokenizing and parsing the query ¶

Module description:

first clean up input keyword query (rm extra spaces, standardize notation)
then it tokenizes the query into:
- individual query terms
- compound query terms in brackets (e.g. “number of events”)
- phrases: “terms operator value” (e.g. nevent > 1, “number of events”=100)

DAS.keywordsearch.tokenizer.cleanup_query(query)¶

Returns cleaned query by applying a number of transformation patterns that removes spaces and simplifies the conditions

>>> cleanup_query('number of events = 33')
'number of events=33'

>>> cleanup_query('number of events >    33')
'number of events>33'

>>> cleanup_query('more than 33 events')
'>33 events'

>>> cleanup_query('X more than 33 events')
'X>33 events'

>>> cleanup_query('find datasets where X more than 33 events')
'datasets where X>33 events'

>>> cleanup_query('=2012-02-01')
'= 20120201'

>>> cleanup_query('>= 2012-02-01')
'>= 20120201'

DAS.keywordsearch.tokenizer.get_keyword_without_operator(keyword)¶

splits keyword on operator

>>> get_keyword_without_operator('number of events >= 10')
'number of events'

>>> get_keyword_without_operator('dataset')
'dataset'

>>> get_keyword_without_operator('dataset=Zmm')
'dataset'

DAS.keywordsearch.tokenizer.get_operator_and_param(keyword)¶

splits keyword on operator

>>> get_operator_and_param('number of events >= 10')
{'type': 'filter', 'param': '10', 'op': '>='}

>>> get_operator_and_param('dataset')

>>> get_operator_and_param('dataset=Zmm')
{'type': 'filter', 'param': 'Zmm', 'op': '='}

DAS.keywordsearch.tokenizer.test_operator_containment(keyword)¶

returns whether a keyword token contains an operator (this is useful then processing a list of tokens, as only the last token may have an operator)

>>> test_operator_containment('number of events >= 10')
True

>>> test_operator_containment('number')
False

DAS.keywordsearch.tokenizer.tokenize(query)¶

tokenizes the query retaining the phrases in brackets together it also tries to group “word operator word” sequences together, such as

"number of events">10 or dataset=/Zmm/*/raw-reco

so it could be used for further processing.

special characters currently allowed in data values include: _*/-

For example:

>>> tokenize('file dataset=/Zmm*/*/raw-reco lumi=20853 nevents>10'                     '"number of events">10 /Zmm*/*/raw-reco')
['file', 'dataset=/Zmm*/*/raw-reco', 'lumi=20853', 'nevents>10',     'number of events>10', '/Zmm*/*/raw-reco']

>>> tokenize('file dataset=/Zmm*/*/raw-reco lumi=20853 dataset.nevents>10'                     '"number of events">10 /Zmm*/*/raw-reco')
['file', 'dataset=/Zmm*/*/raw-reco', 'lumi=20853', 'dataset.nevents>10',     'number of events>10', '/Zmm*/*/raw-reco']

>>> tokenize("file dataset=/Zmm*/*/raw-reco lumi=20853 dataset.nevents>10"                      "'number of events'>10 /Zmm*/*/raw-reco")
['file', 'dataset=/Zmm*/*/raw-reco', 'lumi=20853', 'dataset.nevents>10',     'number of events>10', '/Zmm*/*/raw-reco']


>>> tokenize('user=vidmasze@cern.ch')
['user=vidmasze@cern.ch']

Entity matchers ¶

Contain entity matching related functions:

Name matching / custom String distance
Chunk matching (multi-word terms into names of service output fields)
Value matching
And CMS specific dataset matching

Value matching ¶

module provide custom Levenshtein distance function

DAS.keywordsearch.entity_matchers.string_dist_levenstein.levenshtein(string1, string2, subcost=3, modification_middle_cost=2)¶: string-edit distance returning min cost of edits needed

DAS.keywordsearch.entity_matchers.string_dist_levenstein.levenshtein_normalized(string1, string2, subcost=2, maxcost=3)¶: return a levenshtein distance normalized between [0-1]

DAS.keywordsearch.presentation.result_presentation.dasql_to_nl(dasql_tuple)¶

Returns natural language representation of a generated DAS query: so to explain users what does it mean.

DAS.keywordsearch.presentation.result_presentation.fescape(value)¶: escape a value to be included in html

DAS.keywordsearch.presentation.result_presentation.result_to_dasql(result, frmt='text', shorten_html=True, max_value_len=26)¶

returns proposed query as DASQL in there formats:

text, standard DASQL
html, colorified DASQL with long values shortened down (if shorten_html

is specified)

DAS.keywordsearch.presentation.result_presentation.shorten_value(value, max_value_len)¶: provide a shorter version of a very long value for displaying in (html) results. Examples include long dataset or block names.

DAS, version development