Search and filtering

1 Filtering
2 Boolean filtering
3 Arithmetic filtering
4 Search
- 4.1 Free text search
- 4.2 Autocomplete search

Search and filtering¶

Search operations in CVA intend to support different use cases from computational analysis to front end features. For computational analysis we need a comprehensive filtering including some normalisation over text fields, arithmetic operations (e.g.: count_samples > 3) and boolean operations (e.g.: this eq 1 or that eq 2 and this eq 1 and that eq 2). To support a front end we will also require some text search performing natural language processing (e.g.: metabolic matches metabolism and connector words are ignored), regex queries and ad hoc summaries to support autocomplete features.

The principal entities that are available for search are: cases and report events. There are other secondary entities: variants, genes and phenotypes.

Search entities

%run initialise_pyark.py

POST https://bio-test-cva.gel.zone/cva/api/0/authentication?
Response time : 21 ms

pyark version 4.0.4

Filtering¶

Main entities has a main get endpoint that exposes a comprehensive set of filters. Some fields are normalised in the database and every query is normalised to achieve a case insensitive matching.

The normalised fields are:

Panel names
Disorders
Interpretation service

Filters are combined with an AND operator¶

# get the first page of cases
cases = next(cases_client.get_cases(
    program=Program.rare_disease, 
    assembly=Assembly.GRCh38, 
    specificDiseases='intellectual disability', 
    as_data_frame=True))

GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&specificDiseases=intellectual disability&include=__all
Response time : 230 ms

# same filters can be applied to count operations
cases_client.count(
    program=Program.rare_disease, 
    assembly=Assembly.GRCh37, 
    specificDiseases='intellectual disability')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh37&specificDiseases=intellectual disability&count=True
Response time : 55 ms

987

Normalisation allows us to make filters case insensitive¶

The filters on which normalisation is applied are:

Panel names
Disease names, groups, subgroups, etc.
Interpretation service names

cases_client.count(specificDiseases='intellectual disability')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?specificDiseases=intellectual disability&count=True
Response time : 218 ms

5602

cases_client.count(specificDiseases='INTELLECTUAL DISABILITY')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?specificDiseases=INTELLECTUAL DISABILITY&count=True
Response time : 220 ms

5602

Cross references¶

Phenotypes and genes are stored as identifiers in CVA (ie: Ensembl identifiers for genes, HPO terms identifiers for rare disease phenotypes and SNOMED/ICD10 for cancer phenotypes) but they can be matched by cross references to other databases.

We can search genes by ensembl identifier, HGNC gene symbol, gene synonym, Uniprot, CCDS and LRG.

http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000130164;r=19:11089362-11133816

NOTE: these queries call CellBase and thus have a lower performance

# match by gene Ensembl identifier
cases_client.count(genes='ENSG00000130164')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genes=ENSG00000130164&count=True
Response time : 10 ms

690

# match by HGNC gene symbol
cases_client.count(genesXrefs='LDLR')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LDLR&count=True
Response time : 55 ms

690

# match by synonym
cases_client.count(genesXrefs='LDLCQ2')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LDLCQ2&count=True
Response time : 65 ms

690

# match by Uniprot identifier
cases_client.count(genesXrefs='P01130')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=P01130&count=True
Response time : 49 ms

690

# match by Human CCDS
cases_client.count(genesXrefs='CCDS12254.1')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=CCDS12254.1&count=True
Response time : 59 ms

690

# match by LRG region
cases_client.count(genesXrefs='LRG_274')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LRG_274&count=True
Response time : 72 ms

690

We can search HPO terms by identifier, alternative HPO id, SNOMED-CT (US version), UMLS and MSH.

https://hpo.jax.org/app/browse/term/HP:0003701

NOTE: alternative HPO terms are replaced automatically by the most current HPO term when queried

cases_client.count(phenotypes='HP:0003701')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0003701&count=True
Response time : 12 ms

56

# search by UMLS
cases_client.count(phenotypesXrefs='UMLS:C0221629')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=UMLS:C0221629&count=True
Response time : 69 ms

56

# search by SNOMED CT
cases_client.count(phenotypesXrefs='SNOMEDCT_US:249939004')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=SNOMEDCT_US:249939004&count=True
Response time : 38 ms

56

# search by alternative HPO id
cases_client.count(phenotypesXrefs='HP:0008961')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=HP:0008961&count=True
Response time : 28 ms

56

Boolean filtering¶

By default filtering results match all filters, as opposed to match any filter. There are two workarounds for this behaviour:

The filter parameter
The text search (see section below)

The filter parameter follows a basic subset of the OData specification for $field https://www.odata.org/documentation/odata-version-3-0/url-conventions/

Following this specification we define equality and inequality with the operators eq and ne, multiple expressions can be combined with or and and operators.

There are known limitations to this filtering approach:

there is no normalisation happening in query values
the field names have to follow the schema naming in the database which is very obscure to the user
we are not able to combine different boolean operators (ie: (this and that) or those)
we do not support substring operations

# AND operator
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "probandDisorders.specificDisease eq 'intellectual disability'")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and probandDisorders.specificDisease eq 'intellectual disability'&count=True
Response time : 259 ms

5450

# OR operator
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' or " + 
    "probandDisorders.specificDisease eq 'intellectual disability'")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' or probandDisorders.specificDisease eq 'intellectual disability'&count=True
Response time : 257 ms

5623

# AND operator and negation
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "probandDisorders.specificDisease ne 'intellectual disability'")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and probandDisorders.specificDisease ne 'intellectual disability'&count=True
Response time : 247 ms

21

# normalisation is not applied
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'INTELLECTUAL DISABILITY' and " + 
    "probandDisorders.specificDisease ne 'intellectual disability'")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'INTELLECTUAL DISABILITY' and probandDisorders.specificDisease ne 'intellectual disability'&count=True
Response time : 246 ms

0

Also when provided a list of values the default behaviour is that results match any element in the list. There is a workaround for this behaviour in a specific use case: phenotypes.

# default behaviour for a list of phenotypes
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"])

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&count=True
Response time : 11 ms

30

# explicitly matching any phenotype
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"], anyOrAllPhenotypes='ANY')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&anyOrAllPhenotypes=ANY&count=True
Response time : 12 ms

30

# matching all phenotypes
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"], anyOrAllPhenotypes='ALL')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&anyOrAllPhenotypes=ALL&count=True
Response time : 16 ms

6

# also works for cross references
cases_client.count(phenotypesXrefs=["HP:0005338", "UMLS:C1857479"], anyOrAllPhenotypes='ALL')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=HP:0005338&phenotypesXrefs=UMLS:C1857479&anyOrAllPhenotypes=ALL&count=True
Response time : 28 ms

6

Arithmetic filtering¶

The filter parameter also support arithmetic operations: lt, le, gt and ge on numeric values.

There is no other alternative to perform arithmetic comparisons than with the filter parameter.

# search for cases with at least 3 samples and the proband being born after 2016
cases_client.count(filter="countSamples ge 3 and probandYearOfBirth gt 2016")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=countSamples ge 3 and probandYearOfBirth gt 2016&count=True
Response time : 161 ms

274

cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "countTiers.TIER1 eq 0 and countTiers.TIER2 eq 0")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and countTiers.TIER1 eq 0 and countTiers.TIER2 eq 0&count=True
Response time : 205 ms

1520

Search¶

The previous filtering approaches are fit for most use cases specially for a computational analysis, but to support human friendly search operations from a front end we need some natural language processing to do partial matching, tokenisation, stemming and ranking of matches. The results from a text search will be fuzzier and not exact as compared to results provided by filtering.

There are two main use cases:

Free text search. A word or a set of words give a number of documents
Autocomplete. A partial word gives a number of possible matches for that word across all existing documents

Free text search¶

There are two entities where search is supported: cases, phenotypes and variants. This is available through /cases?search=something, /variants?search=something and /hpos/search?search=something

The search of cases supports the following:

Identifier based search
- Case identifier (either unversioned "12345" or versioned "12345-2")
- Family identifier
- Participant identifier (ie: any participant in the family, not only the proband)
- HPO identifier (ie: HPOs present in the proband)
- Gene id + gene cross references (ie: genes having a variant in the case)
- Variant cross references (ie: variants in the case)

Semantic search
- Clinical indication
- Panel name
- HPO name
- HPO synonyms

Other regex based search
- Variant coordinates (ie: cases with this variant)
- Genomic region (ie: cases with any variant within this region)

The search of variants supports the following:

Identifier based search
- Variant cross references (ie: ClinVar, dbSNP, COSMIC)
- Gene cross references (ie: Ensembl gene and transcript, Uniprot accession, name and variant id, HGNC gene symbol)
- Other "things" related with the gene (ie: PubMed id, gene drug interactions (e.g.: DGIdb))

Other regex based search
- Variant coordinates
- Genomic region

The search of phenotypes supports the following:

Semantic search
- Name
- Synonyms

# exact match of metabolic against disorder or panel name
cases_client.count(search="metabolic")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolic&count=True
Response time : 70 ms

5226

# match of stem of the word against disorder or panel name
cases_client.count(search="metabolism")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolism&count=True
Response time : 73 ms

5226

# stop words are ignored
cases_client.count(search="metabolism of this and that")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolism of this and that&count=True
Response time : 56 ms

5226

# match "metabolism" or "undiagnosed"
cases_client.count(search="metabolic undiagnosed")

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolic undiagnosed&count=True
Response time : 69 ms

5512

# match exactly "metabolism" and "undiagnosed" by quaoting each word independently
cases_client.count(search='"metabolic" "undiagnosed"')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search="metabolic" "undiagnosed"&count=True
Response time : 268 ms

4966

# match an exact ordered set of words by quoting all words
cases_client.count(search='"undiagnosed metabolic disorders"')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search="undiagnosed metabolic disorders"&count=True
Response time : 376 ms

4965

# numeric values will match any numeric field containing it
cases_client.count(search='10000-1')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=10000-1&count=True
Response time : 8 ms

1

# search by variant
cases_client.count(search='8:8377111:A:T')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=8:8377111:A:T&count=True
Response time : 22 ms

14

# search by gene
cases_client.count(search='ENSG00000213516')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=ENSG00000213516&count=True
Response time : 15 ms

434

# search by gene
cases_client.count(search='BRCA1')

GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=BRCA1&count=True
Response time : 1262 ms

729

HPO terms can be searched over the canonical name, the definition and the synonyms.

hpos = next(entities_client.get_hpos(search='"color blind"', as_data_frame=True))

GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&search="color blind"
Response time : 4 ms

hpos[["identifier", "name", "synonyms"]]

Autocomplete search¶

Cases have a dedicated endpoint to support autocomplete functionality based on panels and disorders which is in /cases/search. The response contains scored matches of different categories which will potentially allow to suggest to the user a sorted set of named things.

cases_client.search('metabol')

GET https://bio-test-cva.gel.zone/cva/api/0/cases/search/metabol?
Response time : 749 ms

[{'diseases': [{'specificDisease': 'undiagnosed metabolic disorders',
    'matchScore': 1.500496,
    'countCases': 168},
   {'specificDisease': 'rhabdomyolysis and metabolic muscle disorders',
    'matchScore': 1.3498728,
    'countCases': 131}],
  'panels': [{'panelName': 'rhabdomyolysis and metabolic muscle disorders',
    'matchScore': 1.2922647,
    'countCases': 199},
   {'panelName': 'undiagnosed metabolic disorders',
    'matchScore': 0.75966716,
    'countCases': 4963}]}]

Genes and their cross references can be matched by prefix.

# match cross reference
entities_client.get_genes(xrefRegex="^BRCA", as_data_frame=True).geneSymbol

GET https://bio-test-cva.gel.zone/cva/api/0/genes/search?xrefRegex=^BRCA
Response time : 30 ms

0     BRCA1
1     BRIP1
2     BRCA2
3     BCCIP
4    ARID4B
Name: geneSymbol, dtype: object

# match gene symbols
entities_client.get_genes(geneSymbolRegex="^BRCA", as_data_frame=True).geneSymbol

GET https://bio-test-cva.gel.zone/cva/api/0/genes/search?geneSymbolRegex=^BRCA
Response time : 7 ms

0    BRCA1
1    BRCA2
Name: geneSymbol, dtype: object

Panels can be matched by regex.

entities_client.get_panels_by_regex(regex="intellect", as_data_frame=True).head()

GET https://bio-test-cva.gel.zone/cva/api/0/panels/search?regex=intellect
Response time : 555 ms

Disorders can be matched by regex.

entities_client.get_disorders_by_regex(regex="intellect", as_data_frame=True).head()

GET https://bio-test-cva.gel.zone/cva/api/0/disorders/search?regex=intellect
Response time : 442 ms

HPO terms and their cross references can be matched by regex.

next(entities_client.get_hpos(xrefRegex="HP:000001", as_data_frame=True)).identifier

GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&xrefRegex=HP:000001
Response time : 26 ms

_index
0    HP:0000013
1    HP:0000015
2    HP:0002839
3    HP:0000012
4    HP:0000011
5    HP:0000019
6    HP:0000014
7    HP:0000016
8    HP:0000010
9    HP:0000017
Name: identifier, dtype: object

next(entities_client.get_hpos(xrefRegex="SNOMED.*123", as_data_frame=True)).identifier

GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&xrefRegex=SNOMED.*123
Response time : 31 ms

_index
0     HP:0000979
1     HP:0003235
2     HP:0100615
3     HP:0000826
4     HP:0005616
5     HP:0100817
6     HP:0000823
7     HP:0000631
8     HP:0007435
9     HP:0030159
10    HP:0002961
11    HP:0004315
12    HP:0005972
13    HP:0002750
14    HP:0000678
15    HP:0100029
16    HP:0012415
Name: identifier, dtype: object

	identifier	name	synonyms
_index
0	HP:0000552	Tritanomaly	[Blue yellow color blindness, Blue-yellow dysc...
1	HP:0000642	Red-green dyschromatopsia	[Dyschromatopsia with red-green confusion, Red...
2	HP:0007641	Dyschromatopsia	[Color blindness]

	panelIdentifier	panelName	panelVersion	source
0	558aa423bb5a16630e15b63c	intellectual disability	2.744	panelApp
1	558aa423bb5a16630e15b63c	intellectual disability	2.684	panelapp
2	558aa423bb5a16630e15b63c	intellectual disability	2.611	panelApp
3	558aa423bb5a16630e15b63c	intellectual disability	2.608	panelApp
4	558aa423bb5a16630e15b63c	intellectual disability	2.200	PanelApp

	ageOfOnset	diseaseGroup	diseaseSubGroup	specificDisease
0	0.000	endocrine disorders	rare subtypes of diabetes	intellectual disability
1	0.000	gastroenterological disorders	liver disease	intellectual disability
2	0.000	gastroenterological disorders	gastrointestinal disorders	intellectual disability
3	1.600	neurology and neurodevelopmental disorders	neurodevelopmental disorders	intellectual disability
4	1.200	neurology and neurodevelopmental disorders	neurodevelopmental disorders	intellectual disability