Search and filtering

Search operations in CVA intend to support different use cases from computational analysis to front end features. For computational analysis we need a comprehensive filtering including some normalisation over text fields, arithmetic operations (e.g.: count_samples > 3) and boolean operations (e.g.: this eq 1 or that eq 2 and this eq 1 and that eq 2). To support a front end we will also require some text search performing natural language processing (e.g.: metabolic matches metabolism and connector words are ignored), regex queries and ad hoc summaries to support autocomplete features.

The principal entities that are available for search are: cases and report events. There are other secondary entities: variants, genes and phenotypes.

Search entities

In [1]:
%run initialise_pyark.py
POST https://bio-test-cva.gel.zone/cva/api/0/authentication?
Response time : 21 ms
pyark version 4.0.4

Filtering

Main entities has a main get endpoint that exposes a comprehensive set of filters. Some fields are normalised in the database and every query is normalised to achieve a case insensitive matching.

The normalised fields are:

  • Panel names
  • Disorders
  • Interpretation service

Filters are combined with an AND operator

In [2]:
# get the first page of cases
cases = next(cases_client.get_cases(
    program=Program.rare_disease, 
    assembly=Assembly.GRCh38, 
    specificDiseases='intellectual disability', 
    as_data_frame=True))
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&specificDiseases=intellectual disability&include=__all
Response time : 230 ms
In [3]:
# same filters can be applied to count operations
cases_client.count(
    program=Program.rare_disease, 
    assembly=Assembly.GRCh37, 
    specificDiseases='intellectual disability')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh37&specificDiseases=intellectual disability&count=True
Response time : 55 ms
Out[3]:
987

Normalisation allows us to make filters case insensitive

The filters on which normalisation is applied are:

  • Panel names
  • Disease names, groups, subgroups, etc.
  • Interpretation service names
In [4]:
cases_client.count(specificDiseases='intellectual disability')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?specificDiseases=intellectual disability&count=True
Response time : 218 ms
Out[4]:
5602
In [5]:
cases_client.count(specificDiseases='INTELLECTUAL DISABILITY')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?specificDiseases=INTELLECTUAL DISABILITY&count=True
Response time : 220 ms
Out[5]:
5602

Cross references

Phenotypes and genes are stored as identifiers in CVA (ie: Ensembl identifiers for genes, HPO terms identifiers for rare disease phenotypes and SNOMED/ICD10 for cancer phenotypes) but they can be matched by cross references to other databases.

We can search genes by ensembl identifier, HGNC gene symbol, gene synonym, Uniprot, CCDS and LRG.

http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000130164;r=19:11089362-11133816

NOTE: these queries call CellBase and thus have a lower performance

In [6]:
# match by gene Ensembl identifier
cases_client.count(genes='ENSG00000130164')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genes=ENSG00000130164&count=True
Response time : 10 ms
Out[6]:
690
In [7]:
# match by HGNC gene symbol
cases_client.count(genesXrefs='LDLR')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LDLR&count=True
Response time : 55 ms
Out[7]:
690
In [8]:
# match by synonym
cases_client.count(genesXrefs='LDLCQ2')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LDLCQ2&count=True
Response time : 65 ms
Out[8]:
690
In [9]:
# match by Uniprot identifier
cases_client.count(genesXrefs='P01130')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=P01130&count=True
Response time : 49 ms
Out[9]:
690
In [10]:
# match by Human CCDS
cases_client.count(genesXrefs='CCDS12254.1')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=CCDS12254.1&count=True
Response time : 59 ms
Out[10]:
690
In [11]:
# match by LRG region
cases_client.count(genesXrefs='LRG_274')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?genesXrefs=LRG_274&count=True
Response time : 72 ms
Out[11]:
690

We can search HPO terms by identifier, alternative HPO id, SNOMED-CT (US version), UMLS and MSH.

https://hpo.jax.org/app/browse/term/HP:0003701

NOTE: alternative HPO terms are replaced automatically by the most current HPO term when queried

In [12]:
cases_client.count(phenotypes='HP:0003701')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0003701&count=True
Response time : 12 ms
Out[12]:
56
In [13]:
# search by UMLS
cases_client.count(phenotypesXrefs='UMLS:C0221629')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=UMLS:C0221629&count=True
Response time : 69 ms
Out[13]:
56
In [14]:
# search by SNOMED CT
cases_client.count(phenotypesXrefs='SNOMEDCT_US:249939004')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=SNOMEDCT_US:249939004&count=True
Response time : 38 ms
Out[14]:
56
In [15]:
# search by alternative HPO id
cases_client.count(phenotypesXrefs='HP:0008961')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=HP:0008961&count=True
Response time : 28 ms
Out[15]:
56

Boolean filtering

By default filtering results match all filters, as opposed to match any filter. There are two workarounds for this behaviour:

  • The filter parameter
  • The text search (see section below)

The filter parameter follows a basic subset of the OData specification for $field https://www.odata.org/documentation/odata-version-3-0/url-conventions/

Following this specification we define equality and inequality with the operators eq and ne, multiple expressions can be combined with or and and operators.

There are known limitations to this filtering approach:

  • there is no normalisation happening in query values
  • the field names have to follow the schema naming in the database which is very obscure to the user
  • we are not able to combine different boolean operators (ie: (this and that) or those)
  • we do not support substring operations
In [16]:
# AND operator
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "probandDisorders.specificDisease eq 'intellectual disability'")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and probandDisorders.specificDisease eq 'intellectual disability'&count=True
Response time : 259 ms
Out[16]:
5450
In [17]:
# OR operator
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' or " + 
    "probandDisorders.specificDisease eq 'intellectual disability'")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' or probandDisorders.specificDisease eq 'intellectual disability'&count=True
Response time : 257 ms
Out[17]:
5623
In [18]:
# AND operator and negation
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "probandDisorders.specificDisease ne 'intellectual disability'")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and probandDisorders.specificDisease ne 'intellectual disability'&count=True
Response time : 247 ms
Out[18]:
21
In [19]:
# normalisation is not applied
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'INTELLECTUAL DISABILITY' and " + 
    "probandDisorders.specificDisease ne 'intellectual disability'")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'INTELLECTUAL DISABILITY' and probandDisorders.specificDisease ne 'intellectual disability'&count=True
Response time : 246 ms
Out[19]:
0

Also when provided a list of values the default behaviour is that results match any element in the list. There is a workaround for this behaviour in a specific use case: phenotypes.

In [20]:
# default behaviour for a list of phenotypes
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"])
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&count=True
Response time : 11 ms
Out[20]:
30
In [21]:
# explicitly matching any phenotype
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"], anyOrAllPhenotypes='ANY')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&anyOrAllPhenotypes=ANY&count=True
Response time : 12 ms
Out[21]:
30
In [22]:
# matching all phenotypes
cases_client.count(phenotypes=["HP:0005338", "HP:0002000"], anyOrAllPhenotypes='ALL')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypes=HP:0005338&phenotypes=HP:0002000&anyOrAllPhenotypes=ALL&count=True
Response time : 16 ms
Out[22]:
6
In [23]:
# also works for cross references
cases_client.count(phenotypesXrefs=["HP:0005338", "UMLS:C1857479"], anyOrAllPhenotypes='ALL')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?phenotypesXrefs=HP:0005338&phenotypesXrefs=UMLS:C1857479&anyOrAllPhenotypes=ALL&count=True
Response time : 28 ms
Out[23]:
6

Arithmetic filtering

The filter parameter also support arithmetic operations: lt, le, gt and ge on numeric values.

There is no other alternative to perform arithmetic comparisons than with the filter parameter.

In [24]:
# search for cases with at least 3 samples and the proband being born after 2016
cases_client.count(filter="countSamples ge 3 and probandYearOfBirth gt 2016")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=countSamples ge 3 and probandYearOfBirth gt 2016&count=True
Response time : 161 ms
Out[24]:
274
In [25]:
cases_client.count(
    filter="pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and " + 
    "countTiers.TIER1 eq 0 and countTiers.TIER2 eq 0")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?filter=pedigreeAnalysisPanels.specificDisease eq 'intellectual disability' and countTiers.TIER1 eq 0 and countTiers.TIER2 eq 0&count=True
Response time : 205 ms
Out[25]:
1520

The previous filtering approaches are fit for most use cases specially for a computational analysis, but to support human friendly search operations from a front end we need some natural language processing to do partial matching, tokenisation, stemming and ranking of matches. The results from a text search will be fuzzier and not exact as compared to results provided by filtering.

There are two main use cases:

  • Free text search. A word or a set of words give a number of documents
  • Autocomplete. A partial word gives a number of possible matches for that word across all existing documents

There are two entities where search is supported: cases, phenotypes and variants. This is available through /cases?search=something, /variants?search=something and /hpos/search?search=something

The search of cases supports the following:

  1. Identifier based search
    • Case identifier (either unversioned "12345" or versioned "12345-2")
    • Family identifier
    • Participant identifier (ie: any participant in the family, not only the proband)
    • HPO identifier (ie: HPOs present in the proband)
    • Gene id + gene cross references (ie: genes having a variant in the case)
    • Variant cross references (ie: variants in the case)
  1. Semantic search
    • Clinical indication
    • Panel name
    • HPO name
    • HPO synonyms
  1. Other regex based search
    • Variant coordinates (ie: cases with this variant)
    • Genomic region (ie: cases with any variant within this region)

The search of variants supports the following:

  1. Identifier based search
    • Variant cross references (ie: ClinVar, dbSNP, COSMIC)
    • Gene cross references (ie: Ensembl gene and transcript, Uniprot accession, name and variant id, HGNC gene symbol)
    • Other "things" related with the gene (ie: PubMed id, gene drug interactions (e.g.: DGIdb))
  1. Other regex based search
    • Variant coordinates
    • Genomic region

The search of phenotypes supports the following:

  1. Semantic search
    • Name
    • Synonyms
In [26]:
# exact match of metabolic against disorder or panel name
cases_client.count(search="metabolic")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolic&count=True
Response time : 70 ms
Out[26]:
5226
In [27]:
# match of stem of the word against disorder or panel name
cases_client.count(search="metabolism")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolism&count=True
Response time : 73 ms
Out[27]:
5226
In [28]:
# stop words are ignored
cases_client.count(search="metabolism of this and that")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolism of this and that&count=True
Response time : 56 ms
Out[28]:
5226
In [29]:
# match "metabolism" or "undiagnosed"
cases_client.count(search="metabolic undiagnosed")
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=metabolic undiagnosed&count=True
Response time : 69 ms
Out[29]:
5512
In [30]:
# match exactly "metabolism" and "undiagnosed" by quaoting each word independently
cases_client.count(search='"metabolic" "undiagnosed"')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search="metabolic" "undiagnosed"&count=True
Response time : 268 ms
Out[30]:
4966
In [31]:
# match an exact ordered set of words by quoting all words
cases_client.count(search='"undiagnosed metabolic disorders"')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search="undiagnosed metabolic disorders"&count=True
Response time : 376 ms
Out[31]:
4965
In [32]:
# numeric values will match any numeric field containing it
cases_client.count(search='10000-1')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=10000-1&count=True
Response time : 8 ms
Out[32]:
1
In [33]:
# search by variant
cases_client.count(search='8:8377111:A:T')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=8:8377111:A:T&count=True
Response time : 22 ms
Out[33]:
14
In [34]:
# search by gene
cases_client.count(search='ENSG00000213516')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=ENSG00000213516&count=True
Response time : 15 ms
Out[34]:
434
In [35]:
# search by gene
cases_client.count(search='BRCA1')
GET https://bio-test-cva.gel.zone/cva/api/0/cases?search=BRCA1&count=True
Response time : 1262 ms
Out[35]:
729

HPO terms can be searched over the canonical name, the definition and the synonyms.

In [36]:
hpos = next(entities_client.get_hpos(search='"color blind"', as_data_frame=True))
GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&search="color blind"
Response time : 4 ms
In [37]:
hpos[["identifier", "name", "synonyms"]]
Out[37]:
identifier name synonyms
_index
0 HP:0000552 Tritanomaly [Blue yellow color blindness, Blue-yellow dysc...
1 HP:0000642 Red-green dyschromatopsia [Dyschromatopsia with red-green confusion, Red...
2 HP:0007641 Dyschromatopsia [Color blindness]

Cases have a dedicated endpoint to support autocomplete functionality based on panels and disorders which is in /cases/search. The response contains scored matches of different categories which will potentially allow to suggest to the user a sorted set of named things.

In [38]:
cases_client.search('metabol')
GET https://bio-test-cva.gel.zone/cva/api/0/cases/search/metabol?
Response time : 749 ms
Out[38]:
[{'diseases': [{'specificDisease': 'undiagnosed metabolic disorders',
    'matchScore': 1.500496,
    'countCases': 168},
   {'specificDisease': 'rhabdomyolysis and metabolic muscle disorders',
    'matchScore': 1.3498728,
    'countCases': 131}],
  'panels': [{'panelName': 'rhabdomyolysis and metabolic muscle disorders',
    'matchScore': 1.2922647,
    'countCases': 199},
   {'panelName': 'undiagnosed metabolic disorders',
    'matchScore': 0.75966716,
    'countCases': 4963}]}]

Genes and their cross references can be matched by prefix.

In [39]:
# match cross reference
entities_client.get_genes(xrefRegex="^BRCA", as_data_frame=True).geneSymbol
GET https://bio-test-cva.gel.zone/cva/api/0/genes/search?xrefRegex=^BRCA
Response time : 30 ms
Out[39]:
0     BRCA1
1     BRIP1
2     BRCA2
3     BCCIP
4    ARID4B
Name: geneSymbol, dtype: object
In [40]:
# match gene symbols
entities_client.get_genes(geneSymbolRegex="^BRCA", as_data_frame=True).geneSymbol
GET https://bio-test-cva.gel.zone/cva/api/0/genes/search?geneSymbolRegex=^BRCA
Response time : 7 ms
Out[40]:
0    BRCA1
1    BRCA2
Name: geneSymbol, dtype: object

Panels can be matched by regex.

In [41]:
entities_client.get_panels_by_regex(regex="intellect", as_data_frame=True).head()
GET https://bio-test-cva.gel.zone/cva/api/0/panels/search?regex=intellect
Response time : 555 ms
Out[41]:
panelIdentifier panelName panelVersion source
0 558aa423bb5a16630e15b63c intellectual disability 2.744 panelApp
1 558aa423bb5a16630e15b63c intellectual disability 2.684 panelapp
2 558aa423bb5a16630e15b63c intellectual disability 2.611 panelApp
3 558aa423bb5a16630e15b63c intellectual disability 2.608 panelApp
4 558aa423bb5a16630e15b63c intellectual disability 2.200 PanelApp

Disorders can be matched by regex.

In [42]:
entities_client.get_disorders_by_regex(regex="intellect", as_data_frame=True).head()
GET https://bio-test-cva.gel.zone/cva/api/0/disorders/search?regex=intellect
Response time : 442 ms
Out[42]:
ageOfOnset diseaseGroup diseaseSubGroup specificDisease
0 0.000 endocrine disorders rare subtypes of diabetes intellectual disability
1 0.000 gastroenterological disorders liver disease intellectual disability
2 0.000 gastroenterological disorders gastrointestinal disorders intellectual disability
3 1.600 neurology and neurodevelopmental disorders neurodevelopmental disorders intellectual disability
4 1.200 neurology and neurodevelopmental disorders neurodevelopmental disorders intellectual disability

HPO terms and their cross references can be matched by regex.

In [43]:
next(entities_client.get_hpos(xrefRegex="HP:000001", as_data_frame=True)).identifier
GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&xrefRegex=HP:000001
Response time : 26 ms
Out[43]:
_index
0    HP:0000013
1    HP:0000015
2    HP:0002839
3    HP:0000012
4    HP:0000011
5    HP:0000019
6    HP:0000014
7    HP:0000016
8    HP:0000010
9    HP:0000017
Name: identifier, dtype: object
In [44]:
next(entities_client.get_hpos(xrefRegex="SNOMED.*123", as_data_frame=True)).identifier
GET https://bio-test-cva.gel.zone/cva/api/0/hpos/search?max=None&xrefRegex=SNOMED.*123
Response time : 31 ms
Out[44]:
_index
0     HP:0000979
1     HP:0003235
2     HP:0100615
3     HP:0000826
4     HP:0005616
5     HP:0100817
6     HP:0000823
7     HP:0000631
8     HP:0007435
9     HP:0030159
10    HP:0002961
11    HP:0004315
12    HP:0005972
13    HP:0002750
14    HP:0000678
15    HP:0100029
16    HP:0012415
Name: identifier, dtype: object