%run initialise_pyark.py
# fetch a case
case = next(cases_client.get_cases(program=Program.rare_disease, max_results=1))
The pedigrees for rare disease cases can be fetched using the case identifier and version. All relevant information in the pedigree is reshaped into the case entity.
ped = cases_client.get_pedigree(
identifier=case.get('identifier'), version=case.get('version'), as_data_frame=True)
The clinical reports for any case can be fetched using the case identifier and version. All relevant information in the clinical report is reshaped into the case entity and in the report events.
cr = cases_client.get_clinical_report(
identifier=case.get('identifier'), version=case.get('version'), as_data_frame=True)
The rare disease exit questionnaires for any case can be fetched using the case identifier and version. All relevant information in the exit questionnaire is reshaped into the case entity and in the report events.
eq = cases_client.get_rd_exit_questionnaire(
identifier=case.get('identifier'), version=case.get('version'), as_data_frame=True)
There are secondary entities that are usually used to filter the main entities. CVA provide endpoints to support autocomplete features and to provide some basic summary related to the distrution of these entities across different cohorts of cases.
Some of these are:
Our panels data source is PanelApp. That being said CVA does not query PanelApp or receive data directly from PanelApp, all information it has about panels is aggregated from the data it receives.
# fetch the list of unique panel names
all_panels = entities_client.get_all_panels()
all_panels[0:5]
We can fetch a summary of panels which gives the count of cases where each panel was applied.
# fetch a summary of panels across cases
entities_client.get_panels_summary(as_data_frame=True).head()
We can fetch the summary of panels on a given cohort of cases. All filters for case cohort selection apply here.
entities_client.get_panels_summary(as_data_frame=True, hasPositiveDx=True).head()
We can disaggregate this information by panel version using the parameter considerVersions=True
.
entities_client.get_panels_summary(as_data_frame=True, hasPositiveDx=True, considerVersions=True).head()
We can search panels by regex.
entities_client.get_panels_by_regex(regex="dystrophy", as_data_frame=True).head()
The clinical indications are defined in a hierarchy of three levels: disease group, disease subgroup and specific disease. CVA holds the clinical indications provided with the clinical data of each case. It is slightly different for cancer as the hierarchy only has two levels but they are stored in the same data structure.
We can fetch all disease groups.
entities_client.get_all_disease_groups()[0:5]
We can get the above over any selected cohort of cases.
entities_client.get_all_disease_groups(filter="countParticipants gt 3")[0:5]
We can also fetch disease subgroups.
entities_client.get_all_disease_subgroups(filter="countParticipants gt 3")[0:5]
And specific diseases.
entities_client.get_all_specific_diseases(filter="countParticipants gt 3")[0:5]
We can also fetch summaries of disorders across a given cohort of cases.
entities_client.get_disorders_summary(hasPositiveDx=True, as_data_frame=True).head()
To support autocomplete we also support search by regular expressions.
entities_client.get_disorders_by_regex(regex="intellect", as_data_frame=True).head()
The organisations are the owners of cases. In the 100K Genomes Project these are the different Genomic Medicine Centers (GMCs).
We can obtain a summary of organisations across a cohort of cases.
next(entities_client.get_organisations(hasPositiveDx=True, as_data_frame=True)).head()
We store the Cellbase annotations for every gene reported in any case in CVA. Within the Cellbase annotations we have the Ensembl gene identifier, the HGNC gene symbold and a number of cross references from several resources (eg: UniProt, InterPro, Gene Ontology, etc.).
We can fetch the distribution of genes affected by a potential LoF variant across cases in any given cohort.
entities_client.get_genes_summary(hasPositiveDx=True, as_data_frame=True).head()
We can perform search on genes by gene symbol, cross reference or any regex of both. This endpoint may be useful for autocomplete purposes.
entities_client.get_genes(geneSymbols="BRCA2", as_data_frame=True).head()
entities_client.get_genes(geneSymbolRegex="^BRC", as_data_frame=True).head()
entities_client.get_genes(xrefs=["GO:0006351", "GO:0003700"], as_data_frame=True).head()
Phenotypes for rare disease cases are represented in the Human Phenotype Ontology (HPO) terms. The whole HPO dataset is integrated into CVA providing normalisation of terms from older versions and allowing searching of terms by a different set of criteria. HPO terms are enriched with the annotations of their information content, according to the disease annotation provided by HPO.
We can fetch a single phenotype by identifier.
entities_client.get_hpo(identifier="HP:0012345", as_data_frame=True)
We can do a semantic search over phenotype names and synonyms.
next(entities_client.get_hpos(search="color blind", as_data_frame=True)).head()
Finally, we can also search for phenotypes by cross references.
next(entities_client.get_hpos(xrefs="SNOMEDCT_US:51886007", as_data_frame=True)).head()