There are three primary entities in CVA:
%run initialise_pyark.py
The case entity holds a summary on all the information about a given case. For instance it contains the list of variant identifiers that were tiered, although it does not contain the detail of why those specific variants were chosen. It also contains a summary of the clinical information coming from the pedigree and cancer participant models. It also contains some general flags about the main characteristics of the case: does the case have a positive diagnosis? does the pedigre include mother-father-child or some other family structure?
The objective of the case entity is to provide an entry point to the dataset held within CVA where we can easily select cohorts of cases according to a different number of criteria. In order the deep dive into the details of a case, it will be better to fetch all the report events for the case and probably the variant annotations for those variants.
We can obtain an iterator over cases given a certain criteria to select a cohort of cases. The iterator will take care of making as many queries as necessary to fetch all desired cases. The iterator is lazy, it fetches data from CVA only when necessary.
cases_iterator = cases_client.get_cases(
program=Program.rare_disease, assembly=Assembly.GRCh38,
caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True)
We can use the iterator to fetch case by case. Cases are returned in a Python dictionary.
case_1 = next(cases_iterator)
print(type(case_1))
print("Case identifier {}-{}".format(case_1['identifier'], case_1['version']))
Otherwise we can iterate through all cases in our cohort.
i = 0
for c in cases_iterator:
print("Case identifier {}-{}".format(c['identifier'], c['version']))
i += 1
if i >= 3:
break
Pagination can we customised by using the parameter limit
and max_results
. limit
defines the size of tha page, meaning that the iterator will fetch this number of elements every time that it needs more data and store them in memory. max_results
fixes the maximum number of elements that the iterator will ever return, this may be useful, for instance, to fetch the first ten cases.
cases_iterator = cases_client.get_cases(
program=Program.rare_disease, assembly=Assembly.GRCh38,
caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=2, max_results=5)
five_cases = []
for c in cases_iterator:
print("Case identifier {}-{}".format(c['identifier'], c['version']))
five_cases.append("{}-{}".format(c['identifier'], c['version']))
Pagination happens automatically using pyark, but if making raw REST queries the required coordinates for the next page are in the header of the response in the fields:
X-Pagination-Limit
X-Pagination-Marker
To fetch the next page, these attributes must be passed in the query parameter limit
and marker
respectively.
A case can be fetched by identifier and version using the method get_case
.
the_case = cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"])
There is also a method to fetch a list of cases by identifier get_cases_by_identifiers
, although it requires identifier and version to be merged as in {identifier}-{version}
.
the_five_cases = cases_client.get_case_by_identifiers(identifiers=five_cases)
There is a default minimal representation of a case that will be obtained when using explicitly include_all=False
. Long lists of variants and genes will be excluded. We can include or exclude ad hoc some fields from a case in order to make the data lighter using the parameters include_all=False
and include=["some", "fields"]
or exclude=["this", "that"]
. include
and exclude
cannot be used simultaneously.
print("Full case size: {}".format(get_size(
cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"]))))
print("Minimal default size: {}".format(get_size(
cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"], include_all=False))))
print("Only id and version size: {}".format(get_size(
cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"],
include_all=False, include=["identifier", "version"]))))
There is a predefined method get_cases_ids
to obtain only case identifiers.
list(cases_client.get_cases_ids(
program=Program.rare_disease, assembly=Assembly.GRCh38,
caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=2, max_results=5))
It is out of scope here to go through all fields available in the case entity, but just to give you a feeling of all data available:
list(case_1.keys())
The cases are returned by default in a native Python dictionary, but they can returned in a Pandas data frame. Cases are represented in a flattened model easily transformed in a tabular format. The Pandas library provides an implementation of a table called data frame, the Pandas library provides data analysis features on the data, for more information see https://pandas.pydata.org/.
In order to fetch cases in data frames we should use the parameter as_data_frame=True
.
cases_iterator = cases_client.get_cases(
program=Program.rare_disease, assembly=Assembly.GRCh38,
caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=10, as_data_frame=True)
cases_data_frame = next(cases_iterator)
Select some columns of interest as follows.
cases_data_frame[["identifier", "version", "creationDate", "owner.gmc",
"countTiered", "countByClassification.pathogenic_variant"]]
Perform some fancy computation very similar to R data frames.
cases_data_frame[cases_data_frame['countByClassification.pathogenic_variant'] > 0].countTiered.mean()
Iteration over cases is different when using data frames as the generators return a whole page in a single data frame. In order to merge multiple pages in a single data frame we would need to append as we iterate over the pages.
cases_iterator = cases_client.get_cases(
program=Program.rare_disease, assembly=Assembly.GRCh38, caseStatuses="ARCHIVED_POSITIVE",
hasCanonicalTrio=True, limit=10, max_results=50, as_data_frame=True)
fifty_cases = pd.DataFrame()
for c in cases_iterator:
fifty_cases = fifty_cases.append(c)
len(fifty_cases)
Easy access to visual data analysis is one of the main advantages of working with Pandas data frames.
%matplotlib inline
fifty_cases[[
'countTiers.TIER1',
'countTiers.TIER2',
'countByClassification.pathogenic_variant',
'countByClassification.likely_pathogenic_variant',
'countPedigreeAnalysisPanels',
'countProbandPresentPhenotypes'
]].plot()
The report events in CVA store data following the model defined here https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.gel.models.cva.avro/1.3.0/ReportEvent.html#/schema/org.gel.models.cva.avro.ReportEventEntry. The objective of a report event is to describe every detail of why a given variant has been highlighted in a given case: the segregation of the variant within the family, the context where the variant was analysed (eg: panel, clinical indication, etc.), the annotations that were used at analysis time (not to mistake with the annotations held in CVA which may be of a more recent version). The same variant in the same case can have multiple report events, for instance when multiple panels were applied to the case. But also, when the variant is subsequently reported and classified in the exit questionnaire we will also have report events describing those.
Also they support the functionality previously described for inclusion and exclusion of fields, pagination, counting and integration with Pandas data frames. The typical use of report events is to fetch all those report events in a given case, in a given gene or genomic region.
# fetch all the report events for a given case
report_events_iterator = report_events_client.get_report_events(
caseId=case_1['identifier'], caseVersion=case_1['version'])
report_event_1 = next(report_events_iterator)
type(report_event_1)
Report events are returned by default in an object of type ReportEventWrapper which facilitates object oriented programming in the highly nested structure of the report event. Nevertheless, this object can be transformed in a Python native data structure.
type(report_event_1.toJsonDict())
Also provides some functionality to navigate the data structure more easily.
variant_1 = report_event_1.get_variant()
type(variant_1)
The default variant representation is in assembly GRCh38
variant_1.get_default_variant_representation().smallVariantCoordinates.toJsonDict()
variant_1.get_variant_representation_by_assembly(Assembly.GRCh37).smallVariantCoordinates.toJsonDict()
The variant object inside the report event does not contain the variant annotations unless specified.
variant_1.get_default_variant_representation().annotation is None
But if we add the parameter fullPopulate=True
we will obtain the Cellbase annotations for all variants. The performance will be degraded when using this option. The annotations follow the OpenCB model described here https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.opencb.biodata.models.variant.avro/1.3.0/variantAnnotation.html#/schema/org.opencb.biodata.models.variant.avro.VariantAnnotation
report_events_iterator = report_events_client.get_report_events(
caseId=case_1['identifier'], caseVersion=case_1['version'], fullPopulate=True)
report_event_2 = next(report_events_iterator)
variant_ann_2 = report_event_2.get_variant().get_default_variant_annotation()
variant_ann_2.consequenceTypes[0].toJsonDict()
# overlapping transcripts
[ct.ensemblTranscriptId for ct in variant_ann_2.consequenceTypes]
# there are some simple features provided by the wrappers
variant_ann_2.get_max_allele_frequency()
# fetch report events on a given region
list(report_events_client.get_report_events(genomicCoordinates="1:12345-100000", max_results=5))
# fetch report events on a given genomic entity
list(report_events_client.get_report_events(gene="ENSG00000139618", max_results=5))
The third primary entity is the variant. The variants follow the model https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.gel.models.cva.avro/1.3.0/CvaVariant.html#/schema/org.gel.models.cva.avro.Variant. This is where we store the biological annotations for the variant irrespective on the observations of the variant on different samples. These are the annotations coming from Cellbase. If there is a valid lift over we store the annotations in both coordinates GRCh37 and GRCh38. Variants can be fetched following certain criteria based on their coordinates and their annotations. Also they support the functionality previously described for inclusion and exclusion of fields, pagination, counting and integration with Pandas data frames.
# fetch the first five variants in gene PKD1
variants = list(variants_client.get_variants(geneSymbols='PKD1', max_results=5))
variant = variants[0]
type(variant)
Variants can also be fetched by identifier.
variants_client.get_variant_by_id(variant.id)
And by lists of ids making parallel requests.
variants_client.get_variants_by_id(identifiers=[v.id for v in variants])
CVA holds variants of different types, small variants and structural variants. We can fetch variants of only a given category.
small_variants = list(variants_client.get_variants(geneSymbols='PKD1', variantCategory='small', max_results=5))
small_variant = small_variants[0]
type(small_variant)
Small variants have the simpler variant coordinates based on chromosome, position, reference and alternate.
print(small_variant.get_default_variant_representation().smallVariantCoordinates.toJsonDict())
While structural variants have a more complex set of coordinates.
structural_variants = []
for v in variants_client.get_variants(geneSymbols='PKD1', variantCategory='structural', max_results=5):
structural_variants.append(v)
structural_variant = structural_variants[0]
type(structural_variant)
print(structural_variant.get_default_variant_representation().structuralVariantCoordinates.toJsonDict())
print(structural_variant.get_default_variant_representation().variantType)
print(structural_variant.id)
Identifiers are unique within the CVA database. They hide a complex logic to identify a variant univocally. They cannot be built from any variant coordinates without those coordinates being first normalised. Also, some long variants (ie: indels greater than 50 bp) are hashed to avoid arbitrarily long identifiers. The variant identifiers have been designed to be human readable when possible. Also, the same identifier refers to a variant that can be represented in different assemblies, when building the identifier coordinates in assembly GRCh38 have preference over GRCh37, unless there are no GRCh38 coordinates. The main conclusion of the previous facts is that different variant coordinates may refer to the same variant, and hence to the same identifier. There is a convenience endpoint from where variant identifiers can be built from a set of variant coordinates once the coordinates have been processed by the normalisation engine, irrespectively of those variants being in the database or not.
NOTE: currently this endpoint only supports small variants
# different coordinates can refer to the same variant id
variant_coordinates_37 = small_variant.get_variant_representation_by_assembly(
assembly=Assembly.GRCh37).smallVariantCoordinates.toJsonDict()
variant_coordinates_38 = small_variant.get_variant_representation_by_assembly(
assembly=Assembly.GRCh38).smallVariantCoordinates.toJsonDict()
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_37]))
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_38]))
variant_coordinates_37['chromosome'] = "chr" + variant_coordinates_37['chromosome']
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_37]))