There are three primary entities in CVA:

  • Cases - a summary of everything we know about a given case either from rare disease or cancer. It contains lists of variants together with clinical information in the right format to facilitate the cohort selection and the data analysis over these structures.
  • Report events - the observation of a variant in a particular case with its zygosity state and segregation within the family. The same variant may have multiple report events in the same case.
  • Variants - the variant as an abstract concept with its biological annotations
In [1]:
%run initialise_pyark.py
POST https://bio-test-cva.gel.zone/cva/api/0/authentication?
Response time : 10 ms
pyark version 4.0.4

Fetching cases

The case entity holds a summary on all the information about a given case. For instance it contains the list of variant identifiers that were tiered, although it does not contain the detail of why those specific variants were chosen. It also contains a summary of the clinical information coming from the pedigree and cancer participant models. It also contains some general flags about the main characteristics of the case: does the case have a positive diagnosis? does the pedigre include mother-father-child or some other family structure?

The objective of the case entity is to provide an entry point to the dataset held within CVA where we can easily select cohorts of cases according to a different number of criteria. In order the deep dive into the details of a case, it will be better to fetch all the report events for the case and probably the variant annotations for those variants.

We can obtain an iterator over cases given a certain criteria to select a cohort of cases. The iterator will take care of making as many queries as necessary to fetch all desired cases. The iterator is lazy, it fetches data from CVA only when necessary.

In [2]:
cases_iterator = cases_client.get_cases(
    program=Program.rare_disease, assembly=Assembly.GRCh38, 
    caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True)

We can use the iterator to fetch case by case. Cases are returned in a Python dictionary.

In [3]:
case_1 = next(cases_iterator)
print(type(case_1))
print("Case identifier {}-{}".format(case_1['identifier'], case_1['version']))
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&include=__all
Response time : 4780 ms
<class 'dict'>
Case identifier 11019-1

Otherwise we can iterate through all cases in our cohort.

In [4]:
i = 0
for c in cases_iterator:
    print("Case identifier {}-{}".format(c['identifier'], c['version']))
    i += 1
    if i >= 3:
        break
Case identifier 26401-1
Case identifier 22476-1
Case identifier 27348-1

Pagination over cases

Pagination can we customised by using the parameter limit and max_results. limit defines the size of tha page, meaning that the iterator will fetch this number of elements every time that it needs more data and store them in memory. max_results fixes the maximum number of elements that the iterator will ever return, this may be useful, for instance, to fetch the first ten cases.

In [5]:
cases_iterator = cases_client.get_cases(
    program=Program.rare_disease, assembly=Assembly.GRCh38, 
    caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=2, max_results=5)
five_cases = []
for c in cases_iterator:
    print("Case identifier {}-{}".format(c['identifier'], c['version']))
    five_cases.append("{}-{}".format(c['identifier'], c['version']))
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=__all
Response time : 1466 ms
Case identifier 11019-1
Case identifier 26401-1
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=__all&marker=00a0949d-4ab7-41f1-ad99-1fcc9f960625
Response time : 692 ms
Case identifier 22476-1
Case identifier 27348-1
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=__all&marker=03da35d8-2fbc-4d7a-a537-2fdebdcdcacd
Response time : 510 ms
Case identifier 33569-1

Pagination happens automatically using pyark, but if making raw REST queries the required coordinates for the next page are in the header of the response in the fields:

  • X-Pagination-Limit
  • X-Pagination-Marker

To fetch the next page, these attributes must be passed in the query parameter limit and marker respectively.

Fetch a specific case by identifier

A case can be fetched by identifier and version using the method get_case.

In [6]:
the_case = cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"])
GET https://bio-test-cva.gel.zone/cva/api/0/cases/11019/1?include=__all
Response time : 5 ms

There is also a method to fetch a list of cases by identifier get_cases_by_identifiers, although it requires identifier and version to be merged as in {identifier}-{version}.

In [7]:
the_five_cases = cases_client.get_case_by_identifiers(identifiers=five_cases)
GET https://bio-test-cva.gel.zone/cva/api/0/cases/11019-1,26401-1,22476-1,27348-1,33569-1?include=__all
Response time : 19 ms

Inclusion/exclusion of fields

There is a default minimal representation of a case that will be obtained when using explicitly include_all=False. Long lists of variants and genes will be excluded. We can include or exclude ad hoc some fields from a case in order to make the data lighter using the parameters include_all=False and include=["some", "fields"] or exclude=["this", "that"]. include and exclude cannot be used simultaneously.

In [8]:
print("Full case size: {}".format(get_size(
    cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"]))))
print("Minimal default size: {}".format(get_size(
    cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"], include_all=False))))
print("Only id and version size: {}".format(get_size(
    cases_client.get_case(identifier=case_1["identifier"], version=case_1["version"], 
                          include_all=False, include=["identifier", "version"]))))
GET https://bio-test-cva.gel.zone/cva/api/0/cases/11019/1?include=__all
Response time : 4 ms
Full case size: 94629
GET https://bio-test-cva.gel.zone/cva/api/0/cases/11019/1?
Response time : 3 ms
Minimal default size: 49963
GET https://bio-test-cva.gel.zone/cva/api/0/cases/11019/1?include=identifier&include=version
Response time : 2 ms
Only id and version size: 12928

There is a predefined method get_cases_ids to obtain only case identifiers.

In [9]:
list(cases_client.get_cases_ids(
    program=Program.rare_disease, assembly=Assembly.GRCh38, 
    caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=2, max_results=5))
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=identifier&include=version
Response time : 473 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=identifier&include=version&marker=00a0949d-4ab7-41f1-ad99-1fcc9f960625
Response time : 351 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=2&include=identifier&include=version&marker=03da35d8-2fbc-4d7a-a537-2fdebdcdcacd
Response time : 149 ms
Out[9]:
['11019-1', '26401-1', '22476-1', '27348-1', '33569-1']

It is out of scope here to go through all fields available in the case entity, but just to give you a feeling of all data available:

In [10]:
list(case_1.keys())
Out[10]:
['caseIdentifier',
 'identifier',
 'version',
 'latest',
 'program',
 'assembly',
 'pedigreeAnalysisPanels',
 'reportEventsAnalysisPanels',
 'countPedigreeAnalysisPanels',
 'countReportEventsAnalysisPanels',
 'probandDisorders',
 'reportedPhenotypes',
 'owner',
 'assignee',
 'creationDate',
 'tieringAnalysisDate',
 'exomiserAnalysisDate',
 'clinicalReportDate',
 'questionnaireDate',
 'lastModifiedDate',
 'hasClinicalData',
 'hasClinicalReport',
 'hasExitQuestionnaire',
 'tieringVersion',
 'clinicalReportAuthor',
 'questionnaireAuthor',
 'interpretationServices',
 'countAllVariants',
 'countTiers',
 'countTiersGenes',
 'countTiersByPanel',
 'countTiersByPanelGenes',
 'countInterpretationServices',
 'countInterpretationServicesGenes',
 'countTiered',
 'countTieredByVariantType',
 'countReported',
 'countReportedByVariantType',
 'countReportedGenes',
 'countByClassification',
 'countGenesByClassification',
 'countTieredAndClassified',
 'countCandidateAndClassified',
 'countTieringSegregationPatterns',
 'countRejectedVariants',
 'countConfirmedVariants',
 'countInconsistentGroupClassificationVariants',
 'groupId',
 'groupIdInPedigree',
 'cohortId',
 'sampleIds',
 'countSamples',
 'participantIds',
 'countParticipants',
 'participantIdsInPedigree',
 'countParticipantsInPedigree',
 'hasMotherSequenced',
 'hasFatherSequenced',
 'hasCanonicalTrio',
 'probandYearOfBirth',
 'probandSex',
 'probandKaryotipicSex',
 'probandParticipantId',
 'probandConsentStatus',
 'probandEstimatedAgeAtAnalysis',
 'probandAgeOfOnset',
 'probandNumberDisorders',
 'allVariants',
 'reportedVariants',
 'classifiedVariants',
 'inconsistentGroupClassificationVariants',
 'rejectedVariants',
 'confirmedVariants',
 'tieredVariants',
 'tieredVariantsByPanel',
 'candidateVariants',
 'genes',
 'reportedGenes',
 'classifiedGenes',
 'reportedGenesByModeOfInheritance',
 'tieredGenes',
 'tieredGenesByPanel',
 'candidateGenes',
 'tieredAndClassifiedVariants',
 'candidateAndClassifiedVariants',
 'hasPositiveDx',
 'hasNegativeDx',
 'caseSolvedFamily',
 'segregationQuestion',
 'questionnaireAdditionalComments',
 'hasActionable',
 'hasConfirmationDecision',
 'hasPositiveConfirmationOutcome',
 'probandPresentPhenotypes',
 'probandNotPresentPhenotypes',
 'allPresentPhenotypes',
 'allNotPresentPhenotypes',
 'nonProbandPresentPhenotypes',
 'nonProbandNotPresentPhenotypes',
 'explainedPhenotypes',
 'countProbandPresentPhenotypes',
 'countProbandNotPresentPhenotypes',
 'countAllPresentPhenotypes',
 'countAllNotPresentPhenotypes',
 'countNonProbandPresentPhenotypes',
 'countNonProbandNotPresentPhenotypes',
 'countExplainedPhenotypes',
 'probandPresentPhenotypesInformationContent',
 'probandNotPresentPhenotypesInformationContent',
 'interpretation']

Pandas data frames

The cases are returned by default in a native Python dictionary, but they can returned in a Pandas data frame. Cases are represented in a flattened model easily transformed in a tabular format. The Pandas library provides an implementation of a table called data frame, the Pandas library provides data analysis features on the data, for more information see https://pandas.pydata.org/.

In order to fetch cases in data frames we should use the parameter as_data_frame=True.

In [11]:
cases_iterator = cases_client.get_cases(
    program=Program.rare_disease, assembly=Assembly.GRCh38, 
    caseStatuses="ARCHIVED_POSITIVE", hasCanonicalTrio=True, limit=10, as_data_frame=True)
cases_data_frame = next(cases_iterator)
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all
Response time : 334 ms

Select some columns of interest as follows.

In [12]:
cases_data_frame[["identifier", "version", "creationDate", "owner.gmc",
                  "countTiered", "countByClassification.pathogenic_variant"]]
Out[12]:
identifier version creationDate owner.gmc countTiered countByClassification.pathogenic_variant
_index
0 11019 1 2018-04-05 18:51:46.071272+00:00 Yorkshire and Humber 7 1
1 26401 1 2018-09-26 09:34:34.655098+00:00 Wessex 19 0
2 22476 1 2018-07-24 20:17:07.839512+00:00 Wales GMC 259 1
3 27348 1 2018-10-05 13:12:04.287304+00:00 West of England 34 1
4 33569 1 2018-12-04 15:23:20.782954+00:00 South London 12 0
5 33166 1 2018-11-29 20:19:58.178925+00:00 East of England 255 0
6 6010 1 2017-12-14 06:30:11.212112+00:00 Southwest Peninsula 22 0
7 8421 1 2018-02-02 05:24:54.262460+00:00 West of England 22 1
8 7026 1 2017-12-24 20:43:57.048102+00:00 Northwest Coast 26 0
9 36614 1 2019-01-23 22:24:50.611484+00:00 Wales GMC 290 2

Perform some fancy computation very similar to R data frames.

In [13]:
cases_data_frame[cases_data_frame['countByClassification.pathogenic_variant'] > 0].countTiered.mean()
Out[13]:
122.4

Iteration over cases is different when using data frames as the generators return a whole page in a single data frame. In order to merge multiple pages in a single data frame we would need to append as we iterate over the pages.

In [14]:
cases_iterator = cases_client.get_cases(
    program=Program.rare_disease, assembly=Assembly.GRCh38, caseStatuses="ARCHIVED_POSITIVE", 
    hasCanonicalTrio=True, limit=10, max_results=50, as_data_frame=True)

fifty_cases = pd.DataFrame()
for c in cases_iterator:
    fifty_cases = fifty_cases.append(c)
len(fifty_cases)
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all
Response time : 324 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all&marker=0be3b55c-b4d2-494d-a98d-616ed0f34328
Response time : 316 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all&marker=150af47d-4793-4f01-aa94-9f6fa33a53a1
Response time : 220 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all&marker=21b40874-1da3-40c1-979c-eac3bf6d2faf
Response time : 298 ms
GET https://bio-test-cva.gel.zone/cva/api/0/cases?program=rare_disease&assembly=GRCh38&caseStatuses=ARCHIVED_POSITIVE&hasCanonicalTrio=True&limit=10&include=__all&marker=28486da6-868c-4d5d-a426-e4247e4118d5
Response time : 294 ms
Out[14]:
50

Easy access to visual data analysis is one of the main advantages of working with Pandas data frames.

In [15]:
%matplotlib inline
fifty_cases[[
 'countTiers.TIER1',
 'countTiers.TIER2',
 'countByClassification.pathogenic_variant',
 'countByClassification.likely_pathogenic_variant',
 'countPedigreeAnalysisPanels',
 'countProbandPresentPhenotypes'
 ]].plot()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd37dfa898>

Fetching report events

The report events in CVA store data following the model defined here https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.gel.models.cva.avro/1.3.0/ReportEvent.html#/schema/org.gel.models.cva.avro.ReportEventEntry. The objective of a report event is to describe every detail of why a given variant has been highlighted in a given case: the segregation of the variant within the family, the context where the variant was analysed (eg: panel, clinical indication, etc.), the annotations that were used at analysis time (not to mistake with the annotations held in CVA which may be of a more recent version). The same variant in the same case can have multiple report events, for instance when multiple panels were applied to the case. But also, when the variant is subsequently reported and classified in the exit questionnaire we will also have report events describing those.

Also they support the functionality previously described for inclusion and exclusion of fields, pagination, counting and integration with Pandas data frames. The typical use of report events is to fetch all those report events in a given case, in a given gene or genomic region.

In [16]:
# fetch all the report events for a given case
report_events_iterator = report_events_client.get_report_events(
    caseId=case_1['identifier'], caseVersion=case_1['version'])
report_event_1 = next(report_events_iterator)
type(report_event_1)
GET https://bio-test-cva.gel.zone/cva/api/0/report-events?caseId=11019&caseVersion=1&include=__all
Response time : 41 ms
Out[16]:
pyark.models.wrappers.ReportEventEntryWrapper

Report events are returned by default in an object of type ReportEventWrapper which facilitates object oriented programming in the highly nested structure of the report event. Nevertheless, this object can be transformed in a Python native data structure.

In [17]:
type(report_event_1.toJsonDict())
Out[17]:
dict

Also provides some functionality to navigate the data structure more easily.

In [18]:
variant_1 = report_event_1.get_variant()
type(variant_1)
Out[18]:
pyark.models.wrappers.VariantWrapper

The default variant representation is in assembly GRCh38

In [19]:
variant_1.get_default_variant_representation().smallVariantCoordinates.toJsonDict()
Out[19]:
{'chromosome': '2',
 'position': 227695943,
 'reference': 'T',
 'alternate': 'C',
 'assembly': 'GRCh38'}
In [20]:
variant_1.get_variant_representation_by_assembly(Assembly.GRCh37).smallVariantCoordinates.toJsonDict()
Out[20]:
{'chromosome': '2',
 'position': 228560659,
 'reference': 'T',
 'alternate': 'C',
 'assembly': 'GRCh37'}

The variant object inside the report event does not contain the variant annotations unless specified.

In [21]:
variant_1.get_default_variant_representation().annotation is None
Out[21]:
True

But if we add the parameter fullPopulate=True we will obtain the Cellbase annotations for all variants. The performance will be degraded when using this option. The annotations follow the OpenCB model described here https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.opencb.biodata.models.variant.avro/1.3.0/variantAnnotation.html#/schema/org.opencb.biodata.models.variant.avro.VariantAnnotation

In [22]:
report_events_iterator = report_events_client.get_report_events(
    caseId=case_1['identifier'], caseVersion=case_1['version'], fullPopulate=True)
report_event_2 = next(report_events_iterator)
GET https://bio-test-cva.gel.zone/cva/api/0/report-events?caseId=11019&caseVersion=1&fullPopulate=True&include=__all
Response time : 226 ms
In [23]:
variant_ann_2 = report_event_2.get_variant().get_default_variant_annotation()
variant_ann_2.consequenceTypes[0].toJsonDict()
Out[23]:
{'geneName': 'SLC19A3',
 'ensemblGeneId': 'ENSG00000135917',
 'ensemblTranscriptId': 'ENST00000425817',
 'strand': '-',
 'biotype': 'nonsense_mediated_decay',
 'exonOverlap': [{'number': '4/7', 'percentage': 0.5181347}],
 'transcriptAnnotationFlags': ['CCDS'],
 'cdnaPosition': 1188,
 'cdsPosition': 1118,
 'codon': 'tAt/tGt',
 'proteinVariantAnnotation': {'uniprotAccession': 'Q9BZV2',
  'uniprotName': None,
  'position': 373,
  'reference': 'TYR',
  'alternate': 'CYS',
  'uniprotVariantId': None,
  'functionalDescription': None,
  'substitutionScores': [],
  'keywords': ['Complete proteome',
   'Disease mutation',
   'Glycoprotein',
   'Membrane',
   'Polymorphism',
   'Reference proteome',
   'Transmembrane',
   'Transmembrane helix',
   'Transport'],
  'features': [{'id': 'IPR002666',
    'start': 11,
    'end': 489,
    'type': None,
    'description': 'Reduced folate carrier'},
   {'id': 'IPR020846',
    'start': 260,
    'end': 455,
    'type': None,
    'description': 'Major facilitator superfamily domain'},
   {'id': 'IPR002666',
    'start': 1,
    'end': 492,
    'type': None,
    'description': 'Reduced folate carrier'},
   {'id': None,
    'start': 364,
    'end': 375,
    'type': 'topological domain',
    'description': 'Extracellular'},
   {'id': 'IPR028337',
    'start': 1,
    'end': 496,
    'type': 'chain',
    'description': 'Thiamine transporter 2'},
   {'id': 'IPR002666',
    'start': 11,
    'end': 441,
    'type': None,
    'description': 'Reduced folate carrier'},
   {'id': 'IPR002666',
    'start': 1,
    'end': 491,
    'type': None,
    'description': 'Reduced folate carrier'}]},
 'sequenceOntologyTerms': [{'accession': 'SO:0001583',
   'name': 'missense_variant'},
  {'accession': 'SO:0001621', 'name': 'NMD_transcript_variant'}]}
In [24]:
# overlapping transcripts
[ct.ensemblTranscriptId for ct in variant_ann_2.consequenceTypes]
Out[24]:
['ENST00000425817',
 'ENST00000409287',
 'ENST00000258403',
 'ENST00000456524',
 'ENST00000431622',
 'ENST00000419059',
 'ENST00000409456']
In [25]:
# there are some simple features provided by the wrappers
variant_ann_2.get_max_allele_frequency()
Out[25]:
0.0005
In [26]:
# fetch report events on a given region
list(report_events_client.get_report_events(genomicCoordinates="1:12345-100000", max_results=5))
GET https://bio-test-cva.gel.zone/cva/api/0/report-events?genomicCoordinates=1:12345-100000&include=__all
Response time : 18886 ms
Out[26]:
[<pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35bf2330>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd357eb520>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd357eb618>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd357eb710>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd357eb808>]
In [27]:
# fetch report events on a given genomic entity
list(report_events_client.get_report_events(gene="ENSG00000139618", max_results=5))
GET https://bio-test-cva.gel.zone/cva/api/0/report-events?gene=ENSG00000139618&include=__all
Response time : 2805 ms
Out[27]:
[<pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35423428>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35423520>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35423618>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35423710>,
 <pyark.models.wrappers.ReportEventEntryWrapper at 0x7fbd35423808>]

Fetching variants

The third primary entity is the variant. The variants follow the model https://gelreportmodels.genomicsengland.co.uk/html_schemas/org.gel.models.cva.avro/1.3.0/CvaVariant.html#/schema/org.gel.models.cva.avro.Variant. This is where we store the biological annotations for the variant irrespective on the observations of the variant on different samples. These are the annotations coming from Cellbase. If there is a valid lift over we store the annotations in both coordinates GRCh37 and GRCh38. Variants can be fetched following certain criteria based on their coordinates and their annotations. Also they support the functionality previously described for inclusion and exclusion of fields, pagination, counting and integration with Pandas data frames.

In [28]:
# fetch the first five variants in gene PKD1
variants = list(variants_client.get_variants(geneSymbols='PKD1', max_results=5))
variant = variants[0]
type(variant)
GET https://bio-test-cva.gel.zone/cva/api/0/variants?geneSymbols=PKD1&include=__all
Response time : 307 ms
Out[28]:
pyark.models.wrappers.VariantWrapper

Variants can also be fetched by identifier.

In [29]:
variants_client.get_variant_by_id(variant.id)
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:16:2111052:G:A?include=__all
Response time : 7 ms
Out[29]:
<pyark.models.wrappers.VariantWrapper at 0x7fbd3c66cf78>

And by lists of ids making parallel requests.

In [30]:
variants_client.get_variants_by_id(identifiers=[v.id for v in variants])
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:16:2097403:G:C?include=__all
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:deletion:16:2136124:2174248?include=__all
Response time : 5 ms
Response time : 5 ms
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:16:2111218:C:T?include=__all
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:16:2091130:G:C?include=__all
Response time : 6 ms
GET https://bio-test-cva.gel.zone/cva/api/0/variants/GRCh38:16:2111052:G:A?include=__all
Response time : 9 ms
Response time : 21 ms
Out[30]:
[<pyark.models.wrappers.VariantWrapper at 0x7fbd37d170d8>,
 <pyark.models.wrappers.VariantWrapper at 0x7fbd35cc4798>,
 <pyark.models.wrappers.VariantWrapper at 0x7fbd359aab88>,
 <pyark.models.wrappers.VariantWrapper at 0x7fbd3594c630>,
 <pyark.models.wrappers.VariantWrapper at 0x7fbd35cc2d80>]

CVA holds variants of different types, small variants and structural variants. We can fetch variants of only a given category.

In [31]:
small_variants = list(variants_client.get_variants(geneSymbols='PKD1', variantCategory='small', max_results=5))
small_variant = small_variants[0]
type(small_variant)
GET https://bio-test-cva.gel.zone/cva/api/0/variants?geneSymbols=PKD1&variantCategory=small&include=__all
Response time : 117 ms
Out[31]:
pyark.models.wrappers.VariantWrapper

Small variants have the simpler variant coordinates based on chromosome, position, reference and alternate.

In [32]:
print(small_variant.get_default_variant_representation().smallVariantCoordinates.toJsonDict())
{'chromosome': '16', 'position': 2111052, 'reference': 'G', 'alternate': 'A', 'assembly': 'GRCh38'}

While structural variants have a more complex set of coordinates.

In [33]:
structural_variants = []
for v in variants_client.get_variants(geneSymbols='PKD1', variantCategory='structural', max_results=5):
    structural_variants.append(v)
structural_variant = structural_variants[0]
type(structural_variant)
GET https://bio-test-cva.gel.zone/cva/api/0/variants?geneSymbols=PKD1&variantCategory=structural&include=__all
Response time : 111 ms
Out[33]:
pyark.models.wrappers.VariantWrapper
In [34]:
print(structural_variant.get_default_variant_representation().structuralVariantCoordinates.toJsonDict())
print(structural_variant.get_default_variant_representation().variantType)
{'assembly': 'GRCh38', 'chromosome': '16', 'start': 2136124, 'end': 2174248, 'ciStart': None, 'ciEnd': None}
deletion
In [35]:
print(structural_variant.id)
GRCh38:deletion:16:2136124:2174248

Identifiers are unique within the CVA database. They hide a complex logic to identify a variant univocally. They cannot be built from any variant coordinates without those coordinates being first normalised. Also, some long variants (ie: indels greater than 50 bp) are hashed to avoid arbitrarily long identifiers. The variant identifiers have been designed to be human readable when possible. Also, the same identifier refers to a variant that can be represented in different assemblies, when building the identifier coordinates in assembly GRCh38 have preference over GRCh37, unless there are no GRCh38 coordinates. The main conclusion of the previous facts is that different variant coordinates may refer to the same variant, and hence to the same identifier. There is a convenience endpoint from where variant identifiers can be built from a set of variant coordinates once the coordinates have been processed by the normalisation engine, irrespectively of those variants being in the database or not.

NOTE: currently this endpoint only supports small variants

In [36]:
# different coordinates can refer to the same variant id
variant_coordinates_37 = small_variant.get_variant_representation_by_assembly(
    assembly=Assembly.GRCh37).smallVariantCoordinates.toJsonDict()
variant_coordinates_38 = small_variant.get_variant_representation_by_assembly(
    assembly=Assembly.GRCh38).smallVariantCoordinates.toJsonDict()
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_37]))
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_38]))
variant_coordinates_37['chromosome'] = "chr" + variant_coordinates_37['chromosome']
print(variants_client.variant_coordinates_to_ids(variant_coordinates=[variant_coordinates_37]))
POST https://bio-test-cva.gel.zone/cva/api/0/variants/identifiers-from-small-variant-coordinates?
Response time : 3 ms
['GRCh38:16:2111052:G:A']
POST https://bio-test-cva.gel.zone/cva/api/0/variants/identifiers-from-small-variant-coordinates?
Response time : 2 ms
['GRCh38:16:2111052:G:A']
POST https://bio-test-cva.gel.zone/cva/api/0/variants/identifiers-from-small-variant-coordinates?
Response time : 1 ms
['GRCh38:16:2111052:G:A']