%run initialise_pyark.py
We store the counts of report events having each segregation pattern in the case in the field countTieringSegregationPatterns
. We can then use the filter
parameter to make any arithmetic filtering on those values.
The main difficulty to use the filter
parameter is that you need to know the underlying name of the field in the database you want to filter by; as opposed to other filters exposed in the REST API which are available in Swagger which happens to be the documentation of the REST API.
cases_client.count(program="rare_disease", filter="countTieringSegregationPatterns.deNovo gt 0")
cases_client.count(program="rare_disease", filter="countTieringSegregationPatterns.MitochondrialGenome gt 0")
cases_client.count(program="rare_disease", filter="countTieringSegregationPatterns.UniparentalIsodisomy gt 0")
Once we have the query to select our cohort we can fetch those cases. In order not to fetch all at once we need to use the parameter limit
to determine the number of cases to be fetched in one single call. The limit
maximum value is 200, any value greater than that will be considered as the maximum value. We can use the parameter as_data_frame=True
to fetch the cases in a Pandas data frame or otherwise they will be returned in a native Python list of dictionaries.
For this example we will drill down into a smaller set of cases for the purpose of simplicity in the size of the data by adding some additional filtering.
cases_client.count(program="rare_disease", filter="countTieringSegregationPatterns.deNovo gt 150")
# fetch cases in batches of 100
denovo_cases_generator = cases_client.get_cases(program="rare_disease", hasPositiveDx=True, filter="countTieringSegregationPatterns.deNovo gt 150", limit=100, as_data_frame=True)
# fetch the first batch
denovo_cases = next(denovo_cases_generator)
denovo_cases[["identifier", "version", "countTiered", "tieringVersion", "hasExitQuestionnaire", "hasPositiveDx"]].head()
# get them all
for c in denovo_cases_generator:
denovo_cases = denovo_cases.append(c)
denovo_cases.shape
Now extract relevant information like case ids or count of tiered variants.
denovo_cases[["identifier", "version", "countTiered", "tieringVersion", "hasExitQuestionnaire", "hasPositiveDx"]].head()
This information is not available in the case entity so we need to query the report events for each case in order to infer this information.
# this makes one query per case
denovo_cases["countDeNovoTier12"] = denovo_cases.apply(lambda c: report_events_client.count(caseId=c.identifier, caseVersion=c.version, segregationPattern="deNovo", tiers=["TIER1", "TIER2"]), axis=1)
denovo_cases[denovo_cases["countDeNovoTier12"] > 0].shape
denovo_cases[denovo_cases["countDeNovoTier12"] > 0][["identifier", "version", "countTiered", "countDeNovoTier12", "hasExitQuestionnaire", "hasPositiveDx"]].head()
We store the number of tiered variants by variant type as follows:
"countTieredByVariantType" : {
"MNV" : 0,
"DELETION" : 11,
"INSERTION" : 6,
"INDEL" : 17,
"SNV" : 247
},
We can fetch cases with an unusual number of variants of a certain type querying this data structure.
NOTE: for cases having multiple panels applied the same variant may be reported multiple times, but the counts above are unique, meaning that the same variant is not counted more than once.
cases_client.count(program="rare_disease", filter="countTieredByVariantType.INSERTION gt 50")
cases_client.count(program="rare_disease", filter="countTieredByVariantType.DELETION gt 50")
We can look at the distribution of these values in a given cohort previously downloaded.
# NOTE: we are missing the count of panels as precomputed value in cases, we will add it soon...
denovo_cases["countPanels"] = denovo_cases["reportEventsAnalysisPanels"].apply(lambda p: len(p))
denovo_cases[["countPanels", "countTieredByVariantType.DELETION", "countTieredByVariantType.INSERTION", "countTieredByVariantType.SNV"]]
denovo_cases["countTieredByVariantType.DELETION"].plot.kde()
denovo_cases["countTieredByVariantType.INSERTION"].plot.kde()
There are a number of fields related to the clinical data and the case status:
hasCanonicalTrio
, hasFatherSequenced
, hasMotherSequenced
, probandYearOfBirth
, probandSex
,
probandKaryotipicSex
, probandEstimatedAgeAtAnalysis
, probandAgeOfOnset
, probandNumberDisorders
,
hasPositiveDx
, hasNegativeDx
, caseSolvedFamily
, segregationQuestion
, hasActionable
,
hasConfirmationDecision
, hasPositiveConfirmationOutcome
, countProbandPresentPhenotypes
,
hasClinicalReport
, hasExitQuestionnaire
, countParticipants
, countSamples
cases_client.count(program="rare_disease", hasCanonicalTrio=True, hasPositiveDx=True)
cases_client.count(program="rare_disease", hasCanonicalTrio=True, hasPositiveDx=False)
# get summary of cases to compare solved versus unsolved cases with a clinical trio
cases_client.get_summary(params_list=[
{'program':'rare_disease', 'hasCanonicalTrio':True, 'hasPositiveDx':True},
{'program':'rare_disease', 'hasCanonicalTrio':True, 'hasPositiveDx':False}],
as_data_frame=True)[['diagnosticRate', 'countCases']]
# get summary of cases with different number of participants in the family
cases_client.get_summary(params_list=[
{'program':'rare_disease', 'filter':'countParticipants eq 1'},
{'program':'rare_disease', 'filter':'countParticipants eq 2'},
{'program':'rare_disease', 'filter':'countParticipants eq 3'},
{'program':'rare_disease', 'filter':'countParticipants eq 4'},
{'program':'rare_disease', 'filter':'countParticipants eq 5'},
{'program':'rare_disease', 'filter':'countParticipants eq 6'}
], as_data_frame=True)[['diagnosticRate', 'countCases']]
# get summary of cases with different proband age ranges
cases_client.get_summary(params_list=[
{'program':'rare_disease', 'filter':'probandEstimatedAgeAtAnalysis lt 2'},
{'program':'rare_disease',
'filter':'probandEstimatedAgeAtAnalysis lt 10 and probandEstimatedAgeAtAnalysis ge 2'},
{'program':'rare_disease',
'filter':'probandEstimatedAgeAtAnalysis lt 18 and probandEstimatedAgeAtAnalysis ge 10'},
{'program':'rare_disease', 'filter':'probandEstimatedAgeAtAnalysis ge 18'}
], as_data_frame=True)[['diagnosticRate', 'countCases']]
# get summary of cases with different count of phenotypes
cases_client.get_summary(params_list=[
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 1'},
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 2'},
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 3'},
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 4'},
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 5'},
{'program':'rare_disease', 'filter':'countProbandPresentPhenotypes eq 6'}
], as_data_frame=True)[['diagnosticRate', 'countCases']]
# get summary of cases with different values of information content
cases_client.get_summary(params_list=[
{'program':'rare_disease', 'filter':'probandPresentPhenotypesInformationContent le 1'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 2 and probandPresentPhenotypesInformationContent gt 1'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 3 and probandPresentPhenotypesInformationContent gt 2'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 4 and probandPresentPhenotypesInformationContent gt 3'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 5 and probandPresentPhenotypesInformationContent gt 4'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 6 and probandPresentPhenotypesInformationContent gt 5'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent le 7 and probandPresentPhenotypesInformationContent gt 6'},
{'program':'rare_disease',
'filter':'probandPresentPhenotypesInformationContent gt 7'}
], as_data_frame=True)[['diagnosticRate', 'countCases']]
## Select cases with a positive Dx by diagnostic gene
cases_client.count(
hasPositiveDx=True,
filter="classifiedGenes.pathogenic_variant eq 'ENSG00000008710' or classifiedGenes.likely_pathogenic_variant eq 'ENSG00000008710'")
## Select cases with a positive Dx from a tier 3
cases_client.count(
hasPositiveDx=True,
filter="countTieredAndClassified.TIER3.pathogenic_variant gt 0 and " +
"countTieredAndClassified.TIER1.pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER1.likely_pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER2.pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER2.likely_pathogenic_variant eq 0")
cases_client.count(
hasPositiveDx=True,
filter="countTieredAndClassified.TIER3.likely_pathogenic_variant gt 0 and " +
"countTieredAndClassified.TIER1.pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER1.likely_pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER2.pathogenic_variant eq 0 and " +
"countTieredAndClassified.TIER2.likely_pathogenic_variant eq 0")