FHIR-Aggregator TCGA-BRCA Cohort Survival Anaysis¶
This notebook demonstrates how to extract breast cancer patient data from TCGA and compare the Kaplan-Meier survival curves for two groups: white and African American patients aged 50 or younger.
Install necessay packages¶
In [ ]:
!pip install lifelines -q
Preparing metadata (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 349.3/349.3 kB 10.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.7/115.7 kB 5.1 MB/s eta 0:00:00 Building wheel for autograd-gamma (setup.py) ... done
In [ ]:
pip install git+https://github.com/FHIR-Aggregator/fhir-query.git
Collecting git+https://github.com/FHIR-Aggregator/fhir-query.git Cloning https://github.com/FHIR-Aggregator/fhir-query.git to /tmp/pip-req-build-hqguuawu Running command git clone --filter=blob:none --quiet https://github.com/FHIR-Aggregator/fhir-query.git /tmp/pip-req-build-hqguuawu Resolved https://github.com/FHIR-Aggregator/fhir-query.git to commit 7a62f40d3f04f0fd8bc85e2696a92667088ac293 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from fhir_query==0.1.0) (8.1.8) Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from fhir_query==0.1.0) (2.32.3) Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/dist-packages (from fhir_query==0.1.0) (6.0.2) Collecting halo (from fhir_query==0.1.0) Downloading halo-0.0.31.tar.gz (11 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: httpx in /usr/local/lib/python3.11/dist-packages (from fhir_query==0.1.0) (0.28.1) Collecting fhir.resources==8.0.0b4 (from fhir_query==0.1.0) Downloading fhir.resources-8.0.0b4-py2.py3-none-any.whl.metadata (47 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.4/47.4 kB 2.9 MB/s eta 0:00:00 Collecting dotty-dict (from fhir_query==0.1.0) Downloading dotty_dict-1.3.1-py3-none-any.whl.metadata (5.3 kB) Collecting pyvis (from fhir_query==0.1.0) Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB) Collecting click_default_group (from fhir_query==0.1.0) Downloading click_default_group-1.2.4-py2.py3-none-any.whl.metadata (2.8 kB) Collecting inflection (from fhir_query==0.1.0) Downloading inflection-0.5.1-py2.py3-none-any.whl.metadata (1.7 kB) Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from fhir_query==0.1.0) (2.2.2) Collecting fhir-core>=0.1.3 (from fhir.resources==8.0.0b4->fhir_query==0.1.0) Downloading fhir_core-1.0.0-py2.py3-none-any.whl.metadata (10 kB) Collecting log_symbols>=0.0.14 (from halo->fhir_query==0.1.0) Downloading log_symbols-0.0.14-py3-none-any.whl.metadata (523 bytes) Collecting spinners>=0.0.24 (from halo->fhir_query==0.1.0) Downloading spinners-0.0.24-py3-none-any.whl.metadata (576 bytes) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from halo->fhir_query==0.1.0) (2.5.0) Collecting colorama>=0.3.9 (from halo->fhir_query==0.1.0) Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.11/dist-packages (from halo->fhir_query==0.1.0) (1.17.0) Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx->fhir_query==0.1.0) (3.7.1) Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx->fhir_query==0.1.0) (2025.1.31) Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx->fhir_query==0.1.0) (1.0.7) Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx->fhir_query==0.1.0) (3.10) Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx->fhir_query==0.1.0) (0.14.0) Requirement already satisfied: numpy>=1.23.2 in /usr/local/lib/python3.11/dist-packages (from pandas->fhir_query==0.1.0) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->fhir_query==0.1.0) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->fhir_query==0.1.0) (2025.1) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->fhir_query==0.1.0) (2025.1) Requirement already satisfied: ipython>=5.3.0 in /usr/local/lib/python3.11/dist-packages (from pyvis->fhir_query==0.1.0) (7.34.0) Requirement already satisfied: jinja2>=2.9.6 in /usr/local/lib/python3.11/dist-packages (from pyvis->fhir_query==0.1.0) (3.1.5) Requirement already satisfied: jsonpickle>=1.4.1 in /usr/local/lib/python3.11/dist-packages (from pyvis->fhir_query==0.1.0) (4.0.2) Requirement already satisfied: networkx>=1.11 in /usr/local/lib/python3.11/dist-packages (from pyvis->fhir_query==0.1.0) (3.4.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->fhir_query==0.1.0) (3.4.1) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->fhir_query==0.1.0) (2.3.0) Requirement already satisfied: pydantic<3.0,>=2.7.4 in /usr/local/lib/python3.11/dist-packages (from fhir-core>=0.1.3->fhir.resources==8.0.0b4->fhir_query==0.1.0) (2.10.6) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (75.1.0) Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB) Requirement already satisfied: decorator in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (4.4.2) Requirement already satisfied: pickleshare in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.7.5) Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (5.7.1) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (3.0.50) Requirement already satisfied: pygments in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (2.18.0) Requirement already satisfied: backcall in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.2.0) Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.1.7) Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.11/dist-packages (from ipython>=5.3.0->pyvis->fhir_query==0.1.0) (4.9.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2>=2.9.6->pyvis->fhir_query==0.1.0) (3.0.2) Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx->fhir_query==0.1.0) (1.3.1) Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.11/dist-packages (from jedi>=0.16->ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.8.4) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.11/dist-packages (from pexpect>4.3->ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.7.0) Requirement already satisfied: wcwidth in /usr/local/lib/python3.11/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.3.0->pyvis->fhir_query==0.1.0) (0.2.13) Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic<3.0,>=2.7.4->fhir-core>=0.1.3->fhir.resources==8.0.0b4->fhir_query==0.1.0) (0.7.0) Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic<3.0,>=2.7.4->fhir-core>=0.1.3->fhir.resources==8.0.0b4->fhir_query==0.1.0) (2.27.2) Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic<3.0,>=2.7.4->fhir-core>=0.1.3->fhir.resources==8.0.0b4->fhir_query==0.1.0) (4.12.2) Downloading fhir.resources-8.0.0b4-py2.py3-none-any.whl (2.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 34.0 MB/s eta 0:00:00 Downloading click_default_group-1.2.4-py2.py3-none-any.whl (4.1 kB) Downloading dotty_dict-1.3.1-py3-none-any.whl (7.0 kB) Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB) Downloading pyvis-0.3.2-py3-none-any.whl (756 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 756.0/756.0 kB 30.0 MB/s eta 0:00:00 Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB) Downloading fhir_core-1.0.0-py2.py3-none-any.whl (29 kB) Downloading log_symbols-0.0.14-py3-none-any.whl (3.1 kB) Downloading spinners-0.0.24-py3-none-any.whl (5.5 kB) Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 42.0 MB/s eta 0:00:00 Building wheels for collected packages: fhir_query, halo Building wheel for fhir_query (pyproject.toml) ... done Created wheel for fhir_query: filename=fhir_query-0.1.0-py3-none-any.whl size=22028 sha256=c6e83b86da80e20701e7c5902caa6f27fa409c15941f578f1a631c5abc8e3305 Stored in directory: /tmp/pip-ephem-wheel-cache-920jyz09/wheels/bd/1c/ed/2a7d815d865da44b126c27d1cc954bb09f65eba557e4f4de43 Building wheel for halo (setup.py) ... done Created wheel for halo: filename=halo-0.0.31-py3-none-any.whl size=11233 sha256=31a665be41894e8cd3f5ae6c97fa146034f2c2448b7e36ef50fd4cdcfbd9ec65 Stored in directory: /root/.cache/pip/wheels/2e/b1/0a/62566170555f623c8327d47df1f53b6e4311ec9dd0ea70a99c Successfully built fhir_query halo Installing collected packages: spinners, jedi, inflection, dotty-dict, colorama, click_default_group, log_symbols, pyvis, halo, fhir-core, fhir.resources, fhir_query Successfully installed click_default_group-1.2.4 colorama-0.4.6 dotty-dict-1.3.1 fhir-core-1.0.0 fhir.resources-8.0.0b4 fhir_query-0.1.0 halo-0.0.31 inflection-0.5.1 jedi-0.19.2 log_symbols-0.0.14 pyvis-0.3.2 spinners-0.0.24
Use FHIR-Aggregator to retrieve the necessary data¶
Retrieve a pre-defined set of queries, a GraphDefinition¶
In [ ]:
!wget https://raw.githubusercontent.com/FHIR-Aggregator/fhir-query/refs/heads/main/graph-definitions/R5/ResearchStudyGraph.yaml
--2025-02-27 20:13:09-- https://raw.githubusercontent.com/FHIR-Aggregator/fhir-query/refs/heads/main/graph-definitions/R5/ResearchStudyGraph.yaml Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1840 (1.8K) [text/plain] Saving to: ‘ResearchStudyGraph.yaml’ ResearchStudyGraph. 0%[ ] 0 --.-KB/s ResearchStudyGraph. 100%[===================>] 1.80K --.-KB/s in 0s 2025-02-27 20:13:09 (28.5 MB/s) - ‘ResearchStudyGraph.yaml’ saved [1840/1840]
Export TCGA-BRCA data to a local database¶
In [ ]:
%env FHIR_BASE=https://google-fhir.test-fhir-aggregator.org
# export a study using a set of stored queries
!fq --fhir-base-url $FHIR_BASE --graph-definition-file-path ResearchStudyGraph.yaml --path '/ResearchStudy?identifier=TCGA-BRCA'
env: FHIR_BASE=https://google-fhir.test-fhir-aggregator.org research-study-graph is valid FHIR R5 GraphDefinition ✔ Fetching https://google-fhir.test-fhir-aggregator.org/ResearchStudy?identifier=TCGA-BRCA ✔ Processing link: ResearchSubject/study={path} with 1 ResearchStudy(s) ✔ Processing link: Group/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: Patient/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: Specimen/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: Observation/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: Procedure/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: DocumentReference/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: ServiceRequest/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: ImagingStudy/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✔ Processing link: Condition/part-of-study={path}&_count=1000&_total=accurate with 1 ResearchStudy(s) ✖ Could not find any resources for MedicationAdministration->Medication link: {'params': '_id={path}&_count=1000&_total=accurate', 'path': 'MedicationAdministration.medication.reference.reference', 'sourceId': 'MedicationAdministration', 'targetId': 'Medication'} Aggregated Results: {'Condition': 1191, 'DocumentReference': 68962, 'Group': 19, 'ImagingStudy': 3171, 'MedicationAdministration': 5879, 'Observation': 56136, 'Patient': 1098, 'Procedure': 3397, 'ResearchStudy': 1, 'ResearchSubject': 1098, 'ServiceRequest': 64655, 'Specimen': 31984} database available at: /tmp/fhir-graph.sqlite
Create a tsv file from the extracted data¶
In [5]:
!fq dataframe
Saved /tmp/fhir-graph.tsv
Surviving analysis¶
After retrieveing the data, we then use the python library lifelines to plor Kaplan-Meier plots of two groups (white and african american) of Breat cancer patients that are 50 years old or younger.
In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
# read the data into a dataframe
df = pd.read_csv('/tmp/fhir-graph.tsv')
# get days to death data in the necessary formay
df['days_to_death'] = (
df['patient_observation_days_between_diagnosis_and_death']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
df['patient_observation_days_between_birth_and_diagnosis']
.str.replace(' days', '', regex=False)
.replace('', np.nan)
.astype(float)
)
# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])
<ipython-input-6-a97b424089a7>:7: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('/tmp/fhir-graph.tsv')
Select Breast cancer patients that are white, african american, and 50 years old or younger.
In [7]:
df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
& (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
& (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino') ]
Get the necessary data for lifelines
package.
In [8]:
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())
E = df_cohort['patient_deceasedBoolean'].astype(bool)
Plot the survivial curves
In [9]:
kmf = KaplanMeierFitter()
fig=plt.figure(figsize=(13, 8), dpi= 80)
ax = plt.subplot(111,
title = "Survival Curve")
for r in df_cohort['patient_us_core_race'].sort_values().unique() :
if (r != None):
cohort = df_cohort['patient_us_core_race'] == r
kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
kmf.plot(ax=ax, )
else:
print("")
ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")
plt.show()