Loading a multi-omics dataset#

Suppose you have your own -omics dataset(s) and you’d like to load them. One of OpenOmics’s primary goal is to encapsulate the data import process with one line of code along with a few parameters. Given any processed single-omic dataset, the library loads the data as a tabular structure where rows correspond to observation samples and columns correspond to measurements of different biomolecules.

Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.

folder_path = "tests/data/TCGA_LUAD/"

Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data

from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path + "LUAD__geneExp.txt",
                    transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path + "LUAD__miRNAExp__RPM.txt",
                 transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path + "TCGA-rnaexpr.tsv",
                transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path + "LUAD__somaticMutation_geneLevel.txt",
                      transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path + "protein_RPPA.txt",
              transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
    clinical=folder_path + "nationwidechildrens.org_clinical_patient_luad.txt")

luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)

luad_data.build_samples()

Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.

PATIENTS (522, 5) SAMPLES (1160, 6) DRUGS (461, 4) MessengerRNA (576, 20472) SomaticMutation (587, 21070) MicroRNA (494, 1870) LncRNA (546, 12727) Protein (364, 154)

You may notice that in this dataset, the samples index (e.g. TCGA-XX-XXXX) across different omics does not match. It may be necessary to change them to be 12 characters in total.

lncRNA.expressions.index = lncRNA.expressions.index.str.slice(-12, )
miRNA.expressions.index = miRNA.expressions.index.str.slice(0, 12)
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12)
som.expressions.index = som.expressions.index.str.slice(0, 12)
pro.expressions.index = pro.expressions.index.str.slice(0, 12)

luad_data.build_samples()
luad_data.samples

Index([‘TCGA-05-4244’, ‘TCGA-05-4249’, ‘TCGA-05-4250’, ‘TCGA-05-4382’, ‘TCGA-05-4384’, ‘TCGA-05-4389’, ‘TCGA-05-4390’, ‘TCGA-05-4395’, ‘TCGA-05-4396’, ‘TCGA-05-4397’, … ‘TCGA-NJ-A4YG’, ‘TCGA-NJ-A4YI’, ‘TCGA-NJ-A4YP’, ‘TCGA-NJ-A4YQ’, ‘TCGA-NJ-A55A’, ‘TCGA-NJ-A55O’, ‘TCGA-NJ-A55R’, ‘TCGA-NJ-A7XG’, ‘TCGA-O1-A52J’, ‘TCGA-S2-AA1A’], dtype=’object’, length=952)

Load single omics expressions for MessengerRNA, MicroRNA, LncRNA#

We instantiate the MessengerRNA, MicroRNA and LncRNA -omics expression data from gtex.data. Since the gene expression were not seperated by RNA type, we use GENCODE and Ensembl gene annotations to filter the list of mRNA, miRNA, and lncRNAs.

from openomics import MessengerRNA, MicroRNA, LncRNA

# Gene Expression
messengerRNA_id = gtex_transcripts_gene_id & pd.Index(gencode.data[gencode.data["gene_type"] == "protein_coding"]["gene_id"].unique())

messengerRNA = MessengerRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(messengerRNA_id)],
                           transpose=True, gene_index="gene_name", usecols=None, npartitions=4)

# MicroRNA expression
microRNA_id = pd.Index(ensembl.data[ensembl.data["gene_biotype"] == "miRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id

microRNA = MicroRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(microRNA_id)],
                   gene_index="gene_id", transpose=True, usecols=None, )

# LncRNA expression
lncRNA_id = pd.Index(gencode.data[gencode.data["gene_type"] == "lncRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id
lncRNA = LncRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(lncRNA_id)],
               gene_index="gene_id", transpose=True, usecols=None, )

Create a MultiOmics dataset#

Now, we create a MultiOmics dataset object by combining the messengerRNA, microRNA, and lncRNA.

   from openomics import MultiOmics

   gtex_data = MultiOmics(cohort_name="GTEx Tissue Avg Expressions")

   gtex_data.add_omic(messengerRNA)
   gtex_data.add_omic(microRNA)
   gtex_data.add_omic(lncRNA)

   gtex_data.build_samples()

Accessing clinical data#

Each multi-omics and clinical data can be accessed through luad_data.data[], like:

luad_data.data["PATIENTS"]

	gender	race	histologic_subtype	pathologic_stage
bcr_patient_barcode
TCGA-05-4244	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage IV
TCGA-05-4245	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage III
TCGA-05-4249	MALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage I
TCGA-05-4250	FEMALE	NaN	Lung Adenocarcinoma- Not Otherwise Specified (...	Stage III
TCGA-05-4382	MALE	NaN	Lung Adenocarcinoma Mixed Subtype	Stage I

522 rows × 4 columns

luad_data.data["MessengerRNA"]

gene_name	A1BG	A1BG-AS1	A1CF	A2M	A2ML1	A4GALT	A4GNT	AAAS	AACS	AACSP1	...	ZXDA	ZXDB	ZXDC	ZYG11A	ZYG11B	ZYX	ZZEF1	ZZZ3	psiTPTE22
TCGA-05-4244-01A	4.756500	5.239211	0.000000	13.265291	0.431997	7.043317	1.033652	9.348765	9.652057	0.763921	...	5.350285	8.197321	9.907260	0.763921	10.088859	11.471139	9.768648	9.170597	2.932118
TCGA-05-4249-01A	6.920471	7.056843	0.402722	14.650247	1.383939	9.178805	0.717123	9.241537	9.967223	0.000000	...	5.980428	8.950001	10.204971	4.411650	9.622978	11.199826	10.153700	9.433116	7.499637
TCGA-05-4250-01A	5.696542	6.136327	0.000000	14.048541	0.000000	8.481646	0.996244	9.203535	9.560412	0.733962	...	5.931168	8.517334	9.722642	4.782796	8.895339	12.408981	10.194168	9.060342	2.867956
TCGA-05-4382-01A	7.198727	6.809804	0.000000	14.509730	2.532591	9.117559	1.657045	9.251035	10.078124	1.860883	...	5.373036	8.441914	9.888267	6.041142	9.828389	12.725186	10.192589	9.376841	5.177029

576 rows × 20472 columns

To match samples accross different multi-omics, use#

luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])

Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
       'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
       'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
       'TCGA-05-4427-01A',
       ...
       'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
       'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
       'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
       'TCGA-S2-AA1A-01A'],
      dtype='object', length=465)