# Loading a multi-omics dataset Suppose you have your own -omics dataset(s) and you'd like to load them. One of OpenOmics's primary goal is to encapsulate the data import process with one line of code along with a few parameters. Given any processed single-omic dataset, the library loads the data as a tabular structure where rows correspond to observation samples and columns correspond to measurements of different biomolecules. Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at [tests/data/TCGA_LUAD](https://github.com/BioMeCIS-Lab/OpenOmics/tree/master/tests/data/TCGA_LUAD). ```{code-block} python folder_path = "tests/data/TCGA_LUAD/" ``` Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data ```{code-block} python from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein # Load each expression dataframe mRNA = MessengerRNA(data=folder_path + "LUAD__geneExp.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name") miRNA = MicroRNA(data=folder_path + "LUAD__miRNAExp__RPM.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name") lncRNA = LncRNA(data=folder_path + "TCGA-rnaexpr.tsv", transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id") som = SomaticMutation(data=folder_path + "LUAD__somaticMutation_geneLevel.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name") pro = Protein(data=folder_path + "protein_RPPA.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name") # Create an integrated MultiOmics dataset luad_data = MultiOmics(cohort_name="LUAD") luad_data.add_clinical_data( clinical=folder_path + "nationwidechildrens.org_clinical_patient_luad.txt") luad_data.add_omic(mRNA) luad_data.add_omic(miRNA) luad_data.add_omic(lncRNA) luad_data.add_omic(som) luad_data.add_omic(pro) luad_data.build_samples() ``` Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features. PATIENTS (522, 5) SAMPLES (1160, 6) DRUGS (461, 4) MessengerRNA (576, 20472) SomaticMutation (587, 21070) MicroRNA (494, 1870) LncRNA (546, 12727) Protein (364, 154) ## Load single omics expressions for MessengerRNA, MicroRNA, LncRNA We instantiate the MessengerRNA, MicroRNA and LncRNA -omics expression data from `gtex.data`. Since the gene expression were not seperated by RNA type, we use GENCODE and Ensembl gene annotations to filter the list of mRNA, miRNA, and lncRNAs. ```{code-block} python from openomics import MessengerRNA, MicroRNA, LncRNA # Gene Expression messengerRNA_id = gtex_transcripts_gene_id & pd.Index(gencode.data[gencode.data["gene_type"] == "protein_coding"]["gene_id"].unique()) messengerRNA = MessengerRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(messengerRNA_id)], transpose=True, gene_index="gene_name", usecols=None, npartitions=4) # MicroRNA expression microRNA_id = pd.Index(ensembl.data[ensembl.data["gene_biotype"] == "miRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id microRNA = MicroRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(microRNA_id)], gene_index="gene_id", transpose=True, usecols=None, ) # LncRNA expression lncRNA_id = pd.Index(gencode.data[gencode.data["gene_type"] == "lncRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id lncRNA = LncRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(lncRNA_id)], gene_index="gene_id", transpose=True, usecols=None, ) ``` ## Create a MultiOmics dataset Now, we create a MultiOmics dataset object by combining the messengerRNA, microRNA, and lncRNA. ```{code-block} python from openomics import MultiOmics gtex_data = MultiOmics(cohort_name="GTEx Tissue Avg Expressions") gtex_data.add_omic(messengerRNA) gtex_data.add_omic(microRNA) gtex_data.add_omic(lncRNA) gtex_data.build_samples() ``` ## Accessing clinical data Each multi-omics and clinical data can be accessed through luad_data.data[], like: ```{code-block} python luad_data.data["PATIENTS"] ```
gender race histologic_subtype pathologic_stage
TCGA-05-4244 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage IV
TCGA-05-4245 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4249 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage I
TCGA-05-4250 FEMALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4382 MALE NaN Lung Adenocarcinoma Mixed Subtype Stage I

522 rows × 4 columns

```{code-block} python luad_data.data["MessengerRNA"] ```
TCGA-05-4244-01A 4.756500 5.239211 0.000000 13.265291 0.431997 7.043317 1.033652 9.348765 9.652057 0.763921 ... 5.350285 8.197321 9.907260 0.763921 10.088859 11.471139 9.768648 9.170597 2.932118 0.000000
TCGA-05-4249-01A 6.920471 7.056843 0.402722 14.650247 1.383939 9.178805 0.717123 9.241537 9.967223 0.000000 ... 5.980428 8.950001 10.204971 4.411650 9.622978 11.199826 10.153700 9.433116 7.499637 0.000000
TCGA-05-4250-01A 5.696542 6.136327 0.000000 14.048541 0.000000 8.481646 0.996244 9.203535 9.560412 0.733962 ... 5.931168 8.517334 9.722642 4.782796 8.895339 12.408981 10.194168 9.060342 2.867956 0.000000
TCGA-05-4382-01A 7.198727 6.809804 0.000000 14.509730 2.532591 9.117559 1.657045 9.251035 10.078124 1.860883 ... 5.373036 8.441914 9.888267 6.041142 9.828389 12.725186 10.192589 9.376841 5.177029 0.000000

576 rows × 20472 columns

## To match samples accross different multi-omics, use ```{code-block} python luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"]) ``` Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A', 'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A', 'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A', 'TCGA-05-4427-01A', ... 'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A', 'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A', 'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A', 'TCGA-S2-AA1A-01A'], dtype='object', length=465)