# Getting started Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections. ## Loading a single-omics dataframe Suppose you have a single-omics dataset and would like to load them as a dataframe. As an example, we use the `TGCA` Lung Adenocarcinoma dataset from [tests/data/TCGA_LUAD](https://github.com/BioMeCIS-Lab/OpenOmics/tree/master/tests/data/TCGA_LUAD). Data tables are tab-delimited and have the following format: | GeneSymbol | EntrezID | TCGA-05-4244-01A-01R-1107-07 | TCGA-05-4249-01A-01R-1107-07 | ... | | ---------- | --------- | ---------------------------- | ---------------------------- | ---- | | A1BG | 100133144 | 10.8123 | 3.7927 | ... | | ⋮ | ⋮ | ⋮ | ⋮ | Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}`openomics.transcriptomics.Expression` or any of its subclasses. ````{tab} From a single file If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe. ```{code-block} python from openomics.multiomics import MessengerRNA mrna = MessengerRNA( data="https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt", transpose=True, usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring gene_index="GeneSymbol", # This column contains the gene index ) ``` One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented. > MessengerRNA (576, 20472) ```` ````{tab} From multiple files (glob) If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe. ```python from openomics.multiomics import MessengerRNA mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol") ``` > INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt'] ```` ````{tab} From DataFrame If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`. ```python import pandas as pd import numpy as np from openomics.multiomics import MessengerRNA # A random dataframe of microRNA gene_id's. df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5), "ENSG00000198973": np.random.rand(5), "ENSG00000198974": np.random.rand(5), "ENSG00000198975": np.random.rand(5), "ENSG00000198976": np.random.rand(5), "ENSG00000198982": np.random.rand(5), "ENSG00000198983": np.random.rand(5)}, index=range(5)) mrna = MessengerRNA(df, transpose=False, sample_level="sample_id") ``` ```` --- To access the {class}`DataFrame`, simply use {obj}`mrna.expressions`: ```python print(mrna.expressions) ```
GeneSymbol A1BG A1BG-AS1 A1CF A2M
sample_index
TCGA-05-4244-01A-01R-1107-07 26.0302 36.7711 0.000 9844.7858
TCGA-05-4249-01A-01R-1107-07 120.1349 132.1439 0.322 25712.6617

## Creating a multi-omics dataset With multiple single-omics, each with different sets of genes and samples, you can use the {class}`openomics.MultiOmics` to integrate them. ```{code-block} python from openomics.multiomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein path = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/" # Load each expression dataframe mRNA = MessengerRNA(path+"LUAD__geneExp.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol") miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol") lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv", transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID") som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name") pro = Protein(path+"protein_RPPA.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol") # Create an integrated MultiOmics dataset luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro]) # You can also add individual -omics one at a time `luad_data.add_omic(mRNA)` luad_data.build_samples() ``` The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data. > MessengerRNA (576, 20472) > MicroRNA (494, 1870) > LncRNA (546, 12727) > SomaticMutation (587, 21070) > Protein (364, 154) To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`: ```python luad_data.MessengerRNA # or luad_data.data["MessengerRNA"] ```
## Adding clinical data as sample attributes When sample attributes are provided for the study cohort, load it as a data table with the {class}`openomics.clinical.ClinicalData`, then add it to the {class}`openomics.multiomics.MultiOmics` dataset to enable querying for subsets of samples across the multi-omics. ```python from openomics import ClinicalData clinical = ClinicalData( "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt", patient_index="bcr_patient_barcode") luad_data.add_clinical_data(clinical) luad_data.clinical.patient ```
bcr_patient_uuid form_completion_date histologic_diagnosis prospective_collection retrospective_collection
bcr_patient_barcode
TCGA-05-4244 34040b83-7e8a-4264-a551-b16621843e28 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4245 03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4249 4addf05f-3668-4b3f-a17f-c0227329ca52 2010-7-22 Lung Adenocarcinoma NO YES
Note that in the clinical data table, `bcr_patient_barcode` is the column with `TCGA-XX-XXXX` patient IDs, which matches that of the `sample_index` index column in the `mrna.expressions` dataframe. ````{note} In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html): ```python mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters ``` ````
## Import an external database Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the [GENCODE database](https://www.gencodegenes.org/), preprocess annotation files into a dataframe, and then match them with the genes in our dataset. OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps: 1. First, provide the base `path` of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base `path` allows you to specify the specific species, genome assembly, and database version for your study. 2. Secondly, use the `file_resources` dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the `file_resources`, the key is the alias of the file required, and the value is the filename with the FTP base `path`. For example, the entry `{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}` indicates the GENCODE class to preprocess a `.gtf` file with the alias `"long_noncoding_RNAs.gtf"`, downloaded from the FTP path `ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz` To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}`openomics.database.sequence.GENCODE`. ```python from openomics.database import GENCODE gencode = GENCODE( path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/", file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz", "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz", "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences "transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences }, npartitions=0, # if > 1, then use Dask partition the dataframe and leverage out-of-core multiprocessing ) ``` To access the attributes constructed from the combination of annotations `long_noncoding_RNAs.gtf` and ` basic.annotation.gtf`, use: ```python gencode.data ```
gene_id gene_name index seqname source feature start end
0 ENSG00000243485 MIR1302-2HG 0 chr1 HAVANA gene 29554 31109
1 ENSG00000243485 MIR1302-2HG 1 chr1 HAVANA transcript 29554 31097

## Annotate your expression dataset with attributes With the annotation database, you can perform a join operation to add gene attributes to your {class}`openomics.transcriptomics.Expression` dataset. To annotate attributes for the `gene_id` list `mRNA.expression`, you must first select the corresponding column in `gencode.data` with matching `gene_id` keys. The following are code snippets for a variety of database types. ````{tab} Genomics attributes ```python luad_data.MessengerRNA.annotate_attributes(gencode, index="gene_id", columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe ) ``` ```` ````{tab} Sequences ```python luad_data.MessengerRNA.annotate_sequences(gencode, index="gene_name", agg_sequences="all", # Collect all sequences with the gene_name into a list ) ``` ```` ````{tab} Disease Associations ```python from openomics.database.disease import DisGeNet disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True) luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name") ``` ```` --- To view the resulting annotations dataframe, use: ```python luad_data.MessengerRNA.annotations ``` For more detailed guide, refer to the [annotation interfaces API](../modules/openomics.annotate.md).