# Getting started
Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.
## Loading a single-omics dataframe
Suppose you have a single-omics dataset and would like to load them as a dataframe.
As an example, we use the `TGCA` Lung Adenocarcinoma dataset from [tests/data/TCGA_LUAD](https://github.com/BioMeCIS-Lab/OpenOmics/tree/master/tests/data/TCGA_LUAD). Data tables are tab-delimited and have the following format:
| GeneSymbol | EntrezID | TCGA-05-4244-01A-01R-1107-07 | TCGA-05-4249-01A-01R-1107-07 | ... |
| ---------- | --------- | ---------------------------- | ---------------------------- | ---- |
| A1BG | 100133144 | 10.8123 | 3.7927 | ... |
| ⋮ | ⋮ | ⋮ | ⋮ |
Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}`openomics.transcriptomics.Expression` or any of its subclasses.
````{tab} From a single file
If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.
```{code-block} python
from openomics.multiomics import MessengerRNA
mrna = MessengerRNA(
data="https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
gene_index="GeneSymbol", # This column contains the gene index
)
```
One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented.
> MessengerRNA (576, 20472)
````
````{tab} From multiple files (glob)
If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.
```python
from openomics.multiomics import MessengerRNA
mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
```
> INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt']
````
````{tab} From DataFrame
If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`.
```python
import pandas as pd
import numpy as np
from openomics.multiomics import MessengerRNA
# A random dataframe of microRNA gene_id's.
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
"ENSG00000198973": np.random.rand(5),
"ENSG00000198974": np.random.rand(5),
"ENSG00000198975": np.random.rand(5),
"ENSG00000198976": np.random.rand(5),
"ENSG00000198982": np.random.rand(5),
"ENSG00000198983": np.random.rand(5)},
index=range(5))
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")
```
````
---
To access the {class}`DataFrame`, simply use {obj}`mrna.expressions`:
```python
print(mrna.expressions)
```
GeneSymbol |
A1BG |
A1BG-AS1 |
A1CF |
A2M |
sample_index |
|
|
|
|
TCGA-05-4244-01A-01R-1107-07 |
26.0302 |
36.7711 |
0.000 |
9844.7858 |
TCGA-05-4249-01A-01R-1107-07 |
120.1349 |
132.1439 |
0.322 |
25712.6617 |
## Creating a multi-omics dataset
With multiple single-omics, each with different sets of genes and samples, you can use the {class}`openomics.MultiOmics` to integrate them.
```{code-block} python
from openomics.multiomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
path = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/"
# Load each expression dataframe
mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
transpose=True,
usecols="Gene_ID|TCGA",
gene_index="Gene_ID")
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="gene_name")
pro = Protein(path+"protein_RPPA.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])
# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)`
luad_data.build_samples()
```
The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data.
> MessengerRNA (576, 20472)
> MicroRNA (494, 1870)
> LncRNA (546, 12727)
> SomaticMutation (587, 21070)
> Protein (364, 154)
To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`:
```python
luad_data.MessengerRNA
# or
luad_data.data["MessengerRNA"]
```
## Adding clinical data as sample attributes
When sample attributes are provided for the study cohort, load it as a data table with the {class}`openomics.clinical.ClinicalData`, then add it to the {class}`openomics.multiomics.MultiOmics` dataset to enable querying for subsets of samples across the multi-omics.
```python
from openomics import ClinicalData
clinical = ClinicalData(
"https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
patient_index="bcr_patient_barcode")
luad_data.add_clinical_data(clinical)
luad_data.clinical.patient
```
|
bcr_patient_uuid |
form_completion_date |
histologic_diagnosis |
prospective_collection |
retrospective_collection |
bcr_patient_barcode |
|
|
|
|
|
TCGA-05-4244 |
34040b83-7e8a-4264-a551-b16621843e28 |
2010-7-22 |
Lung Adenocarcinoma |
NO |
YES |
TCGA-05-4245 |
03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2 |
2010-7-22 |
Lung Adenocarcinoma |
NO |
YES |
TCGA-05-4249 |
4addf05f-3668-4b3f-a17f-c0227329ca52 |
2010-7-22 |
Lung Adenocarcinoma |
NO |
YES |
Note that in the clinical data table, `bcr_patient_barcode` is the column with `TCGA-XX-XXXX` patient IDs, which matches
that of the `sample_index` index column in the `mrna.expressions` dataframe.
````{note}
In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html):
```python
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters
```
````
## Import an external database
Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the [GENCODE database](https://www.gencodegenes.org/), preprocess annotation files into a dataframe, and then match them with the genes in our dataset.
OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:
1. First, provide the base `path` of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base `path` allows you to specify the specific species, genome assembly, and database version for your study.
2. Secondly, use the `file_resources` dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the `file_resources`, the key is the alias of the file required, and the value is the filename with the FTP base `path`.
For example, the entry `{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}` indicates the GENCODE class to preprocess a `.gtf` file with the alias `"long_noncoding_RNAs.gtf"`, downloaded from the FTP path `ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz`
To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}`openomics.database.sequence.GENCODE`.
```python
from openomics.database import GENCODE
gencode = GENCODE(
path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
"basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
"lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
"transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
},
npartitions=0, # if > 1, then use Dask partition the dataframe and leverage out-of-core multiprocessing
)
```
To access the attributes constructed from the combination of annotations `long_noncoding_RNAs.gtf` and `
basic.annotation.gtf`, use:
```python
gencode.data
```
|
gene_id |
gene_name |
index |
seqname |
source |
feature |
start |
end |
0 |
ENSG00000243485 |
MIR1302-2HG |
0 |
chr1 |
HAVANA |
gene |
29554 |
31109 |
1 |
ENSG00000243485 |
MIR1302-2HG |
1 |
chr1 |
HAVANA |
transcript |
29554 |
31097 |
## Annotate your expression dataset with attributes
With the annotation database, you can perform a join operation to add gene attributes to your {class}`openomics.transcriptomics.Expression` dataset. To annotate attributes for the `gene_id` list `mRNA.expression`, you must first select the corresponding column in `gencode.data` with matching `gene_id` keys. The following are code snippets for a variety of database types.
````{tab} Genomics attributes
```python
luad_data.MessengerRNA.annotate_attributes(gencode,
index="gene_id",
columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
)
```
````
````{tab} Sequences
```python
luad_data.MessengerRNA.annotate_sequences(gencode,
index="gene_name",
agg_sequences="all", # Collect all sequences with the gene_name into a list
)
```
````
````{tab} Disease Associations
```python
from openomics.database.disease import DisGeNet
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)
luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")
```
````
---
To view the resulting annotations dataframe, use:
```python
luad_data.MessengerRNA.annotations
```
For more detailed guide, refer to the [annotation interfaces API](../modules/openomics.annotate.md).