Welcome to OpenOmics’s documentation!¶
This Python package provide a series of tools to integrate and query the genomics, transcriptomics, proteomics, and clinical data (aka, the multi-omics data). With scalable data-frame manipulation tools, OpenOmics facilitates the common coding tasks when preparing data for RNA-seq bioinformatics analysis.
Features¶
Provides a bioinformatics workflow to generate integrative results from multi-omics data.
Facilitates integration of various bio-databases, multi-omics expression, genomics, and clinical data.
Highly flexible to different data types and missing data.
Provides researchers with means to consistently store and explore their experimental datasets.
Enables scalable performance with parallel computing, while easily configurable to deploy on both single machine and a cluster.
Table of Content¶
Installation¶
Stable release¶
To install OpenOmics, run this command in your terminal:
$ pip install openomics
This is the preferred method to install OpenOmics, as it will always install the most recent stable release.
If you don’t have pip
installed, it is recommended to install
the Anaconda Python distribution to get started with Python 3.7, 3.8, or
3.9 on either Mac OS, Linux, or Windows.
From sources¶
The sources for OpenOmics can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone https://github.com/BioMeCIS-Lab/OpenOmics
Once you have a copy of the source, you can install it with:
$ cd OpenOmics
$ pip install ./ -e
The versioned release history can be found at OpenOmics/releases.
Getting started¶
Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.
Loading a single-omics dataframe¶
Suppose you have a single-omics dataset and would like to load them as a dataframe.
As an example, we use the TGCA
Lung Adenocarcinoma dataset from tests/data/TCGA_LUAD. Data tables are tab-delimited and have the following format:
GeneSymbol |
EntrezID |
TCGA-05-4244-01A-01R-1107-07 |
TCGA-05-4249-01A-01R-1107-07 |
… |
---|---|---|---|---|
A1BG |
100133144 |
10.8123 |
3.7927 |
… |
⋮ |
⋮ |
⋮ |
⋮ |
Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class openomics.Expression
or any of its subclasses.
If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.
from openomics.multiomics import MessengerRNA
mrna = MessengerRNA(
data="https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
gene_index="GeneSymbol", # This column contains the gene index
)
One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument transpose=True
to convert to row-oriented.
MessengerRNA (576, 20472)
If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.
from openomics.multiomics import MessengerRNA
mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
INFO: Files matched: [‘LUAD__miRNAExp__RPM.txt’, ‘LUAD__protein_RPPA.txt’, ‘LUAD__geneExp.txt’]
If your workflow already produced a dataframe, you can encapsulate it directly with openomics.Expression
.
import pandas as pd
import numpy as np
from openomics.multiomics import MessengerRNA
# A random dataframe of microRNA gene_id's.
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
"ENSG00000198973": np.random.rand(5),
"ENSG00000198974": np.random.rand(5),
"ENSG00000198975": np.random.rand(5),
"ENSG00000198976": np.random.rand(5),
"ENSG00000198982": np.random.rand(5),
"ENSG00000198983": np.random.rand(5)},
index=range(5))
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")
To access the DataFrame
, simply use mrna.expressions
:
print(mrna.expressions)
GeneSymbol | A1BG | A1BG-AS1 | A1CF | A2M |
---|---|---|---|---|
sample_index | ||||
TCGA-05-4244-01A-01R-1107-07 | 26.0302 | 36.7711 | 0.000 | 9844.7858 |
TCGA-05-4249-01A-01R-1107-07 | 120.1349 | 132.1439 | 0.322 | 25712.6617 |
Creating a multi-omics dataset¶
With multiple single-omics, each with different sets of genes and samples, you can use the openomics.MultiOmics
to integrate them.
from openomics.multiomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
path = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/"
# Load each expression dataframe
mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
transpose=True,
usecols="Gene_ID|TCGA",
gene_index="Gene_ID")
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="gene_name")
pro = Protein(path+"protein_RPPA.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])
# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)`
luad_data.build_samples()
The luad_data
is a MultiOmics
object builds the samples list from all the samples given in each -omics data.
MessengerRNA (576, 20472) MicroRNA (494, 1870) LncRNA (546, 12727) SomaticMutation (587, 21070) Protein (364, 154)
To access individual -omics data within luad_data
, such as the mRNA
, simply use the .
accessor with the class name MessengerRNA
:
luad_data.MessengerRNA
# or
luad_data.data["MessengerRNA"]
Adding clinical data as sample attributes¶
When sample attributes are provided for the study cohort, load it as a data table with the openomics.ClinicalData
, then add it to the openomics.multiomics.MultiOmics
dataset to enable querying for subsets of samples across the multi-omics.
from openomics import ClinicalData
clinical = ClinicalData(
"https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
patient_index="bcr_patient_barcode")
luad_data.add_clinical_data(clinical)
luad_data.clinical.patient
bcr_patient_uuid | form_completion_date | histologic_diagnosis | prospective_collection | retrospective_collection | |
---|---|---|---|---|---|
bcr_patient_barcode | |||||
TCGA-05-4244 | 34040b83-7e8a-4264-a551-b16621843e28 | 2010-7-22 | Lung Adenocarcinoma | NO | YES |
TCGA-05-4245 | 03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2 | 2010-7-22 | Lung Adenocarcinoma | NO | YES |
TCGA-05-4249 | 4addf05f-3668-4b3f-a17f-c0227329ca52 | 2010-7-22 | Lung Adenocarcinoma | NO | YES |
Note that in the clinical data table, bcr_patient_barcode
is the column with TCGA-XX-XXXX
patient IDs, which matches
that of the sample_index
index column in the mrna.expressions
dataframe.
Note
In our TCGA_LUAD
example, mismatches in the bcr_patient_barcode
sample index of clinical dataframe may happen because the sample_index
in mRNA
may have a longer form TCGA-XX-XXXX-XXX-XXX-XXXX-XX
that contain the samples number and aliquot ID’s. To make them match, you can modify the index strings on-the-fly using the Pandas’s extensible API:
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters
Import an external database¶
Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we’d need to download annotations from the GENCODE database, preprocess annotation files into a dataframe, and then match them with the genes in our dataset.
OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:
First, provide the base
path
of the FTP download server - usually found in the direct download link on GENCODE’s website. Most of the time, selecting the right basepath
allows you to specify the specific species, genome assembly, and database version for your study.Secondly, use the
file_resources
dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in thefile_resources
, the key is the alias of the file required, and the value is the filename with the FTP basepath
.For example, the entry
{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}
indicates the GENCODE class to preprocess a.gtf
file with the alias"long_noncoding_RNAs.gtf"
, downloaded from the FTP pathftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz
To see which file alias keys are required to construct a dataframe, refer to the docstring in
openomics.database.sequence.GENCODE
.
from openomics.database import GENCODE
gencode = GENCODE(
path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
"basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
"lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
"transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
},
npartitions=0, # if > 1, then use Dask partition the dataframe and leverage out-of-core multiprocessing
)
To access the attributes constructed from the combination of annotations long_noncoding_RNAs.gtf
and basic.annotation.gtf
, use:
gencode.data
gene_id | gene_name | index | seqname | source | feature | start | end | |
---|---|---|---|---|---|---|---|---|
0 | ENSG00000243485 | MIR1302-2HG | 0 | chr1 | HAVANA | gene | 29554 | 31109 |
1 | ENSG00000243485 | MIR1302-2HG | 1 | chr1 | HAVANA | transcript | 29554 | 31097 |
Annotate your expression dataset with attributes¶
With the annotation database, you can perform a join operation to add gene attributes to your openomics.Expression
dataset. To annotate attributes for the gene_id
list mRNA.expression
, you must first select the corresponding column in gencode.data
with matching gene_id
keys. The following are code snippets for a variety of database types.
luad_data.MessengerRNA.annotate_attributes(gencode,
index="gene_id",
columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
)
luad_data.MessengerRNA.annotate_sequences(gencode,
index="gene_name",
agg_sequences="all", # Collect all sequences with the gene_name into a list
)
from openomics.database.disease import DisGeNet
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)
luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")
To view the resulting annotations dataframe, use:
luad_data.MessengerRNA.annotations
For more detailed guide, refer to the annotation interfaces API.
Loading a multi-omics dataset¶
Suppose you have your own -omics dataset(s) and you’d like to load them. One of OpenOmics’s primary goal is to encapsulate the data import process with one line of code along with a few parameters. Given any processed single-omic dataset, the library loads the data as a tabular structure where rows correspond to observation samples and columns correspond to measurements of different biomolecules.
Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.
folder_path = "tests/data/TCGA_LUAD/"
Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data
from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path + "LUAD__geneExp.txt",
transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path + "LUAD__miRNAExp__RPM.txt",
transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path + "TCGA-rnaexpr.tsv",
transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path + "LUAD__somaticMutation_geneLevel.txt",
transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path + "protein_RPPA.txt",
transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")
# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
clinical=folder_path + "nationwidechildrens.org_clinical_patient_luad.txt")
luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)
luad_data.build_samples()
Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.
PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)
Load single omics expressions for MessengerRNA, MicroRNA, LncRNA¶
We instantiate the MessengerRNA, MicroRNA and LncRNA -omics expression data from gtex.data
. Since the gene expression
were not seperated by RNA type, we use GENCODE and Ensembl gene annotations to filter the list of mRNA, miRNA, and
lncRNAs.
from openomics import MessengerRNA, MicroRNA, LncRNA
# Gene Expression
messengerRNA_id = gtex_transcripts_gene_id & pd.Index(gencode.data[gencode.data["gene_type"] == "protein_coding"]["gene_id"].unique())
messengerRNA = MessengerRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(messengerRNA_id)],
transpose=True, gene_index="gene_name", usecols=None, npartitions=4)
# MicroRNA expression
microRNA_id = pd.Index(ensembl.data[ensembl.data["gene_biotype"] == "miRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id
microRNA = MicroRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(microRNA_id)],
gene_index="gene_id", transpose=True, usecols=None, )
# LncRNA expression
lncRNA_id = pd.Index(gencode.data[gencode.data["gene_type"] == "lncRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id
lncRNA = LncRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(lncRNA_id)],
gene_index="gene_id", transpose=True, usecols=None, )
Create a MultiOmics dataset¶
Now, we create a MultiOmics dataset object by combining the messengerRNA, microRNA, and lncRNA.
from openomics import MultiOmics
gtex_data = MultiOmics(cohort_name="GTEx Tissue Avg Expressions")
gtex_data.add_omic(messengerRNA)
gtex_data.add_omic(microRNA)
gtex_data.add_omic(lncRNA)
gtex_data.build_samples()
Accessing clinical data¶
Each multi-omics and clinical data can be accessed through luad_data.data[], like:
luad_data.data["PATIENTS"]
gender | race | histologic_subtype | pathologic_stage | |
---|---|---|---|---|
bcr_patient_barcode | ||||
TCGA-05-4244 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage IV |
TCGA-05-4245 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage III |
TCGA-05-4249 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage I |
TCGA-05-4250 | FEMALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage III |
TCGA-05-4382 | MALE | NaN | Lung Adenocarcinoma Mixed Subtype | Stage I |
522 rows × 4 columns
luad_data.data["MessengerRNA"]
gene_name | A1BG | A1BG-AS1 | A1CF | A2M | A2ML1 | A4GALT | A4GNT | AAAS | AACS | AACSP1 | ... | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | psiTPTE22 | tAKR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-05-4244-01A | 4.756500 | 5.239211 | 0.000000 | 13.265291 | 0.431997 | 7.043317 | 1.033652 | 9.348765 | 9.652057 | 0.763921 | ... | 5.350285 | 8.197321 | 9.907260 | 0.763921 | 10.088859 | 11.471139 | 9.768648 | 9.170597 | 2.932118 | 0.000000 |
TCGA-05-4249-01A | 6.920471 | 7.056843 | 0.402722 | 14.650247 | 1.383939 | 9.178805 | 0.717123 | 9.241537 | 9.967223 | 0.000000 | ... | 5.980428 | 8.950001 | 10.204971 | 4.411650 | 9.622978 | 11.199826 | 10.153700 | 9.433116 | 7.499637 | 0.000000 |
TCGA-05-4250-01A | 5.696542 | 6.136327 | 0.000000 | 14.048541 | 0.000000 | 8.481646 | 0.996244 | 9.203535 | 9.560412 | 0.733962 | ... | 5.931168 | 8.517334 | 9.722642 | 4.782796 | 8.895339 | 12.408981 | 10.194168 | 9.060342 | 2.867956 | 0.000000 |
TCGA-05-4382-01A | 7.198727 | 6.809804 | 0.000000 | 14.509730 | 2.532591 | 9.117559 | 1.657045 | 9.251035 | 10.078124 | 1.860883 | ... | 5.373036 | 8.441914 | 9.888267 | 6.041142 | 9.828389 | 12.725186 | 10.192589 | 9.376841 | 5.177029 | 0.000000 |
576 rows × 20472 columns
To match samples accross different multi-omics, use¶
luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])
Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
'TCGA-05-4427-01A',
...
'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
'TCGA-S2-AA1A-01A'],
dtype='object', length=465)
External annotation databases¶
Import GENCODE human release 32¶
Next, we can annotate the genes in our GTEx expression dataset with genomics annotation from GENCODE. In this example,
we use the URL path prefix “ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/” which specifies the
species and release version. We also pass a dictionary file_resources
, with key-value pairs where the key is name of
file and value is the suffix of the file download URL.
For example, file_resources={“long_noncoding_RNAs.gtf”: “gencode.v32.long_noncoding_RNAs.gtf.gz”} will download file
located at ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz
to process the long_noncoding_RNAs.gtf
file.
Here, we loads both “long_noncoding_RNAs.gtf” and “basic.annotation.gtf” which builds a dataframe of combined
annotations for both lncRNAs and mRNAs. You can specify different annotation files options from GENCODE by modifying
the file_resources
dict argument.
from openomics.database import GENCODE, EnsemblGenes
gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
"basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
"lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
"transcripts.fa": "gencode.v32.transcripts.fa.gz"},
remove_version_num=True)
# We also loads Ensembl genes to get list of miRNA gene IDs
ensembl = EnsemblGenes(biomart='hsapiens_gene_ensembl', npartitions=8, )
Setting the cache download directory¶
The package astropy
is used to automatically cache downloaded files. It defaults to saving the files at
~/.astropy/cache/
, where the cached content is retrieved given the matching URL. To change the path for the cache download file, run:
import openomics
openomics.set_cache_dir(path="PATH/OF/YOUR/CHOICE/")
Note
Note that this setting doesn’t persist across different programming sessions. Ideally, the cache dir should be in one location to minimize automatic FTP downloads, which may cause unnecessary stress on the database server.
Creating biological interaction networks¶
Preparing data for downstream analyses¶
To prepare the data for classification¶
# This function selects only patients with patholotic stages "Stage I" and "Stage II"
X_multiomics, y = luad_data.load_dataframe(,,
print(X_multiomics['MessengerRNA'].shape, X_multiomics['MicroRNA'].shape, X_multiomics['LncRNA'].shape, y.shape)
(336, 20472) (336, 1870) (336, 12727) (336, 1)
y
pathologic_stage | |
---|---|
TCGA-05-4390-01A | Stage I |
TCGA-05-4405-01A | Stage I |
TCGA-05-4410-01A | Stage I |
TCGA-05-4417-01A | Stage I |
TCGA-05-4424-01A | Stage II |
TCGA-05-4427-01A | Stage II |
TCGA-05-4433-01A | Stage I |
TCGA-05-5423-01A | Stage II |
TCGA-05-5425-01A | Stage II |
TCGA-05-5428-01A | Stage II |
TCGA-05-5715-01A | Stage I |
TCGA-38-4631-01A | Stage I |
TCGA-38-7271-01A | Stage I |
TCGA-38-A44F-01A | Stage I |
TCGA-44-2655-11A | Stage I |
336 rows × 1 columns
Log2 transform the mRNA, microRNA, and lncRNA expression values¶
def expression_val_transform(x):
return np.log2(x+1)
X_multiomics['MessengerRNA'] = X_multiomics['MessengerRNA'].applymap(expression_val_transform)
X_multiomics['MicroRNA'] = X_multiomics['MicroRNA'].applymap(expression_val_transform)
# X_multiomics['LncRNA'] = X_multiomics['LncRNA'].applymap(expression_val_transform)
Classification of Cancer Stage¶
from sklearn import preprocessing
from sklearn import metrics
from sklearn.svm import SVC, LinearSVC
import sklearn.linear_model
from sklearn.model_selection import train_test_split
binarizer = preprocessing.LabelEncoder()
binarizer.fit(y)
binarizer.transform(y)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
for omic in ["MessengerRNA", "MicroRNA"]:
print(omic)
scaler = sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=False)
scaler.fit(X_multiomics[omic])
X_train, X_test, Y_train, Y_test = \
train_test_split(X_multiomics[omic], y, test_size=0.3, random_state=np.random.randint(0, 10000), stratify=y)
print(X_train.shape, X_test.shape)
X_train = scaler.transform(X_train)
model = LinearSVC(C=1e-2, penalty='l1', class_weight='balanced', dual=False, multi_class="ovr")
# model = sklearn.linear_model.LogisticRegression(C=1e-0, penalty='l1', fit_intercept=False, class_weight="balanced")
# model = SVC(C=1e0, kernel="rbf", class_weight="balanced", decision_function_shape="ovo")
model.fit(X=X_train, y=Y_train)
print("NONZERO", len(np.nonzero(model.coef_)[0]))
print("Training accuracy", metrics.accuracy_score(model.predict(X_train), Y_train))
print(metrics.classification_report(y_pred=model.predict(X_test), y_true=Y_test))
MessengerRNA
(254, 20472) (109, 20472)
NONZERO 0
Training accuracy 0.6929133858267716
precision recall f1-score support
Stage I 0.69 1.00 0.82 75
Stage II 0.00 0.00 0.00 34
avg / total 0.47 0.69 0.56 109
MicroRNA
(254, 1870) (109, 1870)
NONZERO 0
Training accuracy 0.6929133858267716
precision recall f1-score support
Stage I 0.69 1.00 0.82 75
Stage II 0.00 0.00 0.00 34
avg / total 0.47 0.69 0.56 109
MultiOmics data¶
openomics Package¶
Functions¶
Classes¶
|
This class manages the clinical data tables to handle the patient’s phenotype data, as well as the treatment, and sample data associated to each patient. |
|
|
|
|
|
This class handles importing of any quantitative omics data that is in a table format (e.g. |
|
|
|
|
|
|
|
A data object which holds multiple -omics data for a single clinical cohort. |
|
|
|
Class Inheritance Diagram¶

Annotation interfaces¶
Classes¶
This abstract class provides an interface for the -omics ( |
|
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
Genomic databases¶
openomics.database.annotation Module¶
Classes¶
|
Interface to the BioMart service |
|
|
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Text I/O implementation using an in-memory buffer. |
|
Class Inheritance Diagram¶

Sequence databases¶
openomics.database.sequence Module¶
Classes¶
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
|
Loads the GENCODE database from https://www.gencodegenes.org/ . |
|
|
|
Provides a series of methods to extract sequence data from SequenceDataset. |
Class Inheritance Diagram¶

Interaction databases¶
openomics.database.interaction Module¶
Classes¶
|
|
|
Interface to the BioMart service |
|
|
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Provides a series of methods to extract sequence data from SequenceDataset. |
|
|
|
Text I/O implementation using an in-memory buffer. |
|
|
|
|
|
|
|
Class Inheritance Diagram¶

Disease databases¶
openomics.database.disease Module¶
Classes¶
|
Interface to the BioMart service |
|
|
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Text I/O implementation using an in-memory buffer. |
|
Class Inheritance Diagram¶

Ontology databases¶
openomics.database.ontology Module¶
Functions¶
|
|
|
|
|
|
|
|
|
|
|
Returns all successor terms from seed_node by traversing the ontology network with edges == type. |
|
|
Classes¶
|
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. |
|
|
|
|
|
Class Inheritance Diagram¶

Utilities¶
openomics.utils Package¶
Functions¶
|
Downloads a remote file given the url, then caches it to the user’s home folder. |
|
Parse a GTF into a dictionary mapping column names to sequences of values. |
Frequently Asked Questions (FAQ)¶
How to change the download directory where database files are cached?¶
You can make the setting at openomics.set_cache_dir(path="PATH/OF/YOUR/CHOICE/")
which will change the default
cache location. However, this setting doesn’t persist across user sessions. You can also change the default cache
directory at a ~/.openomics/conf.json
to persist the setting for each user.
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Web development with the Dash for the OpenOmics dashboard webserver.
Documentation standards for the OpenOmics Python API.
Organize the library of genomics, functional ontologies, interactions, and sequence databases for variety of biological studies.
Implement general purpose utilities for importing various fasta, gtf and sequencing files.
Report Bugs¶
Report bugs at openomics/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
OpenOmics could always use more documentation, whether as part of the official OpenOmics docs, in docstrings within the API, or even on the web in blog posts, articles, and such.
If you’d like to help write RTD documentations, note:
Documentation pages are written in markdown using myst-parser
The Sphinx theme used is furo
The autodoc package used is sphinx-automodapi
Submit Feedback¶
The best way to send feedback is to file an issue at openomics/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up openomics
for local development.
Fork the
openomics
repo on GitHub.Clone your fork locally and work on the develop branch:
$ git clone git@github.com:your_name_here/openomics.git
$ git checkout develop
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv openomics
$ cd openomics/
$ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 openomics tests
$ python setup.py test or py.test $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes."
$ git push develop name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website to the develop branch. Once major features are tested, we can create another pull-request to the master branch.
Pull Request Guidelines¶
Before you submit a pull request, check that it meets these guidelines:
The pull request should include tests. Run tests by with
pytest ./
and make sure tests are 100% passing.If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in docs/history.md.
The pull request should work for Python 3.6 or higher, and for PyPi. Check Github Actions Tests and make sure that the tests pass for all supported Python versions and operating systems.
Tips¶
To run the automated tests locally, run this at the root directory:
pytest ./
Hint
To run a subset of tests:
$ py.test tests.test_openomics
To run tests targeting various operating systems and Python versions, make a pull-request to the master branch which will run as Github Actions Tests.
Deploying¶
A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:
$ bumpversion patch # possible: major / minor / patch
$ git push --tags
Github Actions will then deploy to PyPI if tests pass.
Code of Conduct¶
Please note that the OpenOmics project is released with a Contributor Code of Conduct. By contributing to this project you agree to abide by its terms.
Release history¶
0.9.0 (Future)¶
Build web-app dashboard interface for importing user multiomics data files
0.8.6 (Pending)¶
Changed to Github Actions CI from Travis CI
Revamped openomics.readthedocs.io
Fixed bugs from pyOpenSci reviews.
0.8.5 (2020-02-19)¶
Importing GENCODE gtf files using dask now works with gzip compressed files.
Improved coverage and performance of automated tests.
0.8.4 (2020-01-07)¶
Enabled the support for Dask dataframes for ExpressionData and Annotation Dataset’s. To use this feature, simply use the npartitions argument when instantiating an ExpressionData or Annotation/Sequence/Interaction Dataset.
0.7.2 (2019-09-01)¶
Added compatibility for Python 2.7
Refactored ClinicalData
Built working documentations with Sphinx on readthedocs
Added pytests for MultiOmicsData
First successful build on Travis CI on Python 3.4-3.7, 2.7
0.6.0 (2019-08-31)¶
First release on PyPI.