Welcome to OpenOmics’s documentation!

PyPI version Documentation Status OpenOmics codecov

This Python package provide a series of tools to integrate and query the genomics, transcriptomics, proteomics, and clinical data (aka, the multi-omics data). With scalable data-frame manipulation tools, OpenOmics facilitates the common coding tasks when preparing data for RNA-seq bioinformatics analysis.

Features

  • Provides a bioinformatics workflow to generate integrative results from multi-omics data.

  • Facilitates integration of various bio-databases, multi-omics expression, genomics, and clinical data.

  • Highly flexible to different data types and missing data.

  • Provides researchers with means to consistently store and explore their experimental datasets.

  • Enables scalable performance with parallel computing, while easily configurable to deploy on both single machine and a cluster.

Table of Content

Installation

Stable release

To install OpenOmics, run this command in your terminal:

$ pip install openomics

This is the preferred method to install OpenOmics, as it will always install the most recent stable release.

If you don’t have pip installed, it is recommended to install the Anaconda Python distribution to get started with Python 3.7, 3.8, or 3.9 on either Mac OS, Linux, or Windows.

From sources

The sources for OpenOmics can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone https://github.com/BioMeCIS-Lab/OpenOmics

Once you have a copy of the source, you can install it with:

$ cd OpenOmics
$ pip install ./ -e

The versioned release history can be found at OpenOmics/releases.

Getting started

Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.

Loading a single-omics dataframe

Suppose you have a single-omics dataset and would like to load them as a dataframe.

As an example, we use the TGCA Lung Adenocarcinoma dataset from tests/data/TCGA_LUAD. Data tables are tab-delimited and have the following format:

GeneSymbol

EntrezID

TCGA-05-4244-01A-01R-1107-07

TCGA-05-4249-01A-01R-1107-07

A1BG

100133144

10.8123

3.7927

Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class openomics.Expression or any of its subclasses.

If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.

from openomics.multiomics import MessengerRNA

mrna = MessengerRNA(
    data="https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
    transpose=True,
    usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
    gene_index="GeneSymbol", # This column contains the gene index
    )

One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument transpose=True to convert to row-oriented.

MessengerRNA (576, 20472)

If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.

from openomics.multiomics import MessengerRNA

mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
                    transpose=True,
                    usecols="GeneSymbol|TCGA",
                    gene_index="GeneSymbol")

INFO: Files matched: [‘LUAD__miRNAExp__RPM.txt’, ‘LUAD__protein_RPPA.txt’, ‘LUAD__geneExp.txt’]

If your workflow already produced a dataframe, you can encapsulate it directly with openomics.Expression.

import pandas as pd
import numpy as np
from openomics.multiomics import MessengerRNA

# A random dataframe of microRNA gene_id's.
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
                        "ENSG00000198973": np.random.rand(5),
                        "ENSG00000198974": np.random.rand(5),
                        "ENSG00000198975": np.random.rand(5),
                        "ENSG00000198976": np.random.rand(5),
                        "ENSG00000198982": np.random.rand(5),
                        "ENSG00000198983": np.random.rand(5)},
                  index=range(5))
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")

To access the DataFrame, simply use mrna.expressions:

print(mrna.expressions)
GeneSymbol A1BG A1BG-AS1 A1CF A2M
sample_index
TCGA-05-4244-01A-01R-1107-07 26.0302 36.7711 0.000 9844.7858
TCGA-05-4249-01A-01R-1107-07 120.1349 132.1439 0.322 25712.6617

Creating a multi-omics dataset

With multiple single-omics, each with different sets of genes and samples, you can use the openomics.MultiOmics to integrate them.

from openomics.multiomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

path = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/"

# Load each expression dataframe
mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
    transpose=True,
    usecols="GeneSymbol|TCGA",
    gene_index="GeneSymbol")
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
    transpose=True,
    usecols="GeneSymbol|TCGA",
    gene_index="GeneSymbol")
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
    transpose=True,
    usecols="Gene_ID|TCGA",
    gene_index="Gene_ID")
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
    transpose=True,
    usecols="GeneSymbol|TCGA",
    gene_index="gene_name")
pro = Protein(path+"protein_RPPA.txt",
    transpose=True,
    usecols="GeneSymbol|TCGA",
    gene_index="GeneSymbol")

# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])
# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)`

luad_data.build_samples()

The luad_data is a MultiOmics object builds the samples list from all the samples given in each -omics data.

MessengerRNA (576, 20472) MicroRNA (494, 1870) LncRNA (546, 12727) SomaticMutation (587, 21070) Protein (364, 154)

To access individual -omics data within luad_data, such as the mRNA, simply use the . accessor with the class name MessengerRNA:

luad_data.MessengerRNA
# or
luad_data.data["MessengerRNA"]

Adding clinical data as sample attributes

When sample attributes are provided for the study cohort, load it as a data table with the openomics.ClinicalData, then add it to the openomics.multiomics.MultiOmics dataset to enable querying for subsets of samples across the multi-omics.

from openomics import ClinicalData

clinical = ClinicalData(
    "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
    patient_index="bcr_patient_barcode")

luad_data.add_clinical_data(clinical)

luad_data.clinical.patient
bcr_patient_uuid form_completion_date histologic_diagnosis prospective_collection retrospective_collection
bcr_patient_barcode
TCGA-05-4244 34040b83-7e8a-4264-a551-b16621843e28 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4245 03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4249 4addf05f-3668-4b3f-a17f-c0227329ca52 2010-7-22 Lung Adenocarcinoma NO YES

Note that in the clinical data table, bcr_patient_barcode is the column with TCGA-XX-XXXX patient IDs, which matches that of the sample_index index column in the mrna.expressions dataframe.

Note

In our TCGA_LUAD example, mismatches in the bcr_patient_barcode sample index of clinical dataframe may happen because the sample_index in mRNA may have a longer form TCGA-XX-XXXX-XXX-XXX-XXXX-XX that contain the samples number and aliquot ID’s. To make them match, you can modify the index strings on-the-fly using the Pandas’s extensible API:

mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters

Import an external database

Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we’d need to download annotations from the GENCODE database, preprocess annotation files into a dataframe, and then match them with the genes in our dataset.

OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:

  1. First, provide the base path of the FTP download server - usually found in the direct download link on GENCODE’s website. Most of the time, selecting the right base path allows you to specify the specific species, genome assembly, and database version for your study.

  2. Secondly, use the file_resources dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the file_resources, the key is the alias of the file required, and the value is the filename with the FTP base path.

    For example, the entry {"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"} indicates the GENCODE class to preprocess a .gtf file with the alias "long_noncoding_RNAs.gtf", downloaded from the FTP path ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz

    To see which file alias keys are required to construct a dataframe, refer to the docstring in openomics.database.sequence.GENCODE.

from openomics.database import GENCODE

gencode = GENCODE(
    path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
    file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                    "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                    "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
                    "transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
                    },
    npartitions=0, # if > 1, then use Dask partition the dataframe and leverage out-of-core multiprocessing
)

To access the attributes constructed from the combination of annotations long_noncoding_RNAs.gtf and basic.annotation.gtf, use:

gencode.data
gene_id gene_name index seqname source feature start end
0 ENSG00000243485 MIR1302-2HG 0 chr1 HAVANA gene 29554 31109
1 ENSG00000243485 MIR1302-2HG 1 chr1 HAVANA transcript 29554 31097

Annotate your expression dataset with attributes

With the annotation database, you can perform a join operation to add gene attributes to your openomics.Expression dataset. To annotate attributes for the gene_id list mRNA.expression, you must first select the corresponding column in gencode.data with matching gene_id keys. The following are code snippets for a variety of database types.

luad_data.MessengerRNA.annotate_attributes(gencode,
    index="gene_id",
    columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
    )
luad_data.MessengerRNA.annotate_sequences(gencode,
    index="gene_name",
    agg_sequences="all", # Collect all sequences with the gene_name into a list
    )
from openomics.database.disease import DisGeNet
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)

luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")

To view the resulting annotations dataframe, use:

luad_data.MessengerRNA.annotations

For more detailed guide, refer to the annotation interfaces API.

Loading a multi-omics dataset

Suppose you have your own -omics dataset(s) and you’d like to load them. One of OpenOmics’s primary goal is to encapsulate the data import process with one line of code along with a few parameters. Given any processed single-omic dataset, the library loads the data as a tabular structure where rows correspond to observation samples and columns correspond to measurements of different biomolecules.

Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.

folder_path = "tests/data/TCGA_LUAD/"

Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data

from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path + "LUAD__geneExp.txt",
                    transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path + "LUAD__miRNAExp__RPM.txt",
                 transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path + "TCGA-rnaexpr.tsv",
                transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path + "LUAD__somaticMutation_geneLevel.txt",
                      transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path + "protein_RPPA.txt",
              transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
    clinical=folder_path + "nationwidechildrens.org_clinical_patient_luad.txt")

luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)

luad_data.build_samples()

Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.

PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)

Load single omics expressions for MessengerRNA, MicroRNA, LncRNA

We instantiate the MessengerRNA, MicroRNA and LncRNA -omics expression data from gtex.data. Since the gene expression were not seperated by RNA type, we use GENCODE and Ensembl gene annotations to filter the list of mRNA, miRNA, and lncRNAs.

from openomics import MessengerRNA, MicroRNA, LncRNA

# Gene Expression
messengerRNA_id = gtex_transcripts_gene_id & pd.Index(gencode.data[gencode.data["gene_type"] == "protein_coding"]["gene_id"].unique())

messengerRNA = MessengerRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(messengerRNA_id)],
                           transpose=True, gene_index="gene_name", usecols=None, npartitions=4)

# MicroRNA expression
microRNA_id = pd.Index(ensembl.data[ensembl.data["gene_biotype"] == "miRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id

microRNA = MicroRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(microRNA_id)],
                   gene_index="gene_id", transpose=True, usecols=None, )

# LncRNA expression
lncRNA_id = pd.Index(gencode.data[gencode.data["gene_type"] == "lncRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id
lncRNA = LncRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(lncRNA_id)],
               gene_index="gene_id", transpose=True, usecols=None, )

Create a MultiOmics dataset

Now, we create a MultiOmics dataset object by combining the messengerRNA, microRNA, and lncRNA.

   from openomics import MultiOmics

   gtex_data = MultiOmics(cohort_name="GTEx Tissue Avg Expressions")

   gtex_data.add_omic(messengerRNA)
   gtex_data.add_omic(microRNA)
   gtex_data.add_omic(lncRNA)

   gtex_data.build_samples()

Accessing clinical data

Each multi-omics and clinical data can be accessed through luad_data.data[], like:

luad_data.data["PATIENTS"]
gender race histologic_subtype pathologic_stage
bcr_patient_barcode
TCGA-05-4244 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage IV
TCGA-05-4245 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4249 MALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage I
TCGA-05-4250 FEMALE NaN Lung Adenocarcinoma- Not Otherwise Specified (... Stage III
TCGA-05-4382 MALE NaN Lung Adenocarcinoma Mixed Subtype Stage I

522 rows × 4 columns

luad_data.data["MessengerRNA"]
gene_name A1BG A1BG-AS1 A1CF A2M A2ML1 A4GALT A4GNT AAAS AACS AACSP1 ... ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3 psiTPTE22 tAKR
TCGA-05-4244-01A 4.756500 5.239211 0.000000 13.265291 0.431997 7.043317 1.033652 9.348765 9.652057 0.763921 ... 5.350285 8.197321 9.907260 0.763921 10.088859 11.471139 9.768648 9.170597 2.932118 0.000000
TCGA-05-4249-01A 6.920471 7.056843 0.402722 14.650247 1.383939 9.178805 0.717123 9.241537 9.967223 0.000000 ... 5.980428 8.950001 10.204971 4.411650 9.622978 11.199826 10.153700 9.433116 7.499637 0.000000
TCGA-05-4250-01A 5.696542 6.136327 0.000000 14.048541 0.000000 8.481646 0.996244 9.203535 9.560412 0.733962 ... 5.931168 8.517334 9.722642 4.782796 8.895339 12.408981 10.194168 9.060342 2.867956 0.000000
TCGA-05-4382-01A 7.198727 6.809804 0.000000 14.509730 2.532591 9.117559 1.657045 9.251035 10.078124 1.860883 ... 5.373036 8.441914 9.888267 6.041142 9.828389 12.725186 10.192589 9.376841 5.177029 0.000000

576 rows × 20472 columns

To match samples accross different multi-omics, use

luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])
Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
       'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
       'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
       'TCGA-05-4427-01A',
       ...
       'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
       'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
       'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
       'TCGA-S2-AA1A-01A'],
      dtype='object', length=465)

External annotation databases

Import GENCODE human release 32

Next, we can annotate the genes in our GTEx expression dataset with genomics annotation from GENCODE. In this example, we use the URL path prefix “ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/” which specifies the species and release version. We also pass a dictionary file_resources, with key-value pairs where the key is name of file and value is the suffix of the file download URL.

For example, file_resources={“long_noncoding_RNAs.gtf”: “gencode.v32.long_noncoding_RNAs.gtf.gz”} will download file located at ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz to process the long_noncoding_RNAs.gtf file.

Here, we loads both “long_noncoding_RNAs.gtf” and “basic.annotation.gtf” which builds a dataframe of combined annotations for both lncRNAs and mRNAs. You can specify different annotation files options from GENCODE by modifying the file_resources dict argument.

from openomics.database import GENCODE, EnsemblGenes

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
                 file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                                 "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                                 "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
                                 "transcripts.fa": "gencode.v32.transcripts.fa.gz"},
                 remove_version_num=True)

# We also loads Ensembl genes to get list of miRNA gene IDs
ensembl = EnsemblGenes(biomart='hsapiens_gene_ensembl', npartitions=8, )

Setting the cache download directory

The package astropy is used to automatically cache downloaded files. It defaults to saving the files at ~/.astropy/cache/, where the cached content is retrieved given the matching URL. To change the path for the cache download file, run:

import openomics

openomics.set_cache_dir(path="PATH/OF/YOUR/CHOICE/")

Note

Note that this setting doesn’t persist across different programming sessions. Ideally, the cache dir should be in one location to minimize automatic FTP downloads, which may cause unnecessary stress on the database server.

Creating biological interaction networks

Preparing data for downstream analyses

To prepare the data for classification

# This function selects only patients with patholotic stages "Stage I" and "Stage II"
X_multiomics, y = luad_data.load_dataframe(,,
print(X_multiomics['MessengerRNA'].shape, X_multiomics['MicroRNA'].shape, X_multiomics['LncRNA'].shape, y.shape)
(336, 20472) (336, 1870) (336, 12727) (336, 1)
y
pathologic_stage
TCGA-05-4390-01A Stage I
TCGA-05-4405-01A Stage I
TCGA-05-4410-01A Stage I
TCGA-05-4417-01A Stage I
TCGA-05-4424-01A Stage II
TCGA-05-4427-01A Stage II
TCGA-05-4433-01A Stage I
TCGA-05-5423-01A Stage II
TCGA-05-5425-01A Stage II
TCGA-05-5428-01A Stage II
TCGA-05-5715-01A Stage I
TCGA-38-4631-01A Stage I
TCGA-38-7271-01A Stage I
TCGA-38-A44F-01A Stage I
TCGA-44-2655-11A Stage I

336 rows × 1 columns

Log2 transform the mRNA, microRNA, and lncRNA expression values

def expression_val_transform(x):
    return np.log2(x+1)
X_multiomics['MessengerRNA'] = X_multiomics['MessengerRNA'].applymap(expression_val_transform)
X_multiomics['MicroRNA'] = X_multiomics['MicroRNA'].applymap(expression_val_transform)
# X_multiomics['LncRNA'] = X_multiomics['LncRNA'].applymap(expression_val_transform)

Classification of Cancer Stage

from sklearn import preprocessing
from sklearn import metrics
from sklearn.svm import SVC, LinearSVC
import sklearn.linear_model
from sklearn.model_selection import train_test_split

binarizer = preprocessing.LabelEncoder()
binarizer.fit(y)
binarizer.transform(y)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
for omic in ["MessengerRNA", "MicroRNA"]:
    print(omic)
    scaler = sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=False)
    scaler.fit(X_multiomics[omic])

    X_train, X_test, Y_train, Y_test = \
        train_test_split(X_multiomics[omic], y, test_size=0.3, random_state=np.random.randint(0, 10000), stratify=y)
    print(X_train.shape, X_test.shape)


    X_train = scaler.transform(X_train)

    model = LinearSVC(C=1e-2, penalty='l1', class_weight='balanced', dual=False, multi_class="ovr")
#     model = sklearn.linear_model.LogisticRegression(C=1e-0, penalty='l1', fit_intercept=False, class_weight="balanced")
#     model = SVC(C=1e0, kernel="rbf", class_weight="balanced", decision_function_shape="ovo")

    model.fit(X=X_train, y=Y_train)
    print("NONZERO", len(np.nonzero(model.coef_)[0]))
    print("Training accuracy", metrics.accuracy_score(model.predict(X_train), Y_train))
    print(metrics.classification_report(y_pred=model.predict(X_test), y_true=Y_test))

MessengerRNA
(254, 20472) (109, 20472)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

MicroRNA
(254, 1870) (109, 1870)
NONZERO 0
Training accuracy 0.6929133858267716
             precision    recall  f1-score   support

    Stage I       0.69      1.00      0.82        75
   Stage II       0.00      0.00      0.00        34

avg / total       0.47      0.69      0.56       109

MultiOmics data

openomics Package

Functions
Classes

ClinicalData(file, patient_index[, columns])

This class manages the clinical data tables to handle the patient’s phenotype data, as well as the treatment, and sample data associated to each patient.

CopyNumberVariation(data, transpose, gene_index)

DNAMethylation(data, transpose, gene_index)

Expression(data, transpose[, gene_index, …])

This class handles importing of any quantitative omics data that is in a table format (e.g.

LncRNA(data, transpose[, gene_index, …])

MessengerRNA(data, transpose[, gene_index, …])

MicroRNA(data, transpose[, gene_index, …])

MultiOmics(cohort_name[, omics_data])

A data object which holds multiple -omics data for a single clinical cohort.

Protein(data, transpose[, gene_index, …])

SomaticMutation(data, transpose, gene_index)

Class Inheritance Diagram
Inheritance diagram of openomics.clinical.ClinicalData, openomics.genomics.CopyNumberVariation, openomics.genomics.DNAMethylation, openomics.transcriptomics.Expression, openomics.transcriptomics.LncRNA, openomics.transcriptomics.MessengerRNA, openomics.transcriptomics.MicroRNA, openomics.multiomics.MultiOmics, openomics.proteomics.Protein, openomics.genomics.SomaticMutation

Annotation interfaces

Classes

Annotatable()

This abstract class provides an interface for the -omics (Expression) to annotate its genes list with the external data downloaded from various databases.

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

Genomic databases

openomics.database.annotation Module

Classes

BioMart([host, verbose, cache, secure])

Interface to the BioMart service

BioMartManager(dataset, attributes, host, …)

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

EnsemblGeneSequences([biomart, attributes, …])

EnsemblGenes([biomart, attributes, host, …])

EnsemblSNP([biomart, attributes, host, …])

EnsemblSomaticVariation([biomart, …])

EnsemblTranscriptSequences([biomart, …])

GTEx([path, file_resources, col_rename, …])

NONCODE(path[, file_resources, col_rename, …])

ProteinAtlas([path, file_resources, …])

RNAcentral([path, file_resources, …])

StringIO([initial_value, newline])

Text I/O implementation using an in-memory buffer.

TANRIC(path[, file_resources, col_rename, …])

Class Inheritance Diagram
Inheritance diagram of openomics.database.annotation.BioMartManager, openomics.database.annotation.EnsemblGeneSequences, openomics.database.annotation.EnsemblGenes, openomics.database.annotation.EnsemblSNP, openomics.database.annotation.EnsemblSomaticVariation, openomics.database.annotation.EnsemblTranscriptSequences, openomics.database.annotation.GTEx, openomics.database.annotation.NONCODE, openomics.database.annotation.ProteinAtlas, openomics.database.annotation.RNAcentral, openomics.database.annotation.TANRIC

Sequence databases

openomics.database.sequence Module

Classes

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

GENCODE([path, file_resources, col_rename, …])

Loads the GENCODE database from https://www.gencodegenes.org/ .

MirBase([path, file_resources, sequence, …])

SequenceDatabase([replace_U2T])

Provides a series of methods to extract sequence data from SequenceDataset.

Class Inheritance Diagram
Inheritance diagram of openomics.database.sequence.GENCODE, openomics.database.sequence.MirBase, openomics.database.sequence.SequenceDatabase

Interaction databases

openomics.database.interaction Module

Classes

BioGRID([path, file_resources, …])

BioMart([host, verbose, cache, secure])

Interface to the BioMart service

BioMartManager(dataset, attributes, host, …)

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

Dict(*args, **kwds)

EnsemblGeneSequences([biomart, attributes, …])

EnsemblGenes([biomart, attributes, host, …])

EnsemblSNP([biomart, attributes, host, …])

EnsemblSomaticVariation([biomart, …])

EnsemblTranscriptSequences([biomart, …])

GTEx([path, file_resources, col_rename, …])

GeneMania(path[, file_resources, …])

Interactions(path, file_resources[, …])

List(*args, **kwds)

LncBase(path[, file_resources, …])

LncRNA2Target([path, file_resources, …])

LncReg(path, file_resources[, …])

MiRTarBase([path, file_resources, …])

NONCODE(path[, file_resources, col_rename, …])

NPInter([path, file_resources, …])

ProteinAtlas([path, file_resources, …])

RNAcentral([path, file_resources, …])

STRING([path, file_resources, species_id, …])

SequenceDatabase([replace_U2T])

Provides a series of methods to extract sequence data from SequenceDataset.

StarBase(path, file_resources[, …])

StringIO([initial_value, newline])

Text I/O implementation using an in-memory buffer.

TANRIC(path[, file_resources, col_rename, …])

TargetScan(path[, file_resources, …])

lncRInter(path[, file_resources, …])

lncRNome(path, file_resources[, …])

Class Inheritance Diagram
Inheritance diagram of openomics.database.interaction.BioGRID, openomics.database.interaction.GeneMania, openomics.database.interaction.Interactions, openomics.database.interaction.LncBase, openomics.database.interaction.LncRNA2Target, openomics.database.interaction.LncReg, openomics.database.interaction.MiRTarBase, openomics.database.interaction.NPInter, openomics.database.interaction.STRING, openomics.database.interaction.StarBase, openomics.database.interaction.TargetScan, openomics.database.interaction.lncRInter, openomics.database.interaction.lncRNome

Disease databases

openomics.database.disease Module

Classes

BioMart([host, verbose, cache, secure])

Interface to the BioMart service

BioMartManager(dataset, attributes, host, …)

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

DisGeNet([path, file_resources, curated, …])

DiseaseAssociation(path[, file_resources])

EnsemblGeneSequences([biomart, attributes, …])

EnsemblGenes([biomart, attributes, host, …])

EnsemblSNP([biomart, attributes, host, …])

EnsemblSomaticVariation([biomart, …])

EnsemblTranscriptSequences([biomart, …])

GTEx([path, file_resources, col_rename, …])

HMDD([path, file_resources, col_rename])

LncRNADisease([path, file_resources, …])

MalaCards([path, file_resources, col_rename])

NONCODE(path[, file_resources, col_rename, …])

OMIM(path[, file_resources])

ProteinAtlas([path, file_resources, …])

RNAcentral([path, file_resources, …])

StringIO([initial_value, newline])

Text I/O implementation using an in-memory buffer.

TANRIC(path[, file_resources, col_rename, …])

Class Inheritance Diagram
Inheritance diagram of openomics.database.disease.DisGeNet, openomics.database.disease.DiseaseAssociation, openomics.database.disease.HMDD, openomics.database.disease.LncRNADisease, openomics.database.disease.MalaCards, openomics.database.disease.OMIM

Ontology databases

openomics.database.ontology Module

Functions

dfs_path(graph, path)

filter_dfs_paths(paths_df)

flatten(lst)

flatten_list(list_in)

slice_adj(adj, node_list, nodes_A[, nodes_B])

param adj

traverse_predecessors(network, seed_node[, type])

Returns all successor terms from seed_node by traversing the ontology network with edges == type.

write_taxonomy(network, root_nodes, file_path)

param network

A network with edge(i, j) where i is a node and j is a child of i.

Classes

Database(path[, file_resources, col_rename, …])

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs.

GeneOntology([path, file_resources, …])

HumanPhenotypeOntology(path[, …])

Ontology(path[, file_resources, col_rename, …])

Class Inheritance Diagram
Inheritance diagram of openomics.database.ontology.GeneOntology, openomics.database.ontology.HumanPhenotypeOntology, openomics.database.ontology.Ontology

Utilities

openomics.utils Package

Functions

get_pkg_data_filename(dataurl, file)

Downloads a remote file given the url, then caches it to the user’s home folder.

read_gtf(filepath_or_buffer[, npartitions, …])

Parse a GTF into a dictionary mapping column names to sequences of values.

Frequently Asked Questions (FAQ)

How to change the download directory where database files are cached?

You can make the setting at openomics.set_cache_dir(path="PATH/OF/YOUR/CHOICE/") which will change the default cache location. However, this setting doesn’t persist across user sessions. You can also change the default cache directory at a ~/.openomics/conf.json to persist the setting for each user.

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

  • Web development with the Dash for the OpenOmics dashboard webserver.

  • Documentation standards for the OpenOmics Python API.

  • Organize the library of genomics, functional ontologies, interactions, and sequence databases for variety of biological studies.

  • Implement general purpose utilities for importing various fasta, gtf and sequencing files.

Report Bugs

Report bugs at openomics/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

OpenOmics could always use more documentation, whether as part of the official OpenOmics docs, in docstrings within the API, or even on the web in blog posts, articles, and such.

If you’d like to help write RTD documentations, note:

Submit Feedback

The best way to send feedback is to file an issue at openomics/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up openomics for local development.

  1. Fork the openomics repo on GitHub.

  2. Clone your fork locally and work on the develop branch:

$ git clone git@github.com:your_name_here/openomics.git
$ git checkout develop
  1. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

$ mkvirtualenv openomics
$ cd openomics/
$ python setup.py develop
  1. Create a branch for local development:

$ git checkout -b name-of-your-bugfix-or-feature

Now you can make your changes locally.

  1. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

$ flake8 openomics tests
$ python setup.py test or py.test $ tox

To get flake8 and tox, just pip install them into your virtualenv.

  1. Commit your changes and push your branch to GitHub:

$ git add . $ git commit -m "Your detailed description of your changes."
$ git push develop name-of-your-bugfix-or-feature
  1. Submit a pull request through the GitHub website to the develop branch. Once major features are tested, we can create another pull-request to the master branch.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests. Run tests by with pytest ./ and make sure tests are 100% passing.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in docs/history.md.

  3. The pull request should work for Python 3.6 or higher, and for PyPi. Check Github Actions Tests and make sure that the tests pass for all supported Python versions and operating systems.

Tips

To run the automated tests locally, run this at the root directory:

pytest ./

Hint

To run a subset of tests:

$ py.test tests.test_openomics

To run tests targeting various operating systems and Python versions, make a pull-request to the master branch which will run as Github Actions Tests.

Deploying

A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:

$ bumpversion patch # possible: major / minor / patch
$ git push --tags

Github Actions will then deploy to PyPI if tests pass.

Code of Conduct

Please note that the OpenOmics project is released with a Contributor Code of Conduct. By contributing to this project you agree to abide by its terms.

Release history

0.9.0 (Future)

  • Build web-app dashboard interface for importing user multiomics data files

0.8.6 (Pending)

  • Changed to Github Actions CI from Travis CI

  • Revamped openomics.readthedocs.io

  • Fixed bugs from pyOpenSci reviews.

0.8.5 (2020-02-19)

  • Importing GENCODE gtf files using dask now works with gzip compressed files.

  • Improved coverage and performance of automated tests.

0.8.4 (2020-01-07)

  • Enabled the support for Dask dataframes for ExpressionData and Annotation Dataset’s. To use this feature, simply use the npartitions argument when instantiating an ExpressionData or Annotation/Sequence/Interaction Dataset.

0.7.2 (2019-09-01)

  • Added compatibility for Python 2.7

  • Refactored ClinicalData

  • Built working documentations with Sphinx on readthedocs

  • Added pytests for MultiOmicsData

  • First successful build on Travis CI on Python 3.4-3.7, 2.7

0.6.0 (2019-08-31)

  • First release on PyPI.