Database#

class openomics.database.base.Database(path, file_resources=None, index_col=None, keys=None, usecols=None, col_rename=None, blocksize=None, verbose=False, **kwargs)[source][source]#

Bases: object

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. When creating a Database class, the load_dataframe() function is called where the file_resources are used to load (Pandas or Dask) DataFrames, then performs data wrangling to yield a dataframe at self.data . This class also provides an interface for -omics tables, e.g. ExpressionData , to annotate various annotations, expressions, sequences, and disease associations.

Attributes Summary

Methods Summary

close()

get_annotations(on, columns[, agg, agg_for, ...])

Returns the Database's DataFrame such that it's indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.

get_expressions(index)

param index

list_databases()

load_dataframe(file_resources[, blocksize])

Handles data preprocessing given the file_resources input, and returns a DataFrame.

name()

Attributes Documentation

COLUMNS_RENAME_DICT = None[source]#

Methods Documentation

close()[source][source]#
get_annotations(on, columns, agg='unique', agg_for=None, keys=None)[source][source]#

Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.

Parameters
  • on (str, list) – The column name(s) of the DataFrame to group by.

  • columns (list) – a list of column names to aggregate.

  • agg (str) – Function to aggregate when there is more than one values for each index key value. E.g. [‘first’, ‘last’, ‘sum’, ‘mean’, ‘size’, ‘concat’], default ‘concat’.

  • agg_for (Dict[str, Any]) – Bypass the agg function for certain columns with functions specified in this dict of column names and the agg function to aggregate for that column.

  • keys (pd.Index) – The values on the index column to filter before performing the groupby-agg operations.

Returns

An filted-groupby-aggregated dataframe to be used for annotation.

Return type

values

get_expressions(index)[source][source]#
Parameters

index

static list_databases()[source][source]#
abstract load_dataframe(file_resources, blocksize=None)[source][source]#

Handles data preprocessing given the file_resources input, and returns a DataFrame.

Parameters
  • file_resources (dict) – A dict with keys as filenames and values as full file path.

  • blocksize (int) –

Return type

DataFrame

classmethod name()[source][source]#