Database

class openomics.database.base.Database(path, file_resources=None, col_rename=None, npartitions=None, verbose=False)[source][source]

Bases: object

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. When creating a Database class, the load_dataframe() function is called where the file_resources are used to load (Pandas or Dask) DataFrames, then performs data wrangling to yield a dataframe at self.data . This class also provides an interface for -omics tables, e.g. ExpressionData , to annotate various annotations, expressions, sequences, and disease associations.

Attributes Summary

Methods Summary

close()

get_annotations(index, columns[, agg, …])

Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.

get_expressions(index)

param index

info()

list_databases()

load_dataframe(file_resources[, npartitions])

Handles data preprocessing given the file_resources input, and returns a DataFrame.

name()

validate_file_resources(path, file_resources)

For each file in file_resources, download the file if path+file is a URL or load from disk if a local path.

Attributes Documentation

COLUMNS_RENAME_DICT = None[source]

Methods Documentation

close()[source][source]
get_annotations(index, columns, agg='concat', filter_values=None)[source][source]

Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.

Parameters
  • index (str) – The column name of the DataFrame to join by.

  • columns (list) – a list of column names.

  • agg (str) – Function to aggregate when there is more than one values for each index instance. E.g. [‘first’, ‘last’, ‘sum’, ‘mean’, ‘size’, ‘concat’], default ‘concat’.

  • filter_values (pd.Series) – The values on the index column to filter before performing the groupby-agg operations.

Returns

A dataframe to be used for annotation

Return type

DataFrame

get_expressions(index)[source][source]
Parameters

index

info()[source][source]
static list_databases()[source][source]
abstract load_dataframe(file_resources, npartitions=None)[source][source]

Handles data preprocessing given the file_resources input, and returns a DataFrame.

Parameters
  • file_resources (dict) – A dict with keys as filenames and values as full file path.

  • npartitions (int) –

classmethod name()[source][source]
validate_file_resources(path, file_resources, npartitions=None, verbose=False)[source][source]

For each file in file_resources, download the file if path+file is a URL or load from disk if a local path. Additionally unzip or unrar if the file is compressed.

Parameters
  • path (str) – The folder or url path containing the data file resources. If a url path, the files will be downloaded and cached to the user’s home folder (at ~/.astropy/).

  • file_resources (dict) – default None, Used to list required files for preprocessing of the database. A dictionary where keys are required filenames and value are file paths. If None, then the class constructor should automatically build the required file resources dict.

  • npartitions (int) – >0 if the files will be used to create a Dask Dataframe. Default None.

  • verbose

Return type

None