Database¶

class openomics.database.base.Database(path, file_resources=None, col_rename=None, npartitions=None, verbose=False)[source][source]¶

Bases: object

This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. When creating a Database class, the load_dataframe() function is called where the file_resources are used to load (Pandas or Dask) DataFrames, then performs data wrangling to yield a dataframe at self.data . This class also provides an interface for -omics tables, e.g. ExpressionData , to annotate various annotations, expressions, sequences, and disease associations.

Attributes Summary

COLUMNS_RENAME_DICT

Methods Summary

`close`()
`get_annotations`(index, columns[, agg, …])	Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.
`get_expressions`(index)	param index
`info`()
`list_databases`()
`load_dataframe`(file_resources[, npartitions])	Handles data preprocessing given the file_resources input, and returns a DataFrame.
`name`()
`validate_file_resources`(path, file_resources)	For each file in file_resources, download the file if path+file is a URL or load from disk if a local path.

Attributes Documentation

COLUMNS_RENAME_DICT = None[source]¶

Methods Documentation

close()[source][source]¶

get_annotations(index, columns, agg='concat', filter_values=None)[source][source]¶

Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.

Parameters

index (str) – The column name of the DataFrame to join by.
columns (list) – a list of column names.
agg (str) – Function to aggregate when there is more than one values for each index instance. E.g. [‘first’, ‘last’, ‘sum’, ‘mean’, ‘size’, ‘concat’], default ‘concat’.
filter_values (pd.Series) – The values on the index column to filter before performing the groupby-agg operations.

Returns

A dataframe to be used for annotation

Return type

DataFrame

get_expressions(index)[source][source]¶

Parameters: index –

info()[source][source]¶

static list_databases()[source][source]¶

abstract load_dataframe(file_resources, npartitions=None)[source][source]¶

Handles data preprocessing given the file_resources input, and returns a DataFrame.

Parameters

file_resources (dict) – A dict with keys as filenames and values as full file path.
npartitions (int) –

classmethod name()[source][source]¶

validate_file_resources(path, file_resources, npartitions=None, verbose=False)[source][source]¶

For each file in file_resources, download the file if path+file is a URL or load from disk if a local path. Additionally unzip or unrar if the file is compressed.

Parameters

path (str) – The folder or url path containing the data file resources. If a url path, the files will be downloaded and cached to the user’s home folder (at ~/.astropy/).
file_resources (dict) – default None, Used to list required files for preprocessing of the database. A dictionary where keys are required filenames and value are file paths. If None, then the class constructor should automatically build the required file resources dict.
npartitions (int) – >0 if the files will be used to create a Dask Dataframe. Default None.
verbose –

Return type

None