Database¶
-
class
openomics.database.base.
Database
(path, file_resources=None, col_rename=None, npartitions=None, verbose=False)[source][source]¶ Bases:
object
This is a base class used to instantiate an external Database given a a set of files from either local files or URLs. When creating a Database class, the load_dataframe() function is called where the file_resources are used to load (Pandas or Dask) DataFrames, then performs data wrangling to yield a dataframe at self.data . This class also provides an interface for -omics tables, e.g. ExpressionData , to annotate various annotations, expressions, sequences, and disease associations.
Attributes Summary
Methods Summary
close
()get_annotations
(index, columns[, agg, …])Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.
get_expressions
(index)- param index
info
()load_dataframe
(file_resources[, npartitions])Handles data preprocessing given the file_resources input, and returns a DataFrame.
name
()validate_file_resources
(path, file_resources)For each file in file_resources, download the file if path+file is a URL or load from disk if a local path.
Attributes Documentation
Methods Documentation
-
get_annotations
(index, columns, agg='concat', filter_values=None)[source][source]¶ Returns the Database’s DataFrame such that it’s indexed by :param index:, which then applies a groupby operation and aggregates all other columns by concatenating all unique values.
- Parameters
index (str) – The column name of the DataFrame to join by.
columns (list) – a list of column names.
agg (str) – Function to aggregate when there is more than one values for each index instance. E.g. [‘first’, ‘last’, ‘sum’, ‘mean’, ‘size’, ‘concat’], default ‘concat’.
filter_values (pd.Series) – The values on the index column to filter before performing the groupby-agg operations.
- Returns
A dataframe to be used for annotation
- Return type
DataFrame
-
abstract
load_dataframe
(file_resources, npartitions=None)[source][source]¶ Handles data preprocessing given the file_resources input, and returns a DataFrame.
- Parameters
file_resources (dict) – A dict with keys as filenames and values as full file path.
npartitions (int) –
-
validate_file_resources
(path, file_resources, npartitions=None, verbose=False)[source][source]¶ For each file in file_resources, download the file if path+file is a URL or load from disk if a local path. Additionally unzip or unrar if the file is compressed.
- Parameters
path (str) – The folder or url path containing the data file resources. If a url path, the files will be downloaded and cached to the user’s home folder (at ~/.astropy/).
file_resources (dict) – default None, Used to list required files for preprocessing of the database. A dictionary where keys are required filenames and value are file paths. If None, then the class constructor should automatically build the required file resources dict.
npartitions (int) – >0 if the files will be used to create a Dask Dataframe. Default None.
verbose –
- Return type
None