Annotatable#

class openomics.database.base.Annotatable[source][source]#

Bases: abc.ABC

This abstract class provides an interface for the -omics (Expression) to annotate its genes list with the external data downloaded from various databases. The database will be imported as attributes information to the genes’s annotations, or interactions between the genes.

Attributes Summary

Methods Summary

annotate_attributes(database, on, columns[, ...])

Performs a left outer join between the annotation and Database's DataFrame, on the keys in on column.

annotate_diseases(database, on)

param database

annotate_expressions(database, index[, ...])

param database

annotate_interactions(database, index)

param database

annotate_sequences(database, on[, agg, omic])

Annotate a genes list (based on index) with a dictionary of <gene_name: sequence>.

get_annotation_expressions()

get_annotations()

get_rename_dict(from_index, to_index)

Utility function used to retrieve a lookup dictionary to convert from one index to another, e.g., gene_id to gene_name, obtained from two columns in the dataframe.

set_index(new_index)

Resets :param new_index: :type new_index: str

Attributes Documentation

DISEASE_ASSOCIATIONS_COL = 'disease_associations'[source]#

Methods Documentation

annotate_attributes(database, on, columns, agg='unique', agg_for=None, fuzzy_match=False, list_match=False)[source][source]#

Performs a left outer join between the annotation and Database’s DataFrame, on the keys in on column. The on argument must be column present in both DataFrames. If there exists overlapping columns from the join, then .fillna() is used to fill NaN values in the old column with non-NaN values from the new column.

Parameters
  • database (Database) – Database which contains an dataframe.

  • on (str) – The column name which exists in both the annotations and Database dataframe to perform the join on.

  • columns ([str]) – a list of column name to join to the annotation.

  • agg (str) – Function to aggregate when there is more than one values for each index instance. E.g. [‘first’, ‘last’, ‘sum’, ‘mean’, ‘unique’, ‘concat’], default ‘unique’.

  • agg_for (Dict[str, Any]) – Bypass the agg function for certain columns with functions specified in this dict of column names and the agg function to aggregate for that column.

  • fuzzy_match (bool) – default False. Whether to join the annotation by applying a fuzzy match on the string value index with difflib.get_close_matches(). It can be slow and thus should only be used sparingly.

  • list_match (bool) – default False.

annotate_diseases(database, on)[source][source]#
Parameters
annotate_expressions(database, index, fuzzy_match=False)[source][source]#
Parameters
  • database

  • index

  • fuzzy_match

annotate_interactions(database, index)[source][source]#
Parameters
annotate_sequences(database, on, agg='longest', omic=None, **kwargs)[source][source]#

Annotate a genes list (based on index) with a dictionary of <gene_name: sequence>. If multiple sequences per gene name, then perform some aggregation.

Parameters
  • database (SequenceDatabase) – The database

  • on (str) – The gene index column name.

  • agg (str) – The aggregation method, one of [“longest”, “shortest”, or “all”]. Default longest.

  • omic (str) – Default None. Declare the omic type to fetch sequences for.

  • **kwargs

get_annotation_expressions()[source][source]#
get_annotations()[source][source]#
get_rename_dict(from_index, to_index)[source][source]#

Utility function used to retrieve a lookup dictionary to convert from one index to another, e.g., gene_id to gene_name, obtained from two columns in the dataframe.

Returns

Dict[str, str]: the lookup dictionary.

Parameters
  • from_index (str) – an index on the DataFrame for key

  • to_index

set_index(new_index)[source][source]#

Resets :param new_index: :type new_index: str

Parameters

new_index