SequenceDatabase

class openomics.database.sequence.SequenceDatabase(replace_U2T=False, **kwargs)[source][source]

Bases: openomics.database.base.Database

Provides a series of methods to extract sequence data from SequenceDataset.

Methods Summary

get_aggregator([agg])

Returns a function used aggregate a list of sequences from a groupby on a given key.

get_sequences(index, omic, agg_sequences, …)

Returns a dictionary where keys are ‘index’ and values are sequence(s).

read_fasta(fasta_file, replace_U2T[, …])

Returns a pandas DataFrame containing the fasta sequence entries.

Methods Documentation

static get_aggregator(agg=None)[source][source]

Returns a function used aggregate a list of sequences from a groupby on a given key.

Parameters

agg – One of (“all”, “shortest”, “longest”), default “all”. If “all”, then for all

abstract get_sequences(index, omic, agg_sequences, **kwargs)[source][source]

Returns a dictionary where keys are ‘index’ and values are sequence(s).

Parameters
  • index (str) – {“gene_id”, “gene_name”, “transcript_id”, “transcript_name”}

  • omic (str) – {“lncRNA”, “microRNA”, “messengerRNA”}

  • agg_sequences (str) – {“all”, “shortest”, “longest”}

  • **kwargs – any additional argument to pass to SequenceDataset.get_sequences()

abstract read_fasta(fasta_file, replace_U2T, npartitions=None)[source][source]

Returns a pandas DataFrame containing the fasta sequence entries. With a column named ‘sequence’.

Parameters
  • fasta_file (str) – path to the fasta file, usually as self.file_resources[<file_name>]

  • replace_U2T (bool) –

  • npartitions