UniProt#

class openomics.database.sequence.UniProt(path='https://ftp.uniprot.org/pub/databases/uniprot/current_release/', file_resources=None, species_id='9606', remove_version_num=True, index_col='UniProtKB-AC', keys=None, col_rename={'Ensembl': 'gene_id', 'Ensembl_PRO': 'protein_embl_id', 'Ensembl_TRS': 'transcript_id', 'GN': 'gene_name', 'GO': 'go_id', 'GeneID (EntrezGene)': 'entrezgene_id', 'NCBI-taxon': 'species_id', 'OS': 'species', 'OX': 'species_id', 'PE': 'ProteinExistence', 'SV': 'version', 'UniProtKB-AC': 'protein_id', 'UniProtKB-ID': 'protein_name', 'accession': 'UniProtKB-AC', 'gene': 'gene_name', 'geneLocation': 'subcellular_location', 'keyword': 'keywords', 'name': 'protein_name'}, **kwargs)[source][source]#

Bases: openomics.database.sequence.SequenceDatabase

Attributes Summary

Methods Summary

assign_transforms(idmapping)

rtype

Dict[str, Union[Series, Series]]

get_sequences(index[, omic, agg])

Returns a dictionary where keys are 'index' and values are sequence(s).

get_species_list(file_path)

load_dataframe(file_resources[, blocksize])

param file_resources

load_sequences(fasta_file[, index, keys, ...])

Returns a pandas DataFrame containing the fasta sequence entries.

load_uniprot_parquet(file_resources[, blocksize])

rtype

Union[DataFrame, DataFrame]

load_uniprot_xml(file_path[, keys, blocksize])

rtype

DataFrame

Attributes Documentation

COLUMNS_RENAME_DICT = {'Ensembl': 'gene_id', 'Ensembl_PRO': 'protein_embl_id', 'Ensembl_TRS': 'transcript_id', 'GN': 'gene_name', 'GO': 'go_id', 'GeneID (EntrezGene)': 'entrezgene_id', 'NCBI-taxon': 'species_id', 'OS': 'species', 'OX': 'species_id', 'PE': 'ProteinExistence', 'SV': 'version', 'UniProtKB-AC': 'protein_id', 'UniProtKB-ID': 'protein_name', 'accession': 'UniProtKB-AC', 'gene': 'gene_name', 'geneLocation': 'subcellular_location', 'keyword': 'keywords', 'name': 'protein_name'}[source]#
SPECIES_ID_NAME = {'10090': 'MOUSE', '10116': 'RAT', '226900': 'BACCR', '243273': 'MYCGE', '284812': 'SCHPO', '287': 'PSEAI', '3702': 'ARATH', '44689': 'DICDI', '4577': 'MAIZE', '559292': 'YEAST', '6239': 'CAEEL', '7227': 'DROME', '7955': 'DANRE', '83333': 'ECOLI', '9606': 'HUMAN', '9823': 'PIG', '99287': 'SALTY'}[source]#
SPECIES_ID_TAXONOMIC = {'ARATH': 'plants', 'BACCR': 'bacteria', 'CAEEL': 'vertebrates', 'DANRE': 'vertebrates', 'DICDI': 'bacteria', 'DROME': 'invertebrates', 'ECOLI': 'bacteria', 'HUMAN': 'human', 'MAIZE': 'plants', 'MOUSE': 'rodents', 'MYCGE': 'bacteria', 'PIG': 'mammals', 'PSEAI': 'bacteria', 'RAT': 'rodents', 'SALTY': 'bacteria', 'SCHPO': 'fungi', 'YEAST': 'fungi'}[source]#

Methods Documentation

assign_transforms(idmapping)[source][source]#
Return type

Dict[str, Union[Series, Series]]

get_sequences(index, omic=None, agg='first', **kwargs)[source][source]#

Returns a dictionary where keys are ‘index’ and values are sequence(s).

Parameters
  • index (str) – {“gene_id”, “gene_name”, “transcript_id”, “transcript_name”}

  • omic (str) – {“lncRNA”, “microRNA”, “messengerRNA”}

  • agg (str) – {“all”, “shortest”, “longest”}

  • **kwargs – any additional argument to pass to SequenceDataset.get_sequences()

classmethod get_species_list(file_path)[source][source]#
load_dataframe(file_resources, blocksize=None)[source][source]#
Parameters
  • file_resources

  • blocksize

load_sequences(fasta_file, index=None, keys=None, blocksize=None)[source][source]#

Returns a pandas DataFrame containing the fasta sequence entries. With a column named ‘sequence’.

Parameters
  • () (index) –

  • fasta_file (str) – path to the fasta file, usually as self.file_resources[<file_name>]

  • keys (pd.Index) – list of keys to

  • blocksize

Return type

OrderedDict

load_uniprot_parquet(file_resources, blocksize=None)[source][source]#
Return type

Union[DataFrame, DataFrame]

load_uniprot_xml(file_path, keys=None, blocksize=None)[source][source]#
Return type

DataFrame