UniProt#
- class openomics.database.sequence.UniProt(path='https://ftp.uniprot.org/pub/databases/uniprot/current_release/', file_resources=None, species_id='9606', remove_version_num=True, index_col='UniProtKB-AC', keys=None, col_rename={'Ensembl': 'gene_id', 'Ensembl_PRO': 'protein_embl_id', 'Ensembl_TRS': 'transcript_id', 'GN': 'gene_name', 'GO': 'go_id', 'GeneID (EntrezGene)': 'entrezgene_id', 'NCBI-taxon': 'species_id', 'OS': 'species', 'OX': 'species_id', 'PE': 'ProteinExistence', 'SV': 'version', 'UniProtKB-AC': 'protein_id', 'UniProtKB-ID': 'protein_name', 'accession': 'UniProtKB-AC', 'gene': 'gene_name', 'geneLocation': 'subcellular_location', 'keyword': 'keywords', 'name': 'protein_name'}, **kwargs)[source][source]#
Bases:
openomics.database.sequence.SequenceDatabase
Attributes Summary
Methods Summary
assign_transforms
(idmapping)- rtype
Dict
[str
,Union
[Series
,Series
]]
get_sequences
(index[, omic, agg])Returns a dictionary where keys are 'index' and values are sequence(s).
get_species_list
(file_path)load_dataframe
(file_resources[, blocksize])- param file_resources
load_sequences
(fasta_file[, index, keys, ...])Returns a pandas DataFrame containing the fasta sequence entries.
load_uniprot_parquet
(file_resources[, blocksize])- rtype
Union
[DataFrame
,DataFrame
]
load_uniprot_xml
(file_path[, keys, blocksize])- rtype
DataFrame
Attributes Documentation
- COLUMNS_RENAME_DICT = {'Ensembl': 'gene_id', 'Ensembl_PRO': 'protein_embl_id', 'Ensembl_TRS': 'transcript_id', 'GN': 'gene_name', 'GO': 'go_id', 'GeneID (EntrezGene)': 'entrezgene_id', 'NCBI-taxon': 'species_id', 'OS': 'species', 'OX': 'species_id', 'PE': 'ProteinExistence', 'SV': 'version', 'UniProtKB-AC': 'protein_id', 'UniProtKB-ID': 'protein_name', 'accession': 'UniProtKB-AC', 'gene': 'gene_name', 'geneLocation': 'subcellular_location', 'keyword': 'keywords', 'name': 'protein_name'}[source]#
- SPECIES_ID_NAME = {'10090': 'MOUSE', '10116': 'RAT', '226900': 'BACCR', '243273': 'MYCGE', '284812': 'SCHPO', '287': 'PSEAI', '3702': 'ARATH', '44689': 'DICDI', '4577': 'MAIZE', '559292': 'YEAST', '6239': 'CAEEL', '7227': 'DROME', '7955': 'DANRE', '83333': 'ECOLI', '9606': 'HUMAN', '9823': 'PIG', '99287': 'SALTY'}[source]#
- SPECIES_ID_TAXONOMIC = {'ARATH': 'plants', 'BACCR': 'bacteria', 'CAEEL': 'vertebrates', 'DANRE': 'vertebrates', 'DICDI': 'bacteria', 'DROME': 'invertebrates', 'ECOLI': 'bacteria', 'HUMAN': 'human', 'MAIZE': 'plants', 'MOUSE': 'rodents', 'MYCGE': 'bacteria', 'PIG': 'mammals', 'PSEAI': 'bacteria', 'RAT': 'rodents', 'SALTY': 'bacteria', 'SCHPO': 'fungi', 'YEAST': 'fungi'}[source]#
Methods Documentation
- get_sequences(index, omic=None, agg='first', **kwargs)[source][source]#
Returns a dictionary where keys are ‘index’ and values are sequence(s).
- Parameters
index (str) – {“gene_id”, “gene_name”, “transcript_id”, “transcript_name”}
omic (str) – {“lncRNA”, “microRNA”, “messengerRNA”}
agg (str) – {“all”, “shortest”, “longest”}
**kwargs – any additional argument to pass to SequenceDataset.get_sequences()
- load_dataframe(file_resources, blocksize=None)[source][source]#
- Parameters
file_resources –
blocksize –
- load_sequences(fasta_file, index=None, keys=None, blocksize=None)[source][source]#
Returns a pandas DataFrame containing the fasta sequence entries. With a column named ‘sequence’.
- Parameters
() (index) –
fasta_file (str) – path to the fasta file, usually as self.file_resources[<file_name>]
keys (pd.Index) – list of keys to
blocksize –
- Return type
OrderedDict