read_gtf¶

openomics.utils.read_gtf(filepath_or_buffer, npartitions=None, compression=None, expand_attribute_column=True, infer_biotype_column=False, column_converters={}, usecols=None, features=None, chunksize=1048576)[source][source]¶

Parse a GTF into a dictionary mapping column names to sequences of values.

Parameters

filepath_or_buffer (str or buffer object) – Path to GTF file (may be gzip compressed) or buffer object such as StringIO
npartitions (int) – Number of partitions for the dask dataframe. Default None.
compression (str) – Compression type to be passed into dask.dataframe.read_table(). Default None.
expand_attribute_column (bool) – Replace strings of semi-colon separated key-value values in the ‘attribute’ column with one column per distinct key, with a list of values for each row (using None for rows where key didn’t occur).
infer_biotype_column (bool) – Due to the annoying ambiguity of the second GTF column across multiple Ensembl releases, figure out if an older GTF’s source column is actually the gene_biotype or transcript_biotype.
column_converters (dict, optional) – Dictionary mapping column names to conversion functions. Will replace empty strings with None and otherwise passes them to given conversion function.
usecols (list of str or None) – Restrict which columns are loaded to the give set. If None, then load all columns.
features (set of str or None) – Drop rows which aren’t one of the features in the supplied set
chunksize (int) –