read_gtf

openomics.utils.read_gtf(filepath_or_buffer, npartitions=None, compression=None, expand_attribute_column=True, infer_biotype_column=False, column_converters={}, usecols=None, features=None, chunksize=1048576)[source][source]

Parse a GTF into a dictionary mapping column names to sequences of values.

Parameters
  • filepath_or_buffer (str or buffer object) – Path to GTF file (may be gzip compressed) or buffer object such as StringIO

  • npartitions (int) – Number of partitions for the dask dataframe. Default None.

  • compression (str) – Compression type to be passed into dask.dataframe.read_table(). Default None.

  • expand_attribute_column (bool) – Replace strings of semi-colon separated key-value values in the ‘attribute’ column with one column per distinct key, with a list of values for each row (using None for rows where key didn’t occur).

  • infer_biotype_column (bool) – Due to the annoying ambiguity of the second GTF column across multiple Ensembl releases, figure out if an older GTF’s source column is actually the gene_biotype or transcript_biotype.

  • column_converters (dict, optional) – Dictionary mapping column names to conversion functions. Will replace empty strings with None and otherwise passes them to given conversion function.

  • usecols (list of str or None) – Restrict which columns are loaded to the give set. If None, then load all columns.

  • features (set of str or None) – Drop rows which aren’t one of the features in the supplied set

  • chunksize (int) –