Scokit-bio extract genomic features from gff3 file

Question

Scokit-bio extract genomic features from gff3 file

Is it possible in scikit-bio to extract genomic functions stored in a gff3 file from a gena fasta file?

Example:

genome.fasta p>

>sequence1 ATGGAGAGAGAGAGAGAGAGGGGGCAGCATACGCATCGACATACGACATACATCAGATACGACATACTACTACTATGA

annotation.gff3

 #gff-version 3 sequence1 source gene 1 78 . + . ID=gene1 sequence1 source mRNA 1 78 . + . ID=transcript1;parent=gene1 sequence1 source CDS 1 6 . + 0 ID=CDS1;parent=transcript1 sequence1 source CDS 73 78 . + 0 ID=CDS2;parent=transcript1

The desired sequence for the mRNA function (transcript1) would be the concatenation of two CDS daughter functions. So in this case it will be 'ATGGAGCTATGA' .

+6

python python-3.x bioinformatics skbio

holmrenser Jul 11 '16 at 7:59

source share

1 answer

m00am · Answer 1 · 2017-12-15T16:02:15+0000

This feature was added to scikit-bio, however, the version available in the bio-probe does not yet support it (2017-12-15). The format file for gff3 is present in the Github repository .

You can clone the repo and install it locally using:

 $ git clone https://github.com/biocore/scikit-bio.git $ cd scikit-bio $ python setup.py install

Following the example in the file, the following code should work:

 import io from skbio.metadata import IntervalMetadata from skbio.io import read gff = io.StringIO(open("annotations.gff3", "r").read()) im = read(gff, format='gff3', into=IntervalMetadata, seq_id="sequence1") print(im)

For me, this calls the FormatIdentificationWarning value, but the records are reported correctly:

 4 interval features ------------------- Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'gene', 'score': '.', 'strand': '+', 'ID': 'gene1'}) Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'mRNA', 'score': '.', 'strand': '+', 'ID': 'transcript1', 'parent': 'gene1'}) Interval(interval_metadata=<140154121000104>, bounds=[(0, 6)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS1', 'parent': 'transcript1'}) Interval(interval_metadata=<140154121000104>, bounds=[(72, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS2', 'parent': 'transcript1'})

In the example in the GFF3 code, the FASTA file is combined into the input line used for the read function. Perhaps this can solve this problem. Also, I'm not 100% sure how you can use the intervals returned to retrieve this function.

Scokit-bio extract genomic features from gff3 file

More articles: