MultiFASTA file processing

Question

MultiFASTA file processing

I was interested to know if there is any bioinformatics tool capable of processing a multiFASTA file giving me information such as the number of sequences, length, nucleotide / amino acid content, etc. and possibly automatically draw descriptive graphics. An R BIoconductor or BioPerl module would also be executed, but I could not find anything.

Can you help me? Many thanks: -)

+7

bioinformatics bioconductor biopython fasta bioperl

Federico giorgi Nov 24 '09 at 10:55

source share

3 answers

You may also be interested in faSize, which is a tool from Kent's Source Tree , although it takes a bit more effort (you have to dload and compile it) than just using grep ... here is an example output:

 me@my-lab ~/data $ time faSize myfile.fna 215400419 bases (104761 N 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307 N count: mean 0.1 sd 0.4 U count: mean 294.3 sd 138.5 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real real 0m3.710s user 0m3.541s sys 0m0.164s

+2

brant.faircloth Jan 16 '10 at 17:42

source share

It should be noted (for those who stumbled upon this, as I did) that there is a robust python library specifically designed to handle these tasks called Biopython . In a few lines of code, you can quickly get answers to all of the above questions. Here are some very simple examples, mostly adapted by reference. The tutorial also presents GC% graphs and sequence length tables.

 In [1]: from Bio import SeqIO In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse('/home/kevin/stack/ls_orchid.fasta', """fasta""")] In [3]: allSeqs[0] Out[3]: SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]) In [4]: len(allSeqs) #number of unique sequences in the file Out[4]: 94 In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object Out[5]: 740 In [6]: A_count = allSeqs[0].seq.count('A') C_count = allSeqs[0].seq.count('C') G_count = allSeqs[0].seq.count('G') T_count = allSeqs[0].seq.count('T') print A_count # number of A's 144 In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons Out[7]: 0 In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid Out[8]: Seq('RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY', HasStopCodon(ExtendedIUPACProtein(), '*'))

0

Kevin Jan 27 '16 at 19:30

source share

Yannick wurm · Accepted Answer · 2009-11-24T14:53:14+0000

Some embossing tools are a collection of small tools that can help you.

seqstats returns the length of the sequence
pepstats should give you amino acid content, etc. Some of the tools also offer charting functions. Very comfortably. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html

To count the number of fasta entries, I use: grep -c '^>' mySequences.fasta .

MultiFASTA file processing

More articles: