MultiFASTA file processing

I was interested to know if there is any bioinformatics tool capable of processing a multiFASTA file giving me information such as the number of sequences, length, nucleotide / amino acid content, etc. and possibly automatically draw descriptive graphics. An R BIoconductor or BioPerl module would also be executed, but I could not find anything.

Can you help me? Many thanks: -)

+7
bioinformatics bioconductor biopython fasta bioperl
source share
3 answers

Some embossing tools are a collection of small tools that can help you.

To count the number of fasta entries, I use: grep -c '^>' mySequences.fasta .

To make sure that none of the entries are duplicated, I make sure that I get the same number: grep '^>' mySequences.fasta | sort | uniq | wc -l grep '^>' mySequences.fasta | sort | uniq | wc -l

+7
source share

You may also be interested in faSize, which is a tool from Kent's Source Tree , although it takes a bit more effort (you have to dload and compile it) than just using grep ... here is an example output:

 me@my-lab ~/data $ time faSize myfile.fna 215400419 bases (104761 N 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307 N count: mean 0.1 sd 0.4 U count: mean 294.3 sd 138.5 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real real 0m3.710s user 0m3.541s sys 0m0.164s 
+2
source share

It should be noted (for those who stumbled upon this, as I did) that there is a robust python library specifically designed to handle these tasks called Biopython . In a few lines of code, you can quickly get answers to all of the above questions. Here are some very simple examples, mostly adapted by reference. The tutorial also presents GC% graphs and sequence length tables.

 In [1]: from Bio import SeqIO In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse('/home/kevin/stack/ls_orchid.fasta', """fasta""")] In [3]: allSeqs[0] Out[3]: SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]) In [4]: len(allSeqs) #number of unique sequences in the file Out[4]: 94 In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object Out[5]: 740 In [6]: A_count = allSeqs[0].seq.count('A') C_count = allSeqs[0].seq.count('C') G_count = allSeqs[0].seq.count('G') T_count = allSeqs[0].seq.count('T') ​print A_count # number of A's 144 In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons Out[7]: 0 In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid Out[8]: Seq('RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY', HasStopCodon(ExtendedIUPACProtein(), '*')) 
0
source share

All Articles