English 中文(简体)
multiFASTA file processing
原标题:

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn t manage to find anything.

Can you help me? Thanks a lot :-)

最佳回答

Some of the emboss tools are a collection of small tools that can help you out.

To count number of fasta entries, I use: grep -c ^> mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep ^> mySequences.fasta | sort | uniq | wc -l

问题回答

You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:

me@my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N s 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real

real    0m3.710s
user    0m3.541s
sys     0m0.164s

Screed in python is brilliant:

import screed

for record in screed.open(fastafilename):
    print(len(record.sequence))

It should be noted (for anyone stumbling upon this, like I just did) that there is a robust python library specifically designed to handle these tasks called Biopython. In a few lines of code, you can quickly access answers for all of the above questions. Here are some very basic examples, mostly adapted from the link. There are boiler-plate GC% graphs and sequence length graphs in the tutorial also.

In [1]: from Bio import SeqIO

In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse( /home/kevin/stack/ls_orchid.fasta , """fasta""")]

In [3]: allSeqs[0]
Out[3]: SeqRecord(seq=Seq( CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC , SingleLetterAlphabet()), id= gi|2765658|emb|Z78533.1|CIZ78533 , name= gi|2765658|emb|Z78533.1|CIZ78533 , description= gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA , dbxrefs=[])

In [4]: len(allSeqs) #number of unique sequences in the file
Out[4]: 94

In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object
Out[5]: 740

In [6]: A_count = allSeqs[0].seq.count( A )
        C_count = allSeqs[0].seq.count( C )
        G_count = allSeqs[0].seq.count( G )
        T_count = allSeqs[0].seq.count( T )

        ​print A_count # number of A s

        144

In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons
Out[7]: 0

In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid
Out[8]: Seq( RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY , HasStopCodon(ExtendedIUPACProtein(),  * ))




相关问题
multiFASTA file processing

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe ...

Customizing output of BLAST?

I know this is a very specific question relating to BLAST and Bioinformatics but here goes: I am attempting to use standalone BLAST (I already have downloaded it and tested it running on the command ...

Calculation of DNA sequences

Could you tell me how I can calculate the DNA sequences by Java using Levenshtein algorithm

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

热门标签