Possibly useful programs for filtering sequences
DNA feature finders
Protein feature finders
Provides general predictions of which regions within
a DNA sequence encodes protein. Cannot identify precise termini of protein-encoding
regions. Based mostly on codon bias, but does not require the user to provide
codon frequency table.
Ref: Fickett J (1982) Nucl Acids Res 10:5303-5318.
(w/donor GC) [http://www.softberry.com/berry.phtml?topic=gfind]
Good for eukaryotic genomic sequences. Uses hidden Markov
model (HMM), trained on humans, Drosophila, C. elegans, and plants (you
specify) to predict exons and introns. Assembles best guess at translated
sequence. Easy to use.
Ref: Salamov A.A., Solovyev V.V. (1999), unpublished
data. "Please, reference: Computational Genomics Group (Sanger Centre)
WEB server."Maybe Salamov AA Solovyev VV (2000) Genome Res 10:516-22.
Good for eukaryotic cDNA
sequences. Uses Markov analysis, trained on humans and plants (you specify)
to predict translated sequence. Accuracy nearly 100%, thus depends on the
quality of the EST or cDNA sequence you provide.
Ref: Solovyev V.V., Salamov A.A., unpublished data.
Good for eukaryotic genomic sequences. I've never used
it and it is included here only for completeness because it is a very popular
Good for prokaryotic sequences or eukaryotic sequences
Uses Markov models based on (your choice) training on specific genomes
(many!) or on general features of translational start sites.
Ref: Lukashin A, Borodovsky M (1998) Nucl Acids Res 26:1107-1115;
Besemer J, Lomsadze A, Borodovsky M (2001) Nucl Acids Res 29:2607-2618.
Simple minded program that can be used on any sequence.
Finds every open reading frame beginning with a start codon and ending
with a stop codon. For eukaryotic sequences, specify the -ALL command line
parameter to force display of all stop codons (otherwise only the first
after a start codon is shown). Still, the program is of limited utility
with eukaryotic sequences.
Displays translation of any sequence in one or more of
the six possible reading frames. Will show stop codons as "*" but go right
on translating. The program is really designed to list restriction sites,
but if you just want the translation, give NotI (for AT-rich sequences)
or SwaI (for GC-rich sequences) as the enzyme to be mapped. There
will probably be no sites and thus no clutter.
Orf Finder [http://www.ncbi.nlm.nih.gov/gorf/]
Like Frames (above) but much nicer user interface. Finds every
open reading frame within a (max) 30 kb segment of DNA, without
regard to its source. Scans through all six possible reading frames looking
for long stretches between start codons and stop codons. Clicking on an
ORF gets you to the predicted amino acid sequence and to BLAST. May have
problem if gene does not begin with ATG.
(Basic local alignment search tool) [http://www.ncbi.nlm.nih.gov/BLAST/]
Compares DNA or protein sequences against nonreduntant or other databases.
It comes in different flavors (see below). Graphical output allows you
to see at a glance the region of the query that is similar for each
hit -- very valuable information.
Ref: Altschul SF, Madden TL Schaffer AA, Zhang J, Zhang
Z, Miller W, Lipman DJ (1997) Nucl Acids Res 25:3389-3402.
How to confine searches to specific databases
or specific organisms
This is often a good idea in order to get more accurate E values or
because you're interested only in hits within a specific class of sequences.
Or you may want to search only cDNA, in order to figure out exon/intron
structure. There are many ways to get the job done:
BlastP: Compares your protein
sequence against a database of protein sequences. You'll
probably use this most often, or derivatives of it (Psi-BLAST or Phi-BLAST).
Also provides results from a CD (conserved domain)-Search of Pfam motifs,
which is often worth the price of admission by itself.
BlastN: Compares your DNA sequence against
a database of DNA sequences. Use this if you're interested
in finding conserved DNA features, such as transposon ends or regulatory
sites. If you want to find what gene may be contained within your DNA sequence,
BlastX: Translates your DNA sequence in all
six possible reading frames and compares the resulting amino acid sequences
against a database of protein sequences.
Blast 2 sequences: Compares your DNA or protein sequence
against another of your DNA or protein sequences. This is a quick way of
getting an alignment of two sequences.
DNA Feature Finders
Within BLAST (any flavor), scroll down to "Options for advanced blasting".
On the line beginning "Limit by…" select the desired class of organisms
from the pull down menu at the right. Or (same line) specify a key word
(like the name of the species) in the box to the left. This latter choice
provides a very leaky filter.
To search only ESTs (expresses sequence tags; cDNAs) within BLASTN, click
arrow by the box next to "Choose database" and select "est_human" (or whatever
database is appropriate).
Find an organism-specific web site. There are many lists of such sites.
Here are some (please pardon my ignorance of eukaryotes):
PROTEIN Feature Finders
(Transcriptional Start Site - Wingender) [http://www.softberry.com/berry.phtml?topic=promoter]
Predicts eukaryotic transcriptional start sites (called
in output "promoters") for RNA Polymerase II using Wingender's database
of conserved elements. Claims that it will recognize about 50% of true
promoters and give a false positive once every 4000 bp.
Ref: Solovyev V.V., Salamov A.A., Lawrence C.B. (unpublished).
Predicts prokaryotic promoter sites. Claims that it will
recognize about 80% of true E. coli sigma-70 promoters. Advises
that program be used only on intergenic regions.Warning: no reference
given, and algorithm not explained!
Predicts eukaryotic polyA processing sites,
with program trained on collection of human sequences containing bona fide
AATAAA processing signal. Claims (with appropriate threshold value) finds
86% correct sites with 8% false positives.
Ref: Salamov AA, Solovyev VV (1997). Computer Applic
Repeats Finder [http://tandem.bu.edu/trf/trf.submit.options.html]
Finds tandem DNA repeats (repeated
sequences that occur one after another). Can work on huge chunks of DNA
at one time and find repeated sequences with a periodicity between 1 and
500 nucleotides (no need for user to specify size). Permits short gaps
Ref: Benson G (1999) Nucl Acids Res 27:573-580.
Finds repeats in DNA or protein sequences, where the
sequences occur within a distance specified by the user (so may be tandem
or separated by hundreds of bases). Works by simple comparisons, like a
dot plot. You need to specify the window size (the minimum length of the
repeats to be found) and the stringency (the number of allowed mismatches).
Predicts positions of prokaryotic transcriptional terminators
by comparison with set of known terminators.
Ref: Brendel V, Trifonov EN (1984) Nucl Acids Res 12:4411-4427.
TIGR Software Tools[http://www.tigr.org/softlab/]
Treasure trove of free, downloadable software to find transcriptional
terminators, splice sites, coding regions, ribosome binding sites, and
vector sequences (to be removed from new sequences).
Represents information content of aligned sequences as height. Tells
you much more than mere consensus sequences.
alignment surface method) [http://www.sbc.su.se/~miklos/DAS/]
Predicts transmembrane regions. Performs dot plots against
collection of archtypical transmembrane protein.
Ref: Cserzo M, Wallin E, Simon I, von Heijne G, and Elofsson
A (1997). Prot Engineering 10:673-676
Predicts motifs within amino acid sequences by comparison
against a database of consensus sequences.
Ref: Bateman A, Birney E, Durbin E, Eddy SR, Howe KL,
Sonnhammer ELL (2000) Nucl Acids Res 28:263-266.
modular architecture research tool) [http://smart.embl-heidelberg.de/]
Predicts motifs within amino acid sequences by comparison
against a database of consensus sequences. Sounds like Pfam? SMART uses
a different database, biased towards regulatory motifs, that only partially
overlaps with that used by Pfam. Actually, SMART is a suite of separate
programs, allowing you to get different kinds of structural information.
Ref: Letunic I et al (2002) Nucl Acids Res 30:242-244.
BLAST searches automatically give you searches through Pfam and
SMART databases (they call it CD-Searches (CD = Conserved Domain).
Very convenient, but for some reason you don't always get the same answers
as separate searches through Pfam and SMART.
Ref: Marchler-Bauer A., Panchenko AR, Shoemaker BA, Thiessen
PA, Geer LY, Bryant SH (2002) Nucl Acids Res 30:281-283.
Predicts signal peptide cleavage sites and to a lesser
extent membrane anchor peptides. Consists of two programs: SignalP-NN (neural
net), better for predicting cleavage sites; and SignalP-hmm (hidden Markov
models), better for descriminating between cleavage sites and uncleaved
anchor sequences. Trained on (your choice) eukaryotes, Gram negative bacteria,
or Gram-positive bacteria.
Ref (Review): Nielsen H, Brunak S, von Heijne G (1999)
Protein Engineering 12:3-9.
Multiple sequence alignment. Uses a method derived from
Clustal (see below). I haven't determined how their results differ except
Multiple sequence alignment. First constructs phylogenetic
tree by pairwise comparisons, then aligns sequences with aid of tree. Sometimes
fails with sequences whose similarity is masked by large regions of dissimilarity.
For Windows users who want to download their own free copy of Clustal (ClustalX)
and get much prettier output than obtained from either PILEUP or ClustalW,
look at the second reference below.
Ref: (ClustalW) Thompson JD, Higgins DG, Gibson TJ (1994)
Nucl Acids Res 22:4673-4680
(ClustalX) Thompson JD et al (1997) Nucl Acids Res 25:4876-4882.
A variety of useful tools to characterize peptide sequences
(e.g. mass, protease cleavage products, etc) and to identify proteins
from those characteristics. It also provides tools for secondary
and tertiary structure analysis and those that predict protein
Examples of user-specified patterns (click here
for complete list of one-letter codes)
Looks for short, user-specified patterns in DNA sequences.
You can use it to find restriction sites, consensus binding sites, etc.
It allows ambiguities (by using standard one-letter symbols) and mismatches
(by using the -MIS = # of mismatches parameter).
||looks for AvaI sites (Y=pyrimidine, R=purine)
||looks for AvaII sites (W=A or T)
||looks for consensus prokaryotic promoter (two 6-base sequences separated
by 16 to 18 bases)
Finds nucleotide or amino acid frequencies at each position
of a set of sequences that are already aligned. Then, at each position,
it lists the most common residue or set of residues whose frequencies sum
to a value exceeding a value you specify. You can call this list the consensus
Uses frequency table made by Consensus (above) to search
a sequence for subsequences that best match the frequencies within the
Ref: Staden (1984) Nucl Acids Res 12:505-519
MEME [GCG] and
Finds statistically overrepresented motifs
in user provided collection of sequences (DNA or protein). If you think
you have several genes that probably share an upstream regulatory sequence,
the program may find it for you. Or it might find a short amino acid common
to a set of protein of similar function (but if the protein are TOO similar,
use PILEUP or Clustal). Once you've found one motif or a set of motifs,
you can use the MEME-generated prf file (containing the description of
the motifs) to search a sequence or set of sequences (e.g. all of PIR)
for other instances. In the web version, the program called MAST does the
Motif Sampler [http://bayesweb.wadsworth.org/gibbs/gibbs.html]
Similar in purpose to MEME and works in a very similar
way as well. It differs in that it (but not MEME) requires that the user
specify the width of the motif sought (i.e., number of amino acids or nucleotides).
Also, I think it uses a different strategy to find candidate motifs amongst
the huge number possible, so it may find the optimal motif in considerably
less time. It overlaps with Meme in its answers, but the answers are not
Ref: Lawrence CE et al (1993) Science 262:208-214.
Also similar in purpose to MEME (but works only on DNA
sequences, not protein) and also uses Gibbs sampling. However the quality
motifs are assessed through zero- to third-order Markov models. The local
version (but not the version on the web) supplies estimates of the significance
of the motifs found.
Liu X, Brutlag DL, Liu JS (2001). Pac Symp Biocomput.
(Colorado State University)
Addendum: Commercial Sites
Many tools for quick manipulation of DNA and protein sequences: Inverse,
translate, reverse translate, hydrophobicity plots, dot plots.
Many companies sprung out with the genomic fever, offering
integrated gene finding services. Below is a list of some of these companies
(in alphabetical order). They are still evolving, some might continue this
line of business the market permitting, others might turn into pharmaceutical
companies on their own and others might simply disappear. The list is just
for your information and in no way implies an endorsement from our part
on any of them.
Discovery System (CDS) [http://www.celera.com/index.cfmhttp://www.celera.com/index.cfm],
Rosetta Inpharmatics (acquired by Merck).