Gene finders

Possibly useful programs for filtering sequences

Gene finders
Similarity finders
DNA feature finders
Protein feature finders
Pattern finders
Miscellaneous tools
Commercial sites

Gene Finders

TestCode [GCG]

general predictions

Ref: Fickett J (1982) Nucl Acids Res 10:5303-5318.

FGeneSH (w/donor GC) [http://www.softberry.com/berry.phtml?topic=gfind]

eukaryotic genomic

Ref: Salamov A.A., Solovyev V.V. (1999), unpublished data. "Please, reference: Computational Genomics Group (Sanger Centre) WEB server."Maybe Salamov AA Solovyev VV (2000) Genome Res 10:516-22.

BestOrf[http://genomic.sanger.ac.uk/gf/gf.html]

eukaryotic cDNA

Ref: Solovyev V.V., Salamov A.A., unpublished data.

Grail[http://compbio.ornl.gov/Grail-1.3/]

eukaryotic genomic

GeneMark [http://opal.biology.gatech.edu/GeneMark/]

prokaryotic sequences

eukaryotic sequences

Ref: Lukashin A, Borodovsky M (1998) Nucl Acids Res 26:1107-1115; Besemer J, Lomsadze A, Borodovsky M (2001) Nucl Acids Res 29:2607-2618.

FRAMES [GCG]

any sequence

MAP [GCG]

translation of any sequence

Not

Swa

Orf Finder [http://www.ncbi.nlm.nih.gov/gorf/]

every open reading frame

Similarity Finders

BLAST (Basic local alignment search tool) [http://www.ncbi.nlm.nih.gov/BLAST/]

region

Ref: Altschul SF, Madden TL Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Nucl Acids Res 25:3389-3402.

BlastP: Compares your protein sequence against a database of protein sequences. You'll probably use this most often, or derivatives of it (Psi-BLAST or Phi-BLAST). Also provides results from a CD (conserved domain)-Search of Pfam motifs, which is often worth the price of admission by itself.
BlastN: Compares your DNA sequence against a database of DNA sequences. Use this if you're interested in finding conserved DNA features, such as transposon ends or regulatory sites. If you want to find what gene may be contained within your DNA sequence, use BlastX.
BlastX: Translates your DNA sequence in all six possible reading frames and compares the resulting amino acid sequences against a database of protein sequences.
Blast 2 sequences: Compares your DNA or protein sequence against another of your DNA or protein sequences. This is a quick way of getting an alignment of two sequences.

FastA [http://www.genome.ad.jp/]

How to confine searches to specific databases or specific organisms This is often a good idea in order to get more accurate E values or because you're interested only in hits within a specific class of sequences. Or you may want to search only cDNA, in order to figure out exon/intron structure. There are many ways to get the job done:

Within BLAST (any flavor), scroll down to "Options for advanced blasting". On the line beginning "Limit by…" select the desired class of organisms from the pull down menu at the right. Or (same line) specify a key word (like the name of the species) in the box to the left. This latter choice provides a very leaky filter.

To search only ESTs (expresses sequence tags; cDNAs) within BLASTN, click arrow by the box next to "Choose database" and select "est_human" (or whatever database is appropriate).

Find an organism-specific web site. There are many lists of such sites. Here are some (please pardon my ignorance of eukaryotes):

Prokaryotic Ongoing Genome Projects [http://ergo.integratedgenomics.com/GOLD/prokaryagenomes.html]

TIGR (The Institute of Genomic Research) Comprehensive Microbial Database

http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl

Go on line (e.g. www.google.com) and search for [your favorite organism] + "genome"

DNA Feature Finders

TSSW (Transcriptional Start Site - Wingender) [http://www.softberry.com/berry.phtml?topic=promoter]

eukaryotic transcriptional start sites

Ref: Solovyev V.V., Salamov A.A., Lawrence C.B. (unpublished).

BProm[http://www.softberry.com/berry.phtml?topic=promoter]

prokaryotic promoter sites

E. coli

Warning:

PolyAH[http://www.softberry.com/berry.phtml?topic=promoter]

Predicts eukaryotic polyA processing sites, with program trained on collection of human sequences containing bona fide AATAAA processing signal. Claims (with appropriate threshold value) finds 86% correct sites with 8% false positives.

Ref: Salamov AA, Solovyev VV (1997). Computer Applic Biosci 13:23-28.

Tandem Repeats Finder [http://tandem.bu.edu/trf/trf.submit.options.html]

tandem DNA repeats

(repeated sequences that occur one after another). Can work on huge chunks of DNA at one time and find repeated sequences with a periodicity between 1 and 500 nucleotides (no need for user to specify size). Permits short gaps between repeats.

Ref: Benson G (1999) Nucl Acids Res 27:573-580.

REPEATS [GCG]

repeats

TERMINATOR [GCG]

prokaryotic transcriptional terminators

Ref: Brendel V, Trifonov EN (1984) Nucl Acids Res 12:4411-4427.

TIGR Software Tools[http://www.tigr.org/softlab/]

WebLogo [http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi]

PROTEIN Feature Finders

DAS (Dense alignment surface method) [http://www.sbc.su.se/~miklos/DAS/]

transmembrane regions

Ref: Cserzo M, Wallin E, Simon I, von Heijne G, and Elofsson A (1997). Prot Engineering 10:673-676

Pfam (protein families) [http://pfam.wustl.edu/]

motifs

Ref: Bateman A, Birney E, Durbin E, Eddy SR, Howe KL, Sonnhammer ELL (2000) Nucl Acids Res 28:263-266.

SMART (Simple modular architecture research tool) [http://smart.embl-heidelberg.de/]

motifs

Ref: Letunic I et al (2002) Nucl Acids Res 30:242-244.

BLASTP(and derivatives) [http://www.ncbi.nlm.nih.gov/BLAST/]

searches through Pfam and SMART

Ref: Marchler-Bauer A., Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH (2002) Nucl Acids Res 30:281-283.

SignalP v2.0 [http://www.cbs.dtu.dk/services/SignalP-2.0/]

signal peptide cleavage sites

Ref (Review): Nielsen H, Brunak S, von Heijne G (1999) Protein Engineering 12:3-9.

PILEUP [GCG]

Multiple sequence alignment

ClustalW[http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html]

Multiple sequence alignment

Ref: (ClustalW) Thompson JD, Higgins DG, Gibson TJ (1994) Nucl Acids Res 22:4673-4680

(ClustalX) Thompson JD et al (1997) Nucl Acids Res 25:4876-4882.

ExPASy Proteomics Tools [http://www.expasy.ch/tools/]

characterize peptide sequences

identify proteins

secondary and tertiary structure

protein modifications

Pattern Finders

FindPatterns [GCG]

patterns in DNA sequences

Examples of user-specified patterns (click here for complete list of one-letter codes)

CYCGRG	looks for AvaI sites (Y=pyrimidine, R=purine)
GGWCC	looks for AvaII sites (W=A or T)
TTGACA(N){16,18}TATAAT	looks for consensus prokaryotic promoter (two 6-base sequences separated by 16 to 18 bases)

Consensus [GCG]

nucleotide or amino acid frequencies

FitConsensus [GCG]

Uses frequency table made by Consensus (above) to search a sequence for subsequences that best match the frequencies within the table.

Ref: Staden (1984) Nucl Acids Res 12:505-519

MEME [GCG] and [http://meme.sdsc.edu/meme/website/]

Finds statistically overrepresented motifs in user provided collection of sequences (DNA or protein). If you think you have several genes that probably share an upstream regulatory sequence, the program may find it for you. Or it might find a short amino acid common to a set of protein of similar function (but if the protein are TOO similar, use PILEUP or Clustal). Once you've found one motif or a set of motifs, you can use the MEME-generated prf file (containing the description of the motifs) to search a sequence or set of sequences (e.g. all of PIR) for other instances. In the web version, the program called MAST does the same thing.

Gibbs Motif Sampler [http://bayesweb.wadsworth.org/gibbs/gibbs.html]

Similar in purpose to MEME and works in a very similar way as well. It differs in that it (but not MEME) requires that the user specify the width of the motif sought (i.e., number of amino acids or nucleotides). Also, I think it uses a different strategy to find candidate motifs amongst the huge number possible, so it may find the optimal motif in considerably less time. It overlaps with Meme in its answers, but the answers are not identical.

Ref: Lawrence CE et al (1993) Science 262:208-214.

BioProspector[http://bioprospector.stanford.edu/]

Also similar in purpose to MEME (but works only on DNA sequences, not protein) and also uses Gibbs sampling. However the quality motifs are assessed through zero- to third-order Markov models. The local version (but not the version on the web) supplies estimates of the significance of the motifs found.

Liu X, Brutlag DL, Liu JS (2001). Pac Symp Biocomput. pp127-38.

Miscellaneous Tools

Molecular Toolkit [http://arbl.cvmbs.colostate.edu/molkit/index.html] (Colorado State University)
Many tools for quick manipulation of DNA and protein sequences: Inverse, translate, reverse translate, hydrophobicity plots, dot plots.

Addendum: Commercial Sites

Many companies sprung out with the genomic fever, offering integrated gene finding services. Below is a list of some of these companies (in alphabetical order). They are still evolving, some might continue this line of business the market permitting, others might turn into pharmaceutical companies on their own and others might simply disappear. The list is just for your information and in no way implies an endorsement from our part on any of them.

Celera Discovery System (CDS) [http://www.celera.com/index.cfmhttp://www.celera.com/index.cfm], Celera Genomics.

LifeSeq^®[http://www.incyte.com/#], Incyte Genomics.

genomeSCOUT^®[http://www.lionbioscience.com/solutions/genomescout], LION Bioscience.

Rosetta Resolver^® System [http://www.rosettabio.com/home.html], Rosetta Inpharmatics (acquired by Merck).