Sequence Analysis Tools and Databases
Many of the software tools used in studying
genomes involve sequence analysis, which is one of the many subfields
of computational molecular biology. The field of sequence analysis
includes pattern and motif searching, sequence comparison, multiple
sequence alignment, sequence composition determination, and secondary
structure prediction. Because sequence data consists primarily of
character strings, it's relatively easy to process
the sequence entries in a flat file. Bioinformaticians use a variety
of different tools to perform sequence analysis,
including:
Standard Unix tools (e.g., the grep family,
sed, awk, and
cut).
Publicly available tools (e.g., BLAST, the EMBOSS package).
Open source libaries (e.g., BioPerl, BioJava, BioPython, BioRuby).
Custom tools.
Finding these tools is pretty easy, but remembering all the
command-line options for your favorites is often more difficult.
Nearly all of these tools were written to manipulate and analyze data
stored in databases. Many of the most important biological databases
have existed for a decade or more, making them almost ancient in this
fast-moving field. The first public release of GenBank (Release 3)
was in December 1982. There were 606 sequences containing 680,338
basepairs. Release 132 from October 2002 had 19,808,101 sequences
containing 26,525,934,656 basepairs. SWISS-PROT has grown from 3939
protein sequences containing 900,163 amino acids (Release 2.0 in
September 1986) to 101,602 protein sequences containing 37,315,215
amino acids (Release 40.0 in October 2001).
Plenty of data is available, and finding it is easy. Downloading it
is almost as simple, assuming you've got a broadband
Internet connection and plenty of disk space. The hard part is
dealing with the plethora of flat file formats and trying to remember
what their specific field codes mean. Most of us survive by either
having hard copies of README files lying around or remembering
exactly where to go look for something we need. The need to remember
details about our favorite tools and databases prompted us to gather
the information and organize it into this book.
|