123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280 |
- .TH SPIDEY 1 2005-01-25 NCBI "NCBI Tools User's Manual"
- .SH NAME
- spidey \- align mRNA sequences to a genome
- .SH SYNOPSIS
- .B spidey
- [\|\fB\-\fP\|]
- [\|\fB\-F\fP\ \fIN\fP\|]
- [\|\fB\-G\fP\|]
- [\|\fB\-L\fP\ \fIN\fP\|]
- [\|\fB\-M\fP\ \fIfilename\fP\|]
- [\|\fB\-N\fP\ \fIfilename\fP\|]
- [\|\fB\-R\fP\ \fIfilename\fP\|]
- [\|\fB\-S\fP\ \fIp/m\fP\|]
- [\|\fB\-T\fP\ \fIN\fP\|]
- [\|\fB\-X\fP\|]
- [\|\fB\-a\fP\ \fIfilename\fP\|]
- [\|\fB\-c\fP\ \fIN\fP\|]
- [\|\fB\-d\fP\|]
- [\|\fB\-e\fP\ \fIX\fP\|]
- [\|\fB\-f\fP\ \fIX\fP\|]
- [\|\fB\-g\fP\ \fIX\fP\|]
- \fB\-i\fP\ \fIfilename\fP
- [\|\fB\-j\fP\|]
- [\|\fB\-k\fP\ \fIfilename\fP\|]
- [\|\fB\-l\fP\ \fIN\fP\|]
- \fB\-m\fP\ \fIfilename\fP
- [\|\fB\-n\fP\ \fIN\fP\|]
- [\|\fB\-o\fP\ \fIstr\fP\|]
- [\|\fB\-p\fP\ \fIN\fP\|]
- [\|\fB\-r\fP\ \fIc/d/m/p/v\fP\|]
- [\|\fB\-s\fP\|]
- [\|\fB\-t\fP\ \fIfilename\fP\|]
- [\|\fB\-u\fP\|]
- [\|\fB\-w\fP\|]
- .SH DESCRIPTION
- \fBspidey\fP is a tool for aligning one or more mRNA sequences to a
- given genomic sequence. \fBspidey\fP was written with two main goals
- in mind: find good alignments regardless of intron size; and avoid
- getting confused by nearby pseudogenes and paralogs. Towards the
- first goal, \fBspidey\fP uses BLAST and Dot View (another local
- alignment tool) to find its alignments; since these are both local
- alignment tools, \fBspidey\fP does not intrinsically favor shorter or
- longer introns and has no maximum intron size. To avoid mistakenly
- including exons from paralogs and pseudogenes, \fBspidey\fP first
- defines windows on the genomic sequence and then performs the
- mRNA-to-genomic alignment separately within each window. Because of
- the way the windows are constructed, neighboring paralogs or
- pseudogenes should be in separate windows and should not be included
- in the final spliced alignment.
- .SS Initial alignments and construction of genomic windows
- \fBspidey\fP takes as input a single genomic sequence and a set of
- mRNA accessions or FASTA sequences. All processing is done one mRNA
- sequence at a time. The first step for each mRNA sequence is a
- high-stringency BLAST against the genomic sequence. The resulting
- hits are analyzed to find the genomic windows.
- .PP
- The BLAST alignments are sorted by score and then assigned into
- windows by a recursive function which takes the first alignment and
- then goes down the alignment list to find all alignments that are
- consistent with the first (same strand of mRNA, both the mRNA and
- genomic coordinates are nonoverlapping and linearly consistent). On
- subsequent passes, the remaining alignments are examined and are put
- into their own nonoverlapping, consistent windows, until no alignments
- are left. Depending on how many gene models are desired, the
- top \fIn\fP windows are chosen to go on to the next step and the others
- are deleted.
- .SS Aligning in each window
- Once the genomic windows are constructed, the initial BLAST alignments
- are freed and another BLAST search is performed, this time with the
- entire mRNA against the genomic region defined by the window, and at a
- lower stringency than the initial search. \fBspidey\fP then uses a
- greedy algorithm to generate a high-scoring, nonoverlapping subset of
- the alignments from the second BLAST search. This consistent set is
- analyzed carefully to make sure that the entire mRNA sequence is
- covered by the alignments. When gaps are found between the
- alignments, the appropriate region of genomic sequence is searched
- against the missing mRNA, first using a very low-stringency BLAST and,
- if the BLAST fails to find a hit, using DotView functions to locate
- the alignment. When gaps are found at the ends of the alignments, the
- BLAST and DotView searches are actually allowed to extend past the
- boundaries of the window. If the 3' end of the mRNA does not align
- completely, it is first examined for the presence of a poly(A) tail.
- No attempt is made to align the portion of the mRNA that seems to be a
- poly(A) tail; sometimes there is a poly(A) tail that does align to the
- genomic sequence, and these are noted because they indicate the
- possibility of a pseudogene.
- .PP
- Now that the mRNA is completely covered by the set of alignments, the
- boundaries of the alignments (there should be one alignment per exon
- now) are adjusted so that the alignments abut each other precisely and
- so that they are adjacent to good splice donor and acceptor sites.
- Most commonly, two adjacent exons' alignments overlap by as much as 20
- or 30 base pairs on the mRNA sequence. The true exon boundary may lie
- anywhere within this overlap, or (as we have seen empirically) even a
- few base pairs outside the overlap. To position the exon boundaries,
- the overlap plus a few base pairs on each side is examined for splice
- donor sites, using functions that have different splice matrices
- depending on the organism chosen. The top few splice donor sites (by
- score) are then evaluated as to how much they affect the original
- alignment boundaries. The site that affects the boundaries the least
- is chosen, and is evaluated as to the presence of an acceptor site.
- The alignments are truncated or extended as necessary so that they
- terminate at the splice donor site and so that they do not overlap.
- .SS Final result
- The windows are examined carefully to get the percent identity per
- exon, the number of gaps per exon, the overall percent identity, the
- percent coverage of the mRNA, presence of an aligning or non-aligning
- poly(A) tail, number of splice donor sites and the presence or absence
- of splice donor and acceptor sites for each exon, and the occurrence
- of an mRNA that has a 5' or 3' end (or both) that does not align to
- the genomic sequence. If the overall percent identity and percent
- length coverage are above the user-defined cutoffs, a summary report
- is printed, and, if requested, a text alignment showing identities and
- mismatches is also printed.
- .SS Interspecies alignments
- \fBspidey\fP is capable of performing interspecies alignments. The
- major difference in interspecies alignments is that the mRNA-genomic
- identity will not be close to 100% as it is in intraspecies
- alignments; also, the alignments have numerous and lengthy gaps. If
- \fBspidey\fP is used in its normal mode to do interspecies alignments,
- it produces gene models with many, many short exons. When the
- interspecies flag is set, \fBspidey\fP uses different BLAST parameters
- to encourage longer and more gaps and to not penalize as heavily for
- mismatches. This way, the alignments for the exons are much longer
- and more closely approximate the actual gene structure.
- .SS Extracting CDS alignments
- When \fBspidey\fP is run in network-aware mode or when ASN.1 files are
- used for the mRNA records, it is capable of extracting a CDS alignment
- from an mRNA alignment and printing the CDS information also. Since
- the CDS alignment is just a subset of the mRNA alignment, it is
- relatively straightforward to truncate the exon alignments as
- necessary and to generate a CDS alignment. Furthermore, the
- untranslated regions are now defined, so the percent identity for the
- 5' and 3' untranslated regions is also calculated.
- .PP
- .SH OPTIONS
- A summary of options is included below.
- .TP
- \fB\-\fP
- Print usage message.
- .TP
- \fB\-F\fP\ \fIN\fP
- Start of genomic interval desired (from; 0-based).
- .TP
- \fB\-G\fP
- Input file is a GI list.
- .TP
- \fB\-L\fP\ \fIN\fP
- The extra-large intron size to use (default = 220000).
- .TP
- \fB\-M\fP\ \fIfilename\fP
- File with donor splice matrix.
- .TP
- \fB\-N\fP\ \fIfilename\fP
- File with acceptor splice matrix.
- .TP
- \fB\-R\fP\ \fIfilename\fP
- File (including path) to repeat blast database for filtering.
- .TP
- \fB\-S\fP\ \fIp/m\fP
- Restrict to plus (p) or minus (m) strand of genomic sequence.
- .TP
- \fB\-T\fP\ \fIN\fP
- Stop of genomic interval desired (to; 0-based).
- .TP
- \fB\-X\fP
- Use extra-large intron sizes (increases the limit for initial and
- terminal introns from 100kb to 240kb and for all others from 35kb to
- 120kb); may result in significantly longer compute times.
- .TP
- \fB\-a\fP\ \fIfilename\fP
- Output file for alignments when directed to a separate file with
- \fB-p\ 3\fP (default = spidey.aln).
- .TP
- \fB\-c\fP\ \fIN\fP
- Identity cutoff, in percent, for quality control purposes.
- .TP
- \fB\-d\fP
- Also try to align coding sequences corresponding to the given mRNA
- records (may require network access).
- .TP
- \fB\-e\fP\ \fIX\fP
- First-pass e-value (default = 1.0e-10). Higher values increase speed
- at the cost of sensitivity.
- .TP
- \fB\-f\fP\ \fIX\fP
- Second-pass e-value (default = 0.001).
- .TP
- \fB\-g\fP\ \fIX\fP
- Third-pass e-value (default = 10).
- .TP
- \fB\-i\fP\ \fIfilename\fP
- Input file containing the genomic sequence in ASN.1 or FASTA format.
- If your computer is running on a network that can access GenBank, you
- can substitute the desired accession number for the filename.
- .TP
- \fB\-j\fP
- Print ASN.1 alignment?
- .TP
- \fB\-k\fP\ \fIfilename\fP
- File for ASN.1 output with \fB-k\fP (default = spidey.asn).
- .TP
- \fB\-l\fP\ \fIN\fP
- Length coverage cutoff, in percent.
- .TP
- \fB\-m\fP\ \fIfilename\fP
- Input file containing the mRNA sequence(s) in ASN.1 or FASTA format,
- or a list of their accessions (with \fB-G\fP). If your computer is
- running on a network that can access GenBank, you can substitute a
- single accession number for the filename.
- .TP
- \fB\-n\fP\ \fIN\fP
- Number of gene models to return per input mRNA (default = 1).
- .TP
- \fB\-o\fP\ \fIstr\fP
- Main output file (default = stdout; contents controlled by \fB-p\fP).
- .TP
- \fB\-p\fP\ \fIN\fP
- Print alignment?
- .RS
- .PD 0
- .IP \fB0\fP
- summary and alignments together (default)
- .IP \fB1\fP
- just the summary
- .IP \fB2\fP
- just the alignments
- .IP \fB3\fP
- summary and alignments in different files
- .PD
- .RE
- .TP
- \fB\-r\fP\ \fIc/d/m/p/v\fP
- Organism of genomic sequence, used to determine splice matrices.
- .RS
- .PD 0
- .IP \fBc\fP
- C. elegans
- .IP \fBd\fP
- Drosophila
- .IP \fBm\fP
- Dictyostelium discoideum
- .IP \fBp\fP
- plant
- .IP \fBv\fP
- vertebrate (default)
- .PD
- .RE
- .TP
- \fB\-s\fP
- Tune for interspecies alignments.
- .TP
- \fB\-t\fP\ \fIfilename\fP
- File with feature table, in 4 tab-delimited columns:
- .RS
- .PD 0
- .IP \fIseqid\fP
- (e.g., \fBNM_04377.1\fP)
- .IP \fIname\fP
- (only \fBrepetitive_region\fP is currently supported)
- .IP \fIstart\fP
- (0-based)
- .IP \fIstop\fP
- (0-based)
- .PD
- .RE
- .TP
- \fB\-u\fP
- Make a multiple alignment of all input mRNAs (which must overlap on
- the genomic sequence).
- .TP
- \fB\-w\fP
- Consider lowercase characters in input FASTA sequences to be masked.
- .SH AUTHOR
- Sarah Wheelan and others at the National Center for Biotechnology
- Information; Steffen Moeller contributed to this documentation.
- .SH SEE ALSO
- .BR blast (1),
- <http://www.ncbi.nlm.nih.gov/spidey>
|