spidey.1 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280
  1. .TH SPIDEY 1 2005-01-25 NCBI "NCBI Tools User's Manual"
  2. .SH NAME
  3. spidey \- align mRNA sequences to a genome
  4. .SH SYNOPSIS
  5. .B spidey
  6. [\|\fB\-\fP\|]
  7. [\|\fB\-F\fP\ \fIN\fP\|]
  8. [\|\fB\-G\fP\|]
  9. [\|\fB\-L\fP\ \fIN\fP\|]
  10. [\|\fB\-M\fP\ \fIfilename\fP\|]
  11. [\|\fB\-N\fP\ \fIfilename\fP\|]
  12. [\|\fB\-R\fP\ \fIfilename\fP\|]
  13. [\|\fB\-S\fP\ \fIp/m\fP\|]
  14. [\|\fB\-T\fP\ \fIN\fP\|]
  15. [\|\fB\-X\fP\|]
  16. [\|\fB\-a\fP\ \fIfilename\fP\|]
  17. [\|\fB\-c\fP\ \fIN\fP\|]
  18. [\|\fB\-d\fP\|]
  19. [\|\fB\-e\fP\ \fIX\fP\|]
  20. [\|\fB\-f\fP\ \fIX\fP\|]
  21. [\|\fB\-g\fP\ \fIX\fP\|]
  22. \fB\-i\fP\ \fIfilename\fP
  23. [\|\fB\-j\fP\|]
  24. [\|\fB\-k\fP\ \fIfilename\fP\|]
  25. [\|\fB\-l\fP\ \fIN\fP\|]
  26. \fB\-m\fP\ \fIfilename\fP
  27. [\|\fB\-n\fP\ \fIN\fP\|]
  28. [\|\fB\-o\fP\ \fIstr\fP\|]
  29. [\|\fB\-p\fP\ \fIN\fP\|]
  30. [\|\fB\-r\fP\ \fIc/d/m/p/v\fP\|]
  31. [\|\fB\-s\fP\|]
  32. [\|\fB\-t\fP\ \fIfilename\fP\|]
  33. [\|\fB\-u\fP\|]
  34. [\|\fB\-w\fP\|]
  35. .SH DESCRIPTION
  36. \fBspidey\fP is a tool for aligning one or more mRNA sequences to a
  37. given genomic sequence. \fBspidey\fP was written with two main goals
  38. in mind: find good alignments regardless of intron size; and avoid
  39. getting confused by nearby pseudogenes and paralogs. Towards the
  40. first goal, \fBspidey\fP uses BLAST and Dot View (another local
  41. alignment tool) to find its alignments; since these are both local
  42. alignment tools, \fBspidey\fP does not intrinsically favor shorter or
  43. longer introns and has no maximum intron size. To avoid mistakenly
  44. including exons from paralogs and pseudogenes, \fBspidey\fP first
  45. defines windows on the genomic sequence and then performs the
  46. mRNA-to-genomic alignment separately within each window. Because of
  47. the way the windows are constructed, neighboring paralogs or
  48. pseudogenes should be in separate windows and should not be included
  49. in the final spliced alignment.
  50. .SS Initial alignments and construction of genomic windows
  51. \fBspidey\fP takes as input a single genomic sequence and a set of
  52. mRNA accessions or FASTA sequences. All processing is done one mRNA
  53. sequence at a time. The first step for each mRNA sequence is a
  54. high-stringency BLAST against the genomic sequence. The resulting
  55. hits are analyzed to find the genomic windows.
  56. .PP
  57. The BLAST alignments are sorted by score and then assigned into
  58. windows by a recursive function which takes the first alignment and
  59. then goes down the alignment list to find all alignments that are
  60. consistent with the first (same strand of mRNA, both the mRNA and
  61. genomic coordinates are nonoverlapping and linearly consistent). On
  62. subsequent passes, the remaining alignments are examined and are put
  63. into their own nonoverlapping, consistent windows, until no alignments
  64. are left. Depending on how many gene models are desired, the
  65. top \fIn\fP windows are chosen to go on to the next step and the others
  66. are deleted.
  67. .SS Aligning in each window
  68. Once the genomic windows are constructed, the initial BLAST alignments
  69. are freed and another BLAST search is performed, this time with the
  70. entire mRNA against the genomic region defined by the window, and at a
  71. lower stringency than the initial search. \fBspidey\fP then uses a
  72. greedy algorithm to generate a high-scoring, nonoverlapping subset of
  73. the alignments from the second BLAST search. This consistent set is
  74. analyzed carefully to make sure that the entire mRNA sequence is
  75. covered by the alignments. When gaps are found between the
  76. alignments, the appropriate region of genomic sequence is searched
  77. against the missing mRNA, first using a very low-stringency BLAST and,
  78. if the BLAST fails to find a hit, using DotView functions to locate
  79. the alignment. When gaps are found at the ends of the alignments, the
  80. BLAST and DotView searches are actually allowed to extend past the
  81. boundaries of the window. If the 3' end of the mRNA does not align
  82. completely, it is first examined for the presence of a poly(A) tail.
  83. No attempt is made to align the portion of the mRNA that seems to be a
  84. poly(A) tail; sometimes there is a poly(A) tail that does align to the
  85. genomic sequence, and these are noted because they indicate the
  86. possibility of a pseudogene.
  87. .PP
  88. Now that the mRNA is completely covered by the set of alignments, the
  89. boundaries of the alignments (there should be one alignment per exon
  90. now) are adjusted so that the alignments abut each other precisely and
  91. so that they are adjacent to good splice donor and acceptor sites.
  92. Most commonly, two adjacent exons' alignments overlap by as much as 20
  93. or 30 base pairs on the mRNA sequence. The true exon boundary may lie
  94. anywhere within this overlap, or (as we have seen empirically) even a
  95. few base pairs outside the overlap. To position the exon boundaries,
  96. the overlap plus a few base pairs on each side is examined for splice
  97. donor sites, using functions that have different splice matrices
  98. depending on the organism chosen. The top few splice donor sites (by
  99. score) are then evaluated as to how much they affect the original
  100. alignment boundaries. The site that affects the boundaries the least
  101. is chosen, and is evaluated as to the presence of an acceptor site.
  102. The alignments are truncated or extended as necessary so that they
  103. terminate at the splice donor site and so that they do not overlap.
  104. .SS Final result
  105. The windows are examined carefully to get the percent identity per
  106. exon, the number of gaps per exon, the overall percent identity, the
  107. percent coverage of the mRNA, presence of an aligning or non-aligning
  108. poly(A) tail, number of splice donor sites and the presence or absence
  109. of splice donor and acceptor sites for each exon, and the occurrence
  110. of an mRNA that has a 5' or 3' end (or both) that does not align to
  111. the genomic sequence. If the overall percent identity and percent
  112. length coverage are above the user-defined cutoffs, a summary report
  113. is printed, and, if requested, a text alignment showing identities and
  114. mismatches is also printed.
  115. .SS Interspecies alignments
  116. \fBspidey\fP is capable of performing interspecies alignments. The
  117. major difference in interspecies alignments is that the mRNA-genomic
  118. identity will not be close to 100% as it is in intraspecies
  119. alignments; also, the alignments have numerous and lengthy gaps. If
  120. \fBspidey\fP is used in its normal mode to do interspecies alignments,
  121. it produces gene models with many, many short exons. When the
  122. interspecies flag is set, \fBspidey\fP uses different BLAST parameters
  123. to encourage longer and more gaps and to not penalize as heavily for
  124. mismatches. This way, the alignments for the exons are much longer
  125. and more closely approximate the actual gene structure.
  126. .SS Extracting CDS alignments
  127. When \fBspidey\fP is run in network-aware mode or when ASN.1 files are
  128. used for the mRNA records, it is capable of extracting a CDS alignment
  129. from an mRNA alignment and printing the CDS information also. Since
  130. the CDS alignment is just a subset of the mRNA alignment, it is
  131. relatively straightforward to truncate the exon alignments as
  132. necessary and to generate a CDS alignment. Furthermore, the
  133. untranslated regions are now defined, so the percent identity for the
  134. 5' and 3' untranslated regions is also calculated.
  135. .PP
  136. .SH OPTIONS
  137. A summary of options is included below.
  138. .TP
  139. \fB\-\fP
  140. Print usage message.
  141. .TP
  142. \fB\-F\fP\ \fIN\fP
  143. Start of genomic interval desired (from; 0-based).
  144. .TP
  145. \fB\-G\fP
  146. Input file is a GI list.
  147. .TP
  148. \fB\-L\fP\ \fIN\fP
  149. The extra-large intron size to use (default = 220000).
  150. .TP
  151. \fB\-M\fP\ \fIfilename\fP
  152. File with donor splice matrix.
  153. .TP
  154. \fB\-N\fP\ \fIfilename\fP
  155. File with acceptor splice matrix.
  156. .TP
  157. \fB\-R\fP\ \fIfilename\fP
  158. File (including path) to repeat blast database for filtering.
  159. .TP
  160. \fB\-S\fP\ \fIp/m\fP
  161. Restrict to plus (p) or minus (m) strand of genomic sequence.
  162. .TP
  163. \fB\-T\fP\ \fIN\fP
  164. Stop of genomic interval desired (to; 0-based).
  165. .TP
  166. \fB\-X\fP
  167. Use extra-large intron sizes (increases the limit for initial and
  168. terminal introns from 100kb to 240kb and for all others from 35kb to
  169. 120kb); may result in significantly longer compute times.
  170. .TP
  171. \fB\-a\fP\ \fIfilename\fP
  172. Output file for alignments when directed to a separate file with
  173. \fB-p\ 3\fP (default = spidey.aln).
  174. .TP
  175. \fB\-c\fP\ \fIN\fP
  176. Identity cutoff, in percent, for quality control purposes.
  177. .TP
  178. \fB\-d\fP
  179. Also try to align coding sequences corresponding to the given mRNA
  180. records (may require network access).
  181. .TP
  182. \fB\-e\fP\ \fIX\fP
  183. First-pass e-value (default = 1.0e-10). Higher values increase speed
  184. at the cost of sensitivity.
  185. .TP
  186. \fB\-f\fP\ \fIX\fP
  187. Second-pass e-value (default = 0.001).
  188. .TP
  189. \fB\-g\fP\ \fIX\fP
  190. Third-pass e-value (default = 10).
  191. .TP
  192. \fB\-i\fP\ \fIfilename\fP
  193. Input file containing the genomic sequence in ASN.1 or FASTA format.
  194. If your computer is running on a network that can access GenBank, you
  195. can substitute the desired accession number for the filename.
  196. .TP
  197. \fB\-j\fP
  198. Print ASN.1 alignment?
  199. .TP
  200. \fB\-k\fP\ \fIfilename\fP
  201. File for ASN.1 output with \fB-k\fP (default = spidey.asn).
  202. .TP
  203. \fB\-l\fP\ \fIN\fP
  204. Length coverage cutoff, in percent.
  205. .TP
  206. \fB\-m\fP\ \fIfilename\fP
  207. Input file containing the mRNA sequence(s) in ASN.1 or FASTA format,
  208. or a list of their accessions (with \fB-G\fP). If your computer is
  209. running on a network that can access GenBank, you can substitute a
  210. single accession number for the filename.
  211. .TP
  212. \fB\-n\fP\ \fIN\fP
  213. Number of gene models to return per input mRNA (default = 1).
  214. .TP
  215. \fB\-o\fP\ \fIstr\fP
  216. Main output file (default = stdout; contents controlled by \fB-p\fP).
  217. .TP
  218. \fB\-p\fP\ \fIN\fP
  219. Print alignment?
  220. .RS
  221. .PD 0
  222. .IP \fB0\fP
  223. summary and alignments together (default)
  224. .IP \fB1\fP
  225. just the summary
  226. .IP \fB2\fP
  227. just the alignments
  228. .IP \fB3\fP
  229. summary and alignments in different files
  230. .PD
  231. .RE
  232. .TP
  233. \fB\-r\fP\ \fIc/d/m/p/v\fP
  234. Organism of genomic sequence, used to determine splice matrices.
  235. .RS
  236. .PD 0
  237. .IP \fBc\fP
  238. C. elegans
  239. .IP \fBd\fP
  240. Drosophila
  241. .IP \fBm\fP
  242. Dictyostelium discoideum
  243. .IP \fBp\fP
  244. plant
  245. .IP \fBv\fP
  246. vertebrate (default)
  247. .PD
  248. .RE
  249. .TP
  250. \fB\-s\fP
  251. Tune for interspecies alignments.
  252. .TP
  253. \fB\-t\fP\ \fIfilename\fP
  254. File with feature table, in 4 tab-delimited columns:
  255. .RS
  256. .PD 0
  257. .IP \fIseqid\fP
  258. (e.g., \fBNM_04377.1\fP)
  259. .IP \fIname\fP
  260. (only \fBrepetitive_region\fP is currently supported)
  261. .IP \fIstart\fP
  262. (0-based)
  263. .IP \fIstop\fP
  264. (0-based)
  265. .PD
  266. .RE
  267. .TP
  268. \fB\-u\fP
  269. Make a multiple alignment of all input mRNAs (which must overlap on
  270. the genomic sequence).
  271. .TP
  272. \fB\-w\fP
  273. Consider lowercase characters in input FASTA sequences to be masked.
  274. .SH AUTHOR
  275. Sarah Wheelan and others at the National Center for Biotechnology
  276. Information; Steffen Moeller contributed to this documentation.
  277. .SH SEE ALSO
  278. .BR blast (1),
  279. <http://www.ncbi.nlm.nih.gov/spidey>