regexp.texi 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284
  1. @node POSIX regular expressions
  2. @section Regular expressions
  3. @cindex string matching
  4. The procedures in this section provide access to POSIX regular
  5. expression matching. The regular expression syntax and semantics are
  6. far too complex to be described here.
  7. @strong{Note:} Because the C interface uses ASCII @code{NUL} bytes to
  8. mark the ends of strings, patterns & strings that contain @code{NUL}
  9. characters will not work correctly.
  10. @subsection Direct POSIX regular expression interface
  11. @stindex posix-regexps
  12. The first interface to regular expressions is a thin layer over the
  13. interface that POSIX provides. It is exported by the structures
  14. @code{posix-regexps} & @code{posix}.
  15. @deffn procedure make-regexp string option @dots{} @returns{} regexp
  16. @deffnx procedure regexp? object @returns{} boolean
  17. @code{Make-regexp} creates a regular expression with the given string
  18. pattern. The arguments after @var{string} specify various options for
  19. the regular expression; see @code{regexp-option} below. The regular
  20. expression is not compiled until it is matched against a string, so any
  21. errors in the pattern string will not be reported until that point.
  22. @code{Regexp?} is the disjoint type predicate for regular expression
  23. objects.
  24. @end deffn
  25. @deffn syntax regexp-option name @returns{} regexp-option
  26. Evaluates to a regular expression option, suitable to be passed to
  27. @code{make-regexp}, with the given name. The possible option names
  28. are:
  29. @table @code
  30. @item extended
  31. use the extended patterns
  32. @item ignore-case
  33. ignore case differences when matching
  34. @item submatches
  35. report submatches
  36. @item newline
  37. treat newlines specially
  38. @end table
  39. @end deffn
  40. @deffn procedure regexp-match regexp string start submatches? starts-line? ends-line? @returns{} boolean or list of matches
  41. @code{Regexp-match} matches @var{regexp} against the characters in
  42. @var{string}, starting at position @var{start}. If the string does not
  43. match the regular expression, @code{regexp-match} returns @code{#f}.
  44. If the string does match, then a list of match records is returned if
  45. @var{submatches?} is true or @code{#t} if @var{submatches?} is false.
  46. The first match record gives the location of the substring that matched
  47. @var{regexp}. If the pattern in @var{regexp} contained submatches,
  48. then the submatches are returned in order, with match records in the
  49. positions where submatches succeeded and @code{#f} in the positions
  50. where submatches failed.
  51. @var{Starts-line?} should be true if @var{string} starts at the
  52. beginning of a line, and @var{ends-line?} should be true if it ends
  53. one.
  54. @end deffn
  55. @deffn procedure match? object @returns{} boolean
  56. @deffnx procedure match-start match @returns{} integer
  57. @deffnx procedure match-end match @returns{} integer
  58. @deffnx procedure match-submatches match @returns{} alist
  59. @code{Match?} is the disjoint type predicate for match records. Match
  60. records contain three values: the beginning & end of the substring that
  61. matched the pattern and an association list of submatch keys and
  62. corresponding match records for any named submatches that also matched.
  63. @code{Match-start} returns the index of the first character in the
  64. matching substring, and @code{match-end} gives the index of the
  65. first character after the matching substring. @code{Match-submatches}
  66. returns the alist of submatches.
  67. @end deffn
  68. @subsection High-level regular expression construction
  69. @stindex regexp
  70. This section describes a functional interface for building regular
  71. expressions and matching them against strings, higher-level than the
  72. direct POSIX interface. The matching is done using the POSIX regular
  73. expression package. Regular expressions constructed by procedures
  74. listed here are compatible with those in the previous section; that is,
  75. they satisfy the predicate @code{regexp?} from the @code{posix-regexps}
  76. structure. These names are exported by the structure @code{regexps}.
  77. @subsubsection Character sets
  78. Character sets may be defined using a list of characters and strings,
  79. using a range or ranges of characters, or by using set operations on
  80. existing character sets.
  81. @deffn procedure set char-or-string @dots{} @returns{} char-set-regexp
  82. @deffnx procedure range low-char high-char @returns{} char-set-regexp
  83. @deffnx procedure ranges low-char high-char @dots{} @returns{} char-set-regexp
  84. @deffnx procedure ascii-range low-char high-char @returns{} char-set-regexp
  85. @deffnx procedure ascii-ranges low-char high-char @dots{} @returns{} char-set-regexp
  86. @code{Set} returns a character set that contains all of the character
  87. arguments and all of the characters in all of the string arguments.
  88. @code{Range} returns a character set that contains all characters
  89. between @var{low-char} and @var{high-char}, inclusive. @code{Ranges}
  90. returns a set that contains all of the characters in the given set of
  91. ranges. @code{Range} & @code{ranges} use the ordering imposed by
  92. @code{char->integer}. @code{Ascii-range} & @code{ascii-ranges} are
  93. like @code{range} & @code{ranges}, but they use the ASCII ordering.
  94. @code{Ranges} & @code{ascii-ranges} must be given an even number of
  95. arguments. It is an error for a @var{high-char} to be less than the
  96. preceding @var{low-char} in the appropriate ordering.
  97. @end deffn
  98. @deffn procedure negate char-set @returns{} char-set-regexp
  99. @deffnx procedure union char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
  100. @deffnx procedure intersection char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
  101. @deffnx procedure subtract char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
  102. Set operations on character sets. @code{Negate} returns a character
  103. set of all characters that are not in @var{char-set}. @code{Union}
  104. returns a character set that contains all of the characters in
  105. @var{char-set@suba{a}} and all of the characters in
  106. @var{char-set@suba{b}}. @code{Intersection} returns a character set of
  107. all of the characters that are in both @var{char-set@suba{a}} and
  108. @var{char-set@suba{b}}. @code{Subtract} returns a character set of all
  109. the characters in @var{char-set@suba{a}} that are not also in
  110. @var{char-set@suba{b}}.
  111. @end deffn
  112. @defvr {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
  113. @defvrx {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
  114. @defvrx {character set} upper-case = @code{(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")}
  115. @defvrx {character set} alphabetic = @code{(union lower-case upper-case)}
  116. @defvrx {character set} numeric = @code{(set "0123456789")}
  117. @defvrx {character set} alphanumeric = @code{(union alphabetic numeric)}
  118. @defvrx {character set} punctuation = @code{(set "!\"#$%&'()*+,-./:;<=>?@@[\\]^_`@{|@}~")}
  119. @defvrx {character set} graphic = @code{(union alphanumeric punctuation)}
  120. @defvrx {character set} printing = @code{(union graphic (set #\space))}
  121. @defvrx {character set} control = @code{(negate printing)}
  122. @defvrx {character set} blank = @code{(set #\space (ascii->char 9)) ; ASCII 9 = TAB}
  123. @defvrx {character set} whitespace = @code{(union (set #\space) (ascii-range 9 13))}
  124. @defvrx {character set} hexdigit = @code{(set "0123456789ABCDEF")}
  125. Predefined character sets.
  126. @end defvr
  127. @subsubsection Anchoring
  128. @deffn procedure string-start @returns{} regexp
  129. @deffnx procedure string-end @returns{} regexp
  130. @code{String-start} returns a regular expression that matches the
  131. beginning of the string being matched against; @code{string-end}
  132. returns one that matches the end.
  133. @end deffn
  134. @subsubsection Composite expressions
  135. @deffn procedure sequence regexp @dots{} @returns{} regexp
  136. @deffnx procedure one-of regexp @dots{} @returns{} regexp
  137. @code{Sequence} returns a regular expression that matches
  138. concatenation of all of its arguments; @code{one-of} returns a regular
  139. expression that matches any one of its arguments.
  140. @end deffn
  141. @deffn procedure text string @returns{} regexp
  142. Returns a regular expression that matches exactly the characters in
  143. @var{string}, in order.
  144. @end deffn
  145. @deffn procedure repeat regexp @returns{} regexp
  146. @deffnx procedure repeat count regexp @returns{} regexp
  147. @deffnx procedure repeat min max regexp @returns{} regexp
  148. @code{Repeat} returns a regular expression that matches zero or more
  149. occurrences of its @var{regexp} argument. With only one argument, the
  150. result will match @var{regexp} any number of times. With two
  151. arguments, @ie{} one @var{count} argument, the returned regular
  152. expression will match @var{regexp} exactly that number of times. The
  153. final case will match from @var{min} to @var{max} repetitions,
  154. inclusive. @var{Max} may be @code{#f}, in which case there is no
  155. maximum number of matches. @var{Count} & @var{min} must be exact,
  156. non-negative integers; @var{max} should be either @code{#f} or an
  157. exact, non-negative integer.
  158. @end deffn
  159. @subsubsection Case sensitivity
  160. Regular expressions are normally case-sensitive, but case sensitivity
  161. can be manipulated simply.
  162. @deffn procedure ignore-case regexp @returns{} regexp
  163. @deffnx procedure use-case regexp @returns{} regexp
  164. The regular expression returned by @code{ignore-case} is identical to
  165. its argument except that the case will be ignored when matching. The
  166. value returned by @code{use-case} is protected from future applications
  167. of @code{ignore-case}. The expressions returned by @code{use-case} and
  168. @code{ignore-case} are unaffected by any enclosing uses of these
  169. procedures.
  170. By way of example, the following matches @code{"ab"}, but not
  171. @code{"aB"}, @code{"Ab"}, or @code{"AB"}:
  172. @lisp
  173. (text "ab")@end lisp
  174. @noindent
  175. while
  176. @lisp
  177. (ignore-case (text "ab"))@end lisp
  178. @noindent
  179. matches all of those, and
  180. @lisp
  181. (ignore-case (sequence (text "a")
  182. (use-case (text "b"))))@end lisp
  183. @noindent
  184. matches @code{"ab"} or @code{"Ab"}, but not @code{"aB"} or @code{"AB"}.
  185. @end deffn
  186. @subsubsection Submatches and matching
  187. A subexpression within a larger expression can be marked as a submatch.
  188. When an expression is matched against a string, the success or failure
  189. of each submatch within that expression is reported, as well as the
  190. location of the substring matched by each successful submatch.
  191. @deffn procedure submatch key regexp @returns{} regexp
  192. @deffnx procedure no-submatches regexp @returns{} regexp
  193. @code{Submatch} returns a regular expression that is equivalent to
  194. @var{regexp} in every way except that the regular expression returned by
  195. @code{submatch} will produce a submatch record in the output for the
  196. part of the string matched by @var{regexp}. @code{No-submatches}
  197. returns a regular expression that is equivalent to @var{regexp} in every
  198. respect except that all submatches generated by @var{regexp} will be
  199. ignored & removed from the output.
  200. @end deffn
  201. @deffn procedure any-match? regexp string @returns{} boolean
  202. @deffnx procedure exact-match? regexp string @returns{} boolean
  203. @deffnx procedure match regexp string @returns{} match or @code{#f}
  204. @code{Any-match?} returns @code{#t} if @var{string} matches
  205. @var{regexp} or contains a substring that does, or @code{#f} if
  206. otherwise. @code{Exact-match?} returns @code{#t} if @var{string}
  207. matches @var{regexp} exactly, or @code{#f} if it does not.
  208. @code{Match} returns @code{#f} if @var{string} does not match
  209. @var{regexp}, or a match record if it does, as described in the
  210. previous section. Matching occurs according to POSIX. The match
  211. returned is the one with the lowest starting index in @var{string}. If
  212. there is more than one such match, the longest is returned. Within
  213. that match, the longest possible submatches are returned.
  214. All three matching procedures cache a compiled version of @var{regexp}.
  215. Subsequent calls with the same input regular expression will be more
  216. efficient.
  217. @end deffn
  218. Here are some examples of the high-level regular expression interface:
  219. @lisp
  220. (define pattern (text "abc"))
  221. (any-match? pattern "abc") @result{} #t
  222. (any-match? pattern "abx") @result{} #f
  223. (any-match? pattern "xxabcxx") @result{} #t
  224. (exact-match? pattern "abc") @result{} #t
  225. (exact-match? pattern "abx") @result{} #f
  226. (exact-match? pattern "xxabcxx") @result{} #f
  227. (let ((m (match (sequence (text "ab")
  228. (submatch 'foo (text "cd"))
  229. (text "ef")))
  230. "xxabcdefxx"))
  231. (list m (match-submatches m)))
  232. @result{} (#@{Match 3 9@} ((foo . #@{Match 5 7@})))
  233. (match-submatches
  234. (match (sequence (set "a")
  235. (one-of (submatch 'foo (text "bc"))
  236. (submatch 'bar (text "BC"))))
  237. "xxxaBCd"))
  238. @result{} ((bar . #@{Match 4 6@}))@end lisp