123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284 |
- @node POSIX regular expressions
- @section Regular expressions
- @cindex string matching
- The procedures in this section provide access to POSIX regular
- expression matching. The regular expression syntax and semantics are
- far too complex to be described here.
- @strong{Note:} Because the C interface uses ASCII @code{NUL} bytes to
- mark the ends of strings, patterns & strings that contain @code{NUL}
- characters will not work correctly.
- @subsection Direct POSIX regular expression interface
- @stindex posix-regexps
- The first interface to regular expressions is a thin layer over the
- interface that POSIX provides. It is exported by the structures
- @code{posix-regexps} & @code{posix}.
- @deffn procedure make-regexp string option @dots{} @returns{} regexp
- @deffnx procedure regexp? object @returns{} boolean
- @code{Make-regexp} creates a regular expression with the given string
- pattern. The arguments after @var{string} specify various options for
- the regular expression; see @code{regexp-option} below. The regular
- expression is not compiled until it is matched against a string, so any
- errors in the pattern string will not be reported until that point.
- @code{Regexp?} is the disjoint type predicate for regular expression
- objects.
- @end deffn
- @deffn syntax regexp-option name @returns{} regexp-option
- Evaluates to a regular expression option, suitable to be passed to
- @code{make-regexp}, with the given name. The possible option names
- are:
- @table @code
- @item extended
- use the extended patterns
- @item ignore-case
- ignore case differences when matching
- @item submatches
- report submatches
- @item newline
- treat newlines specially
- @end table
- @end deffn
- @deffn procedure regexp-match regexp string start submatches? starts-line? ends-line? @returns{} boolean or list of matches
- @code{Regexp-match} matches @var{regexp} against the characters in
- @var{string}, starting at position @var{start}. If the string does not
- match the regular expression, @code{regexp-match} returns @code{#f}.
- If the string does match, then a list of match records is returned if
- @var{submatches?} is true or @code{#t} if @var{submatches?} is false.
- The first match record gives the location of the substring that matched
- @var{regexp}. If the pattern in @var{regexp} contained submatches,
- then the submatches are returned in order, with match records in the
- positions where submatches succeeded and @code{#f} in the positions
- where submatches failed.
- @var{Starts-line?} should be true if @var{string} starts at the
- beginning of a line, and @var{ends-line?} should be true if it ends
- one.
- @end deffn
- @deffn procedure match? object @returns{} boolean
- @deffnx procedure match-start match @returns{} integer
- @deffnx procedure match-end match @returns{} integer
- @deffnx procedure match-submatches match @returns{} alist
- @code{Match?} is the disjoint type predicate for match records. Match
- records contain three values: the beginning & end of the substring that
- matched the pattern and an association list of submatch keys and
- corresponding match records for any named submatches that also matched.
- @code{Match-start} returns the index of the first character in the
- matching substring, and @code{match-end} gives the index of the
- first character after the matching substring. @code{Match-submatches}
- returns the alist of submatches.
- @end deffn
- @subsection High-level regular expression construction
- @stindex regexp
- This section describes a functional interface for building regular
- expressions and matching them against strings, higher-level than the
- direct POSIX interface. The matching is done using the POSIX regular
- expression package. Regular expressions constructed by procedures
- listed here are compatible with those in the previous section; that is,
- they satisfy the predicate @code{regexp?} from the @code{posix-regexps}
- structure. These names are exported by the structure @code{regexps}.
- @subsubsection Character sets
- Character sets may be defined using a list of characters and strings,
- using a range or ranges of characters, or by using set operations on
- existing character sets.
- @deffn procedure set char-or-string @dots{} @returns{} char-set-regexp
- @deffnx procedure range low-char high-char @returns{} char-set-regexp
- @deffnx procedure ranges low-char high-char @dots{} @returns{} char-set-regexp
- @deffnx procedure ascii-range low-char high-char @returns{} char-set-regexp
- @deffnx procedure ascii-ranges low-char high-char @dots{} @returns{} char-set-regexp
- @code{Set} returns a character set that contains all of the character
- arguments and all of the characters in all of the string arguments.
- @code{Range} returns a character set that contains all characters
- between @var{low-char} and @var{high-char}, inclusive. @code{Ranges}
- returns a set that contains all of the characters in the given set of
- ranges. @code{Range} & @code{ranges} use the ordering imposed by
- @code{char->integer}. @code{Ascii-range} & @code{ascii-ranges} are
- like @code{range} & @code{ranges}, but they use the ASCII ordering.
- @code{Ranges} & @code{ascii-ranges} must be given an even number of
- arguments. It is an error for a @var{high-char} to be less than the
- preceding @var{low-char} in the appropriate ordering.
- @end deffn
- @deffn procedure negate char-set @returns{} char-set-regexp
- @deffnx procedure union char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
- @deffnx procedure intersection char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
- @deffnx procedure subtract char-set@suba{a} char-set@suba{b} @returns{} char-set-regexp
- Set operations on character sets. @code{Negate} returns a character
- set of all characters that are not in @var{char-set}. @code{Union}
- returns a character set that contains all of the characters in
- @var{char-set@suba{a}} and all of the characters in
- @var{char-set@suba{b}}. @code{Intersection} returns a character set of
- all of the characters that are in both @var{char-set@suba{a}} and
- @var{char-set@suba{b}}. @code{Subtract} returns a character set of all
- the characters in @var{char-set@suba{a}} that are not also in
- @var{char-set@suba{b}}.
- @end deffn
- @defvr {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
- @defvrx {character set} lower-case = @code{(set "abcdefghijklmnopqrstuvwxyz")}
- @defvrx {character set} upper-case = @code{(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")}
- @defvrx {character set} alphabetic = @code{(union lower-case upper-case)}
- @defvrx {character set} numeric = @code{(set "0123456789")}
- @defvrx {character set} alphanumeric = @code{(union alphabetic numeric)}
- @defvrx {character set} punctuation = @code{(set "!\"#$%&'()*+,-./:;<=>?@@[\\]^_`@{|@}~")}
- @defvrx {character set} graphic = @code{(union alphanumeric punctuation)}
- @defvrx {character set} printing = @code{(union graphic (set #\space))}
- @defvrx {character set} control = @code{(negate printing)}
- @defvrx {character set} blank = @code{(set #\space (ascii->char 9)) ; ASCII 9 = TAB}
- @defvrx {character set} whitespace = @code{(union (set #\space) (ascii-range 9 13))}
- @defvrx {character set} hexdigit = @code{(set "0123456789ABCDEF")}
- Predefined character sets.
- @end defvr
- @subsubsection Anchoring
- @deffn procedure string-start @returns{} regexp
- @deffnx procedure string-end @returns{} regexp
- @code{String-start} returns a regular expression that matches the
- beginning of the string being matched against; @code{string-end}
- returns one that matches the end.
- @end deffn
- @subsubsection Composite expressions
- @deffn procedure sequence regexp @dots{} @returns{} regexp
- @deffnx procedure one-of regexp @dots{} @returns{} regexp
- @code{Sequence} returns a regular expression that matches
- concatenation of all of its arguments; @code{one-of} returns a regular
- expression that matches any one of its arguments.
- @end deffn
- @deffn procedure text string @returns{} regexp
- Returns a regular expression that matches exactly the characters in
- @var{string}, in order.
- @end deffn
- @deffn procedure repeat regexp @returns{} regexp
- @deffnx procedure repeat count regexp @returns{} regexp
- @deffnx procedure repeat min max regexp @returns{} regexp
- @code{Repeat} returns a regular expression that matches zero or more
- occurrences of its @var{regexp} argument. With only one argument, the
- result will match @var{regexp} any number of times. With two
- arguments, @ie{} one @var{count} argument, the returned regular
- expression will match @var{regexp} exactly that number of times. The
- final case will match from @var{min} to @var{max} repetitions,
- inclusive. @var{Max} may be @code{#f}, in which case there is no
- maximum number of matches. @var{Count} & @var{min} must be exact,
- non-negative integers; @var{max} should be either @code{#f} or an
- exact, non-negative integer.
- @end deffn
- @subsubsection Case sensitivity
- Regular expressions are normally case-sensitive, but case sensitivity
- can be manipulated simply.
- @deffn procedure ignore-case regexp @returns{} regexp
- @deffnx procedure use-case regexp @returns{} regexp
- The regular expression returned by @code{ignore-case} is identical to
- its argument except that the case will be ignored when matching. The
- value returned by @code{use-case} is protected from future applications
- of @code{ignore-case}. The expressions returned by @code{use-case} and
- @code{ignore-case} are unaffected by any enclosing uses of these
- procedures.
- By way of example, the following matches @code{"ab"}, but not
- @code{"aB"}, @code{"Ab"}, or @code{"AB"}:
- @lisp
- (text "ab")@end lisp
- @noindent
- while
- @lisp
- (ignore-case (text "ab"))@end lisp
- @noindent
- matches all of those, and
- @lisp
- (ignore-case (sequence (text "a")
- (use-case (text "b"))))@end lisp
- @noindent
- matches @code{"ab"} or @code{"Ab"}, but not @code{"aB"} or @code{"AB"}.
- @end deffn
- @subsubsection Submatches and matching
- A subexpression within a larger expression can be marked as a submatch.
- When an expression is matched against a string, the success or failure
- of each submatch within that expression is reported, as well as the
- location of the substring matched by each successful submatch.
- @deffn procedure submatch key regexp @returns{} regexp
- @deffnx procedure no-submatches regexp @returns{} regexp
- @code{Submatch} returns a regular expression that is equivalent to
- @var{regexp} in every way except that the regular expression returned by
- @code{submatch} will produce a submatch record in the output for the
- part of the string matched by @var{regexp}. @code{No-submatches}
- returns a regular expression that is equivalent to @var{regexp} in every
- respect except that all submatches generated by @var{regexp} will be
- ignored & removed from the output.
- @end deffn
- @deffn procedure any-match? regexp string @returns{} boolean
- @deffnx procedure exact-match? regexp string @returns{} boolean
- @deffnx procedure match regexp string @returns{} match or @code{#f}
- @code{Any-match?} returns @code{#t} if @var{string} matches
- @var{regexp} or contains a substring that does, or @code{#f} if
- otherwise. @code{Exact-match?} returns @code{#t} if @var{string}
- matches @var{regexp} exactly, or @code{#f} if it does not.
- @code{Match} returns @code{#f} if @var{string} does not match
- @var{regexp}, or a match record if it does, as described in the
- previous section. Matching occurs according to POSIX. The match
- returned is the one with the lowest starting index in @var{string}. If
- there is more than one such match, the longest is returned. Within
- that match, the longest possible submatches are returned.
- All three matching procedures cache a compiled version of @var{regexp}.
- Subsequent calls with the same input regular expression will be more
- efficient.
- @end deffn
- Here are some examples of the high-level regular expression interface:
- @lisp
- (define pattern (text "abc"))
- (any-match? pattern "abc") @result{} #t
- (any-match? pattern "abx") @result{} #f
- (any-match? pattern "xxabcxx") @result{} #t
- (exact-match? pattern "abc") @result{} #t
- (exact-match? pattern "abx") @result{} #f
- (exact-match? pattern "xxabcxx") @result{} #f
- (let ((m (match (sequence (text "ab")
- (submatch 'foo (text "cd"))
- (text "ef")))
- "xxabcdefxx"))
- (list m (match-submatches m)))
- @result{} (#@{Match 3 9@} ((foo . #@{Match 5 7@})))
- (match-submatches
- (match (sequence (set "a")
- (one-of (submatch 'foo (text "bc"))
- (submatch 'bar (text "BC"))))
- "xxxaBCd"))
- @result{} ((bar . #@{Match 4 6@}))@end lisp
|