123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038 |
- @c -*-texinfo-*-
- @c This is part of the GNU Guile Reference Manual.
- @c Copyright (C) 2006, 2010, 2011
- @c Free Software Foundation, Inc.
- @c See the file guile.texi for copying conditions.
- @node PEG Parsing
- @section PEG Parsing
- Parsing Expression Grammars (PEGs) are a way of specifying formal
- languages for text processing. They can be used either for matching
- (like regular expressions) or for building recursive descent parsers
- (like lex/yacc). Guile uses a superset of PEG syntax that allows more
- control over what information is preserved during parsing.
- Wikipedia has a clear and concise introduction to PEGs if you want to
- familiarize yourself with the syntax:
- @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
- The @code{(ice-9 peg)} module works by compiling PEGs down to lambda
- expressions. These can either be stored in variables at compile-time by
- the define macros (@code{define-peg-pattern} and
- @code{define-peg-string-patterns}) or calculated explicitly at runtime
- with the compile functions (@code{compile-peg-pattern} and
- @code{peg-string-compile}).
- They can then be used for either parsing (@code{match-pattern}) or searching
- (@code{search-for-pattern}). For convenience, @code{search-for-pattern}
- also takes pattern literals in case you want to inline a simple search
- (people often use regular expressions this way).
- The rest of this documentation consists of a syntax reference, an API
- reference, and a tutorial.
- @menu
- * PEG Syntax Reference::
- * PEG API Reference::
- * PEG Tutorial::
- * PEG Internals::
- @end menu
- @node PEG Syntax Reference
- @subsection PEG Syntax Reference
- @subsubheading Normal PEG Syntax:
- @deftp {PEG Pattern} sequence a b
- Parses @var{a}. If this succeeds, continues to parse @var{b} from the
- end of the text parsed as @var{a}. Succeeds if both @var{a} and
- @var{b} succeed.
- @code{"a b"}
- @code{(and a b)}
- @end deftp
- @deftp {PEG Pattern} {ordered choice} a b
- Parses @var{a}. If this fails, backtracks and parses @var{b}.
- Succeeds if either @var{a} or @var{b} succeeds.
- @code{"a/b"}
- @code{(or a b)}
- @end deftp
- @deftp {PEG Pattern} {zero or more} a
- Parses @var{a} as many times in a row as it can, starting each @var{a}
- at the end of the text parsed by the previous @var{a}. Always
- succeeds.
- @code{"a*"}
- @code{(* a)}
- @end deftp
- @deftp {PEG Pattern} {one or more} a
- Parses @var{a} as many times in a row as it can, starting each @var{a}
- at the end of the text parsed by the previous @var{a}. Succeeds if at
- least one @var{a} was parsed.
- @code{"a+"}
- @code{(+ a)}
- @end deftp
- @deftp {PEG Pattern} optional a
- Tries to parse @var{a}. Succeeds if @var{a} succeeds.
- @code{"a?"}
- @code{(? a)}
- @end deftp
- @deftp {PEG Pattern} {followed by} a
- Makes sure it is possible to parse @var{a}, but does not actually parse
- it. Succeeds if @var{a} would succeed.
- @code{"&a"}
- @code{(followed-by a)}
- @end deftp
- @deftp {PEG Pattern} {not followed by} a
- Makes sure it is impossible to parse @var{a}, but does not actually
- parse it. Succeeds if @var{a} would fail.
- @code{"!a"}
- @code{(not-followed-by a)}
- @end deftp
- @deftp {PEG Pattern} {string literal} ``abc''
- Parses the string @var{"abc"}. Succeeds if that parsing succeeds.
- @code{"'abc'"}
- @code{"abc"}
- @end deftp
- @deftp {PEG Pattern} {any character}
- Parses any single character. Succeeds unless there is no more text to
- be parsed.
- @code{"."}
- @code{peg-any}
- @end deftp
- @deftp {PEG Pattern} {character class} a b
- Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and
- @var{b} are characters.
- @code{"[ab]"}
- @code{(or "a" "b")}
- @end deftp
- @deftp {PEG Pattern} {range of characters} a z
- Parses any character falling between @var{a} and @var{z}.
- @code{"[a-z]"}
- @code{(range #\a #\z)}
- @end deftp
- Example:
- @example
- "(a !b / c &d*) 'e'+"
- @end example
- Would be:
- @lisp
- (and
- (or
- (and a (not-followed-by b))
- (and c (followed-by (* d))))
- (+ "e"))
- @end lisp
- @subsubheading Extended Syntax
- There is some extra syntax for S-expressions.
- @deftp {PEG Pattern} ignore a
- Ignore the text matching @var{a}
- @end deftp
- @deftp {PEG Pattern} capture a
- Capture the text matching @var{a}.
- @end deftp
- @deftp {PEG Pattern} peg a
- Embed the PEG pattern @var{a} using string syntax.
- @end deftp
- Example:
- @example
- "!a / 'b'"
- @end example
- Is equivalent to
- @lisp
- (or (peg "!a") "b")
- @end lisp
- and
- @lisp
- (or (not-followed-by a) "b")
- @end lisp
- @node PEG API Reference
- @subsection PEG API Reference
- @subsubheading Define Macros
- The most straightforward way to define a PEG is by using one of the
- define macros (both of these macroexpand into @code{define}
- expressions). These macros bind parsing functions to variables. These
- parsing functions may be invoked by @code{match-pattern} or
- @code{search-for-pattern}, which return a PEG match record. Raw data can be
- retrieved from this record with the PEG match deconstructor functions.
- More complicated (and perhaps enlightening) examples can be found in the
- tutorial.
- @deffn {Scheme Macro} define-peg-string-patterns peg-string
- Defines all the nonterminals in the PEG @var{peg-string}. More
- precisely, @code{define-peg-string-patterns} takes a superset of PEGs. A normal PEG
- has a @code{<-} between the nonterminal and the pattern.
- @code{define-peg-string-patterns} uses this symbol to determine what information it
- should propagate up the parse tree. The normal @code{<-} propagates the
- matched text up the parse tree, @code{<--} propagates the matched text
- up the parse tree tagged with the name of the nonterminal, and @code{<}
- discards that matched text and propagates nothing up the parse tree.
- Also, nonterminals may consist of any alphanumeric character or a ``-''
- character (in normal PEGs nonterminals can only be alphabetic).
- For example, if we:
- @lisp
- (define-peg-string-patterns
- "as <- 'a'+
- bs <- 'b'+
- as-or-bs <- as/bs")
- (define-peg-string-patterns
- "as-tag <-- 'a'+
- bs-tag <-- 'b'+
- as-or-bs-tag <-- as-tag/bs-tag")
- @end lisp
- Then:
- @lisp
- (match-pattern as-or-bs "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (match-pattern as-or-bs-tag "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
- @end lisp
- Note that in doing this, we have bound 6 variables at the toplevel
- (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
- @var{as-or-bs-tag}).
- @end deffn
- @deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp
- Defines a single nonterminal @var{name}. @var{capture-type} determines
- how much information is passed up the parse tree. @var{peg-sexp} is a
- PEG in S-expression form.
- Possible values for capture-type:
- @table @code
- @item all
- passes the matched text up the parse tree tagged with the name of the
- nonterminal.
- @item body
- passes the matched text up the parse tree.
- @item none
- passes nothing up the parse tree.
- @end table
- For Example, if we:
- @lisp
- (define-peg-pattern as body (+ "a"))
- (define-peg-pattern bs body (+ "b"))
- (define-peg-pattern as-or-bs body (or as bs))
- (define-peg-pattern as-tag all (+ "a"))
- (define-peg-pattern bs-tag all (+ "b"))
- (define-peg-pattern as-or-bs-tag all (or as-tag bs-tag))
- @end lisp
- Then:
- @lisp
- (match-pattern as-or-bs "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (match-pattern as-or-bs-tag "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
- @end lisp
- Note that in doing this, we have bound 6 variables at the toplevel
- (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
- @var{as-or-bs-tag}).
- @end deffn
- @subsubheading Compile Functions
- It is sometimes useful to be able to compile anonymous PEG patterns at
- runtime. These functions let you do that using either syntax.
- @deffn {Scheme Procedure} peg-string-compile peg-string capture-type
- Compiles the PEG pattern in @var{peg-string} propagating according to
- @var{capture-type} (capture-type can be any of the values from
- @code{define-peg-pattern}).
- @end deffn
- @deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type
- Compiles the PEG pattern in @var{peg-sexp} propagating according to
- @var{capture-type} (capture-type can be any of the values from
- @code{define-peg-pattern}).
- @end deffn
- The functions return syntax objects, which can be useful if you want to
- use them in macros. If all you want is to define a new nonterminal, you
- can do the following:
- @lisp
- (define exp '(+ "a"))
- (define as (compile (compile-peg-pattern exp 'body)))
- @end lisp
- You can use this nonterminal with all of the regular PEG functions:
- @lisp
- (match-pattern as "aaaaa") @result{}
- #<peg start: 0 end: 5 string: bbbbb tree: bbbbb>
- @end lisp
- @subsubheading Parsing & Matching Functions
- For our purposes, ``parsing'' means parsing a string into a tree
- starting from the first character, while ``matching'' means searching
- through the string for a substring. In practice, the only difference
- between the two functions is that @code{match-pattern} gives up if it can't
- find a valid substring starting at index 0 and @code{search-for-pattern} keeps
- looking. They are both equally capable of ``parsing'' and ``matching''
- given those constraints.
- @deffn {Scheme Procedure} match-pattern nonterm string
- Parses @var{string} using the PEG stored in @var{nonterm}. If no match
- was found, @code{match-pattern} returns false. If a match was found, a PEG
- match record is returned.
- The @code{capture-type} argument to @code{define-peg-pattern} allows you to
- choose what information to hold on to while parsing. The options are:
- @table @code
- @item all
- tag the matched text with the nonterminal
- @item body
- just the matched text
- @item none
- nothing
- @end table
- @lisp
- (define-peg-pattern as all (+ "a"))
- (match-pattern as "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
- (define-peg-pattern as body (+ "a"))
- (match-pattern as "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (define-peg-pattern as none (+ "a"))
- (match-pattern as "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: ()>
- (define-peg-pattern bs body (+ "b"))
- (match-pattern bs "aabbcc") @result{}
- #f
- @end lisp
- @end deffn
- @deffn {Scheme Macro} search-for-pattern nonterm-or-peg string
- Searches through @var{string} looking for a matching subexpression.
- @var{nonterm-or-peg} can either be a nonterminal or a literal PEG
- pattern. When a literal PEG pattern is provided, @code{search-for-pattern} works
- very similarly to the regular expression searches many hackers are used
- to. If no match was found, @code{search-for-pattern} returns false. If a match
- was found, a PEG match record is returned.
- @lisp
- (define-peg-pattern as body (+ "a"))
- (search-for-pattern as "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (search-for-pattern (+ "a") "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (search-for-pattern "'a'+" "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: aa>
- (define-peg-pattern as all (+ "a"))
- (search-for-pattern as "aabbcc") @result{}
- #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
- (define-peg-pattern bs body (+ "b"))
- (search-for-pattern bs "aabbcc") @result{}
- #<peg start: 2 end: 4 string: aabbcc tree: bb>
- (search-for-pattern (+ "b") "aabbcc") @result{}
- #<peg start: 2 end: 4 string: aabbcc tree: bb>
- (search-for-pattern "'b'+" "aabbcc") @result{}
- #<peg start: 2 end: 4 string: aabbcc tree: bb>
- (define-peg-pattern zs body (+ "z"))
- (search-for-pattern zs "aabbcc") @result{}
- #f
- (search-for-pattern (+ "z") "aabbcc") @result{}
- #f
- (search-for-pattern "'z'+" "aabbcc") @result{}
- #f
- @end lisp
- @end deffn
- @subsubheading PEG Match Records
- The @code{match-pattern} and @code{search-for-pattern} functions both return PEG
- match records. Actual information can be extracted from these with the
- following functions.
- @deffn {Scheme Procedure} peg:string match-record
- Returns the original string that was parsed in the creation of
- @code{match-record}.
- @end deffn
- @deffn {Scheme Procedure} peg:start match-record
- Returns the index of the first parsed character in the original string
- (from @code{peg:string}). If this is the same as @code{peg:end},
- nothing was parsed.
- @end deffn
- @deffn {Scheme Procedure} peg:end match-record
- Returns one more than the index of the last parsed character in the
- original string (from @code{peg:string}). If this is the same as
- @code{peg:start}, nothing was parsed.
- @end deffn
- @deffn {Scheme Procedure} peg:substring match-record
- Returns the substring parsed by @code{match-record}. This is equivalent to
- @code{(substring (peg:string match-record) (peg:start match-record) (peg:end
- match-record))}.
- @end deffn
- @deffn {Scheme Procedure} peg:tree match-record
- Returns the tree parsed by @code{match-record}.
- @end deffn
- @deffn {Scheme Procedure} peg-record? match-record
- Returns true if @code{match-record} is a PEG match record, or false
- otherwise.
- @end deffn
- Example:
- @lisp
- (define-peg-pattern bs all (peg "'b'+"))
- (search-for-pattern bs "aabbcc") @result{}
- #<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
- (let ((pm (search-for-pattern bs "aabbcc")))
- `((string ,(peg:string pm))
- (start ,(peg:start pm))
- (end ,(peg:end pm))
- (substring ,(peg:substring pm))
- (tree ,(peg:tree pm))
- (record? ,(peg-record? pm)))) @result{}
- ((string "aabbcc")
- (start 2)
- (end 4)
- (substring "bb")
- (tree (bs "bb"))
- (record? #t))
- @end lisp
- @subsubheading Miscellaneous
- @deffn {Scheme Procedure} context-flatten tst lst
- Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst}
- until all elements are either atoms or satisfy @var{tst}. If @var{lst}
- itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
- flat list whose only element satisfies @var{tst}).
- @lisp
- (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
- (2 2 (1 1 (2 2)) 2 2 (1 1))
- (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
- ((1 1 (1 1 (2 2)) (2 2 (1 1))))
- @end lisp
- If you're wondering why this is here, take a look at the tutorial.
- @end deffn
- @deffn {Scheme Procedure} keyword-flatten terms lst
- A less general form of @code{context-flatten}. Takes a list of terminal
- atoms @code{terms} and flattens @var{lst} until all elements are either
- atoms, or lists which have an atom from @code{terms} as their first
- element.
- @lisp
- (keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
- (c a b (a c) (b c) c (b a) c a)
- @end lisp
- If you're wondering why this is here, take a look at the tutorial.
- @end deffn
- @node PEG Tutorial
- @subsection PEG Tutorial
- @subsubheading Parsing /etc/passwd
- This example will show how to parse /etc/passwd using PEGs.
- First we define an example /etc/passwd file:
- @lisp
- (define *etc-passwd*
- "root:x:0:0:root:/root:/bin/bash
- daemon:x:1:1:daemon:/usr/sbin:/bin/sh
- bin:x:2:2:bin:/bin:/bin/sh
- sys:x:3:3:sys:/dev:/bin/sh
- nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
- messagebus:x:103:107::/var/run/dbus:/bin/false
- ")
- @end lisp
- As a first pass at this, we might want to have all the entries in
- /etc/passwd in a list.
- Doing this with string-based PEG syntax would look like this:
- @lisp
- (define-peg-string-patterns
- "passwd <- entry* !.
- entry <-- (! NL .)* NL*
- NL < '\n'")
- @end lisp
- A @code{passwd} file is 0 or more entries (@code{entry*}) until the end
- of the file (@code{!.} (@code{.} is any character, so @code{!.} means
- ``not anything'')). We want to capture the data in the nonterminal
- @code{passwd}, but not tag it with the name, so we use @code{<-}.
- An entry is a series of 0 or more characters that aren't newlines
- (@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}). We want
- to tag all the entries with @code{entry}, so we use @code{<--}.
- A newline is just a literal newline (@code{'\n'}). We don't want a
- bunch of newlines cluttering up the output, so we use @code{<} to throw
- away the captured data.
- Here is the same PEG defined using S-expressions:
- @lisp
- (define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any)))
- (define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any))
- (* NL)))
- (define-peg-pattern NL none "\n")
- @end lisp
- Obviously this is much more verbose. On the other hand, it's more
- explicit, and thus easier to build automatically. However, there are
- some tricks that make S-expressions easier to use in some cases. One is
- the @code{ignore} keyword; the string syntax has no way to say ``throw
- away this text'' except breaking it out into a separate nonterminal.
- For instance, to throw away the newlines we had to define @code{NL}. In
- the S-expression syntax, we could have simply written @code{(ignore
- "\n")}. Also, for the cases where string syntax is really much cleaner,
- the @code{peg} keyword can be used to embed string syntax in
- S-expression syntax. For instance, we could have written:
- @lisp
- (define-peg-pattern passwd body (peg "entry* !."))
- @end lisp
- However we define it, parsing @code{*etc-passwd*} with the @code{passwd}
- nonterminal yields the same results:
- @lisp
- (peg:tree (match-pattern passwd *etc-passwd*)) @result{}
- ((entry "root:x:0:0:root:/root:/bin/bash")
- (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
- (entry "bin:x:2:2:bin:/bin:/bin/sh")
- (entry "sys:x:3:3:sys:/dev:/bin/sh")
- (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
- (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
- @end lisp
- However, here is something to be wary of:
- @lisp
- (peg:tree (match-pattern passwd "one entry")) @result{}
- (entry "one entry")
- @end lisp
- By default, the parse trees generated by PEGs are compressed as much as
- possible without losing information. It may not look like this is what
- you want at first, but uncompressed parse trees are an enormous headache
- (there's no easy way to predict how deep particular lists will nest,
- there are empty lists littered everywhere, etc. etc.). One side-effect
- of this, however, is that sometimes the compressor is too aggressive.
- No information is discarded when @code{((entry "one entry"))} is
- compressed to @code{(entry "one entry")}, but in this particular case it
- probably isn't what we want.
- There are two functions for easily dealing with this:
- @code{keyword-flatten} and @code{context-flatten}. The
- @code{keyword-flatten} function takes a list of keywords and a list to
- flatten, then tries to coerce the list such that the first element of
- all sublists is one of the keywords. The @code{context-flatten}
- function is similar, but instead of a list of keywords it takes a
- predicate that should indicate whether a given sublist is good enough
- (refer to the API reference for more details).
- What we want here is @code{keyword-flatten}.
- @lisp
- (keyword-flatten '(entry) (peg:tree (match-pattern passwd *etc-passwd*))) @result{}
- ((entry "root:x:0:0:root:/root:/bin/bash")
- (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
- (entry "bin:x:2:2:bin:/bin:/bin/sh")
- (entry "sys:x:3:3:sys:/dev:/bin/sh")
- (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
- (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
- (keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{}
- ((entry "one entry"))
- @end lisp
- Of course, this is a somewhat contrived example. In practice we would
- probably just tag the @code{passwd} nonterminal to remove the ambiguity
- (using either the @code{all} keyword for S-expressions or the @code{<--}
- symbol for strings)..
- @lisp
- (define-peg-pattern tag-passwd all (peg "entry* !."))
- (peg:tree (match-pattern tag-passwd *etc-passwd*)) @result{}
- (tag-passwd
- (entry "root:x:0:0:root:/root:/bin/bash")
- (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
- (entry "bin:x:2:2:bin:/bin:/bin/sh")
- (entry "sys:x:3:3:sys:/dev:/bin/sh")
- (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
- (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
- (peg:tree (match-pattern tag-passwd "one entry"))
- (tag-passwd
- (entry "one entry"))
- @end lisp
- If you're ever uncertain about the potential results of parsing
- something, remember the two absolute rules:
- @enumerate
- @item
- No parsing information will ever be discarded.
- @item
- There will never be any lists with fewer than 2 elements.
- @end enumerate
- For the purposes of (1), "parsing information" means things tagged with
- the @code{any} keyword or the @code{<--} symbol. Plain strings will be
- concatenated.
- Let's extend this example a bit more and actually pull some useful
- information out of the passwd file:
- @lisp
- (define-peg-string-patterns
- "passwd <-- entry* !.
- entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
- login <-- text
- pass <-- text
- uid <-- [0-9]*
- gid <-- [0-9]*
- nameORcomment <-- text
- homedir <-- path
- shell <-- path
- path <-- (SLASH pathELEMENT)*
- pathELEMENT <-- (!NL !C !'/' .)*
- text <- (!NL !C .)*
- C < ':'
- NL < '\n'
- SLASH < '/'")
- @end lisp
- This produces rather pretty parse trees:
- @lisp
- (passwd
- (entry (login "root")
- (pass "x")
- (uid "0")
- (gid "0")
- (nameORcomment "root")
- (homedir (path (pathELEMENT "root")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
- (entry (login "daemon")
- (pass "x")
- (uid "1")
- (gid "1")
- (nameORcomment "daemon")
- (homedir
- (path (pathELEMENT "usr") (pathELEMENT "sbin")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
- (entry (login "bin")
- (pass "x")
- (uid "2")
- (gid "2")
- (nameORcomment "bin")
- (homedir (path (pathELEMENT "bin")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
- (entry (login "sys")
- (pass "x")
- (uid "3")
- (gid "3")
- (nameORcomment "sys")
- (homedir (path (pathELEMENT "dev")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
- (entry (login "nobody")
- (pass "x")
- (uid "65534")
- (gid "65534")
- (nameORcomment "nobody")
- (homedir (path (pathELEMENT "nonexistent")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
- (entry (login "messagebus")
- (pass "x")
- (uid "103")
- (gid "107")
- nameORcomment
- (homedir
- (path (pathELEMENT "var")
- (pathELEMENT "run")
- (pathELEMENT "dbus")))
- (shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
- @end lisp
- Notice that when there's no entry in a field (e.g. @code{nameORcomment}
- for messagebus) the symbol is inserted. This is the ``don't throw away
- any information'' rule---we succesfully matched a @code{nameORcomment}
- of 0 characters (since we used @code{*} when defining it). This is
- usually what you want, because it allows you to e.g. use @code{list-ref}
- to pull out elements (since they all have known offsets).
- If you'd prefer not to have symbols for empty matches, you can replace
- the @code{*} with a @code{+} and add a @code{?} after the
- @code{nameORcomment} in @code{entry}. Then it will try to parse 1 or
- more characters, fail (inserting nothing into the parse tree), but
- continue because it didn't have to match the nameORcomment to continue.
- @subsubheading Embedding Arithmetic Expressions
- We can parse simple mathematical expressions with the following PEG:
- @lisp
- (define-peg-string-patterns
- "expr <- sum
- sum <-- (product ('+' / '-') sum) / product
- product <-- (value ('*' / '/') product) / value
- value <-- number / '(' expr ')'
- number <-- [0-9]+")
- @end lisp
- Then:
- @lisp
- (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{}
- (sum (product (value (number "1")))
- "+"
- (sum (product
- (value (number "1"))
- "/"
- (product
- (value (number "2"))
- "*"
- (product (value (number "3")))))
- "+"
- (sum (product
- (value "("
- (sum (product (value (number "1")))
- "+"
- (sum (product (value (number "1")))))
- ")")
- "/"
- (product (value (number "2")))))))
- @end lisp
- There is very little wasted effort in this PEG. The @code{number}
- nonterminal has to be tagged because otherwise the numbers might run
- together with the arithmetic expressions during the string concatenation
- stage of parse-tree compression (the parser will see ``1'' followed by
- ``/'' and decide to call it ``1/''). When in doubt, tag.
- It is very easy to turn these parse trees into lisp expressions:
- @lisp
- (define (parse-sum sum left . rest)
- (if (null? rest)
- (apply parse-product left)
- (list (string->symbol (car rest))
- (apply parse-product left)
- (apply parse-sum (cadr rest)))))
- (define (parse-product product left . rest)
- (if (null? rest)
- (apply parse-value left)
- (list (string->symbol (car rest))
- (apply parse-value left)
- (apply parse-product (cadr rest)))))
- (define (parse-value value first . rest)
- (if (null? rest)
- (string->number (cadr first))
- (apply parse-sum (car rest))))
- (define parse-expr parse-sum)
- @end lisp
- (Notice all these functions look very similar; for a more complicated
- PEG, it would be worth abstracting.)
- Then:
- @lisp
- (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
- (+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
- @end lisp
- But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2
- 3))}, it should say @code{(* (/ 1 2) 3)}.
- It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-')
- sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) /
- product"}, but this is a Bad Idea. PEGs don't support left recursion.
- To see why, imagine what the parser will do here. When it tries to
- parse @code{sum}, it first has to try and parse @code{sum}. But to do
- that, it first has to try and parse @code{sum}. This will continue
- until the stack gets blown off.
- So how does one parse left-associative binary operators with PEGs?
- Honestly, this is one of their major shortcomings. There's no
- general-purpose way of doing this, but here the repetition operators are
- a good choice:
- @lisp
- (use-modules (srfi srfi-1))
- (define-peg-string-patterns
- "expr <- sum
- sum <-- (product ('+' / '-'))* product
- product <-- (value ('*' / '/'))* value
- value <-- number / '(' expr ')'
- number <-- [0-9]+")
- ;; take a deep breath...
- (define (make-left-parser next-func)
- (lambda (sum first . rest) ;; general form, comments below assume
- ;; that we're dealing with a sum expression
- (if (null? rest) ;; form (sum (product ...))
- (apply next-func first)
- (if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
- (list (string->symbol (cadr first))
- (apply next-func (car first))
- (apply next-func (car rest)))
- ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
- (car
- (reduce ;; walk through the list and build a left-associative tree
- (lambda (l r)
- (list (list (cadr r) (car r) (apply next-func (car l)))
- (string->symbol (cadr l))))
- 'ignore
- (append ;; make a list of all the products
- ;; the first one should be pre-parsed
- (list (list (apply next-func (caar first))
- (string->symbol (cadar first))))
- (cdr first)
- ;; the last one has to be added in
- (list (append rest '("done"))))))))))
- (define (parse-value value first . rest)
- (if (null? rest)
- (string->number (cadr first))
- (apply parse-sum (car rest))))
- (define parse-product (make-left-parser parse-value))
- (define parse-sum (make-left-parser parse-product))
- (define parse-expr parse-sum)
- @end lisp
- Then:
- @lisp
- (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
- (+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
- @end lisp
- As you can see, this is much uglier (it could be made prettier by using
- @code{context-flatten}, but the way it's written above makes it clear
- how we deal with the three ways the zero-or-more @code{*} expression can
- parse). Fortunately, most of the time we can get away with only using
- right-associativity.
- @subsubheading Simplified Functions
- For a more tantalizing example, consider the following grammar that
- parses (highly) simplified C functions:
- @lisp
- (define-peg-string-patterns
- "cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
- ctype <-- cidentifier
- cname <-- cidentifier
- cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
- carg <-- cSP ctype cSP cname
- cbody <-- cstatement *
- cidentifier <- [a-zA-z][a-zA-Z0-9_]*
- cstatement <-- (!';'.)*cSC cSP
- cSC < ';'
- cCOMMA < ','
- cLP < '('
- cRP < ')'
- cLB < '@{'
- cRB < '@}'
- cSP < [ \t\n]*")
- @end lisp
- Then:
- @lisp
- (match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{}
- (32
- (cfunc (ctype "int")
- (cname "square")
- (cargs (carg (ctype "int") (cname "a")))
- (cbody (cstatement "return a*a"))))
- @end lisp
- And:
- @lisp
- (match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
- (52
- (cfunc (ctype "int")
- (cname "mod")
- (cargs (carg (ctype "int") (cname "a"))
- (carg (ctype "int") (cname "b")))
- (cbody (cstatement "int c = a/b")
- (cstatement "return a- b*c"))))
- @end lisp
- By wrapping all the @code{carg} nonterminals in a @code{cargs}
- nonterminal, we were able to remove any ambiguity in the parsing
- structure and avoid having to call @code{context-flatten} on the output
- of @code{match-pattern}. We used the same trick with the @code{cstatement}
- nonterminals, wrapping them in a @code{cbody} nonterminal.
- The whitespace nonterminal @code{cSP} used here is a (very) useful
- instantiation of a common pattern for matching syntactically irrelevant
- information. Since it's tagged with @code{<} and ends with @code{*} it
- won't clutter up the parse trees (all the empty lists will be discarded
- during the compression step) and it will never cause parsing to fail.
- @node PEG Internals
- @subsection PEG Internals
- A PEG parser takes a string as input and attempts to parse it as a given
- nonterminal. The key idea of the PEG implementation is that every
- nonterminal is just a function that takes a string as an argument and
- attempts to parse that string as its nonterminal. The functions always
- start from the beginning, but a parse is considered successful if there
- is material left over at the end.
- This makes it easy to model different PEG parsing operations. For
- instance, consider the PEG grammar @code{"ab"}, which could also be
- written @code{(and "a" "b")}. It matches the string ``ab''. Here's how
- that might be implemented in the PEG style:
- @lisp
- (define (match-and-a-b str)
- (match-a str)
- (match-b str))
- @end lisp
- As you can see, the use of functions provides an easy way to model
- sequencing. In a similar way, one could model @code{(or a b)} with
- something like the following:
- @lisp
- (define (match-or-a-b str)
- (or (match-a str) (match-b str)))
- @end lisp
- Here the semantics of a PEG @code{or} expression map naturally onto
- Scheme's @code{or} operator. This function will attempt to run
- @code{(match-a str)}, and return its result if it succeeds. Otherwise it
- will run @code{(match-b str)}.
- Of course, the code above wouldn't quite work. We need some way for the
- parsing functions to communicate. The actual interface used is below.
- @subsubheading Parsing Function Interface
- A parsing function takes three arguments - a string, the length of that
- string, and the position in that string it should start parsing at. In
- effect, the parsing functions pass around substrings in pieces - the
- first argument is a buffer of characters, and the second two give a
- range within that buffer that the parsing function should look at.
- Parsing functions return either #f, if they failed to match their
- nonterminal, or a list whose first element must be an integer
- representing the final position in the string they matched and whose cdr
- can be any other data the function wishes to return, or '() if it
- doesn't have any more data.
- The one caveat is that if the extra data it returns is a list, any
- adjacent strings in that list will be appended by @code{match-pattern}. For
- instance, if a parsing function returns @code{(13 ("a" "b" "c"))},
- @code{match-pattern} will take @code{(13 ("abc"))} as its value.
- For example, here is a function to match ``ab'' using the actual
- interface.
- @lisp
- (define (match-a-b str len pos)
- (and (<= (+ pos 2) len)
- (string= str "ab" pos (+ pos 2))
- (list (+ pos 2) '()))) ; we return no extra information
- @end lisp
- The above function can be used to match a string by running
- @code{(match-pattern match-a-b "ab")}.
- @subsubheading Code Generators and Extensible Syntax
- PEG expressions, such as those in a @code{define-peg-pattern} form, are
- interpreted internally in two steps.
- First, any string PEG is expanded into an s-expression PEG by the code
- in the @code{(ice-9 peg string-peg)} module.
- Then, the s-expression PEG that results is compiled into a parsing
- function by the @code{(ice-9 peg codegen)} module. In particular, the
- function @code{compile-peg-pattern} is called on the s-expression. It then
- decides what to do based on the form it is passed.
- The PEG syntax can be expanded by providing @code{compile-peg-pattern} more
- options for what to do with its forms. The extended syntax will be
- associated with a symbol, for instance @code{my-parsing-form}, and will
- be called on all PEG expressions of the form
- @lisp
- (my-parsing-form ...)
- @end lisp
- The parsing function should take two arguments. The first will be a
- syntax object containing a list with all of the arguments to the form
- (but not the form's name), and the second will be the
- @code{capture-type} argument that is passed to @code{define-peg-pattern}.
- New functions can be registered by calling @code{(add-peg-compiler!
- symbol function)}, where @code{symbol} is the symbol that will indicate
- a form of this type and @code{function} is the code generating function
- described above. The function @code{add-peg-compiler!} is exported from
- the @code{(ice-9 peg codegen)} module.
|