pegdocs.txt 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
  1. PEG syntax and semantics
  2. ========================
  3. A PEG (Parsing expression grammar) is a simple deterministic grammar, that can
  4. be directly used for parsing. The current implementation has been designed as
  5. a more powerful replacement for regular expressions. UTF-8 is supported.
  6. The notation used for a PEG is similar to that of EBNF:
  7. =============== ============================================================
  8. notation meaning
  9. =============== ============================================================
  10. ``A / ... / Z`` Ordered choice: Apply expressions `A`, ..., `Z`, in this
  11. order, to the text ahead, until one of them succeeds and
  12. possibly consumes some text. Indicate success if one of
  13. expressions succeeded. Otherwise, do not consume any text
  14. and indicate failure.
  15. ``A ... Z`` Sequence: Apply expressions `A`, ..., `Z`, in this order,
  16. to consume consecutive portions of the text ahead, as long
  17. as they succeed. Indicate success if all succeeded.
  18. Otherwise, do not consume any text and indicate failure.
  19. The sequence's precedence is higher than that of ordered
  20. choice: ``A B / C`` means ``(A B) / Z`` and
  21. not ``A (B / Z)``.
  22. ``(E)`` Grouping: Parenthesis can be used to change
  23. operator priority.
  24. ``{E}`` Capture: Apply expression `E` and store the substring
  25. that matched `E` into a *capture* that can be accessed
  26. after the matching process.
  27. ``{}`` Empty capture: Delete the last capture. No character
  28. is consumed.
  29. ``$i`` Back reference to the ``i``th capture. ``i`` counts forwards
  30. from 1 or backwards (last capture to first) from ^1.
  31. ``$`` Anchor: Matches at the end of the input. No character
  32. is consumed. Same as ``!.``.
  33. ``^`` Anchor: Matches at the start of the input. No character
  34. is consumed.
  35. ``&E`` And predicate: Indicate success if expression `E` matches
  36. the text ahead; otherwise indicate failure. Do not consume
  37. any text.
  38. ``!E`` Not predicate: Indicate failure if expression E matches the
  39. text ahead; otherwise indicate success. Do not consume any
  40. text.
  41. ``E+`` One or more: Apply expression `E` repeatedly to match
  42. the text ahead, as long as it succeeds. Consume the matched
  43. text (if any) and indicate success if there was at least
  44. one match. Otherwise, indicate failure.
  45. ``E*`` Zero or more: Apply expression `E` repeatedly to match
  46. the text ahead, as long as it succeeds. Consume the matched
  47. text (if any). Always indicate success.
  48. ``E?`` Zero or one: If expression `E` matches the text ahead,
  49. consume it. Always indicate success.
  50. ``[s]`` Character class: If the character ahead appears in the
  51. string `s`, consume it and indicate success. Otherwise,
  52. indicate failure.
  53. ``[a-b]`` Character range: If the character ahead is one from the
  54. range `a` through `b`, consume it and indicate success.
  55. Otherwise, indicate failure.
  56. ``'s'`` String: If the text ahead is the string `s`, consume it
  57. and indicate success. Otherwise, indicate failure.
  58. ``i's'`` String match ignoring case.
  59. ``y's'`` String match ignoring style.
  60. ``v's'`` Verbatim string match: Use this to override a global
  61. ``\i`` or ``\y`` modifier.
  62. ``i$j`` String match ignoring case for back reference.
  63. ``y$j`` String match ignoring style for back reference.
  64. ``v$j`` Verbatim string match for back reference.
  65. ``.`` Any character: If there is a character ahead, consume it
  66. and indicate success. Otherwise, (that is, at the end of
  67. input) indicate failure.
  68. ``_`` Any Unicode character: If there is a UTF-8 character
  69. ahead, consume it and indicate success. Otherwise, indicate
  70. failure.
  71. ``@E`` Search: Shorthand for ``(!E .)* E``. (Search loop for the
  72. pattern `E`.)
  73. ``{@} E`` Captured Search: Shorthand for ``{(!E .)*} E``. (Search
  74. loop for the pattern `E`.) Everything until and excluding
  75. `E` is captured.
  76. ``@@ E`` Same as ``{@} E``.
  77. ``A <- E`` Rule: Bind the expression `E` to the *nonterminal symbol*
  78. `A`. **Left recursive rules are not possible and crash the
  79. matching engine.**
  80. ``\identifier`` Built-in macro for a longer expression.
  81. ``\ddd`` Character with decimal code *ddd*.
  82. ``\"``, etc. Literal ``"``, etc.
  83. =============== ============================================================
  84. Built-in macros
  85. ---------------
  86. ============== ============================================================
  87. macro meaning
  88. ============== ============================================================
  89. ``\d`` any decimal digit: ``[0-9]``
  90. ``\D`` any character that is not a decimal digit: ``[^0-9]``
  91. ``\s`` any whitespace character: ``[ \9-\13]``
  92. ``\S`` any character that is not a whitespace character:
  93. ``[^ \9-\13]``
  94. ``\w`` any "word" character: ``[a-zA-Z0-9_]``
  95. ``\W`` any "non-word" character: ``[^a-zA-Z0-9_]``
  96. ``\a`` same as ``[a-zA-Z]``
  97. ``\A`` same as ``[^a-zA-Z]``
  98. ``\n`` any newline combination: ``\10 / \13\10 / \13``
  99. ``\i`` ignore case for matching; use this at the start of the PEG
  100. ``\y`` ignore style for matching; use this at the start of the PEG
  101. ``\skip`` pat skip pattern *pat* before trying to match other tokens;
  102. this is useful for whitespace skipping, for example:
  103. ``\skip(\s*) {\ident} ':' {\ident}`` matches key value
  104. pairs ignoring whitespace around the ``':'``.
  105. ``\ident`` a standard ASCII identifier: ``[a-zA-Z_][a-zA-Z_0-9]*``
  106. ``\letter`` any Unicode letter
  107. ``\upper`` any Unicode uppercase letter
  108. ``\lower`` any Unicode lowercase letter
  109. ``\title`` any Unicode title letter
  110. ``\white`` any Unicode whitespace character
  111. ============== ============================================================
  112. A backslash followed by a letter is a built-in macro, otherwise it
  113. is used for ordinary escaping:
  114. ============== ============================================================
  115. notation meaning
  116. ============== ============================================================
  117. ``\\`` a single backslash
  118. ``\*`` same as ``'*'``
  119. ``\t`` not a tabulator, but an (unknown) built-in
  120. ============== ============================================================
  121. Supported PEG grammar
  122. ---------------------
  123. The PEG parser implements this grammar (written in PEG syntax):
  124. # Example grammar of PEG in PEG syntax.
  125. # Comments start with '#'.
  126. # First symbol is the start symbol.
  127. grammar <- rule* / expr
  128. identifier <- [A-Za-z][A-Za-z0-9_]*
  129. charsetchar <- "\\" . / [^\]]
  130. charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
  131. stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" /
  132. "'" ("\\" . / [^'])* "'")
  133. builtin <- "\\" identifier / [^\13\10]
  134. comment <- '#' @ \n
  135. ig <- (\s / comment)* # things to ignore
  136. rule <- identifier \s* "<-" expr ig
  137. identNoArrow <- identifier !(\s* "<-")
  138. prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@'
  139. literal <- ig identifier? '$' '^'? [0-9]+ / '$' / '^' /
  140. ig identNoArrow /
  141. ig charset /
  142. ig stringlit /
  143. ig builtin /
  144. ig '.' /
  145. ig '_' /
  146. (ig "(" expr ig ")") /
  147. (ig "{" expr? ig "}")
  148. postfixOpr <- ig '?' / ig '*' / ig '+'
  149. primary <- prefixOpr* (literal postfixOpr*)
  150. # Concatenation has higher priority than choice:
  151. # ``a b / c`` means ``(a b) / c``
  152. seqExpr <- primary+
  153. expr <- seqExpr (ig "/" expr)*
  154. **Note**: As a special syntactic extension if the whole PEG is only a single
  155. expression, identifiers are not interpreted as non-terminals, but are
  156. interpreted as verbatim string:
  157. ```nim
  158. abc =~ peg"abc" # is true
  159. ```
  160. So it is not necessary to write ``peg" 'abc' "`` in the above example.
  161. Examples
  162. --------
  163. Check if `s` matches Nim's "while" keyword:
  164. ```nim
  165. s =~ peg" y'while'"
  166. ```
  167. Exchange (key, val)-pairs:
  168. ```nim
  169. "key: val; key2: val2".replacef(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")
  170. ```
  171. Determine the ``#include``'ed files of a C file:
  172. ```nim
  173. for line in lines("myfile.c"):
  174. if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws
  175. comment <- '/*' @ '*/' / '//' .*
  176. ws <- (comment / \s+)* """:
  177. echo matches[0]
  178. ```
  179. PEG vs regular expression
  180. -------------------------
  181. As a regular expression ``\[.*\]`` matches the longest possible text between
  182. ``'['`` and ``']'``. As a PEG it never matches anything, because a PEG is
  183. deterministic: ``.*`` consumes the rest of the input, so ``\]`` never matches.
  184. As a PEG this needs to be written as: ``\[ ( !\] . )* \]`` (or ``\[ @ \]``).
  185. Note that the regular expression does not behave as intended either: in the
  186. example ``*`` should not be greedy, so ``\[.*?\]`` should be used instead.
  187. PEG construction
  188. ----------------
  189. There are two ways to construct a PEG in Nim code:
  190. (1) Parsing a string into an AST which consists of `Peg` nodes with the
  191. `peg` proc.
  192. (2) Constructing the AST directly with proc calls. This method does not
  193. support constructing rules, only simple expressions and is not as
  194. convenient. Its only advantage is that it does not pull in the whole PEG
  195. parser into your executable.