2.12.2021, 9:00 - 11:00: Due to updates GitLab may be unavailable for some minutes between 09:00 and 11:00.

ebnf.py 187 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ebnf.py - EBNF -> Python-Parser compilation for DHParser
#
# Copyright 2016  by Eckhart Arnold (arnold@badw.de)
#                 Bavarian Academy of Sciences an Humanities (badw.de)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.  See the License for the specific language governing
# permissions and limitations under the License.
17
18


19
"""
Eckhart Arnold's avatar
Eckhart Arnold committed
20
21
Module ``ebnf`` provides an EBNF-parser-generator that compiles an
EBNF-Grammar into avPython-code that can be executed to parse source text
22
conforming to this grammar into concrete syntax trees.
Eckhart Arnold's avatar
Eckhart Arnold committed
23

24
25
26
Specifying Grammers with EBNF
-----------------------------

Eckhart Arnold's avatar
Eckhart Arnold committed
27
With DHParser, Grammars can be specified either directly in Python-code
Eckhart Arnold's avatar
Eckhart Arnold committed
28
(see :py:mod:`parse`) or in one of several EBNF-dialects. (Yes,
Eckhart Arnold's avatar
Eckhart Arnold committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
DHParser supports several different variants of EBNF! This makes it easy
to crate a parser directly from Grammars found in external sources.)
"EBNF" stands for the "Extended-Backus-Naur-Form" which is a common
formalism for specifying Grammars for context-free-languages.
(see https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

The recommended way of compiling grammars with DHParser is to either
write the EBNF-specification for that Grammar into a text-file and then
compile EBNF-source to an executable as well as importable Python-module
with the help of the "dhparser"-skript. Or, for bigger projects, to
create a new domain-specific-language-project with the DHParser-skript
as described in the step-by-step-guide.

However, here we will show how to compile an EBNF-specified grammar
from within Python-code and how to execute the parser that was
generated by compiling the grammar.

46
47
48
As an example, we will realize a json-parser (https://www.json.org/).
Let's start with creating some test-data::

Eckhart Arnold's avatar
Eckhart Arnold committed
49
50
51
52
53
    >>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
    >>> import json
    >>> testdata = json.dumps(testobj)
    >>> testdata
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
54
55
56
57
58
59
60
61

We define the json-Grammar (see https://www.json.org/) in
top-down manner in EBNF. We'll use a regular-expression look-alike
syntax. EBNF, as you may recall, consists of a sequence of symbol
definitions. The definiens of those definitions either is a string
literal or regular expression or other symbols or a combination
of these with four different operators: 1. sequences
2. alternatives 3. options and 4. repetitions. Here is how these
62
elements are denoted in classical and regex-like EBNF-syntax:
63

Eckhart Arnold's avatar
Eckhart Arnold committed
64
========================  ==================  ================
di68kap's avatar
di68kap committed
65
element                   classical EBNF      regex-like
Eckhart Arnold's avatar
Eckhart Arnold committed
66
67
68
========================  ==================  ================
insignificant whitespace  ~                   ~
string literal            "..." or \\`...\\`    "..." or \\`...\\`
di68kap's avatar
di68kap committed
69
70
71
72
73
74
75
regular expr.             /.../               /.../
sequences                 A B C               A B C
alternatives              A | B | C           A | B | C
options                   [ ... ]             ...?
repetions                 { ... }             ...*
one or more                                   ...+
grouping                  (...)               (...)
Eckhart Arnold's avatar
Eckhart Arnold committed
76
========================  ==================  ================
77
78

"insignificant whitespace" is a speciality of DHParser. Denoting
Eckhart Arnold's avatar
Eckhart Arnold committed
79
insignificant whitespace with a particular sign ``~`` allows to eliminate
80
81
82
83
84
85
86
87
88
89
90
it already during the parsing process without burdening later
syntax-tree-processing stages with this common task. DHParser offers
several more facilities to restrain the verbosity of the concrete
syntax tree, so that the outcome of the parsing stage comes close (or
at least closer) to the intended abstract-syntax-tree, already.

JSON consists of two complex data types, 1) associative arrays,
called "object" and sequences of heterogeneous data, called array; and
of four simple data types, 1) string, 2) number, 3) bool and 4) null.
The structure of a JSON file can easily be described in EBNF::

Eckhart Arnold's avatar
Eckhart Arnold committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
    >>> grammar = '''
    ...     json     = ~ _element _EOF
    ...     _EOF     = /$/
    ...     _element = object | array | string | number | bool | null
    ...     object   = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member   = string ":" ~ _element
    ...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string   = `"` _CHARS `"` ~
    ...     _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number   = _INT _FRAC? _EXP? ~
    ...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...     _FRAC    = `.` /[0-9]+/
    ...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/
    ...     bool     = "true" ~ | "false" ~
    ...     null     = "null" ~                                   '''
106

107
This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though:
108
109
First of all you might notice that some components of the grammar
(or "prduction rules" as they are commonly called) have names with a leading
Eckhart Arnold's avatar
Eckhart Arnold committed
110
111
underscore ``_``. It is a convention to mark those elements, in which we are on
interested on their own account, with an underscore ``_``. When moving from the
112
113
114
concrete syntax-tree to a more abstract syntax-tree, these elements could be
substituted by their content, to simplify the tree.

115
116
Secondly, some production rules carry a name written in captial letters. This is also
a convention to mark those symbols which with other parser-generators would
117
118
represent tokens delivered by a lexical scanner. DHParser is a "scanner-less"
parser, which means that the breaking down of the string into meaningful tokens
Eckhart Arnold's avatar
Eckhart Arnold committed
119
120
is done in place with regular expressions (like in the definition of ``_EOF``)
or simple combinations of regular expressions (see the definition of ``_INT`` above).
121
Their is no sharp distinction between tokens and other symbols in DHParser,
122
but we keep it as a loose convention. Regular expressions are enclosed in forward
123
124
slashes and follow the standard syntax of Perl-style regular expression that is
also used by the "re"-module of the Python standard library. (Don't worry about
Eckhart Arnold's avatar
Eckhart Arnold committed
125
the number of backslashes in the line defining ``_CHARS`` for now!)
126
127
128

Finally, it is another helpful conention to indent the defintions of symbols
that have only been introduced to simplify an otherwise uneccessarily
Eckhart Arnold's avatar
Eckhart Arnold committed
129
130
complicated definition (e.g. the definition of ``number``, above) or to make
it more understandable by giving names to its componentns (like ``_EOF``).
131
132
133

Let's try this grammar on our test-string.  In order to compile
this grammar into executable Python-code, we use the high-level-function
Eckhart Arnold's avatar
Eckhart Arnold committed
134
:py:func:`~dsl.create_parser` from the :py:mod:`dsl`-module.
eckhart's avatar
eckhart committed
135

Eckhart Arnold's avatar
Eckhart Arnold committed
136
137
138
139
140
141
142
    >>> from DHParser.dsl import create_parser
    >>> # from DHParser.dsl import compileEBNF
    >>> # print(compileEBNF(grammar))
    >>> parser = create_parser(grammar, branding="JSON")
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
143
144
145
146
147
148
149

As expected serializing the content of the resulting syntax-tree yields exactly
the input-string of the parsing process. What we cannot see here, is that the
parser has structured the string into the individual elements described in the
grammar. Since the concrete syntax-tree that the parser vields is rather
verbose, it would not make sense to print it out. We'll just look at a small
part of it, to see what it looks like. Let's just pick the sub-tree that
150
151
captures the first json-array within the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
152
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
    (array
      (:Text "[")
      (_element
        (number
          (_INT "1")))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (number
          (_INT "2")
          (_FRAC
            (:Text ".")
            (:RegExp "0"))))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (string
          (:Text '"')
          (_CHARS "a string")
          (:Text '"')))
      (:Text "]"))
174
175

The nodes of the syntax-tree carry the names of the production rules
176
177
by which they have been generated. Nodes, that have been created by
components of a prduction receive the name of of the parser-type
Eckhart Arnold's avatar
Eckhart Arnold committed
178
that has created the node (see :py:mod:`parse`) prefixed
179
180
with a colon ":". In DHParser, these nodes are called "anonymous",
because they lack the name of a proper grammatical component.
181

Eckhart Arnold's avatar
Eckhart Arnold committed
182
183
.. _simplifying_syntax_trees:

184
Simplifying Syntax-Trees while Parsing
185
--------------------------------------
186

187
Usually, anonymous nodes are what you want to get rid of in the course
188
of transforming the concrete syntax-tree into an abstract syntax-tree.
Eckhart Arnold's avatar
Eckhart Arnold committed
189
(See :py:mod:`transform`). DHParser already eliminates per
190
default all anonymous nodes that are not leaf-nodes by replacing them
di68kap's avatar
di68kap committed
191
192
193
with their children during parsing. Anonymous leaf-nodes will be
replaced by their content, if they are a single child of some parent,
and otherwise be left in place. Without this optimization, each
194
construct of the EBNF-grammar would leave a node in the syntax-tree::
195

Eckhart Arnold's avatar
Eckhart Arnold committed
196
197
198
    >>> from DHParser.parse import CombinedParser, TreeReduction
    >>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
    >>> syntax_tree = parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
199
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
200
201
202
    (array
      (:Text "[")
      (:Option
203
204
205
206
        (:Series
          (_element
            (number
              (_INT
di68kap's avatar
di68kap committed
207
                (:Alternative
Eckhart Arnold's avatar
Eckhart Arnold committed
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
                  (:RegExp "1")))))
          (:ZeroOrMore
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (number
                  (_INT
                    (:Alternative
                      (:RegExp "2")))
                  (:Option
                    (_FRAC
                      (:Text ".")
                      (:RegExp "0"))))))
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (string
                  (:Text '"')
                  (_CHARS
                    (:RegExp "a string"))
                  (:Text '"')))))))
      (:Text "]"))
232

233
234
This can be helpful for understanding how parsing that is directed by
an EBNF-grammar works (next to looking at the logs of the complete
Eckhart Arnold's avatar
Eckhart Arnold committed
235
parsing-process, see :py:mod:`trace`), but other than that it
236
237
238
239
is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.

di68kap's avatar
di68kap committed
240
Because of this, DHParser offers further means of simplifying
241
242
243
244
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree; but they must be turned on by "directives" that
are listed at the top of an EBNF-grammar and that guide the
di68kap's avatar
di68kap committed
245
parser-generation process. DHParser-directives always start with an
Eckhart Arnold's avatar
Eckhart Arnold committed
246
``@``-sign. For example, the ``@drop``-directive advises the parser to
di68kap's avatar
di68kap committed
247
248
249
drop certain nodes entirely, including their content. In the following
example, the parser is directed to drop all insignificant whitespace::

Eckhart Arnold's avatar
Eckhart Arnold committed
250
    >>> drop_insignificant_wsp = '@drop = whitespace \\n'
di68kap's avatar
di68kap committed
251
252
253
254
255

Directives look similar to productions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anomymous
nodes that shall be dropped or one of four particular classes of
Eckhart Arnold's avatar
Eckhart Arnold committed
256
anonymous nodes (``strings``, ``backticked``, ``regexp``, ``whitespace``) that
di68kap's avatar
di68kap committed
257
258
259
260
261
262
263
264
265
will be dropped.

Another useful directive advises the parser to treat named nodes as
anynouse nodes and to eliminate them accordingly during parsing. This
is usefule, if we have introduced certain names in our grammar
only as placeholders to render the definition of the grammar a bit
more readable, not because we are intested in the text that is
captured by the production associated with them in their own right::

Eckhart Arnold's avatar
Eckhart Arnold committed
266
    >>> disposable_symbols = '@disposable = /_\w+/ \\n'
di68kap's avatar
di68kap committed
267
268
269
270
271
272
273
274

Instead of passing a comma-separated list of symbols to the directive,
which would also have been possible, we have leveraged our convention
to prefix unimportant symbols with an underscore "_" by specifying the
symbols that shall by anonymized with a regular expression.

Now, let's examine the effect of these two directives::

Eckhart Arnold's avatar
Eckhart Arnold committed
275
276
277
278
279
    >>> refined_grammar = drop_insignificant_wsp + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'
di68kap's avatar
di68kap committed
280
281
282
283
284
285
286

You might have notived that all insigificant whitespaces adjacent to
the delimiters have been removed this time (but, of course not the
significant whitespace between "a" and "string" in "a string"). And
the difference, the use of these two directives makes, is even more
obvious, if we look at (a section of) the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
287
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
288
289
290
291
292
293
294
295
296
297
298
299
300
301
    (array
      (:Text "[")
      (number "1")
      (:Text ",")
      (number
        (:RegExp "2")
        (:Text ".")
        (:RegExp "0"))
      (:Text ",")
      (string
        (:Text '"')
        (:RegExp "a string")
        (:Text '"'))
      (:Text "]"))
di68kap's avatar
di68kap committed
302

eckhart's avatar
eckhart committed
303
304
305
306
307
308
309
310
311
312
313
314
315
316
This tree looks more streamlined. But it still contains more structure
than we might like to see in an abstract syntax tree. In particular, it
still contains als the delimiters ("[", ",", '"', ...) next to the data. But
other than in the UTF-8 representation of our json data, the delimiters are
not needed any more, because the structural information is now retained
in the tree-structure.

So how can we get rid of those delimiters? The rather coarse-grained tools
that DHParser offers in the parsing stage require some care to do this
properly.

The @drop-directive allows to drop all unnamed strings (i.e. strings
that are not directly assigned to a symbol) and backticked strings (for
the difference between strings and backticked strings, see below) and
Eckhart Arnold's avatar
Eckhart Arnold committed
317
regular expressions. However, using ``@drop = whitespace, strings, backticked``
eckhart's avatar
eckhart committed
318
319
would also drop those parts captured as string that contain data::

Eckhart Arnold's avatar
Eckhart Arnold committed
320
321
322
323
    >>> refined_grammar = '@drop = whitespace, strings, backticked \\n' \
                          + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
324
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
325
326
327
328
329
330
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
331
332
333
334
335

Here, suddenly, the number "2.0" has been turned into "20"! There
are three ways to get around this problem:

1. Assigning all non-delmiter strings to symbols. In this case
Eckhart Arnold's avatar
Eckhart Arnold committed
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
   we would have to rewrite the definition of "number" as such::

      number     = _INT _FRAC? _EXP? ~
        _INT     = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = _DOT /[0-9]+/
        _EXP     = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/
        _MINUS   = `-`
        _PLUS    = `+`
        _DOT     = `.`
        _Ecap    = `E`
        _Esmall  = `e`

   A simpler alternative of this technique would be to make use of
   the fact that the document-parts captured by regular expresseion
   are not dropped (although regular expressions can also be listed
   in the @drop-directive, if needed) and that at the same time
   delimiters are almost always simple strings containing keywords
   or punctuation characters. Thus, one only needs to rewrite those
   string-expressions that capture data as regular expressions::

      number     = _INT _FRAC? _EXP? ~
        _INT     = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = /[.]/ /[0-9]+/
        _EXP     = (/E/|/e/) [/[-+]/] /[0-9]+/

2. Assigning all delimiter strings to symbols and drop the nodes
   and content captured by these symbols. This means doing exactly
   the opposite of the first solution. Here is an excerpt of what
   a JSON-grammar emploing this technique would look like::

      @disposable = /_\w+/
      @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ...
      ...
      array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~
      ...

   It is important that all symbols listed for dropping are also made
   disposable, either by listing them in the disposable-directive as well
   or using names that the regular-expressions for disposables matches.
   Otherwise, DHParser does not allow to drop the content of named nodes,
   because the default assumption is that symbols in the grammar are
   defined to capture meaningful parts of the document that contain
   relevant data.

3. Bailing out and leaving the further simplification of the syntax-tree
   to the next tree-processing stage which, if you follow DHParser's suggested
   usage pattern, is the abstract-syntax-tree-transformation proper
   and which allows for a much more fine-grained specification of
Eckhart Arnold's avatar
Eckhart Arnold committed
384
   transformation rules. See :py:mod:`transform`.
eckhart's avatar
eckhart committed
385
386

To round this section up, we present the full grammar for a streamlined
387
388
389
JSON-Parser according to the first solution-strategy. Observe, that the
values of "bool" and "null" are now defined with regular expressions
instead of string-literals, because the latter would be dropped because
Eckhart Arnold's avatar
Eckhart Arnold committed
390
of the ``@drop = ... strings, ...``-directive, leaving an empty named node
391
without a value, wheneever a bool value or null occurs in the input::
eckhart's avatar
eckhart committed
392

Eckhart Arnold's avatar
Eckhart Arnold committed
393
    >>> json_gr = '''
Eckhart Arnold's avatar
Eckhart Arnold committed
394
    ...     @disposable = /_\\\\w+/
Eckhart Arnold's avatar
Eckhart Arnold committed
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ _element
    ...     array      = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string     = `"` _CHARS `"` ~
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
412
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
413
414
415
416
417
418
419
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp ".")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
420

eckhart's avatar
eckhart committed
421
422
423
424
425
426
This time the data is not distorted, any more. One oddity reamins, however: We
are most probably not interested in the fact that the number 2.0 consists of
three components, each of which hast been captured by a regular expression.
Luckiliy, there exists yet another directive that allows to reduce the tree
further by merging adjacent anonymous leaf-nodes::

Eckhart Arnold's avatar
Eckhart Arnold committed
427
428
429
    >>> json_gr = '@reduction = merge \\n' + json_gr
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
430
    >>> print(syntax_tree.as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
431
432
433
434
435
436
437
438
439
440
441
442
443
444
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
445

eckhart's avatar
eckhart committed
446
447
448
Merging adjacent anonymous leaf-nodes takes place after the @drop-directive
comes into effect. It should be observed that merging only produces the desired
result, if any delimiters have been dropped previously, because otherwise
Eckhart Arnold's avatar
Eckhart Arnold committed
449
450
451
delimiters would be merged with content. Therefore, the ``@reduction = merge`-
directive should at best only be applied in conjunction with the ``@drop`` and
``@disposable``-directives.
eckhart's avatar
eckhart committed
452
453
454
455
456
457
458

Applying any of the here described tree-reduction (or "simplification" for
that matter) requires a bit of careful planning concerning which nodes
will be named and which nodes will be dropped. This, however, pays off in
terms of speed and considerably simplified abtract-syntax-tree generation
stage, because most of the unnecessary structure of concrete-syntax-trees
has already been eliminated at the parsing stage.
eckhart's avatar
eckhart committed
459

Eckhart Arnold's avatar
Eckhart Arnold committed
460
461
.. _comments_and_whitespace:

eckhart's avatar
eckhart committed
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
Comments and Whitespace
-----------------------

Why whitespace isn't trivial
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Handling whitespace in text-documents is not all trivial, because
whitespace can serve several different purposes and there can be
different kinds of whitespace: Whitespace can serve a syntactic function
as delimiter. But whitespace can also be purely asthetic to render
a document more readable.

Depending on the data model, whitespace can be considered as
significant and be included in the data or as
insignificant and be excluded from the data and only be re-inserted
when displaying the data in a human-readable-form. (For example, one
can model a sentence as a seuqence of words and spaces or just as
a sequence of words.) Note, that "significance" does not correlate
with the syntatic or asthetic function, but only depends on whether
you'd like to keep the whitespace in you data or not.

There can be different kinds of whitespace with different meaning
(and differing significance). For example, one can make a difference
between horozontal whitespace (spaces and tabs) and vertical
whitespace (including linefeeds). And there can be different sizes
of whitespace with different meaning. For example in LaTeX, a single
linefeed still counts as plain whitespace while an empty line (i.e.
whitespace including two or more not linefeeds) signals a new
paragraph.

Finally, even the position of whitespace can make a difference.
A certain number of whitespaces at the beginning of a line can
have the meaning of "indentation" (as in Python code) while at
the end of the line or between brackets it is just plain
496
497
498
499
500
insignificant whitespace. (This is actually something, where
the boundaries of the EBNF-formalism become visible and you'd
probably use a preprocessor or some kind of "semantic actions"
to handle such cases. There is some support for either of these
in DHParser.)
eckhart's avatar
eckhart committed
501

di68kap's avatar
di68kap committed
502
503
Coding significant Whitespace in EBNF-Grammars
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eckhart's avatar
eckhart committed
504

505
506
507
508
509
A reasonable approach to coding whitespace is to use one
particular symbol for each kind of whitespace. Those kinds of
whitespace that are insignficant, i.e. that do not need to
appear in the data, should be dropped from the syntax-tree.
With DHParser this can be done already while parsing, using
Eckhart Arnold's avatar
Eckhart Arnold committed
510
the ``@disposable`` and ``@drop``-directives described earlier.
511
512
513
514
515

But let's first look at an example which only includes significant
whitespace. The following parser parses sequences of paragraphs which
consist of sequences of sentences which consist of sequences
of main clauses and subordinate clauses which consist of sequences
di68kap's avatar
di68kap committed
516
of words::
517

Eckhart Arnold's avatar
Eckhart Arnold committed
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
    >>> text_gr = '''
    ...     @ disposable = /_\\\\w+/
    ...     document       = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF
    ...       _EOF         = /$/
    ...     paragraph      = sentence (S sentence)*
    ...     sentence       = (clause _c_delimiter S)* clause _s_delimiter
    ...       _c_delimiter = KOMMA | COLON | SEMICOLON
    ...       _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK
    ...     clause         = word (S word)*
    ...     word           = /(?:[A-Z]|[a-z])[a-z']*/
    ...     DOT            = `.`
    ...     QUESTION_MARK  = `?`
    ...     EXCLAMATION_MARK = `!`
    ...     KOMMA          = `,`
    ...     COLON          = `:`
    ...     SEMICOLON      = `;`
    ...     PBR            = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     S              = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/ '''
di68kap's avatar
di68kap committed
536

Eckhart Arnold's avatar
Eckhart Arnold committed
537
Here, we have two types of significant whitespace ``PBR`` ("paragraph-break") and ``S``
di68kap's avatar
di68kap committed
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
("space"). Both types allow for a certain amount of flexibility, so that two
whitespaces of the same type do not need to have exactly the same content, but
we could always normalize these whitespaces in a subsequent transformation step.

Two typical design patterns for significant whitespace are noteworthy, here:

1. Both whitespaces match only if there was at least one whitespace character.
   We may allow whitespace to be optional (as at the beginning and end of the
   document), but if the option has not been taken, we don't to see an empty
   whitespace-tag in the document, later on.
   (For insignificant whitespace, the opposite convention can be more convenient,
   because, typically, insignificant whitespace is dropped anyway, whether it's
   got content or not.)

2. The grammar is construed in such a way that the whitespace always appears
   *between* different elements at the same level, but not after the last or
   before the first element. The whitespace after the last word of a sentence
   or before the first word of a sentence is really whitespace between
   two sentences. If we pick out a sentence or a clause, we will have no
   dangling whitespace at its beginning or end.
   (Again, for soon to be dropped insignificant whitespace, another convention
   can be more advisable.)

eckhart's avatar
eckhart committed
561
Let's just try our grammar on an example::
di68kap's avatar
di68kap committed
562

Eckhart Arnold's avatar
Eckhart Arnold committed
563
564
565
566
567
568
569
570
571
572
573
574
575
576
    >>> text_example = '''
    ... I want to say, in all seriousness, that a great deal of harm is being
    ... done in the modern world by belief in the virtuousness of work, and that
    ... the road to happiness and prosperity lies in an organized diminution of
    ... work.
    ...
    ... First of all: what is work? Work is of two kinds: first, altering the
    ... position of matter at or near the earth's surface relatively to other
    ... such matter; second, telling other people to do so. The first kind is
    ... unpleasant and ill paid; the second is pleasant and highly paid.'''
    >>> text_parser = create_parser(text_gr, 'Text')
    >>> text_as_data = text_parser(text_example)
    >>> sentence = text_as_data.pick(\
            lambda nd: nd.tag_name == "sentence" and nd.content.startswith('First'))
Eckhart Arnold's avatar
Eckhart Arnold committed
577
    >>> print(sentence.as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
    (sentence
      (clause
        (word "First")
        (S " ")
        (word "of")
        (S " ")
        (word "all"))
      (COLON ":")
      (S " ")
      (clause
        (word "what")
        (S " ")
        (word "is")
        (S " ")
        (word "work"))
      (QUESTION_MARK "?"))
di68kap's avatar
di68kap committed
594
595

Again, it is a question of design, whether we leave whitespace in the data or
eckhart's avatar
eckhart committed
596
not. Leaving it has the advantage, that serialization become as simple as
di68kap's avatar
di68kap committed
597
598
printing the content of the data-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
599
600
    >>> print(sentence)
    First of all: what is work?
di68kap's avatar
di68kap committed
601
602
603
604
605
606

Otherwise one would have to programm a dedicated serialization routine. Especially,
if you receive data from a different source, you'll appreciate not having to
do this - and so will other people, receiving your data. Think about it! However,
dropping the whitespace will yield more consice data.

eckhart's avatar
eckhart committed
607
Coding Comments
di68kap's avatar
di68kap committed
608
609
^^^^^^^^^^^^^^^

eckhart's avatar
eckhart committed
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
Allowing comments in a domain-specific language almost always makes sense,
because it allows users to annotate the source texts while working on them
and to share those comments with collaborators. From a technical point of
view, adding comments to a DSL raises two questions:

1. At what places shall we allow to insert comments in the source code?
   Common answers are: a) at the end of a line, b) almost everywhere, or
   c) both.

2. How do we avoid pollution of the EBNF-grammar with comment markers?
   It's already curtails the readability that we have to put whitespace
   symbols in so many places. And speaking of comments at the end of
   the line: If linefeeds aren't important for us - as in our toy-grammar
   for prose-text, above - we probably wouldn't want to reframe our
   grammar jsut to allow for at the end of the line comments.

Luckily, there exists a simple and highly intuitive solution that takes
care of both of these concerns: We admitt comments, whereever whitespace
is allowed. And we code this by defining a symbol that means: "whitespace
and, optionally, a comment".

Let's try this with our prose-text-grammar. In order to do so, we have
to define a symbols for comments, a symbol for pure whitespace, and,
finally, a symbol for whitespace with optional comment. Since, in
Eckhart Arnold's avatar
Eckhart Arnold committed
634
our grammar, we actually have two kinds of whitespace, ``S`` and ``PBR``,
eckhart's avatar
eckhart committed
635
636
637
we'll have to redefine both of them. As delimiters for comments, we
use curly braces::

Eckhart Arnold's avatar
Eckhart Arnold committed
638
639
640
641
642
643
644
    >>> wsp_gr = '''
    ...     PBR      = pure_PBR COMMENT (pure_PBR | pure_S)?
    ...              | (pure_S? COMMENT)? pure_PBR
    ...     S        = pure_S COMMENT pure_S? | COMMENT? pure_S
    ...     COMMENT  = /\{[^}]*\}/
    ...     pure_PBR = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     pure_S   = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/'''
eckhart's avatar
eckhart committed
645
646
647
648
649
650
651

As can be seen, the concrete re-definition of the whitespace tokens
requires a bit of careful consideration, because we want to allow
additional whitespace next to comments, but at the same time avoid
ending up with two whitespaces in sequence in our data. Let's see, if
we have succeeded::

Eckhart Arnold's avatar
Eckhart Arnold committed
652
653
654
655
656
    >>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr
    >>> extended_parser = create_parser(extended_text_gr, 'Text')
    >>> syntax_tree = extended_parser('What {check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
Eckhart Arnold's avatar
Eckhart Arnold committed
657
    >>> print(syntax_tree.pick('clause').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
658
659
660
661
662
663
664
665
666
667
    (clause
      (word "What")
      (S
        (pure_S " ")
        (COMMENT "{check this again!}")
        (pure_S " "))
      (word "is")
      (S
        (pure_S " "))
      (word "work"))
eckhart's avatar
eckhart committed
668
669

We will not worry about the more sub-structure of the S-nodes right now. If
Eckhart Arnold's avatar
Eckhart Arnold committed
670
671
we are not interested in the comments, we could use the ``@disposable``,
``@drop`` and ``@reduction = merge``-directives to simplify these at the
eckhart's avatar
eckhart committed
672
673
674
675
parsing stage. Or, we could extract the comments and normalize the whitespace
at a later tree-processing stage. For now, let's just check wehter our
comments work as expected::

Eckhart Arnold's avatar
Eckhart Arnold committed
676
677
678
679
680
681
682
683
    >>> syntax_tree = extended_parser('What{check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What {check this again!}is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What{check this again!}is work?')
    >>> print(syntax_tree.errors[0])
684
    1:24: Error (1040): Parser "pure_S" did not match: »is work?«
eckhart's avatar
eckhart committed
685
686

The last error was to be expected, because we did not allow comments
di68kap's avatar
di68kap committed
687
688
to serve a substitutes for whitespace. Let's check whether putting comments
near paragraph breaks works as well::
eckhart's avatar
eckhart committed
689

Eckhart Arnold's avatar
Eckhart Arnold committed
690
691
692
693
694
695
696
697
698
699
700
701
702
703
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ...
    ... { Here comes the comment }
    ...
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph PBR paragraph
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ... { Here comes the comment }
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph
eckhart's avatar
eckhart committed
704
705
706
707
708
709

The last result might look surprising at first, but since a paragraph
break requires at least one empty line as a separator, the input text
is correctly understood by the parser as a sinlge paragrpah with
two sentence interspersed by a sinlge whitespace which, incidently,
contains a comment::
Eckhart Arnold's avatar
Eckhart Arnold committed
710
711
712

    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('paragraph').children))
    sentence S sentence
Eckhart Arnold's avatar
Eckhart Arnold committed
713
    >>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
714
715
716
717
718
719
720
721
    (S
      (pure_S
        ""
        "")
      (COMMENT "{ Here comes the comment }")
      (pure_S
        ""
        ""))
722
723
724
725
726
727

A common problem with whitespace is that it tends to pollute
the Grammar, because whereever you'd like to allow whitespace,
you'd have to insert a symbol for whitespace. The same problem
existis when it comes to allowing comments, because you'd
probably allow to insert comments in as many places as possible.
728

eckhart's avatar
eckhart committed
729
730
731
DHParser's support for insignificant whitespace and comments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

di68kap's avatar
di68kap committed
732
733
734
735
736
737
738
739
740
741
742
Coding insignificant whitespace and comments is exactly the
same as coding siginificant whitespace and comments and does not
need to be repeated, here. (The combination of insignificant
whitespace and significant comments, is slightly more complicated,
and probably best outsourced to some degree to the post-parsing
processing stages. It will not be discussed here.) However,
DHParser offers some special support for insignificant
whitesapce and comments, which can make working with these
easier in some cases.

First of all, DHParser has a special dedicated token for
Eckhart Arnold's avatar
Eckhart Arnold committed
743
insignificant whitespace which is the tilde ``~``-character.
di68kap's avatar
di68kap committed
744
745
We have seen this earlier in the definition of the json-Grammar.

Eckhart Arnold's avatar
Eckhart Arnold committed
746
The ``~``-whitespace-marker differs from the usual pattern for
di68kap's avatar
di68kap committed
747
748
749
750
751
752
753
754
755
756
757
758
759
defining whitespace in that it is implicitly optional, or what
amounts to the same, it matches the empty string. Normally,
it is to be considered bad practice to define a symbol as
optional. Rahter, a symbol should always match something and
only at the places where it is used, it should be marked as
optional. If this rule is obeyed, it is always easy to tell,
wether some element is optional or not at a specific place
in the Grammar. Otherwise, it can become quite confusing
indeed. However, since the tilde character is usually used
very often, it is more convenient not to mark it with a
question-mark or, if you use classical EBNF-syntax, to enclose
it with square brackets.

di68kap's avatar
di68kap committed
760
761
The default regular expression for the tilde-whitespace captures
arbitraily many spaces and tabs and at most one linefeed, but
Eckhart Arnold's avatar
Eckhart Arnold committed
762
not an empty line (``[ \t]*(?:\n[ \t]*)?(?!\n)``), as this is
di68kap's avatar
di68kap committed
763
764
the most convenient way to define whitespace for text-data.
However, the tilde whitespace can also be definied with any
Eckhart Arnold's avatar
Eckhart Arnold committed
765
other regular expression with the ``@whitespace``-directive.
di68kap's avatar
di68kap committed
766
767
768
769
770
771
772
773

Let's go back to our JSON-grammar and define the optional
insignificant whitespace marked by the tilde-character in such a
way that it matches any amount of horizontal or vertical
whitespace, which makes much more sense in the context of json
than the default tilde-whitespace that is restricted vertically
to at most a single linefeed::

Eckhart Arnold's avatar
Eckhart Arnold committed
774
775
776
    >>> testdata = '{"array": [1, 2.0, "a string"], \\n\\n\\n "number": -1.3e+25, "bool": false}'
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors[0])
777
    1:32: Error (1010): 'member' expected by parser 'object', but » \\n \\n \\n  "numb...« found instead!
Eckhart Arnold's avatar
Eckhart Arnold committed
778
779
780
781
782
    >>> json_gr = '@whitespace = /\\\\s*/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors)
    []
di68kap's avatar
di68kap committed
783

eckhart's avatar
eckhart committed
784
785
786
787
788
When redefining the tilde-whitespace, make sure that your regular expression
also matches the empty string! There is no need to worry that the syntax tree
get's cluttered by empty whitespace-nodes, because tilde-whitespace always
yeidls anonymous nodes and DHParser drops empty anonymous nodes right away.

Eckhart Arnold's avatar
Eckhart Arnold committed
789
Comments can be defined using the ``@comment``-directive. DHParser automatically
eckhart's avatar
eckhart committed
790
intermingles comments and whitespace so that where-ever tilde-whitespace is
Eckhart Arnold's avatar
Eckhart Arnold committed
791
allowed, a comment defined by the ``@comment``-directive is also allowed:
eckhart's avatar
eckhart committed
792

Eckhart Arnold's avatar
Eckhart Arnold committed
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
    >>> json_gr = '@comment = /#[^\\\\n]*(?:\\\\n|$)/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> testdata = '''{"array": [1, 2.0, "a string"], # a string
    ...                "number": -1.3e+25,  # a number
    ...                "bool": false}  # a bool'''
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
eckhart's avatar
eckhart committed
814

Eckhart Arnold's avatar
Eckhart Arnold committed
815
Since the json-grammar still contains the ``@drop = whitespace, ...``-
eckhart's avatar
eckhart committed
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
directive from earlier on (next to other tree-reductions), the comments
have been nicely dropped along with the tilde-whitespace.

There is one caveat: When using comments alongside with whitespace that
captures at most one linefeed, the comments should be defined in such
a way that the last charcter of a comment is never a linefeed.

Also a few limitations of the tilde-whitespace and directive-defined
comments should be kept in mind: 1. Only one kind of insignificant
whitespace can be defined this way. If there are more kinds of
insignificant whitespace, all but one need to be defined conventionally
as part of the grammar. 2. Both directive-defined comments and
tilde-whitespace can only be defined by a regular expresseion. In
particular, nested comments are impossible to define with regular
expressions, only.

832
833
834
835
However, using tilde-whitespace has yet one more benefit: With the
tilde-whitespace, cluttering of the grammar with whitespace-markers
can be avoid, by adding implicit whitespace adjacent to string-literals.
Remember the definition of the JSON-Grammar earlier. If you look at
Eckhart Arnold's avatar
Eckhart Arnold committed
836
a definition like: ``object = "{" ~ member ( "," ~ §member )* "}" ~``,
837
you'll notice that there are three whitespace markers, one next to
838
each delimiter. Naturally so, because one usually wants to allow users
839
840
841
842
843
844
845
846
847
848
849
850
851
852
of a domain specific language to put whitespace around delimiters.

You may wonder, why the tilde appears only on the right hand side
of the literals, although you'd probably like to allow whitespace
on both side of a literal like "{". But if you look at the grammar
closely, you'll find that almost every symbol definition ends
either with a tilde sign or a symbol the definition of which ends
with a tilde sign, which means that they allow whitespace on the
right hand side. Now, if all elements of the grammar allow
whitespace on the right hand side, this means that automatically
also have whitespace on the left-hand side, too, which is, of
course the whitespace on the right hand side of the previous
element.

853
854
In order to reduce cluttering the grammar with tile-signs, DHParser
allows to turn on implicit tilde-whitespace adjacent to any
Eckhart Arnold's avatar
Eckhart Arnold committed
855
856
string literal with the diretive ``@ literalws = right`` or
``@ literalws = left``. As the argument of the directive suggests,
857
858
859
860
whitespace is either "eaten" at the right hand side or the left
hand side of the literal. String literals can either be
enclose in double quotes "..." or single quotes '...'. Both
kinds of literals will have implicit whitespace, if the
Eckhart Arnold's avatar
Eckhart Arnold committed
861
``@literalws``-directive is used.
862
863
864
865
866
867
868
869
870
871

(Don't confuse implicit whitespace
with insignificant whitespace: Insignificnat whitespace is whitespace
you do not need any more after parsing. Implicit whitespace is
whitespace you do not denote explicitly in the grammar. It's
a speciality of DHParser and DHParser allows onl the insignificant
whitespace denoted by the tilde-character to be declared as
"implicit".)

If left-adjacent whitespace is declared as implicit with the
Eckhart Arnold's avatar
Eckhart Arnold committed
872
``@literalws``-directive, the expression::
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898

    object     = "{" ~ member ( "," ~ §member )* "}" ~

can be written as::

    object     = "{" member ( "," §member )* "}"

which is easier to read.

For situations where implicit whitespace is not desired, DHParser
has a special kind of string literal, written with backticks, which
never carries any implicit whitespace. This is important, when
literals are used for signs that enclose content, like the quotation
marks for the string literals in our JSON-Grammar::

    string     = `"` _CHARS '"'  # mind the difference between `"` and '"'!

Regular expressions, also, never carry implicit whitespace.
So, if you are using regular expressions as delimiters, you
still have to add the tilde character, if adjacent insignificant
whitespace is to be allowed::

    bool       = /true/~ | /false/~

The compliete json-grammar now looks like this::

Eckhart Arnold's avatar
Eckhart Arnold committed
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
    >>> json_gr = '''
    ...     @disposable = /_\\\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     @reduction = merge
    ...     @whitespace= /\\\\s*/
    ...     @comment   = /#[^\\\\n]*(?:\\\\n|$)/
    ...     @literalws = right
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" member ( "," §member )* "}"
    ...     member     = string ":" _element
    ...     array      = "[" ( _element ( "," _element )* )? "]"
    ...     string     = `"` _CHARS '"'
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/~ | /false/~
    ...     null       = /null/~                                    '''
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree_ = json_parser(testdata)
    >>> assert syntax_tree_.equals(syntax_tree)
di68kap's avatar
di68kap committed
923

Eckhart Arnold's avatar
Eckhart Arnold committed
924
925
The whitespace defined by the ``@whitespace``-directive can be access from
within the grammar via the name ``WHITESPACE__``. Other than the tilde-sign
eckhart's avatar
eckhart committed
926
this name refers to the pure whitespace that is not intermingles with
Eckhart Arnold's avatar
Eckhart Arnold committed
927
928
comments. Similarly, comments defined by the ``@comment``-directive can
be accessed via the symbol ``COMMENT__``.
di68kap's avatar
di68kap committed
929

eckhart's avatar
eckhart committed
930
931
Lookahead and Lookbehind
------------------------
eckhart's avatar
eckhart committed
932

Eckhart Arnold's avatar
Eckhart Arnold committed
933
934
935
936
937
938
939
940
Lookahead and lookbehind operators are a convenient way to resolve or rather
avoid ambiguities while at the same time keeping the DSL lean. Assume for
example a simple DSL for writing definitions like::

    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     human := featherless biped'''

di68kap's avatar
di68kap committed
941
Now, let's try to draw up a grammar for "definitions"::
Eckhart Arnold's avatar
Eckhart Arnold committed
942
943

    >>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
di68kap's avatar
di68kap committed
944
    ...     @literalws  = right
Eckhart Arnold's avatar
Eckhart Arnold committed
945
946
947
948
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum ":=" definiens
    ...     definiendum = word
    ...     definiens   = word { word }
di68kap's avatar
di68kap committed
949
    ...     word        = /[a-z]+|[A-Z][a-z]*/~
Eckhart Arnold's avatar
Eckhart Arnold committed
950
951
952
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_first_try, "defDSL")

di68kap's avatar
di68kap committed
953
954
955
956
Parsing our example with the generated parser yields an error, however::

    >>> syntax_tree = def_parser(definitions)
    >>> for e in syntax_tree.errors_sorted:  print(e)
di68kap's avatar
di68kap committed
957
958
    3:11: Error (1040): Parser "word->/[a-z]+|[A-Z][a-z]*/" did not match: »:= featherless biped«

Eckhart Arnold's avatar
Eckhart Arnold committed
959
The reason for this error is that the parser ``definiens`` captures as many
di68kap's avatar
di68kap committed
960
961
962
963
words as occur in a sequence, including the definiendum of the next definition
which is the word "human". But then the next definition does not find it
definiendum, any more, because it has already been captured. (All this may not
easily become clear from the error message itself, but can easily be found
Eckhart Arnold's avatar
Eckhart Arnold committed
964
out by using the post-mortem debugger of module :py:mod:`trace`.)
di68kap's avatar
di68kap committed
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981

An common tequnique to avoid this problem would be to introduce an
end-of-statemenet, for example, a semicolon ";". A more elegant way to solve
the problem in this case is to make use of the fact that if a word is
followed by the definition sign ":=" it cannot be part of the definiens
any more, but must be a definiendum. This can be encoded by using the
negative look-ahead operator "!":

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word !":=" }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

Eckhart Arnold's avatar
Eckhart Arnold committed
982
983
The statement ``word !":="`` is a squence of a ``word`` and a negative lookahead.
This whole sequence only matches, if ``word`` matches and the negative looakahead
di68kap's avatar
di68kap committed
984
985
986
matches, which is only the case of the following text cannot be matched by ":=".

We could have achieved the same effect with a positive lookahead by checking
Eckhart Arnold's avatar
Eckhart Arnold committed
987
whether any of the possible follow-up-sqeuences of parser ``definines`` ensues::
di68kap's avatar
di68kap committed
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word &(word|EOF) }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

Generally, lookahead operators, whether positive or negative, never capture any
text. They merely match or fail depending on whether the following parser would
match or would fail to match the next piece of text. The positive lookahead
matches, if the parser would match. The negative operator matches, if it would
fail. Lookahead operators can also be though of as a boolean condition on the
following text, where the positive lookahead operator "&" resembles an and,
"and" the negative lookahead operator an "and not". As in our example, these
operators are very helpful for "exploring" the surroundings of a piece of text
to be captured by a parser. They allow parsers to match or fail depending on
the ensuing text.

A negative lookahead expresseion can also serve to encode the meaning of
"without" if placed in front of another expression. Let's rewrite our
grammar of a definitions-DSL so as to exclude certain bad words::

    >>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
    ...     word        = !forbidden /[a-z]+|[A-Z][a-z]*/~
    ...     forbidden   = /[sf][a-z][a-z][a-z]/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser('nice := nice word')
    >>> print(syntax_tree)
    nice := nice word
    >>> syntax_tree = def_parser('sxxx := bad word')
    >>> print(str(syntax_tree).strip())
    <<< Error on "sxxx := bad word" | Parser "definitions" did not match: »sxxx := bad word« >>>

The same effect can be achieved by using the subtraction operator "-". This
is just syntactic sugar make the use of the negative lookahead operator
in the sense of "without" more intuitive::

    >>> def_DSL = def_DSL[:def_DSL.find('word        =')] + '''
    ...     word        = all_words - forbidden
    ...     all_words   = /[a-z]+|[A-Z][a-z]*/~
    ...     forbidden   = /[sf][a-z][a-z][a-z]/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser('sxxx := bad word')
    >>> print(str(syntax_tree).strip())
    <<< Error on "sxxx := bad word" | Parser "definitions" did not match: »sxxx := bad word« >>>

Next to the lookahead operators, there also exist lookback operators. Be warned,
though, that look back operators are an **experimental** feature in DHParser
and that their implementation is highly idiosyncratic, that is, it is most
Eckhart Arnold's avatar
Eckhart Arnold committed
1042
likely not compatible with any other parser-generator-toolkit based on EBNF-grammers.
di68kap's avatar
di68kap committed
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
Also, lookback operators in DHParser are more restricted than lookahead-operators.
They can only be used in combination with simple text or regular expression parsers
and - here comes the idiosyncratic part - they work in the opposite direction.
This means that if you want to check whether a parser is preceeded, say, by the
keyword "BEGIN", the text phrase that you have to check for with the lookback
parser is actually "NIGEB". If that still does not put you off, here is how
lookback-operators are used: Let's assume that our definition should not only
allow for a definiens but, alternatively for enumerations and that the difference
is indicated by using a simple equal sign "=" instead of the definition symbol
":=". Then using lookback-operators to distinguish the case, we can rewrite our
grammar as follows::

    >>> def_DSL_extended = '''
    ...     @literalws  = right
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum (":=" | "=") (definiens | enumeration)
    ...     definiendum = word
    ...     definiens   = <-& /\s*=:/ word { word &(word|EOF) }
    ...     enumeration = <-& /\s*=[^:]/ word { word &(word|EOF) }
    ...     word        = /[a-z]+|[A-Z][a-z]*/~
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_extended, "defDSL")
    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     drinks = water beer juice
    ...     human := featherless biped'''
    >>> def_parser = create_parser(def_DSL_extended, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> print(str(syntax_tree.pick('enumeration')).strip())
    water beer juice

Eckhart Arnold's avatar
Eckhart Arnold committed
1074
The lookback operators are ``<-&`` for the positive lookback and ``<-!`` for the
di68kap's avatar
di68kap committed
1075
1076
1077
negative lookback, each of which must be followed by a regular expression or a string.
Of course, this example is rather wanton and the grammar can easily be rewritten
without the lookback-operators.
eckhart's avatar
eckhart committed
1078

eckhart's avatar
eckhart committed
1079

Eckhart Arnold's avatar
Eckhart Arnold committed
1080
1081
1082
1083
1084
1085
1086
1087
1088
Locating errors and customizing error messages
----------------------------------------------

Providing users with proper error information is one of the most tenacious
problem when implementing the parser for a domain specific language.
There are three different challenges:

1. Locating the error at the correct position in the source code.
2. Providing proper error messages that explain the reason for the error.
Eckhart Arnold's avatar
Eckhart Arnold committed
1089
3. Resuming the parsing progress after an error has occurred at the nearest
Eckhart Arnold's avatar
Eckhart Arnold committed
1090
1091
1092
1093
1094
   possible place without producing artificial follow-up errors.

If the following, DHParser's techniques for the first two challenges,
locating errors and customizing error messages will be described.
Techniques for resuming the parsing process after an error occurred
Eckhart Arnold's avatar
Eckhart Arnold committed
1095
1096
or for passing by erroneous passages in the source code will be
explained below, under the heading "Fail-tolerant Parsing".
Eckhart Arnold's avatar
Eckhart Arnold committed
1097

di68kap's avatar
di68kap committed
1098
1099
1100
Farthest-Fail-Heuristics
^^^^^^^^^^^^^^^^^^^^^^^^

Eckhart Arnold's avatar
Eckhart Arnold committed
1101
1102
1103
1104
1105
1106
1107
Without adding any hints to the grammar, DHParser applies only a
very basic technique for locating the error if the grammar does
not match a text which is known as "farthest  failure" and locates
the error at the "farthest" position where a parser failed,
reporting the last named parser in the call chain (that first reached
this location) as the cause of the failure. This approach often works
surprisingly well for locating errors, unless the grammar relies to
Eckhart Arnold's avatar
Eckhart Arnold committed
1108
heavy on regular expressions capturing large chunks of text, because
Eckhart Arnold's avatar
Eckhart Arnold committed
1109
the error location works only on the level of the parsing expression
1110
1111
1112
grammar not at that of the atomic regular expressions. To see how
farthest fail word, consider a parser for simple arithmetic
expressions::
Eckhart Arnold's avatar
Eckhart Arnold committed
1113
1114
1115
1116
1117
1118

    >>> arithmetic_grammar = '''@ literalws = right
    ... arithmetic = expression EOF
    ... expression = ~ term { ("+" | "-") term }
    ... term       = factor { ("*" | "/") factor }
    ... factor     = number | group
1119
    ... group      = "(" expression ")"
Eckhart Arnold's avatar
Eckhart Arnold committed
1120
1121
1122
    ... number     = /\\\\d+/~
    ... EOF        = /$/'''
    >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
1123
    >>> terms = arithmetic('(2 - 3 * (4 + 5)')
Eckhart Arnold's avatar
Eckhart Arnold committed
1124
    >>> print(terms.errors[0])
di68kap's avatar
di68kap committed
1125
    1:17: Error (1040): Parser "term->`*`" did not match: »«
1126
1127
    >>> terms = arithmetic('(2 - 3) * ( )')
    >>> print(terms.errors[0])
di68kap's avatar
di68kap committed
1128
1129
1130
1131
1132
    1:13: Error (1040): Parser "number->/\\\\d+/" did not match: »)«

As can be seen the location of the error is captured well enough,
at least when we keep in mind that the computer cannot guess where
we would have placed the forgotton closing bracket. It can only
Eckhart Arnold's avatar
Eckhart Arnold committed
1133
report the point where the mistake becomes aparant.
di68kap's avatar
di68kap committed
1134

Eckhart Arnold's avatar
Eckhart Arnold committed
1135
However, the reported fact that it was the sub-parser \`*\` of
di68kap's avatar
di68kap committed
1136
1137
1138
1139
parser term that failed at this location does little to enlighten
us with respect to the cause of the failure. The "farthest fail"-method
as implemented by DHParser yields the
first parser (of possibly several) that has been tried at the
Eckhart Arnold's avatar
Eckhart Arnold committed
1140
position where the farthest fail occurred. Thus, in this case,
Eckhart Arnold's avatar
Eckhart Arnold committed
1141
1142
a failure of the parser capturing \`*\` is reporeted rather than
of the parser expression->\`+\`. Changing this by reporting the
di68kap's avatar
di68kap committed
1143
last parser or all parsers that failed at this location would
Eckhart Arnold's avatar
Eckhart Arnold committed
1144
do little to remedy this situation, however. In this example,
Eckhart Arnold's avatar
Eckhart Arnold committed
1145
it would just be as confusing to learn that expression->\`+\` failed
di68kap's avatar
di68kap committed
1146
1147
1148
1149
1150
1151
1152
1153
at the end of the parsed string.

Marking mandatory items with "§"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Thus, "farthest fail"-method is not very suitable for explaining
the failure or pinpointing which parser really was the culprit.
Therefore, DHParser provides a simple annotation that allows to
Eckhart Arnold's avatar
Eckhart Arnold committed
1154
1155
raise a parsing error deliberately, if a certain point in the
chain of parsers has not been reached: By placing the "§"-sign
di68kap's avatar
di68kap committed
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
as a "mandatory-marker" in front of a parser, the parser as well
as all subsequent parsers in the same sequence, will not simply
return a non-match when failing, but it will cause the entire
parsing process to stop and report an error at the location
of failure::

    >>> arithmetic_grammar = arithmetic_grammar.replace(
    ...    'group      = "(" expression ")"',
    ...    'group      = "(" § expression ")"')
    >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic")
    >>> terms = arithmetic('(2 - 3 * (4 + 5)')
    >>> print(terms.errors[0])
1168
    1:17: Error (1010): '`)` ~' expected by parser 'group', but »...« found instead!
di68kap's avatar
di68kap committed
1169
1170
    >>> terms = arithmetic('(2 - 3) * ( )')
    >>> print(terms.errors[0])
1171
    1:13: Error (1010): 'expression' expected by parser 'group', but »)...« found instead!
di68kap's avatar
di68kap committed
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184

The error messages give a much better indication of the cause of the
error. What is reported as cause is either the name of the parser that was
expected to match, as in the second case, or the rule of the parser in case
of unnamed parsers, as in the first case. This usually, though unfortunately not
always, yields a much better indication of the location and cause of an
error than the farthest failure.

However, a little care has to be taken, not
to place the mandatory marker in front of a parser that might fail at a location
that could still be reached and matched by another branch of the grammar.
(In our example it is clear that round brackets enclose only groups. Thus,
if the opening round bracket has matched, we can be sure that what follows
Eckhart Arnold's avatar
Eckhart Arnold committed
1185
must be an expression followed by a closing round bracket, or, if not it is
di68kap's avatar
di68kap committed
1186
1187
1188
1189
a mistake.) Luckily, although this may sound complicated, in practice it
never is. Unless you grammar is very badly structured, you will hardly
ever make this mistake, an if you do, you will notice soon enough.

Eckhart Arnold's avatar
Eckhart Arnold committed
1190
Also, there is an important restriction: There is only one §-marker
Eckhart Arnold's avatar
Eckhart Arnold committed
1191
allowed per named parser. In case you have a long EBNF-expression on the
Eckhart Arnold's avatar
Eckhart Arnold committed
1192
1193
1194
1195
right hand side of a symbol-definition, where you'd like to use the
§-marker at more than one place, you can, however, always split it into
several expression by introducing new symbols. These symbols, if they
serve no other purpose, can be marked as disposable with the
Eckhart Arnold's avatar
Eckhart Arnold committed
1196
``@ disposable``-directive (see :ref:`simplifying_syntax_trees`).
Eckhart Arnold's avatar
Eckhart Arnold committed
1197

di68kap's avatar
di68kap committed
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
The §-marker has proven to be a very simple means of pinpointing errors
the DSL-code, and I recommend to use it from early on in the process of
developing a new grammar. Plus, the §-marker offers two further benefits,
namely customizing error messages and resuming the parsing process after
a failure. That latter is particularly helpful if the DSL is to be
used in with an integrated development environment, which benefits greatly
from fail-tolerant parsing. However I only recommend to start using these,
only after the grammar has reached a certain amount of maturity, because
changing the grammer ofter requires re-adjusting customized error messages
and resume-clauses as well, which can become tedious.

Eckhart Arnold's avatar
Eckhart Arnold committed
1209
1210
Customizing error messages
^^^^^^^^^^^^^^^^^^^^^^^^^^
di68kap's avatar
di68kap committed
1211
1212
1213
1214
1215

While the error messages produced by the use of the §-marker are often
quite understandable for the engineer designing the grammar of a DSL,
they might not be so for the user the DSL, who might not know the names
of the parsers of the grammar, let alone the expressions of the unnamed
Eckhart Arnold's avatar
Eckhart Arnold committed
1216
1217
1218
1219
parsers und will therefore not always be able to make much sense of
an error-messages that report just these.

In order to customize error messages, the symbol-related directive
Eckhart Arnold's avatar
Eckhart Arnold committed
1220
``@ SYMBOLNAME_error = CONDITION, ERROR_STRING`` is used. The directive's
Eckhart Arnold's avatar
Eckhart Arnold committed
1221
name consists of the name of a symbol that contains a §-marker and the
Eckhart Arnold's avatar
Eckhart Arnold committed
1222
appendix ``_error``. The directive always takes two arguments, separated
Eckhart Arnold's avatar
Eckhart Arnold committed
1223
1224
as usual by a comma, of which the first is condition-expression and
the second an error message. The condition can be used to make
Eckhart Arnold's avatar
Eckhart Arnold committed
1225
the choice of an error-message dependant on the text following the
Eckhart Arnold's avatar
Eckhart Arnold committed
1226
1227
1228
1229
1230
1231
1232
point of failure. It can either be
a regular expression or a simple string which must match (or be equal
to in the case of the string) the first part of the text at the
position where the parser defined by the symbol failed to match and
raised an error. Only if the condition matches, the error message
given as the second argument will be emitted. Otherwise, the fallback
error-expression described above ("... expected by parser ...") will
Eckhart Arnold's avatar
Eckhart Arnold committed
1233
be shown. The empty string ``''`` can be used as a fallback if the
Eckhart Arnold's avatar
Eckhart Arnold committed
1234
1235
1236
1237
customized message shall always be emitted, no matter what the
following text looks like.

The error string is a format string that may include any of the
Eckhart Arnold's avatar
Eckhart Arnold committed
1238
two arguments ``{0}`` or ``{1}`` where ``{0}`` will be replaced by
Eckhart Arnold's avatar
Eckhart Arnold committed
1239
the name or string representation of the parser that was expected
Eckhart Arnold's avatar
Eckhart Arnold committed
1240
to match but didn't and ``{1}`` will be replaced by the first twenty
Eckhart Arnold's avatar
Eckhart Arnold committed
1241
1242
1243
1244
1245
1246
1247
1248
or so letters of the unmatched rest of the text. Here is a simple
example that could be part of a JSON-parser that is intended to
deliver understandable error-messages::

    >>> grammar = '''
    ... @ string_error  = '', 'Illegal character(s) »{1}« in string.'
    ... string          = `"` §characters `"` ~
    ... characters      = { plain | escape }
di68kap's avatar
di68kap committed
1249
    ... plain           = /[^"\\\\\\]+/
Eckhart Arnold's avatar
Eckhart Arnold committed
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
    ... escape          = /\\\\\\[\\\\/bnrt\\\\\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> print(json_string('"alpha"'))
    "alpha"
    >>> for e in json_string('"al\\pha"').errors:  print(e)
    1:4: Error (1010): Illegal character(s) »\pha"...« in string.
    1:4: Error (1040): Parser "string" stopped before end, at:  \pha"  Terminating parser.

Customized error-messages must always be specified in the grammar
before definition of the symbol, they are related to and they can
be stacked. That is, several different error-directives with
different conditions and messages but related to the same symbol
can be specified. The conditions are evaluated in the order the
error-directives appear in the grammar and the error message
of the first matching condition is picked. Therefore, the more
Eckhart Arnold's avatar
Eckhart Arnold committed
1265
specific conditions should always be placed first and the more
Eckhart Arnold's avatar
Eckhart Arnold committed
1266
1267
general or fallback conditions should be placed below these::

Eckhart Arnold's avatar
Eckhart Arnold committed
1268
1269
1270
1271
1272
1273
    >>> grammar = ("@ string_error  = /\\\\\\/, 'Illegal escape sequence »{1}« "
    ...            "Allowed values are b,n,r,t,u'") + grammar
    >>> json_string = create_parser(grammar, 'json_string')
    >>> for e in json_string('"al\\pha"').errors:  print(e)
    1:4: Error (1010): Illegal escape sequence »\pha"...« Allowed values are b,n,r,t,u
    1:4: Error (1040): Parser "string" stopped before end, at:  \pha"  Terminating parser.
Eckhart Arnold's avatar
Eckhart Arnold committed
1274

Eckhart Arnold's avatar
Eckhart Arnold committed
1275
Here, the more specific and more understandable error message
Eckhart Arnold's avatar
Eckhart Arnold committed
1276
has been selected. Careful readers might notice that the the
Eckhart Arnold's avatar
Eckhart Arnold committed
1277
1278
1279
1280
1281
1282
1283
1284
more general customized error message "Illegal character(s)
... found in string" will now only be selected, if the
string contains a character that not even regular expression
engine recognizes, because the only other character that
is not allowed within the string are the closing quotation
marks that terminate the string and which do not cause the
parser to fail (but only to terminate to early).

Eckhart Arnold's avatar
Eckhart Arnold committed
1285
Also, it might be noticed that the errors are always caused
Eckhart Arnold's avatar
Eckhart Arnold committed
1286
by a failure to match the second ``"``-sign, because the
Eckhart Arnold's avatar
Eckhart Arnold committed
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
characters-parser also matches the empty string and thus
never fails or raises any error. Nonetheless, the error
can occur in the interior of the string and can - with
the help of customized error messages - be described as such
and properly be located.

However, emitting the right error messages based on a regular-
expression-condition is not quite easy and sometimes customized
error messages can even be more confusion for the users of the
DSL. My recommendation is to wait for user feedback or to monitor
the errors that users typically make and then to customize the
error messages to the actual needs of the users to help them
understand why the computer refuses to parse a certain construct.
Eckhart Arnold's avatar
Eckhart Arnold committed
1300
1301


eckhart's avatar
eckhart committed
1302
1303
1304
Fail-tolerant Parsing
---------------------

Eckhart Arnold's avatar
Eckhart Arnold committed
1305
A serious limitation of all previously described error-handling
Eckhart Arnold's avatar
Eckhart Arnold committed
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
mechanisms that the parsing process still stops on the very
first error. This is particularly annoying for beginners learning
to code data or program code with a new DSL, because the compiler
must be run at least as many times as there errors in the code to
find all of them. It would be much better to receive a list of
all or at least most errors on the first run. And, finally,
modern development environments can make the most of incremental
compilation make possible by fail-tolerant parsers.

A generic method for fail-tolerant parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are a number of techniques for fail-tolerant parsing. One
technique that is not specific to DHParser but can be used with
any parser-generator is to add junctions for possibly erroneous
code to the grammar::

    >>> grammar = '''
    ... string          = `"` ([characters] `"` | string_error [`"`]) ~
    ...   string_error  = /[^"]*/
    ... characters      = { plain | escape }+
    ... plain           = /[^"\\\\\\]+/
    ... escape          = /\\\\\\[\\\\/bnrt\\\\\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
Eckhart Arnold's avatar
Eckhart Arnold committed
1331
    >>> print(tree.as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
    (string
      (:Text '"')
      (string_error "al\pha")
      (:Text '"'))

This time the parser did not need to stop at the erroneous part. The erroneaus
part itself has been caught within a node that betrays only by its name
that there was an error. To produce error messages we have to add them
explicitly to such nodes, afterwards:

    >>> for string_error in tree.select('string_error'):
    ...     _ = tree.new_error(string_error, 'Fehler im String: ' + str(string_error))
    >>> for e in tree.errors:  print(e)
    1:2: Error (1000): Fehler im String: al\pha

Eckhart Arnold's avatar
Eckhart Arnold committed
1347
Unfortunately, the error location is not very precise. This can be remedied
Eckhart Arnold's avatar
Eckhart Arnold committed
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
by refining our error junction code::

    >>> grammar = '''
    ... string          = `"` ([characters] `"` | string_error [`"`]) ~
    ...   string_error    = [characters] { ups [characters] }+
    ...   ups             = /[^"]/
    ... characters      = { plain | escape }+
    ... plain           = /[^"\\\\\\]+/
    ... escape          = /\\\\\\[\\\\/bnrt\\\\\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
Eckhart Arnold's avatar
Eckhart Arnold committed
1359
    >>> print(tree.as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
    (string
      (:Text '"')
      (string_error
        (characters
          (plain "al"))
        (ups "\\")
        (characters
          (plain "pha")))
      (:Text '"'))

Here, the node named "ups" pinpoints the precise error location.

Like most techniques for fail-tolerant parsing, this one is not quite
as easy to master in practice as it might look. Generally, adding
a junction for erroneous code works best, when the passage that shall
Eckhart Arnold's avatar
Eckhart Arnold committed
1375
be by-passed is delineated by a easily recognizable follow-up strings.
Eckhart Arnold's avatar
Eckhart Arnold committed
1376
In this example the follow-up string would be the ``"``-sign. The method fails,
Eckhart Arnold's avatar
Eckhart Arnold committed
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
of course if the follow-up text is erroneous, too, or has even been
forgotten. So, to be absolutely sure, one would have to consider
different follow-up sequences, say empty lines, keywords that mark
new parts of the document and the like.

DHParser's support for fail-tolerant parsing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DHParser offers two constructs for fail-tolerant parsing which are
quite similar to the just described technique. However, they do not
require rewriting the grammar and reuse the error-locating ability
di68kap's avatar
di68kap committed
1388
of the §-marker. A disadvantage is that the DHParser-specific support
Eckhart Arnold's avatar
Eckhart Arnold committed
1389
for fail-tolerant parsing presently relies entirely on regular
di68kap's avatar
di68kap committed
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
expressions for finding the right re-entry points.

DHParser allows to resume parsing after an error at a later point
in the text. When trying to resume parsing two questions must be
answered:

  1. At what location should the parsing process be resumed?

  2. Which parser in the parser call-chain should resume parsing?
     E.g. the parser that failed, the parser that called the parser
     that failed, ... ?

The location where parsing should be resumed must be specified by
a regular expression or a list of regular expressions. The resumption
location is the nearest match of any of these expressions that does
Eckhart Arnold's avatar
Eckhart Arnold committed
1405
not fall into a comment (as specified by the ``@comment``-directive
di68kap's avatar
di68kap committed
1406
1407
1408
1409
1410
1411
1412
1413
described above). More precisely it is the location directly after
the match, because this allows to search for the reentry-location
both by the text preceding this location and the text following
this location by using a lookahead operator inside the regular
expression.

The parser that resumes parsing depends on the directive that guides
the search for the reentry-point. DHParser offers two different
Eckhart Arnold's avatar
Eckhart Arnold committed
1414
1415
directives for this purpose, the ``@..._skip``-directive and the
``@..._resume``-directive. The placeholder ... stands for the name
di68kap's avatar
di68kap committed
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
of a parser that contains a §-marker.

The skip-directive resumes parsing with the sequence-parser that
contains the item(s) marked by the §-marker. In the following
example, the skip-directive picks up parsing with the string-
parser when an error was raised by the string-parser::

    >>> grammar = '''
    ... @ string_error  = /\\\\\\/, 'Illegal escape sequence »{1}«'
    ... @ string_error  = '', 'Illegal character "{1}" in string.'
    ... @ string_skip   = /(?=")/
    ... string          = `"` §characters `"` ~
    ... characters      = { plain | escape }
    ... plain           = /[^"\\\\\\]+/
    ... escape          = /\\\\\\[\\\\/bnrt\\\\\\]/'''
    >>> json_string = create_parser(grammar, 'json_string')
    >>> tree = json_string('"al\\pha"')
    >>> print(tree.content)
    "al\pha"
    >>> print(tree.errors[0])
    1:4: Error (1010): Illegal escape sequence »\pha"...«
Eckhart Arnold's avatar
Eckhart Arnold committed
1437
    >>> print(tree.as_sxpr())
di68kap's avatar
di68kap committed
1438
1439
1440
1441
1442
1443
    (string
      (:Text '"')
      (characters
        (plain "al"))
      (ZOMBIE__ `(1:4: Error (1010): Illegal escape sequence »\pha"...«) "\pha")
      (:Text '"'))
eckhart's avatar
eckhart committed
1444

Eckhart Arnold's avatar
Eckhart Arnold committed
1445
After the error has occurred at the illegal escape-sequence, the
di68kap's avatar
di68kap committed
1446
skip-directive catches the error and skips to the location where the
Eckhart Arnold's avatar
Eckhart Arnold committed
1447
`"`-character lies just ahead and continues parsing with the string-parser.
di68kap's avatar
di68kap committed
1448
1449
The skipped passage is stored in a ZOMBIE__-Node within the syntax-tree
and parsing can continue through to the end of the text.
eckhart's avatar
eckhart committed
1450

di68kap's avatar
di68kap committed
1451
In contrast to the skip-directive the resume-directive leaves the parser
Eckhart Arnold's avatar
Eckhart Arnold committed
1452
that raised the error and resumes one level higher up in the call chain.
di68kap's avatar
di68kap committed
1453
1454
1455
1456
1457
1458
1459
1460
1461
The ``@ ..._resume``-directive that tells the *calling*
parsers where to continue after the array parser has failed.
So, the parser resuming the parsing process is not the array parser that
has failed, but the first parser in the reverse call-stack of "array" that
catches up at the location indicated by the ``@ ..._resume``-directive.
The location itself is determined by a regular expression, where the
point for reentry is the location *after* the next match of the regular
expression::

Eckhart Arnold's avatar
Eckhart Arnold committed
1462
1463


eckhart's avatar
eckhart committed
1464
1465
1466
1467
Semantic Actions and Storing Variables
--------------------------------------


1468
1469
"""

1470

1471
from collections import OrderedDict
1472
from functools import partial
eckhart's avatar
eckhart committed
1473
1474
import keyword
import os
1475
from typing import Callable, Dict, List, Set, FrozenSet, Tuple, Sequence, Union, Optional
1476

Eckhart Arnold's avatar
Eckhart Arnold committed
1477
from DHParser.compile import CompilerError, Compiler, CompilationResult, compile_source, visitor_name
1478
1479
from DHParser.configuration import access_thread_locals, get_config_value, \
    NEVER_MATCH_PATTERN, ALLOWED_PRESET_VALUES
1480
1481
from DHParser.error import Error, AMBIGUOUS_ERROR_HANDLING, WARNING, REDECLARED_TOKEN_WARNING,\
    REDEFINED_DIRECTIVE, UNUSED_ERROR_HANDLING_WARNING, INAPPROPRIATE_SYMBOL_FOR_DIRECTIVE, \
1482
    DIRECTIVE_FOR_NONEXISTANT_SYMBOL, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING, \
1483
    UNCONNECTED_SYMBOL_WARNING, REORDERING_OF_ALTERNATIVES_REQUIRED, BAD_ORDER_OF_ALTERNATIVES, \
1484
    EMPTY_GRAMMAR_ERROR, MALFORMED_REGULAR_EXPRESSION
1485
1486
from DHParser.parse import Parser, Grammar, mixin_comment, mixin_nonempty, Forward, RegExp, \
    Drop, Lookahead, NegativeLookahead, Alternative, Series, Option, ZeroOrMore, OneOrMore, \
1487
    Text, Capture, Retrieve, Pop, optional_last_value, GrammarError, Whitespace, Always, Never, \
1488
    INFINITE, matching_bracket, ParseFunc, update_scanner, CombinedParser
1489
from DHParser.preprocess import nil_preprocessor, PreprocessorFunc
1490
1491
from DHParser.syntaxtree import Node, RootNode, WHITESPACE_PTYPE, TOKEN_PTYPE, ZOMBIE_TAG, \
    flatten_sxpr
1492
from DHParser.toolkit import load_if_file, escape_re, escape_ctrl_chars, md5, \
1493
    sane_parser_name, re, expand_table, unrepr, compile_python_object, DHPARSER_PARENTDIR, \
1494
    cython
Eckhart Arnold's avatar
Eckhart Arnold committed
1495
from DHParser.transform import TransformerCallable, traverse, remove_brackets, \
eckhart's avatar
eckhart committed
1496
    reduce_single_child, replace_by_single_child, is_empty, remove_children, \
1497
1498
    remove_tokens, flatten, forbid, assert_content, remove_children_if, all_of, not_one_of, \
    BLOCK_LEAVES
1499
from DHParser.versionnumber import __version__
Eckhart Arnold's avatar
Eckhart Arnold committed
1500

eckhart's avatar
eckhart committed
1501

1502
1503
__all__ = ('DHPARSER_IMPORTS',
           'get_ebnf_preprocessor',
1504
1505
1506
           'get_ebnf_grammar',
           'get_ebnf_transformer',
           'get_ebnf_compiler',
1507
1508
1509
           'parse_ebnf',
           'transform_ebnf',
           'compile_ebnf_ast',
1510
           'EBNFGrammar',
1511
           'EBNFTransform',
Eckhart Arnold's avatar
Eckhart Arnold committed
1512
           'EBNFCompilerError',
1513
           'EBNFDirectives',
1514
           'EBNFCompiler',
1515
           'grammar_changed',
eckhart's avatar
eckhart committed
1516
           'compile_ebnf',
1517
           'PreprocessorFactoryFunc',
1518
1519
           'ParserFactoryFunc',
           'TransformerFactoryFunc',
1520
           'CompilerFactoryFunc')
1521
1522


1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
########################################################################
#
# source code support
#
########################################################################


DHPARSER_IMPORTS = '''
import collections
from functools import partial
import os
import sys
1535
from typing import Tuple, List, Union, Any, Optional, Callable
1536

1537
1538
1539
1540
1541
1542
1543
try:
    scriptpath = os.path.dirname(__file__)
except NameError:
    scriptpath = ''
dhparser_parentdir = os.path.abspath(os.path.join(scriptpath, r'{dhparser_parentdir}'))
if scriptpath not in sys.path:
    sys.path.append(scriptpath)
1544
1545
if dhparser_parentdir not in sys.path:
    sys.path.append(dhparser_parentdir)
1546
1547
1548
1549
1550

try:
    import regex as re
except ImportError:
    import re
1551
from DHParser import start_logging, suspend_logging, resume_logging, is_filename, load_if_file, \\
1552
    Grammar, Compiler, nil_preprocessor, PreprocessorToken, Whitespace, Drop, AnyChar, \\
1553
    Lookbehind, Lookahead, Alternative, Pop, Text, Synonym, Counted, Interleave, INFINITE, \\