ebnf.py 188 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ebnf.py - EBNF -> Python-Parser compilation for DHParser
#
# Copyright 2016  by Eckhart Arnold (arnold@badw.de)
#                 Bavarian Academy of Sciences an Humanities (badw.de)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.  See the License for the specific language governing
# permissions and limitations under the License.
17
18


19
"""
Eckhart Arnold's avatar
Eckhart Arnold committed
20
21
Module ``ebnf`` provides an EBNF-parser-generator that compiles an
EBNF-Grammar into avPython-code that can be executed to parse source text
22
conforming to this grammar into concrete syntax trees.
Eckhart Arnold's avatar
Eckhart Arnold committed
23

24
25
26
Specifying Grammers with EBNF
-----------------------------

Eckhart Arnold's avatar
Eckhart Arnold committed
27
With DHParser, Grammars can be specified either directly in Python-code
Eckhart Arnold's avatar
Eckhart Arnold committed
28
(see :py:mod:`parse`) or in one of several EBNF-dialects. (Yes,
Eckhart Arnold's avatar
Eckhart Arnold committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
DHParser supports several different variants of EBNF! This makes it easy
to crate a parser directly from Grammars found in external sources.)
"EBNF" stands for the "Extended-Backus-Naur-Form" which is a common
formalism for specifying Grammars for context-free-languages.
(see https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

The recommended way of compiling grammars with DHParser is to either
write the EBNF-specification for that Grammar into a text-file and then
compile EBNF-source to an executable as well as importable Python-module
with the help of the "dhparser"-skript. Or, for bigger projects, to
create a new domain-specific-language-project with the DHParser-skript
as described in the step-by-step-guide.

However, here we will show how to compile an EBNF-specified grammar
from within Python-code and how to execute the parser that was
generated by compiling the grammar.

46
47
48
As an example, we will realize a json-parser (https://www.json.org/).
Let's start with creating some test-data::

Eckhart Arnold's avatar
Eckhart Arnold committed
49
50
51
52
53
    >>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
    >>> import json
    >>> testdata = json.dumps(testobj)
    >>> testdata
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
54
55
56
57
58
59
60
61

We define the json-Grammar (see https://www.json.org/) in
top-down manner in EBNF. We'll use a regular-expression look-alike
syntax. EBNF, as you may recall, consists of a sequence of symbol
definitions. The definiens of those definitions either is a string
literal or regular expression or other symbols or a combination
of these with four different operators: 1. sequences
2. alternatives 3. options and 4. repetitions. Here is how these
62
elements are denoted in classical and regex-like EBNF-syntax:
63

Eckhart Arnold's avatar
Eckhart Arnold committed
64
========================  ==================  ================
di68kap's avatar
di68kap committed
65
element                   classical EBNF      regex-like
Eckhart Arnold's avatar
Eckhart Arnold committed
66
67
68
========================  ==================  ================
insignificant whitespace  ~                   ~
string literal            "..." or \\`...\\`    "..." or \\`...\\`
di68kap's avatar
di68kap committed
69
70
71
72
73
74
75
regular expr.             /.../               /.../
sequences                 A B C               A B C
alternatives              A | B | C           A | B | C
options                   [ ... ]             ...?
repetions                 { ... }             ...*
one or more                                   ...+
grouping                  (...)               (...)
Eckhart Arnold's avatar
Eckhart Arnold committed
76
========================  ==================  ================
77
78

"insignificant whitespace" is a speciality of DHParser. Denoting
Eckhart Arnold's avatar
Eckhart Arnold committed
79
insignificant whitespace with a particular sign ``~`` allows to eliminate
80
81
82
83
84
85
86
87
88
89
90
it already during the parsing process without burdening later
syntax-tree-processing stages with this common task. DHParser offers
several more facilities to restrain the verbosity of the concrete
syntax tree, so that the outcome of the parsing stage comes close (or
at least closer) to the intended abstract-syntax-tree, already.

JSON consists of two complex data types, 1) associative arrays,
called "object" and sequences of heterogeneous data, called array; and
of four simple data types, 1) string, 2) number, 3) bool and 4) null.
The structure of a JSON file can easily be described in EBNF::

Eckhart Arnold's avatar
Eckhart Arnold committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
    >>> grammar = '''
    ...     json     = ~ _element _EOF
    ...     _EOF     = /$/
    ...     _element = object | array | string | number | bool | null
    ...     object   = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member   = string ":" ~ _element
    ...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string   = `"` _CHARS `"` ~
    ...     _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number   = _INT _FRAC? _EXP? ~
    ...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...     _FRAC    = `.` /[0-9]+/
    ...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/
    ...     bool     = "true" ~ | "false" ~
    ...     null     = "null" ~                                   '''
106

107
This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though:
108
109
First of all you might notice that some components of the grammar
(or "prduction rules" as they are commonly called) have names with a leading
Eckhart Arnold's avatar
Eckhart Arnold committed
110
111
underscore ``_``. It is a convention to mark those elements, in which we are on
interested on their own account, with an underscore ``_``. When moving from the
112
113
114
concrete syntax-tree to a more abstract syntax-tree, these elements could be
substituted by their content, to simplify the tree.

115
116
Secondly, some production rules carry a name written in captial letters. This is also
a convention to mark those symbols which with other parser-generators would
117
118
represent tokens delivered by a lexical scanner. DHParser is a "scanner-less"
parser, which means that the breaking down of the string into meaningful tokens
Eckhart Arnold's avatar
Eckhart Arnold committed
119
120
is done in place with regular expressions (like in the definition of ``_EOF``)
or simple combinations of regular expressions (see the definition of ``_INT`` above).
121
Their is no sharp distinction between tokens and other symbols in DHParser,
122
but we keep it as a loose convention. Regular expressions are enclosed in forward
123
124
slashes and follow the standard syntax of Perl-style regular expression that is
also used by the "re"-module of the Python standard library. (Don't worry about
Eckhart Arnold's avatar
Eckhart Arnold committed
125
the number of backslashes in the line defining ``_CHARS`` for now!)
126
127
128

Finally, it is another helpful conention to indent the defintions of symbols
that have only been introduced to simplify an otherwise uneccessarily
Eckhart Arnold's avatar
Eckhart Arnold committed
129
130
complicated definition (e.g. the definition of ``number``, above) or to make
it more understandable by giving names to its componentns (like ``_EOF``).
131
132
133

Let's try this grammar on our test-string.  In order to compile
this grammar into executable Python-code, we use the high-level-function
Eckhart Arnold's avatar
Eckhart Arnold committed
134
:py:func:`~dsl.create_parser` from the :py:mod:`dsl`-module.
eckhart's avatar
eckhart committed
135

Eckhart Arnold's avatar
Eckhart Arnold committed
136
137
138
139
140
141
142
    >>> from DHParser.dsl import create_parser
    >>> # from DHParser.dsl import compileEBNF
    >>> # print(compileEBNF(grammar))
    >>> parser = create_parser(grammar, branding="JSON")
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
143
144
145
146
147
148
149

As expected serializing the content of the resulting syntax-tree yields exactly
the input-string of the parsing process. What we cannot see here, is that the
parser has structured the string into the individual elements described in the
grammar. Since the concrete syntax-tree that the parser vields is rather
verbose, it would not make sense to print it out. We'll just look at a small
part of it, to see what it looks like. Let's just pick the sub-tree that
150
151
captures the first json-array within the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
152
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
    (array
      (:Text "[")
      (_element
        (number
          (_INT "1")))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (number
          (_INT "2")
          (_FRAC
            (:Text ".")
            (:RegExp "0"))))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (string
          (:Text '"')
          (_CHARS "a string")
          (:Text '"')))
      (:Text "]"))
174
175

The nodes of the syntax-tree carry the names of the production rules
176
177
by which they have been generated. Nodes, that have been created by
components of a prduction receive the name of of the parser-type
Eckhart Arnold's avatar
Eckhart Arnold committed
178
that has created the node (see :py:mod:`parse`) prefixed
179
180
with a colon ":". In DHParser, these nodes are called "anonymous",
because they lack the name of a proper grammatical component.
181

Eckhart Arnold's avatar
Eckhart Arnold committed
182
183
.. _simplifying_syntax_trees:

184
Simplifying Syntax-Trees while Parsing
185
--------------------------------------
186

187
Usually, anonymous nodes are what you want to get rid of in the course
188
of transforming the concrete syntax-tree into an abstract syntax-tree.
Eckhart Arnold's avatar
Eckhart Arnold committed
189
(See :py:mod:`transform`). DHParser already eliminates per
190
default all anonymous nodes that are not leaf-nodes by replacing them
di68kap's avatar
di68kap committed
191
192
193
with their children during parsing. Anonymous leaf-nodes will be
replaced by their content, if they are a single child of some parent,
and otherwise be left in place. Without this optimization, each
194
construct of the EBNF-grammar would leave a node in the syntax-tree::
195

Eckhart Arnold's avatar
Eckhart Arnold committed
196
197
198
    >>> from DHParser.parse import CombinedParser, TreeReduction
    >>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
    >>> syntax_tree = parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
199
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
200
201
202
    (array
      (:Text "[")
      (:Option
203
204
205
206
        (:Series
          (_element
            (number
              (_INT
di68kap's avatar
di68kap committed
207
                (:Alternative
Eckhart Arnold's avatar
Eckhart Arnold committed
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
                  (:RegExp "1")))))
          (:ZeroOrMore
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (number
                  (_INT
                    (:Alternative
                      (:RegExp "2")))
                  (:Option
                    (_FRAC
                      (:Text ".")
                      (:RegExp "0"))))))
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (string
                  (:Text '"')
                  (_CHARS
                    (:RegExp "a string"))
                  (:Text '"')))))))
      (:Text "]"))
232

233
234
This can be helpful for understanding how parsing that is directed by
an EBNF-grammar works (next to looking at the logs of the complete
Eckhart Arnold's avatar
Eckhart Arnold committed
235
parsing-process, see :py:mod:`trace`), but other than that it
236
237
238
239
is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.

di68kap's avatar
di68kap committed
240
Because of this, DHParser offers further means of simplifying
241
242
243
244
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree; but they must be turned on by "directives" that
are listed at the top of an EBNF-grammar and that guide the
di68kap's avatar
di68kap committed
245
parser-generation process. DHParser-directives always start with an
Eckhart Arnold's avatar
Eckhart Arnold committed
246
``@``-sign. For example, the ``@drop``-directive advises the parser to
di68kap's avatar
di68kap committed
247
248
249
drop certain nodes entirely, including their content. In the following
example, the parser is directed to drop all insignificant whitespace::

Eckhart Arnold's avatar
Eckhart Arnold committed
250
    >>> drop_insignificant_wsp = '@drop = whitespace \\n'
di68kap's avatar
di68kap committed
251
252
253
254
255

Directives look similar to productions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anomymous
nodes that shall be dropped or one of four particular classes of
Eckhart Arnold's avatar
Eckhart Arnold committed
256
anonymous nodes (``strings``, ``backticked``, ``regexp``, ``whitespace``) that
di68kap's avatar
di68kap committed
257
258
259
260
261
262
263
264
265
will be dropped.

Another useful directive advises the parser to treat named nodes as
anynouse nodes and to eliminate them accordingly during parsing. This
is usefule, if we have introduced certain names in our grammar
only as placeholders to render the definition of the grammar a bit
more readable, not because we are intested in the text that is
captured by the production associated with them in their own right::

Eckhart Arnold's avatar
Eckhart Arnold committed
266
    >>> disposable_symbols = '@disposable = /_\w+/ \\n'
di68kap's avatar
di68kap committed
267
268
269
270
271
272
273
274

Instead of passing a comma-separated list of symbols to the directive,
which would also have been possible, we have leveraged our convention
to prefix unimportant symbols with an underscore "_" by specifying the
symbols that shall by anonymized with a regular expression.

Now, let's examine the effect of these two directives::

Eckhart Arnold's avatar
Eckhart Arnold committed
275
276
277
278
279
    >>> refined_grammar = drop_insignificant_wsp + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'
di68kap's avatar
di68kap committed
280
281
282
283
284
285
286

You might have notived that all insigificant whitespaces adjacent to
the delimiters have been removed this time (but, of course not the
significant whitespace between "a" and "string" in "a string"). And
the difference, the use of these two directives makes, is even more
obvious, if we look at (a section of) the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
287
    >>> print(syntax_tree.pick('array').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
288
289
290
291
292
293
294
295
296
297
298
299
300
301
    (array
      (:Text "[")
      (number "1")
      (:Text ",")
      (number
        (:RegExp "2")
        (:Text ".")
        (:RegExp "0"))
      (:Text ",")
      (string
        (:Text '"')
        (:RegExp "a string")
        (:Text '"'))
      (:Text "]"))
di68kap's avatar
di68kap committed
302

eckhart's avatar
eckhart committed
303
304
305
306
307
308
309
310
311
312
313
314
315
316
This tree looks more streamlined. But it still contains more structure
than we might like to see in an abstract syntax tree. In particular, it
still contains als the delimiters ("[", ",", '"', ...) next to the data. But
other than in the UTF-8 representation of our json data, the delimiters are
not needed any more, because the structural information is now retained
in the tree-structure.

So how can we get rid of those delimiters? The rather coarse-grained tools
that DHParser offers in the parsing stage require some care to do this
properly.

The @drop-directive allows to drop all unnamed strings (i.e. strings
that are not directly assigned to a symbol) and backticked strings (for
the difference between strings and backticked strings, see below) and
Eckhart Arnold's avatar
Eckhart Arnold committed
317
regular expressions. However, using ``@drop = whitespace, strings, backticked``
eckhart's avatar
eckhart committed
318
319
would also drop those parts captured as string that contain data::

Eckhart Arnold's avatar
Eckhart Arnold committed
320
321
322
323
    >>> refined_grammar = '@drop = whitespace, strings, backticked \\n' \
                          + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
324
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
325
326
327
328
329
330
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
331
332
333
334
335

Here, suddenly, the number "2.0" has been turned into "20"! There
are three ways to get around this problem:

1. Assigning all non-delmiter strings to symbols. In this case
Eckhart Arnold's avatar
Eckhart Arnold committed
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
   we would have to rewrite the definition of "number" as such::

      number     = _INT _FRAC? _EXP? ~
        _INT     = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = _DOT /[0-9]+/
        _EXP     = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/
        _MINUS   = `-`
        _PLUS    = `+`
        _DOT     = `.`
        _Ecap    = `E`
        _Esmall  = `e`

   A simpler alternative of this technique would be to make use of
   the fact that the document-parts captured by regular expresseion
   are not dropped (although regular expressions can also be listed
   in the @drop-directive, if needed) and that at the same time
   delimiters are almost always simple strings containing keywords
   or punctuation characters. Thus, one only needs to rewrite those
   string-expressions that capture data as regular expressions::

      number     = _INT _FRAC? _EXP? ~
        _INT     = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = /[.]/ /[0-9]+/
        _EXP     = (/E/|/e/) [/[-+]/] /[0-9]+/

2. Assigning all delimiter strings to symbols and drop the nodes
   and content captured by these symbols. This means doing exactly
   the opposite of the first solution. Here is an excerpt of what
   a JSON-grammar emploing this technique would look like::

      @disposable = /_\w+/
      @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ...
      ...
      array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~
      ...

   It is important that all symbols listed for dropping are also made
   disposable, either by listing them in the disposable-directive as well
   or using names that the regular-expressions for disposables matches.
   Otherwise, DHParser does not allow to drop the content of named nodes,
   because the default assumption is that symbols in the grammar are
   defined to capture meaningful parts of the document that contain
   relevant data.

3. Bailing out and leaving the further simplification of the syntax-tree
   to the next tree-processing stage which, if you follow DHParser's suggested
   usage pattern, is the abstract-syntax-tree-transformation proper
   and which allows for a much more fine-grained specification of
Eckhart Arnold's avatar
Eckhart Arnold committed
384
   transformation rules. See :py:mod:`transform`.
eckhart's avatar
eckhart committed
385
386

To round this section up, we present the full grammar for a streamlined
387
388
389
JSON-Parser according to the first solution-strategy. Observe, that the
values of "bool" and "null" are now defined with regular expressions
instead of string-literals, because the latter would be dropped because
Eckhart Arnold's avatar
Eckhart Arnold committed
390
of the ``@drop = ... strings, ...``-directive, leaving an empty named node
391
without a value, wheneever a bool value or null occurs in the input::
eckhart's avatar
eckhart committed
392

Eckhart Arnold's avatar
Eckhart Arnold committed
393
    >>> json_gr = '''
Eckhart Arnold's avatar
Eckhart Arnold committed
394
    ...     @disposable = /_\\\\w+/
Eckhart Arnold's avatar
Eckhart Arnold committed
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ _element
    ...     array      = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string     = `"` _CHARS `"` ~
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
412
    >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
413
414
415
416
417
418
419
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp ".")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
420

eckhart's avatar
eckhart committed
421
422
423
424
425
426
This time the data is not distorted, any more. One oddity reamins, however: We
are most probably not interested in the fact that the number 2.0 consists of
three components, each of which hast been captured by a regular expression.
Luckiliy, there exists yet another directive that allows to reduce the tree
further by merging adjacent anonymous leaf-nodes::

Eckhart Arnold's avatar
Eckhart Arnold committed
427
428
429
    >>> json_gr = '@reduction = merge \\n' + json_gr
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
Eckhart Arnold's avatar
Eckhart Arnold committed
430
    >>> print(syntax_tree.as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
431
432
433
434
435
436
437
438
439
440
441
442
443
444
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
445

eckhart's avatar
eckhart committed
446
447
448
Merging adjacent anonymous leaf-nodes takes place after the @drop-directive
comes into effect. It should be observed that merging only produces the desired
result, if any delimiters have been dropped previously, because otherwise
Eckhart Arnold's avatar
Eckhart Arnold committed
449
450
451
delimiters would be merged with content. Therefore, the ``@reduction = merge`-
directive should at best only be applied in conjunction with the ``@drop`` and
``@disposable``-directives.
eckhart's avatar
eckhart committed
452
453
454
455
456
457
458

Applying any of the here described tree-reduction (or "simplification" for
that matter) requires a bit of careful planning concerning which nodes
will be named and which nodes will be dropped. This, however, pays off in
terms of speed and considerably simplified abtract-syntax-tree generation
stage, because most of the unnecessary structure of concrete-syntax-trees
has already been eliminated at the parsing stage.
eckhart's avatar
eckhart committed
459

Eckhart Arnold's avatar
Eckhart Arnold committed
460
461
.. _comments_and_whitespace:

eckhart's avatar
eckhart committed
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
Comments and Whitespace
-----------------------

Why whitespace isn't trivial
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Handling whitespace in text-documents is not all trivial, because
whitespace can serve several different purposes and there can be
different kinds of whitespace: Whitespace can serve a syntactic function
as delimiter. But whitespace can also be purely asthetic to render
a document more readable.

Depending on the data model, whitespace can be considered as
significant and be included in the data or as
insignificant and be excluded from the data and only be re-inserted
when displaying the data in a human-readable-form. (For example, one
can model a sentence as a seuqence of words and spaces or just as
a sequence of words.) Note, that "significance" does not correlate
with the syntatic or asthetic function, but only depends on whether
you'd like to keep the whitespace in you data or not.

There can be different kinds of whitespace with different meaning
(and differing significance). For example, one can make a difference
between horozontal whitespace (spaces and tabs) and vertical
whitespace (including linefeeds). And there can be different sizes
of whitespace with different meaning. For example in LaTeX, a single
linefeed still counts as plain whitespace while an empty line (i.e.
whitespace including two or more not linefeeds) signals a new
paragraph.

Finally, even the position of whitespace can make a difference.
A certain number of whitespaces at the beginning of a line can
have the meaning of "indentation" (as in Python code) while at
the end of the line or between brackets it is just plain
496
497
498
499
500
insignificant whitespace. (This is actually something, where
the boundaries of the EBNF-formalism become visible and you'd
probably use a preprocessor or some kind of "semantic actions"
to handle such cases. There is some support for either of these
in DHParser.)
eckhart's avatar
eckhart committed
501

di68kap's avatar
di68kap committed
502
503
Coding significant Whitespace in EBNF-Grammars
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eckhart's avatar
eckhart committed
504

505
506
507
508
509
A reasonable approach to coding whitespace is to use one
particular symbol for each kind of whitespace. Those kinds of
whitespace that are insignficant, i.e. that do not need to
appear in the data, should be dropped from the syntax-tree.
With DHParser this can be done already while parsing, using
Eckhart Arnold's avatar
Eckhart Arnold committed
510
the ``@disposable`` and ``@drop``-directives described earlier.
511
512
513
514
515

But let's first look at an example which only includes significant
whitespace. The following parser parses sequences of paragraphs which
consist of sequences of sentences which consist of sequences
of main clauses and subordinate clauses which consist of sequences
di68kap's avatar
di68kap committed
516
of words::
517

Eckhart Arnold's avatar
Eckhart Arnold committed
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
    >>> text_gr = '''
    ...     @ disposable = /_\\\\w+/
    ...     document       = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF
    ...       _EOF         = /$/
    ...     paragraph      = sentence (S sentence)*
    ...     sentence       = (clause _c_delimiter S)* clause _s_delimiter
    ...       _c_delimiter = KOMMA | COLON | SEMICOLON
    ...       _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK
    ...     clause         = word (S word)*
    ...     word           = /(?:[A-Z]|[a-z])[a-z']*/
    ...     DOT            = `.`
    ...     QUESTION_MARK  = `?`
    ...     EXCLAMATION_MARK = `!`
    ...     KOMMA          = `,`
    ...     COLON          = `:`
    ...     SEMICOLON      = `;`
    ...     PBR            = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     S              = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/ '''
di68kap's avatar
di68kap committed
536

Eckhart Arnold's avatar
Eckhart Arnold committed
537
Here, we have two types of significant whitespace ``PBR`` ("paragraph-break") and ``S``
di68kap's avatar
di68kap committed
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
("space"). Both types allow for a certain amount of flexibility, so that two
whitespaces of the same type do not need to have exactly the same content, but
we could always normalize these whitespaces in a subsequent transformation step.

Two typical design patterns for significant whitespace are noteworthy, here:

1. Both whitespaces match only if there was at least one whitespace character.
   We may allow whitespace to be optional (as at the beginning and end of the
   document), but if the option has not been taken, we don't to see an empty
   whitespace-tag in the document, later on.
   (For insignificant whitespace, the opposite convention can be more convenient,
   because, typically, insignificant whitespace is dropped anyway, whether it's
   got content or not.)

2. The grammar is construed in such a way that the whitespace always appears
   *between* different elements at the same level, but not after the last or
   before the first element. The whitespace after the last word of a sentence
   or before the first word of a sentence is really whitespace between
   two sentences. If we pick out a sentence or a clause, we will have no
   dangling whitespace at its beginning or end.
   (Again, for soon to be dropped insignificant whitespace, another convention
   can be more advisable.)

eckhart's avatar
eckhart committed
561
Let's just try our grammar on an example::
di68kap's avatar
di68kap committed
562

Eckhart Arnold's avatar
Eckhart Arnold committed
563
564
565
566
567
568
569
570
571
572
573
574
575
576
    >>> text_example = '''
    ... I want to say, in all seriousness, that a great deal of harm is being
    ... done in the modern world by belief in the virtuousness of work, and that
    ... the road to happiness and prosperity lies in an organized diminution of
    ... work.
    ...
    ... First of all: what is work? Work is of two kinds: first, altering the
    ... position of matter at or near the earth's surface relatively to other
    ... such matter; second, telling other people to do so. The first kind is
    ... unpleasant and ill paid; the second is pleasant and highly paid.'''
    >>> text_parser = create_parser(text_gr, 'Text')
    >>> text_as_data = text_parser(text_example)
    >>> sentence = text_as_data.pick(\
            lambda nd: nd.tag_name == "sentence" and nd.content.startswith('First'))
Eckhart Arnold's avatar
Eckhart Arnold committed
577
    >>> print(sentence.as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
    (sentence
      (clause
        (word "First")
        (S " ")
        (word "of")
        (S " ")
        (word "all"))
      (COLON ":")
      (S " ")
      (clause
        (word "what")
        (S " ")
        (word "is")
        (S " ")
        (word "work"))
      (QUESTION_MARK "?"))
di68kap's avatar
di68kap committed
594
595

Again, it is a question of design, whether we leave whitespace in the data or
eckhart's avatar
eckhart committed
596
not. Leaving it has the advantage, that serialization become as simple as
di68kap's avatar
di68kap committed
597
598
printing the content of the data-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
599
600
    >>> print(sentence)
    First of all: what is work?
di68kap's avatar
di68kap committed
601
602
603
604
605
606

Otherwise one would have to programm a dedicated serialization routine. Especially,
if you receive data from a different source, you'll appreciate not having to
do this - and so will other people, receiving your data. Think about it! However,
dropping the whitespace will yield more consice data.

eckhart's avatar
eckhart committed
607
Coding Comments
di68kap's avatar
di68kap committed
608
609
^^^^^^^^^^^^^^^

eckhart's avatar
eckhart committed
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
Allowing comments in a domain-specific language almost always makes sense,
because it allows users to annotate the source texts while working on them
and to share those comments with collaborators. From a technical point of
view, adding comments to a DSL raises two questions:

1. At what places shall we allow to insert comments in the source code?
   Common answers are: a) at the end of a line, b) almost everywhere, or
   c) both.

2. How do we avoid pollution of the EBNF-grammar with comment markers?
   It's already curtails the readability that we have to put whitespace
   symbols in so many places. And speaking of comments at the end of
   the line: If linefeeds aren't important for us - as in our toy-grammar
   for prose-text, above - we probably wouldn't want to reframe our
   grammar jsut to allow for at the end of the line comments.

Luckily, there exists a simple and highly intuitive solution that takes
care of both of these concerns: We admitt comments, whereever whitespace
is allowed. And we code this by defining a symbol that means: "whitespace
and, optionally, a comment".

Let's try this with our prose-text-grammar. In order to do so, we have
to define a symbols for comments, a symbol for pure whitespace, and,
finally, a symbol for whitespace with optional comment. Since, in
Eckhart Arnold's avatar
Eckhart Arnold committed
634
our grammar, we actually have two kinds of whitespace, ``S`` and ``PBR``,
eckhart's avatar
eckhart committed
635
636
637
we'll have to redefine both of them. As delimiters for comments, we
use curly braces::

Eckhart Arnold's avatar
Eckhart Arnold committed
638
639
640
641
642
643
644
    >>> wsp_gr = '''
    ...     PBR      = pure_PBR COMMENT (pure_PBR | pure_S)?
    ...              | (pure_S? COMMENT)? pure_PBR
    ...     S        = pure_S COMMENT pure_S? | COMMENT? pure_S
    ...     COMMENT  = /\{[^}]*\}/
    ...     pure_PBR = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     pure_S   = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/'''
eckhart's avatar
eckhart committed
645
646
647
648
649
650
651

As can be seen, the concrete re-definition of the whitespace tokens
requires a bit of careful consideration, because we want to allow
additional whitespace next to comments, but at the same time avoid
ending up with two whitespaces in sequence in our data. Let's see, if
we have succeeded::

Eckhart Arnold's avatar
Eckhart Arnold committed
652
653
654
655
656
    >>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr
    >>> extended_parser = create_parser(extended_text_gr, 'Text')
    >>> syntax_tree = extended_parser('What {check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
Eckhart Arnold's avatar
Eckhart Arnold committed
657
    >>> print(syntax_tree.pick('clause').as_sxpr())
Eckhart Arnold's avatar
Eckhart Arnold committed
658
659
660
661
662
663
664
665
666
667
    (clause
      (word "What")
      (S
        (pure_S " ")
        (COMMENT "{check this again!}")
        (pure_S " "))
      (word "is")
      (S
        (pure_S " "))
      (word "work"))
eckhart's avatar
eckhart committed
668
669

We will not worry about the more sub-structure of the S-nodes right now. If
Eckhart Arnold's avatar
Eckhart Arnold committed
670
671
we are not interested in the comments, we could use the ``@disposable``,
``@drop`` and ``@reduction = merge``-directives to simplify these at the
eckhart's avatar
eckhart committed
672
673
674
675
parsing stage. Or, we could extract the comments and normalize the whitespace
at a later tree-processing stage. For now, let's just check wehter our
comments work as expected::

Eckhart Arnold's avatar
Eckhart Arnold committed
676
677
678
679
680
681
682
683
    >>> syntax_tree = extended_parser('What{check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What {check this again!}is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What{check this again!}is work?')
    >>> print(syntax_tree.errors[0])
684
    1:24: Error (1040): Parser "pure_S" did not match: »is work?«
eckhart's avatar
eckhart committed
685
686

The last error was to be expected, because we did not allow comments
di68kap's avatar
di68kap committed
687
688
to serve a substitutes for whitespace. Let's check whether putting comments
near paragraph breaks works as well::
eckhart's avatar
eckhart committed
689

Eckhart Arnold's avatar
Eckhart Arnold committed
690
691
692
693
694
695
696
697
698
699
700
701
702
703
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ...
    ... { Here comes the comment }
    ...
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph PBR paragraph
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ... { Here comes the comment }
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph
eckhart's avatar
eckhart committed
704
705
706
707
708
709

The last result might look surprising at first, but since a paragraph
break requires at least one empty line as a separator, the input text
is correctly understood by the parser as a sinlge paragrpah with
two sentence interspersed by a sinlge whitespace which, incidently,
contains a comment::
Eckhart Arnold's avatar
Eckhart Arnold committed
710
711
712

    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('paragraph').children))
    sentence S sentence
Eckhart Arnold's avatar
Eckhart Arnold committed
713
    >>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(flatten_threshold=0))
Eckhart Arnold's avatar
Eckhart Arnold committed
714
715
716
717
718
719
720
721
    (S
      (pure_S
        ""
        "")
      (COMMENT "{ Here comes the comment }")
      (pure_S
        ""
        ""))
722
723
724
725
726
727

A common problem with whitespace is that it tends to pollute
the Grammar, because whereever you'd like to allow whitespace,
you'd have to insert a symbol for whitespace. The same problem
existis when it comes to allowing comments, because you'd
probably allow to insert comments in as many places as possible.
728

eckhart's avatar
eckhart committed
729
730
731
DHParser's support for insignificant whitespace and comments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

di68kap's avatar
di68kap committed
732
733
734
735
736
737
738
739
740
741
742
Coding insignificant whitespace and comments is exactly the
same as coding siginificant whitespace and comments and does not
need to be repeated, here. (The combination of insignificant
whitespace and significant comments, is slightly more complicated,
and probably best outsourced to some degree to the post-parsing
processing stages. It will not be discussed here.) However,
DHParser offers some special support for insignificant
whitesapce and comments, which can make working with these
easier in some cases.

First of all, DHParser has a special dedicated token for
Eckhart Arnold's avatar
Eckhart Arnold committed
743
insignificant whitespace which is the tilde ``~``-character.
di68kap's avatar
di68kap committed
744
745
We have seen this earlier in the definition of the json-Grammar.

Eckhart Arnold's avatar
Eckhart Arnold committed
746
The ``~``-whitespace-marker differs from the usual pattern for
di68kap's avatar
di68kap committed
747
748
749
750
751
752
753
754
755
756
757
758
759
defining whitespace in that it is implicitly optional, or what
amounts to the same, it matches the empty string. Normally,
it is to be considered bad practice to define a symbol as
optional. Rahter, a symbol should always match something and
only at the places where it is used, it should be marked as
optional. If this rule is obeyed, it is always easy to tell,
wether some element is optional or not at a specific place
in the Grammar. Otherwise, it can become quite confusing
indeed. However, since the tilde character is usually used
very often, it is more convenient not to mark it with a
question-mark or, if you use classical EBNF-syntax, to enclose
it with square brackets.

di68kap's avatar
di68kap committed
760
761
The default regular expression for the tilde-whitespace captures
arbitraily many spaces and tabs and at most one linefeed, but
Eckhart Arnold's avatar
Eckhart Arnold committed
762
not an empty line (``[ \t]*(?:\n[ \t]*)?(?!\n)``), as this is
di68kap's avatar
di68kap committed
763
764
the most convenient way to define whitespace for text-data.
However, the tilde whitespace can also be definied with any
Eckhart Arnold's avatar
Eckhart Arnold committed
765
other regular expression with the ``@whitespace``-directive.
di68kap's avatar
di68kap committed
766
767
768
769
770
771
772
773

Let's go back to our JSON-grammar and define the optional
insignificant whitespace marked by the tilde-character in such a
way that it matches any amount of horizontal or vertical
whitespace, which makes much more sense in the context of json
than the default tilde-whitespace that is restricted vertically
to at most a single linefeed::

Eckhart Arnold's avatar
Eckhart Arnold committed
774
775
776
777
778
779
780
781
782
    >>> testdata = '{"array": [1, 2.0, "a string"], \\n\\n\\n "number": -1.3e+25, "bool": false}'
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors[0])
    1:32: Error (1010): member expected by parser 'object', » \\n \\n \\n  "numb...« found!
    >>> json_gr = '@whitespace = /\\\\s*/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors)
    []
di68kap's avatar
di68kap committed
783

eckhart's avatar
eckhart committed
784
785
786
787
788
When redefining the tilde-whitespace, make sure that your regular expression
also matches the empty string! There is no need to worry that the syntax tree
get's cluttered by empty whitespace-nodes, because tilde-whitespace always
yeidls anonymous nodes and DHParser drops empty anonymous nodes right away.

Eckhart Arnold's avatar
Eckhart Arnold committed
789
Comments can be defined using the ``@comment``-directive. DHParser automatically
eckhart's avatar
eckhart committed
790
intermingles comments and whitespace so that where-ever tilde-whitespace is
Eckhart Arnold's avatar
Eckhart Arnold committed
791
allowed, a comment defined by the ``@comment``-directive is also allowed:
eckhart's avatar
eckhart committed
792

Eckhart Arnold's avatar
Eckhart Arnold committed
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
    >>> json_gr = '@comment = /#[^\\\\n]*(?:\\\\n|$)/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> testdata = '''{"array": [1, 2.0, "a string"], # a string
    ...                "number": -1.3e+25,  # a number
    ...                "bool": false}  # a bool'''
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
eckhart's avatar
eckhart committed
814

Eckhart Arnold's avatar
Eckhart Arnold committed
815
Since the json-grammar still contains the ``@drop = whitespace, ...``-
eckhart's avatar
eckhart committed
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
directive from earlier on (next to other tree-reductions), the comments
have been nicely dropped along with the tilde-whitespace.

There is one caveat: When using comments alongside with whitespace that
captures at most one linefeed, the comments should be defined in such
a way that the last charcter of a comment is never a linefeed.

Also a few limitations of the tilde-whitespace and directive-defined
comments should be kept in mind: 1. Only one kind of insignificant
whitespace can be defined this way. If there are more kinds of
insignificant whitespace, all but one need to be defined conventionally
as part of the grammar. 2. Both directive-defined comments and
tilde-whitespace can only be defined by a regular expresseion. In
particular, nested comments are impossible to define with regular
expressions, only.

832
833
834
835
However, using tilde-whitespace has yet one more benefit: With the
tilde-whitespace, cluttering of the grammar with whitespace-markers
can be avoid, by adding implicit whitespace adjacent to string-literals.
Remember the definition of the JSON-Grammar earlier. If you look at
Eckhart Arnold's avatar
Eckhart Arnold committed
836
a definition like: ``object = "{" ~ member ( "," ~ §member )* "}" ~``,
837
you'll notice that there are three whitespace markers, one next to
838
each delimiter. Naturally so, because one usually wants to allow users
839
840
841
842
843
844
845
846
847
848
849
850
851
852
of a domain specific language to put whitespace around delimiters.

You may wonder, why the tilde appears only on the right hand side
of the literals, although you'd probably like to allow whitespace
on both side of a literal like "{". But if you look at the grammar
closely, you'll find that almost every symbol definition ends
either with a tilde sign or a symbol the definition of which ends
with a tilde sign, which means that they allow whitespace on the
right hand side. Now, if all elements of the grammar allow
whitespace on the right hand side, this means that automatically
also have whitespace on the left-hand side, too, which is, of
course the whitespace on the right hand side of the previous
element.

853
854
In order to reduce cluttering the grammar with tile-signs, DHParser
allows to turn on implicit tilde-whitespace adjacent to any
Eckhart Arnold's avatar
Eckhart Arnold committed
855
856
string literal with the diretive ``@ literalws = right`` or
``@ literalws = left``. As the argument of the directive suggests,
857
858
859
860
whitespace is either "eaten" at the right hand side or the left
hand side of the literal. String literals can either be
enclose in double quotes "..." or single quotes '...'. Both
kinds of literals will have implicit whitespace, if the
Eckhart Arnold's avatar
Eckhart Arnold committed
861
``@literalws``-directive is used.
862
863
864
865
866
867
868
869
870
871

(Don't confuse implicit whitespace
with insignificant whitespace: Insignificnat whitespace is whitespace
you do not need any more after parsing. Implicit whitespace is
whitespace you do not denote explicitly in the grammar. It's
a speciality of DHParser and DHParser allows onl the insignificant
whitespace denoted by the tilde-character to be declared as
"implicit".)

If left-adjacent whitespace is declared as implicit with the
Eckhart Arnold's avatar
Eckhart Arnold committed
872
``@literalws``-directive, the expression::
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898

    object     = "{" ~ member ( "," ~ §member )* "}" ~

can be written as::

    object     = "{" member ( "," §member )* "}"

which is easier to read.

For situations where implicit whitespace is not desired, DHParser
has a special kind of string literal, written with backticks, which
never carries any implicit whitespace. This is important, when
literals are used for signs that enclose content, like the quotation
marks for the string literals in our JSON-Grammar::

    string     = `"` _CHARS '"'  # mind the difference between `"` and '"'!

Regular expressions, also, never carry implicit whitespace.
So, if you are using regular expressions as delimiters, you
still have to add the tilde character, if adjacent insignificant
whitespace is to be allowed::

    bool       = /true/~ | /false/~

The compliete json-grammar now looks like this::

Eckhart Arnold's avatar
Eckhart Arnold committed
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
    >>> json_gr = '''
    ...     @disposable = /_\\\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     @reduction = merge
    ...     @whitespace= /\\\\s*/
    ...     @comment   = /#[^\\\\n]*(?:\\\\n|$)/
    ...     @literalws = right
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" member ( "," §member )* "}"
    ...     member     = string ":" _element
    ...     array      = "[" ( _element ( "," _element )* )? "]"
    ...     string     = `"` _CHARS '"'
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/~ | /false/~
    ...     null       = /null/~                                    '''
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree_ = json_parser(testdata)
    >>> assert syntax_tree_.equals(syntax_tree)
di68kap's avatar
di68kap committed
923

Eckhart Arnold's avatar
Eckhart Arnold committed
924
925
The whitespace defined by the ``@whitespace``-directive can be access from
within the grammar via the name ``WHITESPACE__``. Other than the tilde-sign
eckhart's avatar
eckhart committed
926
this name refers to the pure whitespace that is not intermingles with
Eckhart Arnold's avatar
Eckhart Arnold committed
927
928
comments. Similarly, comments defined by the ``@comment``-directive can
be accessed via the symbol ``COMMENT__``.
di68kap's avatar
di68kap committed
929

eckhart's avatar
eckhart committed
930
931
Lookahead and Lookbehind
------------------------
eckhart's avatar
eckhart committed
932

Eckhart Arnold's avatar
Eckhart Arnold committed
933
934
935
936
937
938
939
940
Lookahead and lookbehind operators are a convenient way to resolve or rather
avoid ambiguities while at the same time keeping the DSL lean. Assume for
example a simple DSL for writing definitions like::

    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     human := featherless biped'''

di68kap's avatar
di68kap committed
941
Now, let's try to draw up a grammar for "definitions"::
Eckhart Arnold's avatar
Eckhart Arnold committed
942
943

    >>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
di68kap's avatar
di68kap committed
944
    ...     @literalws  = right
Eckhart Arnold's avatar
Eckhart Arnold committed
945
946
947
948
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum ":=" definiens
    ...     definiendum = word
    ...     definiens   = word { word }
di68kap's avatar
di68kap committed
949
    ...     word        = /[a-z]+|[A-Z][a-z]*/~
Eckhart Arnold's avatar
Eckhart Arnold committed
950
951
952
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_first_try, "defDSL")

di68kap's avatar
di68kap committed
953
954
955
956
Parsing our example with the generated parser yields an error, however::

    >>> syntax_tree = def_parser(definitions)
    >>> for e in syntax_tree.errors_sorted:  print(e)
di68kap's avatar
di68kap committed
957
958
    3:11: Error (1040): Parser "word->/[a-z]+|[A-Z][a-z]*/" did not match: »:= featherless biped«

Eckhart Arnold's avatar
Eckhart Arnold committed
959
The reason for this error is that the parser ``definiens`` captures as many
di68kap's avatar
di68kap committed
960
961
962
963
words as occur in a sequence, including the definiendum of the next definition
which is the word "human". But then the next definition does not find it
definiendum, any more, because it has already been captured. (All this may not
easily become clear from the error message itself, but can easily be found
Eckhart Arnold's avatar
Eckhart Arnold committed
964
out by using the post-mortem debugger of module :py:mod:`trace`.)
di68kap's avatar
di68kap committed
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981

An common tequnique to avoid this problem would be to introduce an
end-of-statemenet, for example, a semicolon ";". A more elegant way to solve
the problem in this case is to make use of the fact that if a word is
followed by the definition sign ":=" it cannot be part of the definiens
any more, but must be a definiendum. This can be encoded by using the
negative look-ahead operator "!":

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word !":=" }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

Eckhart Arnold's avatar
Eckhart Arnold committed
982
983
The statement ``word !":="`` is a squence of a ``word`` and a negative lookahead.
This whole sequence only matches, if ``word`` matches and the negative looakahead
di68kap's avatar
di68kap committed
984
985
986
matches, which is only the case of the following text cannot be matched by ":=".

We could have achieved the same effect with a positive lookahead by checking
Eckhart Arnold's avatar
Eckhart Arnold committed
987
whether any of the possible follow-up-sqeuences of parser ``definines`` ensues::
di68kap's avatar
di68kap committed
988
989
990
991
992
993
994
995
996
997
998
999
1000

    >>> def_DSL = def_DSL_first_try.replace('definiens   = word { word }',
    ...                                     'definiens   = word { word &(word|EOF) }')
    >>> def_parser = create_parser(def_DSL, "defDSL")
    >>> syntax_tree = def_parser(definitions)
    >>> for d in syntax_tree.select('definition'):
    ...    print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}')
    A dog    is a carnivorous quadrupel that barks
    A human  is a featherless biped

Generally, lookahead operators, whether positive or negative, never capture any
text. They merely match or fail depending on whether the following parser would
match or would fail to match the next piece of text. The positive lookahead