ebnf.py 163 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ebnf.py - EBNF -> Python-Parser compilation for DHParser
#
# Copyright 2016  by Eckhart Arnold (arnold@badw.de)
#                 Bavarian Academy of Sciences an Humanities (badw.de)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.  See the License for the specific language governing
# permissions and limitations under the License.
17
18


19
"""
Eckhart Arnold's avatar
Eckhart Arnold committed
20
21
Module ``ebnf`` provides an EBNF-parser-generator that compiles an
EBNF-Grammar into avPython-code that can be executed to parse source text
22
conforming to this grammar into concrete syntax trees.
Eckhart Arnold's avatar
Eckhart Arnold committed
23

24
25
26
Specifying Grammers with EBNF
-----------------------------

Eckhart Arnold's avatar
Eckhart Arnold committed
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
With DHParser, Grammars can be specified either directly in Python-code
(see :py:mod:`DHParser.parse`) or in one of several EBNF-dialects. (Yes,
DHParser supports several different variants of EBNF! This makes it easy
to crate a parser directly from Grammars found in external sources.)
"EBNF" stands for the "Extended-Backus-Naur-Form" which is a common
formalism for specifying Grammars for context-free-languages.
(see https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

The recommended way of compiling grammars with DHParser is to either
write the EBNF-specification for that Grammar into a text-file and then
compile EBNF-source to an executable as well as importable Python-module
with the help of the "dhparser"-skript. Or, for bigger projects, to
create a new domain-specific-language-project with the DHParser-skript
as described in the step-by-step-guide.

However, here we will show how to compile an EBNF-specified grammar
from within Python-code and how to execute the parser that was
generated by compiling the grammar.

46
47
48
As an example, we will realize a json-parser (https://www.json.org/).
Let's start with creating some test-data::

Eckhart Arnold's avatar
Eckhart Arnold committed
49
50
51
52
53
    >>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
    >>> import json
    >>> testdata = json.dumps(testobj)
    >>> testdata
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
54
55
56
57
58
59
60
61

We define the json-Grammar (see https://www.json.org/) in
top-down manner in EBNF. We'll use a regular-expression look-alike
syntax. EBNF, as you may recall, consists of a sequence of symbol
definitions. The definiens of those definitions either is a string
literal or regular expression or other symbols or a combination
of these with four different operators: 1. sequences
2. alternatives 3. options and 4. repetitions. Here is how these
62
elements are denoted in classical and regex-like EBNF-syntax:
63

Eckhart Arnold's avatar
Eckhart Arnold committed
64
========================  ==================  ================
di68kap's avatar
di68kap committed
65
element                   classical EBNF      regex-like
Eckhart Arnold's avatar
Eckhart Arnold committed
66
67
68
========================  ==================  ================
insignificant whitespace  ~                   ~
string literal            "..." or \\`...\\`    "..." or \\`...\\`
di68kap's avatar
di68kap committed
69
70
71
72
73
74
75
regular expr.             /.../               /.../
sequences                 A B C               A B C
alternatives              A | B | C           A | B | C
options                   [ ... ]             ...?
repetions                 { ... }             ...*
one or more                                   ...+
grouping                  (...)               (...)
Eckhart Arnold's avatar
Eckhart Arnold committed
76
========================  ==================  ================
77
78
79
80
81
82
83
84
85
86
87
88
89
90

"insignificant whitespace" is a speciality of DHParser. Denoting
insignificant whitespace with a particular sign `~` allows to eliminate
it already during the parsing process without burdening later
syntax-tree-processing stages with this common task. DHParser offers
several more facilities to restrain the verbosity of the concrete
syntax tree, so that the outcome of the parsing stage comes close (or
at least closer) to the intended abstract-syntax-tree, already.

JSON consists of two complex data types, 1) associative arrays,
called "object" and sequences of heterogeneous data, called array; and
of four simple data types, 1) string, 2) number, 3) bool and 4) null.
The structure of a JSON file can easily be described in EBNF::

Eckhart Arnold's avatar
Eckhart Arnold committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
    >>> grammar = '''
    ...     json     = ~ _element _EOF
    ...     _EOF     = /$/
    ...     _element = object | array | string | number | bool | null
    ...     object   = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member   = string ":" ~ _element
    ...     array    = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string   = `"` _CHARS `"` ~
    ...     _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number   = _INT _FRAC? _EXP? ~
    ...     _INT     = `-`? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...     _FRAC    = `.` /[0-9]+/
    ...     _EXP     = (`E`|`e`) [`+`|`-`] /[0-9]+/
    ...     bool     = "true" ~ | "false" ~
    ...     null     = "null" ~                                   '''
106

107
This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though:
108
109
First of all you might notice that some components of the grammar
(or "prduction rules" as they are commonly called) have names with a leading
110
111
112
113
114
underscore "_". It is a convention to mark those elements, in which we are on
interested on their own account, with an underscore "_". When moving from the
concrete syntax-tree to a more abstract syntax-tree, these elements could be
substituted by their content, to simplify the tree.

115
116
Secondly, some production rules carry a name written in captial letters. This is also
a convention to mark those symbols which with other parser-generators would
117
118
119
120
121
represent tokens delivered by a lexical scanner. DHParser is a "scanner-less"
parser, which means that the breaking down of the string into meaningful tokens
is done in place with regular expressions (like in the definition of "_EOF")
or simple combinations of regular expressions (see the definition of "_INT" above).
Their is no sharp distinction between tokens and other symbols in DHParser,
122
but we keep it as a loose convention. Regular expressions are enclosed in forward
123
124
125
126
127
128
129
130
131
132
133
slashes and follow the standard syntax of Perl-style regular expression that is
also used by the "re"-module of the Python standard library. (Don't worry about
the number of backslashes in the line defining "_CHARS" for now!)

Finally, it is another helpful conention to indent the defintions of symbols
that have only been introduced to simplify an otherwise uneccessarily
complicated definition (e.g. the definition of "number", above) or to make
it more understandable by giving names to its componentns (like "_EOF").

Let's try this grammar on our test-string.  In order to compile
this grammar into executable Python-code, we use the high-level-function
eckhart's avatar
eckhart committed
134
135
:py:func:`create_parser()` from :py:mod:`DHParser.dsl`-module.

Eckhart Arnold's avatar
Eckhart Arnold committed
136
137
138
139
140
141
142
    >>> from DHParser.dsl import create_parser
    >>> # from DHParser.dsl import compileEBNF
    >>> # print(compileEBNF(grammar))
    >>> parser = create_parser(grammar, branding="JSON")
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
143
144
145
146
147
148
149

As expected serializing the content of the resulting syntax-tree yields exactly
the input-string of the parsing process. What we cannot see here, is that the
parser has structured the string into the individual elements described in the
grammar. Since the concrete syntax-tree that the parser vields is rather
verbose, it would not make sense to print it out. We'll just look at a small
part of it, to see what it looks like. Let's just pick the sub-tree that
150
151
captures the first json-array within the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
    >>> print(syntax_tree.pick('array').as_sxpr(compact=True))
    (array
      (:Text "[")
      (_element
        (number
          (_INT "1")))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (number
          (_INT "2")
          (_FRAC
            (:Text ".")
            (:RegExp "0"))))
      (:Text ",")
      (:Whitespace " ")
      (_element
        (string
          (:Text '"')
          (_CHARS "a string")
          (:Text '"')))
      (:Text "]"))
174
175

The nodes of the syntax-tree carry the names of the production rules
176
177
178
179
180
by which they have been generated. Nodes, that have been created by
components of a prduction receive the name of of the parser-type
that has created the node (see :py:mod:`DHParser.parse`) prefixed
with a colon ":". In DHParser, these nodes are called "anonymous",
because they lack the name of a proper grammatical component.
181

182
Simplifying Syntax-Trees while Parsing
183
--------------------------------------
184

185
Usually, anonymous nodes are what you want to get rid of in the course
186
of transforming the concrete syntax-tree into an abstract syntax-tree.
187
188
(See :py:mod:`DHParser.transform`). DHParser already eliminates per
default all anonymous nodes that are not leaf-nodes by replacing them
di68kap's avatar
di68kap committed
189
190
191
with their children during parsing. Anonymous leaf-nodes will be
replaced by their content, if they are a single child of some parent,
and otherwise be left in place. Without this optimization, each
192
construct of the EBNF-grammar would leave a node in the syntax-tree::
193

Eckhart Arnold's avatar
Eckhart Arnold committed
194
195
196
197
198
199
200
    >>> from DHParser.parse import CombinedParser, TreeReduction
    >>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION)
    >>> syntax_tree = parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(compact=True))
    (array
      (:Text "[")
      (:Option
201
202
203
204
        (:Series
          (_element
            (number
              (_INT
di68kap's avatar
di68kap committed
205
                (:Alternative
Eckhart Arnold's avatar
Eckhart Arnold committed
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
                  (:RegExp "1")))))
          (:ZeroOrMore
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (number
                  (_INT
                    (:Alternative
                      (:RegExp "2")))
                  (:Option
                    (_FRAC
                      (:Text ".")
                      (:RegExp "0"))))))
            (:Series
              (:Text ",")
              (:Whitespace " ")
              (_element
                (string
                  (:Text '"')
                  (_CHARS
                    (:RegExp "a string"))
                  (:Text '"')))))))
      (:Text "]"))
230

231
232
233
234
235
236
237
This can be helpful for understanding how parsing that is directed by
an EBNF-grammar works (next to looking at the logs of the complete
parsing-process, see :py:mod:`DHParser.trace`), but other than that it
is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.

di68kap's avatar
di68kap committed
238
Because of this, DHParser offers further means of simplifying
239
240
241
242
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree; but they must be turned on by "directives" that
are listed at the top of an EBNF-grammar and that guide the
di68kap's avatar
di68kap committed
243
244
parser-generation process. DHParser-directives always start with an
`@`-sign. For example, the `@drop`-directive advises the parser to
di68kap's avatar
di68kap committed
245
246
247
drop certain nodes entirely, including their content. In the following
example, the parser is directed to drop all insignificant whitespace::

Eckhart Arnold's avatar
Eckhart Arnold committed
248
    >>> drop_insignificant_wsp = '@drop = whitespace \\n'
di68kap's avatar
di68kap committed
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263

Directives look similar to productions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anomymous
nodes that shall be dropped or one of four particular classes of
anonymous nodes (`strings`, `backticked`, `regexp`, `whitespace`) that
will be dropped.

Another useful directive advises the parser to treat named nodes as
anynouse nodes and to eliminate them accordingly during parsing. This
is usefule, if we have introduced certain names in our grammar
only as placeholders to render the definition of the grammar a bit
more readable, not because we are intested in the text that is
captured by the production associated with them in their own right::

Eckhart Arnold's avatar
Eckhart Arnold committed
264
    >>> disposable_symbols = '@disposable = /_\w+/ \\n'
di68kap's avatar
di68kap committed
265
266
267
268
269
270
271
272

Instead of passing a comma-separated list of symbols to the directive,
which would also have been possible, we have leveraged our convention
to prefix unimportant symbols with an underscore "_" by specifying the
symbols that shall by anonymized with a regular expression.

Now, let's examine the effect of these two directives::

Eckhart Arnold's avatar
Eckhart Arnold committed
273
274
275
276
277
    >>> refined_grammar = drop_insignificant_wsp + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> syntax_tree.content
    '{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'
di68kap's avatar
di68kap committed
278
279
280
281
282
283
284

You might have notived that all insigificant whitespaces adjacent to
the delimiters have been removed this time (but, of course not the
significant whitespace between "a" and "string" in "a string"). And
the difference, the use of these two directives makes, is even more
obvious, if we look at (a section of) the syntax-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
    >>> print(syntax_tree.pick('array').as_sxpr(compact=True))
    (array
      (:Text "[")
      (number "1")
      (:Text ",")
      (number
        (:RegExp "2")
        (:Text ".")
        (:RegExp "0"))
      (:Text ",")
      (string
        (:Text '"')
        (:RegExp "a string")
        (:Text '"'))
      (:Text "]"))
di68kap's avatar
di68kap committed
300

eckhart's avatar
eckhart committed
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
This tree looks more streamlined. But it still contains more structure
than we might like to see in an abstract syntax tree. In particular, it
still contains als the delimiters ("[", ",", '"', ...) next to the data. But
other than in the UTF-8 representation of our json data, the delimiters are
not needed any more, because the structural information is now retained
in the tree-structure.

So how can we get rid of those delimiters? The rather coarse-grained tools
that DHParser offers in the parsing stage require some care to do this
properly.

The @drop-directive allows to drop all unnamed strings (i.e. strings
that are not directly assigned to a symbol) and backticked strings (for
the difference between strings and backticked strings, see below) and
regular expressions. However, using `@drop = whitespace, strings, backticked`
would also drop those parts captured as string that contain data::

Eckhart Arnold's avatar
Eckhart Arnold committed
318
319
320
321
322
323
324
325
326
327
328
    >>> refined_grammar = '@drop = whitespace, strings, backticked \\n' \
                          + disposable_symbols + grammar
    >>> parser = create_parser(refined_grammar, 'JSON')
    >>> syntax_tree = parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(compact=True))
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
329
330
331
332
333

Here, suddenly, the number "2.0" has been turned into "20"! There
are three ways to get around this problem:

1. Assigning all non-delmiter strings to symbols. In this case
Eckhart Arnold's avatar
Eckhart Arnold committed
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
   we would have to rewrite the definition of "number" as such::

      number     = _INT _FRAC? _EXP? ~
        _INT     = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = _DOT /[0-9]+/
        _EXP     = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/
        _MINUS   = `-`
        _PLUS    = `+`
        _DOT     = `.`
        _Ecap    = `E`
        _Esmall  = `e`

   A simpler alternative of this technique would be to make use of
   the fact that the document-parts captured by regular expresseion
   are not dropped (although regular expressions can also be listed
   in the @drop-directive, if needed) and that at the same time
   delimiters are almost always simple strings containing keywords
   or punctuation characters. Thus, one only needs to rewrite those
   string-expressions that capture data as regular expressions::

      number     = _INT _FRAC? _EXP? ~
        _INT     = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ )
        _FRAC    = /[.]/ /[0-9]+/
        _EXP     = (/E/|/e/) [/[-+]/] /[0-9]+/

2. Assigning all delimiter strings to symbols and drop the nodes
   and content captured by these symbols. This means doing exactly
   the opposite of the first solution. Here is an excerpt of what
   a JSON-grammar emploing this technique would look like::

      @disposable = /_\w+/
      @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ...
      ...
      array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~
      ...

   It is important that all symbols listed for dropping are also made
   disposable, either by listing them in the disposable-directive as well
   or using names that the regular-expressions for disposables matches.
   Otherwise, DHParser does not allow to drop the content of named nodes,
   because the default assumption is that symbols in the grammar are
   defined to capture meaningful parts of the document that contain
   relevant data.

3. Bailing out and leaving the further simplification of the syntax-tree
   to the next tree-processing stage which, if you follow DHParser's suggested
   usage pattern, is the abstract-syntax-tree-transformation proper
   and which allows for a much more fine-grained specification of
   transformation rules. See :py:mod:`DHParser.transformation`
eckhart's avatar
eckhart committed
383
384

To round this section up, we present the full grammar for a streamlined
385
386
387
388
389
JSON-Parser according to the first solution-strategy. Observe, that the
values of "bool" and "null" are now defined with regular expressions
instead of string-literals, because the latter would be dropped because
of the `@drop = ... strings, ...`-directive, leaving an empty named node
without a value, wheneever a bool value or null occurs in the input::
eckhart's avatar
eckhart committed
390

Eckhart Arnold's avatar
Eckhart Arnold committed
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
    >>> json_gr = '''
    ...      @disposable = /_\\\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" ~ member ( "," ~ §member )* "}" ~
    ...     member     = string ":" ~ _element
    ...     array      = "[" ~ ( _element ( "," ~ _element )* )? "]" ~
    ...     string     = `"` _CHARS `"` ~
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/ ~ | /false/ ~
    ...     null       = /null/ ~                                  '''
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.pick('array').as_sxpr(compact=True))
    (array
      (number "1")
      (number
        (:RegExp "2")
        (:RegExp ".")
        (:RegExp "0"))
      (string "a string"))
eckhart's avatar
eckhart committed
418

eckhart's avatar
eckhart committed
419
420
421
422
423
424
This time the data is not distorted, any more. One oddity reamins, however: We
are most probably not interested in the fact that the number 2.0 consists of
three components, each of which hast been captured by a regular expression.
Luckiliy, there exists yet another directive that allows to reduce the tree
further by merging adjacent anonymous leaf-nodes::

Eckhart Arnold's avatar
Eckhart Arnold committed
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
    >>> json_gr = '@reduction = merge \\n' + json_gr
    >>> json_parser = create_parser(json_gr, 'JSON')
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr(compact=True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
443

eckhart's avatar
eckhart committed
444
445
446
447
448
449
450
451
452
453
454
455
456
Merging adjacent anonymous leaf-nodes takes place after the @drop-directive
comes into effect. It should be observed that merging only produces the desired
result, if any delimiters have been dropped previously, because otherwise
delimiters would be merged with content. Therefore, the `@reduction = merge`-
directive should at best only be applied in conjunction with the `@drop` and
`@dispose`-directives.

Applying any of the here described tree-reduction (or "simplification" for
that matter) requires a bit of careful planning concerning which nodes
will be named and which nodes will be dropped. This, however, pays off in
terms of speed and considerably simplified abtract-syntax-tree generation
stage, because most of the unnecessary structure of concrete-syntax-trees
has already been eliminated at the parsing stage.
eckhart's avatar
eckhart committed
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491

Comments and Whitespace
-----------------------

Why whitespace isn't trivial
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Handling whitespace in text-documents is not all trivial, because
whitespace can serve several different purposes and there can be
different kinds of whitespace: Whitespace can serve a syntactic function
as delimiter. But whitespace can also be purely asthetic to render
a document more readable.

Depending on the data model, whitespace can be considered as
significant and be included in the data or as
insignificant and be excluded from the data and only be re-inserted
when displaying the data in a human-readable-form. (For example, one
can model a sentence as a seuqence of words and spaces or just as
a sequence of words.) Note, that "significance" does not correlate
with the syntatic or asthetic function, but only depends on whether
you'd like to keep the whitespace in you data or not.

There can be different kinds of whitespace with different meaning
(and differing significance). For example, one can make a difference
between horozontal whitespace (spaces and tabs) and vertical
whitespace (including linefeeds). And there can be different sizes
of whitespace with different meaning. For example in LaTeX, a single
linefeed still counts as plain whitespace while an empty line (i.e.
whitespace including two or more not linefeeds) signals a new
paragraph.

Finally, even the position of whitespace can make a difference.
A certain number of whitespaces at the beginning of a line can
have the meaning of "indentation" (as in Python code) while at
the end of the line or between brackets it is just plain
492
493
494
495
496
insignificant whitespace. (This is actually something, where
the boundaries of the EBNF-formalism become visible and you'd
probably use a preprocessor or some kind of "semantic actions"
to handle such cases. There is some support for either of these
in DHParser.)
eckhart's avatar
eckhart committed
497

di68kap's avatar
di68kap committed
498
499
Coding significant Whitespace in EBNF-Grammars
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
eckhart's avatar
eckhart committed
500

501
502
503
504
505
A reasonable approach to coding whitespace is to use one
particular symbol for each kind of whitespace. Those kinds of
whitespace that are insignficant, i.e. that do not need to
appear in the data, should be dropped from the syntax-tree.
With DHParser this can be done already while parsing, using
506
507
508
509
510
511
the `@disposable` and `@drop`-directives described earlier.

But let's first look at an example which only includes significant
whitespace. The following parser parses sequences of paragraphs which
consist of sequences of sentences which consist of sequences
of main clauses and subordinate clauses which consist of sequences
di68kap's avatar
di68kap committed
512
of words::
513

Eckhart Arnold's avatar
Eckhart Arnold committed
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
    >>> text_gr = '''
    ...     @ disposable = /_\\\\w+/
    ...     document       = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF
    ...       _EOF         = /$/
    ...     paragraph      = sentence (S sentence)*
    ...     sentence       = (clause _c_delimiter S)* clause _s_delimiter
    ...       _c_delimiter = KOMMA | COLON | SEMICOLON
    ...       _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK
    ...     clause         = word (S word)*
    ...     word           = /(?:[A-Z]|[a-z])[a-z']*/
    ...     DOT            = `.`
    ...     QUESTION_MARK  = `?`
    ...     EXCLAMATION_MARK = `!`
    ...     KOMMA          = `,`
    ...     COLON          = `:`
    ...     SEMICOLON      = `;`
    ...     PBR            = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     S              = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/ '''
di68kap's avatar
di68kap committed
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556

Here, we have two types of significant whitespace `PBR` ("paragraph-break") and `S`
("space"). Both types allow for a certain amount of flexibility, so that two
whitespaces of the same type do not need to have exactly the same content, but
we could always normalize these whitespaces in a subsequent transformation step.

Two typical design patterns for significant whitespace are noteworthy, here:

1. Both whitespaces match only if there was at least one whitespace character.
   We may allow whitespace to be optional (as at the beginning and end of the
   document), but if the option has not been taken, we don't to see an empty
   whitespace-tag in the document, later on.
   (For insignificant whitespace, the opposite convention can be more convenient,
   because, typically, insignificant whitespace is dropped anyway, whether it's
   got content or not.)

2. The grammar is construed in such a way that the whitespace always appears
   *between* different elements at the same level, but not after the last or
   before the first element. The whitespace after the last word of a sentence
   or before the first word of a sentence is really whitespace between
   two sentences. If we pick out a sentence or a clause, we will have no
   dangling whitespace at its beginning or end.
   (Again, for soon to be dropped insignificant whitespace, another convention
   can be more advisable.)

eckhart's avatar
eckhart committed
557
Let's just try our grammar on an example::
di68kap's avatar
di68kap committed
558

Eckhart Arnold's avatar
Eckhart Arnold committed
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
    >>> text_example = '''
    ... I want to say, in all seriousness, that a great deal of harm is being
    ... done in the modern world by belief in the virtuousness of work, and that
    ... the road to happiness and prosperity lies in an organized diminution of
    ... work.
    ...
    ... First of all: what is work? Work is of two kinds: first, altering the
    ... position of matter at or near the earth's surface relatively to other
    ... such matter; second, telling other people to do so. The first kind is
    ... unpleasant and ill paid; the second is pleasant and highly paid.'''
    >>> text_parser = create_parser(text_gr, 'Text')
    >>> text_as_data = text_parser(text_example)
    >>> sentence = text_as_data.pick(\
            lambda nd: nd.tag_name == "sentence" and nd.content.startswith('First'))
    >>> print(sentence.as_sxpr(compact=True))
    (sentence
      (clause
        (word "First")
        (S " ")
        (word "of")
        (S " ")
        (word "all"))
      (COLON ":")
      (S " ")
      (clause
        (word "what")
        (S " ")
        (word "is")
        (S " ")
        (word "work"))
      (QUESTION_MARK "?"))
di68kap's avatar
di68kap committed
590
591

Again, it is a question of design, whether we leave whitespace in the data or
eckhart's avatar
eckhart committed
592
not. Leaving it has the advantage, that serialization become as simple as
di68kap's avatar
di68kap committed
593
594
printing the content of the data-tree::

Eckhart Arnold's avatar
Eckhart Arnold committed
595
596
    >>> print(sentence)
    First of all: what is work?
di68kap's avatar
di68kap committed
597
598
599
600
601
602

Otherwise one would have to programm a dedicated serialization routine. Especially,
if you receive data from a different source, you'll appreciate not having to
do this - and so will other people, receiving your data. Think about it! However,
dropping the whitespace will yield more consice data.

eckhart's avatar
eckhart committed
603
Coding Comments
di68kap's avatar
di68kap committed
604
605
^^^^^^^^^^^^^^^

eckhart's avatar
eckhart committed
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
Allowing comments in a domain-specific language almost always makes sense,
because it allows users to annotate the source texts while working on them
and to share those comments with collaborators. From a technical point of
view, adding comments to a DSL raises two questions:

1. At what places shall we allow to insert comments in the source code?
   Common answers are: a) at the end of a line, b) almost everywhere, or
   c) both.

2. How do we avoid pollution of the EBNF-grammar with comment markers?
   It's already curtails the readability that we have to put whitespace
   symbols in so many places. And speaking of comments at the end of
   the line: If linefeeds aren't important for us - as in our toy-grammar
   for prose-text, above - we probably wouldn't want to reframe our
   grammar jsut to allow for at the end of the line comments.

Luckily, there exists a simple and highly intuitive solution that takes
care of both of these concerns: We admitt comments, whereever whitespace
is allowed. And we code this by defining a symbol that means: "whitespace
and, optionally, a comment".

Let's try this with our prose-text-grammar. In order to do so, we have
to define a symbols for comments, a symbol for pure whitespace, and,
finally, a symbol for whitespace with optional comment. Since, in
our grammar, we actually have two kinds of whitespace, `S` and `PBR`,
we'll have to redefine both of them. As delimiters for comments, we
use curly braces::

Eckhart Arnold's avatar
Eckhart Arnold committed
634
635
636
637
638
639
640
    >>> wsp_gr = '''
    ...     PBR      = pure_PBR COMMENT (pure_PBR | pure_S)?
    ...              | (pure_S? COMMENT)? pure_PBR
    ...     S        = pure_S COMMENT pure_S? | COMMENT? pure_S
    ...     COMMENT  = /\{[^}]*\}/
    ...     pure_PBR = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/
    ...     pure_S   = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/'''
eckhart's avatar
eckhart committed
641
642
643
644
645
646
647

As can be seen, the concrete re-definition of the whitespace tokens
requires a bit of careful consideration, because we want to allow
additional whitespace next to comments, but at the same time avoid
ending up with two whitespaces in sequence in our data. Let's see, if
we have succeeded::

Eckhart Arnold's avatar
Eckhart Arnold committed
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
    >>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr
    >>> extended_parser = create_parser(extended_text_gr, 'Text')
    >>> syntax_tree = extended_parser('What {check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> print(syntax_tree.pick('clause').as_sxpr(compact=True))
    (clause
      (word "What")
      (S
        (pure_S " ")
        (COMMENT "{check this again!}")
        (pure_S " "))
      (word "is")
      (S
        (pure_S " "))
      (word "work"))
eckhart's avatar
eckhart committed
664
665
666
667
668
669
670
671

We will not worry about the more sub-structure of the S-nodes right now. If
we are not interested in the comments, we could use the `@disposable`,
`@drop` and `@reduction = merge`-directives to simplify these at the
parsing stage. Or, we could extract the comments and normalize the whitespace
at a later tree-processing stage. For now, let's just check wehter our
comments work as expected::

Eckhart Arnold's avatar
Eckhart Arnold committed
672
673
674
675
676
677
678
679
680
    >>> syntax_tree = extended_parser('What{check this again!} is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What {check this again!}is work?')
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children))
    word S word S word
    >>> syntax_tree = extended_parser('What{check this again!}is work?')
    >>> print(syntax_tree.errors[0])
    1:1: Error (1040): Parser "document = {PBR} [S] paragraph {PBR paragraph} {PBR} [S] _EOF" did not match!
eckhart's avatar
eckhart committed
681
682
683
684

The last error was to be expected, because we did not allow comments
to serve a substitutes for whitespace. The error message might not be
as clear about the actual error as we might wish, though, but this is a topic
eckhart's avatar
eckhart committed
685
686
687
for later. Let's check whether putting comments near paragraph breaks
works as well::

Eckhart Arnold's avatar
Eckhart Arnold committed
688
689
690
691
692
693
694
695
696
697
698
699
700
701
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ...
    ... { Here comes the comment }
    ...
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph PBR paragraph
    >>> test_text = '''Happiness lies in the diminuniation of work.
    ... { Here comes the comment }
    ... What is work?'''
    >>> syntax_tree = extended_parser(test_text)
    >>> print(' '.join(nd.tag_name for nd in syntax_tree.children))
    paragraph
eckhart's avatar
eckhart committed
702
703
704
705
706
707

The last result might look surprising at first, but since a paragraph
break requires at least one empty line as a separator, the input text
is correctly understood by the parser as a sinlge paragrpah with
two sentence interspersed by a sinlge whitespace which, incidently,
contains a comment::
Eckhart Arnold's avatar
Eckhart Arnold committed
708
709
710
711
712
713
714
715
716
717
718
719

    >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('paragraph').children))
    sentence S sentence
    >>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(compact=True))
    (S
      (pure_S
        ""
        "")
      (COMMENT "{ Here comes the comment }")
      (pure_S
        ""
        ""))
720
721
722
723
724
725

A common problem with whitespace is that it tends to pollute
the Grammar, because whereever you'd like to allow whitespace,
you'd have to insert a symbol for whitespace. The same problem
existis when it comes to allowing comments, because you'd
probably allow to insert comments in as many places as possible.
726

eckhart's avatar
eckhart committed
727
728
729
DHParser's support for insignificant whitespace and comments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

di68kap's avatar
di68kap committed
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
Coding insignificant whitespace and comments is exactly the
same as coding siginificant whitespace and comments and does not
need to be repeated, here. (The combination of insignificant
whitespace and significant comments, is slightly more complicated,
and probably best outsourced to some degree to the post-parsing
processing stages. It will not be discussed here.) However,
DHParser offers some special support for insignificant
whitesapce and comments, which can make working with these
easier in some cases.

First of all, DHParser has a special dedicated token for
insignificant whitespace which is the tilde `~`-character.
We have seen this earlier in the definition of the json-Grammar.

The `~` whitespace marker differs from the usual pattern for
defining whitespace in that it is implicitly optional, or what
amounts to the same, it matches the empty string. Normally,
it is to be considered bad practice to define a symbol as
optional. Rahter, a symbol should always match something and
only at the places where it is used, it should be marked as
optional. If this rule is obeyed, it is always easy to tell,
wether some element is optional or not at a specific place
in the Grammar. Otherwise, it can become quite confusing
indeed. However, since the tilde character is usually used
very often, it is more convenient not to mark it with a
question-mark or, if you use classical EBNF-syntax, to enclose
it with square brackets.

di68kap's avatar
di68kap committed
758
759
760
761
762
763
764
765
766
767
768
769
770
771
The default regular expression for the tilde-whitespace captures
arbitraily many spaces and tabs and at most one linefeed, but
not an empty line (`[ \t]*(?:\n[ \t]*)?(?!\n)`), as this is
the most convenient way to define whitespace for text-data.
However, the tilde whitespace can also be definied with any
other regular expression with the `@whitespace`-directive.

Let's go back to our JSON-grammar and define the optional
insignificant whitespace marked by the tilde-character in such a
way that it matches any amount of horizontal or vertical
whitespace, which makes much more sense in the context of json
than the default tilde-whitespace that is restricted vertically
to at most a single linefeed::

Eckhart Arnold's avatar
Eckhart Arnold committed
772
773
774
775
776
777
778
779
780
    >>> testdata = '{"array": [1, 2.0, "a string"], \\n\\n\\n "number": -1.3e+25, "bool": false}'
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors[0])
    1:32: Error (1010): member expected by parser 'object', » \\n \\n \\n  "numb...« found!
    >>> json_gr = '@whitespace = /\\\\s*/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.errors)
    []
di68kap's avatar
di68kap committed
781

eckhart's avatar
eckhart committed
782
783
784
785
786
787
788
789
790
When redefining the tilde-whitespace, make sure that your regular expression
also matches the empty string! There is no need to worry that the syntax tree
get's cluttered by empty whitespace-nodes, because tilde-whitespace always
yeidls anonymous nodes and DHParser drops empty anonymous nodes right away.

Comments can be defined using the `@comment`-directive. DHParser automatically
intermingles comments and whitespace so that where-ever tilde-whitespace is
allowed, a comment defined by the `@comment`-directive is also allowed:

Eckhart Arnold's avatar
Eckhart Arnold committed
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
    >>> json_gr = '@comment = /#[^\\\\n]*(?:\\\\n|$)/ \\n' + json_gr
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> testdata = '''{"array": [1, 2.0, "a string"], # a string
    ...                "number": -1.3e+25,  # a number
    ...                "bool": false}  # a bool'''
    >>> syntax_tree = json_parser(testdata)
    >>> print(syntax_tree.as_sxpr(compact = True))
    (json
      (object
        (member
          (string "array")
          (array
            (number "1")
            (number "2.0")
            (string "a string")))
        (member
          (string "number")
          (number "-1.3e+25"))
        (member
          (string "bool")
          (bool "false"))))
eckhart's avatar
eckhart committed
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829

Since the json-grammar still contains the `@drop = whitespace, ...`-
directive from earlier on (next to other tree-reductions), the comments
have been nicely dropped along with the tilde-whitespace.

There is one caveat: When using comments alongside with whitespace that
captures at most one linefeed, the comments should be defined in such
a way that the last charcter of a comment is never a linefeed.

Also a few limitations of the tilde-whitespace and directive-defined
comments should be kept in mind: 1. Only one kind of insignificant
whitespace can be defined this way. If there are more kinds of
insignificant whitespace, all but one need to be defined conventionally
as part of the grammar. 2. Both directive-defined comments and
tilde-whitespace can only be defined by a regular expresseion. In
particular, nested comments are impossible to define with regular
expressions, only.

830
831
832
833
834
835
However, using tilde-whitespace has yet one more benefit: With the
tilde-whitespace, cluttering of the grammar with whitespace-markers
can be avoid, by adding implicit whitespace adjacent to string-literals.
Remember the definition of the JSON-Grammar earlier. If you look at
a definition like: `object = "{" ~ member ( "," ~ §member )* "}" ~`,
you'll notice that there are three whitespace markers, one next to
836
each delimiter. Naturally so, because one usually wants to allow users
837
838
839
840
841
842
843
844
845
846
847
848
849
850
of a domain specific language to put whitespace around delimiters.

You may wonder, why the tilde appears only on the right hand side
of the literals, although you'd probably like to allow whitespace
on both side of a literal like "{". But if you look at the grammar
closely, you'll find that almost every symbol definition ends
either with a tilde sign or a symbol the definition of which ends
with a tilde sign, which means that they allow whitespace on the
right hand side. Now, if all elements of the grammar allow
whitespace on the right hand side, this means that automatically
also have whitespace on the left-hand side, too, which is, of
course the whitespace on the right hand side of the previous
element.

851
852
853
854
855
856
857
858
In order to reduce cluttering the grammar with tile-signs, DHParser
allows to turn on implicit tilde-whitespace adjacent to any
string literal with the diretive `@ literalws = right` or
`@ literalws = left`. As the argument of the directive suggests,
whitespace is either "eaten" at the right hand side or the left
hand side of the literal. String literals can either be
enclose in double quotes "..." or single quotes '...'. Both
kinds of literals will have implicit whitespace, if the
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
`@literalws`-directive is used.

(Don't confuse implicit whitespace
with insignificant whitespace: Insignificnat whitespace is whitespace
you do not need any more after parsing. Implicit whitespace is
whitespace you do not denote explicitly in the grammar. It's
a speciality of DHParser and DHParser allows onl the insignificant
whitespace denoted by the tilde-character to be declared as
"implicit".)

If left-adjacent whitespace is declared as implicit with the
`@literalws`-directive, the expression::

    object     = "{" ~ member ( "," ~ §member )* "}" ~

can be written as::

    object     = "{" member ( "," §member )* "}"

which is easier to read.

For situations where implicit whitespace is not desired, DHParser
has a special kind of string literal, written with backticks, which
never carries any implicit whitespace. This is important, when
literals are used for signs that enclose content, like the quotation
marks for the string literals in our JSON-Grammar::

    string     = `"` _CHARS '"'  # mind the difference between `"` and '"'!

Regular expressions, also, never carry implicit whitespace.
So, if you are using regular expressions as delimiters, you
still have to add the tilde character, if adjacent insignificant
whitespace is to be allowed::

    bool       = /true/~ | /false/~

The compliete json-grammar now looks like this::

Eckhart Arnold's avatar
Eckhart Arnold committed
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
    >>> json_gr = '''
    ...     @disposable = /_\\\\w+/
    ...     @drop      = whitespace, strings, backticked, _EOF
    ...     @reduction = merge
    ...     @whitespace= /\\\\s*/
    ...     @comment   = /#[^\\\\n]*(?:\\\\n|$)/
    ...     @literalws = right
    ...     json       = ~ _element _EOF
    ...       _EOF     = /$/
    ...     _element   = object | array | string | number | bool | null
    ...     object     = "{" member ( "," §member )* "}"
    ...     member     = string ":" _element
    ...     array      = "[" ( _element ( "," _element )* )? "]"
    ...     string     = `"` _CHARS '"'
    ...       _CHARS   = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/
    ...     number     = _INT _FRAC? _EXP? ~
    ...       _INT     = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ )
    ...       _FRAC    = /[.]/ /[0-9]+/
    ...       _EXP     = /[Ee]/ [/[-+]/] /[0-9]+/
    ...     bool       = /true/~ | /false/~
    ...     null       = /null/~                                    '''
    >>> json_parser = create_parser(json_gr, "JSON")
    >>> syntax_tree_ = json_parser(testdata)
    >>> assert syntax_tree_.equals(syntax_tree)
di68kap's avatar
di68kap committed
921

eckhart's avatar
eckhart committed
922
923
924
925
926
The whitespace defined by the `@whitespace`-directive can be access from
within the grammar via the name `WHITESPACE__`. Other than the tilde-sign
this name refers to the pure whitespace that is not intermingles with
comments. Similarly, comments defined by the `@comment`-directive can
be accessed via the symbol `COMMENT__`.
di68kap's avatar
di68kap committed
927

eckhart's avatar
eckhart committed
928
929
Lookahead and Lookbehind
------------------------
eckhart's avatar
eckhart committed
930

Eckhart Arnold's avatar
Eckhart Arnold committed
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
Lookahead and lookbehind operators are a convenient way to resolve or rather
avoid ambiguities while at the same time keeping the DSL lean. Assume for
example a simple DSL for writing definitions like::

    >>> definitions = '''
    ...     dog   := carnivorous quadrupel that barks
    ...     human := featherless biped'''

Now, let's try to draw up a grammar for definitions::

    >>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet!
    ...     @literalws = right
    ...     definitions = ~ definition { definition } EOF
    ...     definition  = definiendum ":=" definiens
    ...     definiendum = word
    ...     definiens   = word { word }
    ...     word        = /[A-Z]?[a-z]*/
    ...     EOF         = /$/ '''
    >>> def_parser = create_parser(def_DSL_first_try, "defDSL")

eckhart's avatar
eckhart committed
951

eckhart's avatar
eckhart committed
952

eckhart's avatar
eckhart committed
953
954
955
956
957
958
959
960
961
Fail-tolerant Parsing
---------------------



Semantic Actions and Storing Variables
--------------------------------------


962
963
"""

964

965
from collections import OrderedDict
966
from functools import partial
eckhart's avatar
eckhart committed
967
968
import keyword
import os
969
from typing import Callable, Dict, List, Set, FrozenSet, Tuple, Sequence, Union, Optional
970

971
from DHParser.compile import CompilerError, Compiler, ResultTuple, compile_source, visitor_name
972
973
from DHParser.configuration import access_thread_locals, get_config_value, \
    NEVER_MATCH_PATTERN, ALLOWED_PRESET_VALUES
974
975
from DHParser.error import Error, AMBIGUOUS_ERROR_HANDLING, WARNING, REDECLARED_TOKEN_WARNING,\
    REDEFINED_DIRECTIVE, UNUSED_ERROR_HANDLING_WARNING, INAPPROPRIATE_SYMBOL_FOR_DIRECTIVE, \
976
    DIRECTIVE_FOR_NONEXISTANT_SYMBOL, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING, \
977
    UNCONNECTED_SYMBOL_WARNING, REORDERING_OF_ALTERNATIVES_REQUIRED, BAD_ORDER_OF_ALTERNATIVES, \
978
    EMPTY_GRAMMAR_ERROR, MALFORMED_REGULAR_EXPRESSION
979
980
from DHParser.parse import Parser, Grammar, mixin_comment, mixin_nonempty, Forward, RegExp, \
    Drop, Lookahead, NegativeLookahead, Alternative, Series, Option, ZeroOrMore, OneOrMore, \
981
    Text, Capture, Retrieve, Pop, optional_last_value, GrammarError, Whitespace, Always, Never, \
982
    INFINITE, matching_bracket, ParseFunc, update_scanner, CombinedParser
983
from DHParser.preprocess import nil_preprocessor, PreprocessorFunc
Eckhart Arnold's avatar
Eckhart Arnold committed
984
from DHParser.syntaxtree import Node, RootNode, WHITESPACE_PTYPE, TOKEN_PTYPE
985
from DHParser.toolkit import load_if_file, escape_re, escape_ctrl_chars, md5, \
986
    sane_parser_name, re, expand_table, unrepr, compile_python_object, DHPARSER_PARENTDIR, \
987
    cython
eckhart's avatar
eckhart committed
988
from DHParser.transform import TransformationFunc, traverse, remove_brackets, \
eckhart's avatar
eckhart committed
989
    reduce_single_child, replace_by_single_child, is_empty, remove_children, \
990
991
    remove_tokens, flatten, forbid, assert_content, remove_children_if, all_of, not_one_of, \
    BLOCK_LEAVES
992
from DHParser.versionnumber import __version__
Eckhart Arnold's avatar
Eckhart Arnold committed
993

eckhart's avatar
eckhart committed
994

995
996
__all__ = ('DHPARSER_IMPORTS',
           'get_ebnf_preprocessor',
997
998
999
           'get_ebnf_grammar',
           'get_ebnf_transformer',
           'get_ebnf_compiler',
1000
1001
1002
           'parse_ebnf',
           'transform_ebnf',
           'compile_ebnf_ast',
1003
           'EBNFGrammar',
1004
           'EBNFTransform',
Eckhart Arnold's avatar
Eckhart Arnold committed
1005
           'EBNFCompilerError',
1006
           'EBNFDirectives',
1007
           'EBNFCompiler',
1008
           'grammar_changed',
eckhart's avatar
eckhart committed
1009
           'compile_ebnf',
1010
           'PreprocessorFactoryFunc',
1011
1012
           'ParserFactoryFunc',
           'TransformerFactoryFunc',
1013
           'CompilerFactoryFunc')
1014
1015


1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
########################################################################
#
# source code support
#
########################################################################


DHPARSER_IMPORTS = '''
import collections
from functools import partial
import os
import sys
1028
from typing import Tuple, List, Union, Any, Optional, Callable
1029

1030
1031
1032
1033
1034
1035
1036
try:
    scriptpath = os.path.dirname(__file__)
except NameError:
    scriptpath = ''
dhparser_parentdir = os.path.abspath(os.path.join(scriptpath, r'{dhparser_parentdir}'))
if scriptpath not in sys.path:
    sys.path.append(scriptpath)
1037
1038
if dhparser_parentdir not in sys.path:
    sys.path.append(dhparser_parentdir)
1039
1040
1041
1042
1043

try:
    import regex as re
except ImportError:
    import re
1044
from DHParser import start_logging, suspend_logging, resume_logging, is_filename, load_if_file, \\
1045
    Grammar, Compiler, nil_preprocessor, PreprocessorToken, Whitespace, Drop, AnyChar, \\
1046
    Lookbehind, Lookahead, Alternative, Pop, Text, Synonym, Counted, Interleave, INFINITE, \\
1047
1048
1049
1050
    Option, NegativeLookbehind, OneOrMore, RegExp, Retrieve, Series, Capture, TreeReduction, \\
    ZeroOrMore, Forward, NegativeLookahead, Required, CombinedParser, mixin_comment, \\
    compile_source, grammar_changed, last_value, matching_bracket, PreprocessorFunc, is_empty, \\
    remove_if, Node, TransformationFunc, TransformationDict, transformation_factory, traverse, \\
1051
1052
    remove_children_if, move_adjacent, normalize_whitespace, is_anonymous, matches_re, \\
    reduce_single_child, replace_by_single_child, replace_or_reduce, remove_whitespace, \\
eckhart's avatar
eckhart committed
1053
    replace_by_children, remove_empty, remove_tokens, flatten, all_of, any_of, \\
eckhart's avatar
eckhart committed
1054
    merge_adjacent, collapse, collapse_children_if, transform_content, WHITESPACE_PTYPE, \\
1055
    TOKEN_PTYPE, remove_children, remove_content, remove_brackets, change_tag_name, \\
eckhart's avatar
eckhart committed
1056
    remove_anonymous_tokens, keep_children, is_one_of, not_one_of, has_content, apply_if, peek, \\
1057
    remove_anonymous_empty, keep_nodes, traverse_locally, strip, lstrip, rstrip, \\
eckhart's avatar
eckhart committed
1058
    transform_content, replace_content_with, forbid, assert_content, remove_infix_operator, \\
1059
    add_error, error_on, recompile_grammar, left_associative, lean_left, set_config_value, \\
di68kap's avatar
di68kap committed
1060
    get_config_value, node_maker, access_thread_locals, access_presets, PreprocessorResult, \\
1061
    finalize_presets, ErrorCode, RX_NEVER_MATCH, set_tracer, resume_notices_on, \\
1062
1063
    trace_history, has_descendant, neg, has_ancestor, optional_last_value, insert, \\
    positions_of, replace_tag_names, add_attributes, delimit_children, merge_connected, \\
eckhart's avatar
eckhart committed
1064
    has_attr, has_parent, ThreadLocalSingletonFactory, Error, canonical_error_strings, \\
1065
1066
    has_errors, ERROR, FATAL, set_preset_value, get_preset_value, NEVER_MATCH_PATTERN, \\
    gen_find_include_func, preprocess_includes, make_preprocessor, chain_preprocessors
1067
'''
1068
1069


Eckhart Arnold's avatar
Eckhart Arnold committed
1070
1071
1072
1073
1074
1075
1076
########################################################################
#
# EBNF scanning
#
########################################################################


1077
def get_ebnf_preprocessor() -> PreprocessorFunc:
eckhart's avatar
eckhart committed
1078
1079
1080
1081
1082
    """
    Returns the preprocessor function for the EBNF compiler.
    As of now, no preprocessing is needed for EBNF-sources. Therefore,
    just a dummy function is returned.
    """
1083
    return nil_preprocessor
Eckhart Arnold's avatar
Eckhart Arnold committed
1084
1085
1086
1087
1088
1089
1090
1091


########################################################################
#
# EBNF parsing
#
########################################################################

1092

di68kap's avatar
di68kap committed
1093
class EBNFGrammar(Grammar):
eckhart's avatar
eckhart committed
1094
    r"""Parser for a FlexibleEBNF source file.
eckhart's avatar
eckhart committed
1095

eckhart's avatar
eckhart committed
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
    This grammar is tuned for flexibility, that is, it supports as many
    different flavors of EBNF as possible. However, this flexibility
    comes at the cost of some ambiguities. In particular:

       1. the alternative OR-operator / could be mistaken for the start
          of a regular expression and vice versa, and
       2. character ranges [a-z] can be mistaken for optional blocks
          and vice versa

    A strategy to avoid these ambiguities is to do all of the following:

        - replace the free_char-parser by a never matching parser
        - if this is done, it is safe to replace the char_range_heuristics-
          parser by an always matching parser
        - replace the regex_heuristics by an always matching parser

    Ambiguities can also be avoided by NOT using all the syntactic variants
    made possible by this EBNF-grammar within one and the same EBNF-document.

    EBNF-definition of the Grammar::

        @ comment    = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/
eckhart's avatar
eckhart committed
1118
1119
        # comments can be either C-Style: /* ... */
        # or pascal/modula/oberon-style: (* ... *)
1120
1121
        # or python-style: # ... \n,
        # excluding, however, character markers: #x20
eckhart's avatar
eckhart committed
1122

1123
        @ whitespace = /\s*/                            # whitespace includes linefeeds
eckhart's avatar
eckhart committed
1124
        @ literalws  = right                            # trailing whitespace of literals will be ignored tacitly
1125
        @ disposable = pure_elem, countable, FOLLOW_UP, SYM_REGEX, ANY_SUFFIX, EOF
eckhart's avatar
eckhart committed
1126
1127
        @ drop       = whitespace, EOF                  # do not include these even in the concrete syntax tree
        @ RNG_BRACE_filter = matching_bracket()         # filter or transform content of RNG_BRACE on retrieve
eckhart's avatar
eckhart committed
1128

eckhart's avatar
eckhart committed
1129
        # re-entry-rules for resuming after parsing-error
eckhart's avatar
eckhart committed
1130

eckhart's avatar
eckhart committed
1131
1132
        @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/
        @ directive_resume  = /\n\s*(?=@|\w+\w*\s*=)/
eckhart's avatar
eckhart committed
1133

eckhart's avatar
eckhart committed
1134
        # specialized error messages for certain cases
eckhart's avatar
eckhart committed
1135

eckhart's avatar
eckhart committed
1136
1137
1138
1139
        @ definition_error  = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to '
                                   'be a directive and the directive symbol @ is missing\nor the error is '
                                   'due to inconsistent use of the comma as a delimiter\nfor the elements '
                                   'of a sequence.'
eckhart's avatar
eckhart committed
1140
1141


eckhart's avatar
eckhart committed
1142
        #: top-level
eckhart's avatar
eckhart committed
1143

eckhart's avatar
eckhart committed
1144
        syntax     = ~ { definition | directive } EOF
1145
        definition = symbol §:DEF~ [ :OR~ ] expression :ENDL~ & FOLLOW_UP  # [:OR~] to support v. Rossum's syntax
eckhart's avatar
eckhart committed
1146

eckhart's avatar
eckhart committed
1147
1148
1149
1150
        directive  = "@" §symbol "=" (regexp | literals | procedure | symbol !DEF)
                     { "," (regexp | literals | procedure | symbol !DEF) } & FOLLOW_UP
        literals   = { literal }+                       # string chaining, only allowed in directives!
        procedure  = SYM_REGEX "()"                     # procedure name, only allowed in directives!
eckhart's avatar
eckhart committed
1151

eckhart's avatar
eckhart committed
1152
        FOLLOW_UP  = `@` | symbol | EOF
eckhart's avatar
eckhart committed
1153
1154


eckhart's avatar
eckhart committed
1155
        #: components
eckhart's avatar
eckhart committed
1156

eckhart's avatar
eckhart committed
1157
1158
1159
1160
1161
1162
1163
        expression = sequence { :OR~ sequence }
        sequence   = ["§"] ( interleave | lookaround )  # "§" means all following terms mandatory
                     { :AND~ ["§"] ( interleave | lookaround ) }
        interleave = difference { "°" ["§"] difference }
        lookaround = flowmarker § (oneormore | pure_elem)
        difference = term ["-" § (oneormore | pure_elem)]
        term       = oneormore | counted | repetition | option | pure_elem
eckhart's avatar
eckhart committed
1164
1165


eckhart's avatar
eckhart committed
1166
        #: elements
eckhart's avatar
eckhart committed
1167

eckhart's avatar
eckhart committed
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
        countable  = option | oneormore | element
        pure_elem  = element § !ANY_SUFFIX              # element strictly without a suffix
        element    = [retrieveop] symbol !:DEF          # negative lookahead to be sure it's not a definition
                   | literal
                   | plaintext
                   | regexp
                   | char_range
                   | character ~
                   | any_char
                   | whitespace
                   | group
eckhart's avatar
eckhart committed
1179
1180


eckhart's avatar
eckhart committed
1181
        ANY_SUFFIX = /[?*+]/
eckhart's avatar
eckhart committed
1182
1183


eckhart's avatar
eckhart committed
1184
        #: flow-operators
eckhart's avatar
eckhart committed
1185

eckhart's avatar
eckhart committed
1186
1187
1188
        flowmarker = "!"  | "&"                         # '!' negative lookahead, '&' positive lookahead
                   | "<-!" | "<-&"                      # '<-' negative lookbehind, '<-&' positive lookbehind
        retrieveop = "::" | ":?" | ":"                  # '::' pop, ':?' optional pop, ':' retrieve
eckhart's avatar
eckhart committed
1189
1190


eckhart's avatar
eckhart committed
1191
        #: groups
eckhart's avatar
eckhart committed
1192

eckhart's avatar
eckhart committed
1193
1194
1195
1196
1197
        group      = "(" no_range §expression ")"
        oneormore  = "{" no_range expression "}+" | element "+"
        repetition = "{" no_range §expression "}" | element "*" no_range
        option     = !char_range "[" §expression "]" | element "?"
        counted    = countable range | countable :TIMES~ multiplier | multiplier :TIMES~ §countable
eckhart's avatar
eckhart committed
1198

eckhart's avatar
eckhart committed
1199
1200
1201
        range      = RNG_BRACE~ multiplier [ :RNG_DELIM~ multiplier ] ::RNG_BRACE~
        no_range   = !multiplier | &multiplier :TIMES
        multiplier = /[1-9]\d*/~
eckhart's avatar
eckhart committed
1202
1203


eckhart's avatar
eckhart committed
1204
        #: leaf-elements
eckhart's avatar
eckhart committed
1205

eckhart's avatar
eckhart committed
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
        symbol     = SYM_REGEX ~                        # e.g. expression, term, parameter_list
        literal    = /"(?:(?<!\\)\\"|[^"])*?"/~         # e.g. "(", '+', 'while'
                   | /'(?:(?<!\\)\\'|[^'])*?'/~         # whitespace following literals will be ignored tacitly.
        plaintext  = /`(?:(?<!\\)\\`|[^`])*?`/~         # like literal but does not eat whitespace
                   | /´(?:(?<!\\)\\´|[^´])*?´/~
        regexp     = :RE_LEADIN RE_CORE :RE_LEADOUT ~   # e.g. /\w+/, ~/#.*(?:\n|$)/~
        # regexp     = /\/(?:(?<!\\)\\(?:\/)|[^\/])*?\//~     # e.g. /\w+/, ~/#.*(?:\n|$)/~
        char_range = `[` &char_range_heuristics
                         [`^`] (character | free_char) { [`-`] character | free_char } "]"
        character  = :CH_LEADIN HEXCODE
        free_char  = /[^\n\[\]\\]/ | /\\[nrt`´'"(){}\[\]\/\\]/
        any_char   = "."
        whitespace = /~/~                               # insignificant whitespace
eckhart's avatar
eckhart committed
1219

eckhart's avatar
eckhart committed
1220
        #: delimiters
eckhart's avatar
eckhart committed
1221

eckhart's avatar
eckhart committed
1222
1223
        EOF = !/./ [:?DEF] [:?OR] [:?AND] [:?ENDL]      # [:?DEF], [:?OR], ... clear stack by eating stored value
                   [:?RNG_DELIM] [:?BRACE_SIGN] [:?CH_LEADIN] [:?TIMES] [:?RE_LEADIN] [:?RE_LEADOUT]
eckhart's avatar
eckhart committed
1224

1225
        DEF        = `=` | `:=` | `::=` | `<-` | /:\n/ | `: `  # if `: `, retrieve marker mustn't be followed by blank!
eckhart's avatar
eckhart committed
1226
1227
1228
        OR         = `|` | `/` !regex_heuristics
        AND        = `,` | ``
        ENDL       = `;` | ``
eckhart's avatar
eckhart committed
1229

eckhart's avatar
eckhart committed
1230
1231
1232
1233
        RNG_BRACE  = :BRACE_SIGN
        BRACE_SIGN = `{` | `(`
        RNG_DELIM  = `,`
        TIMES      = `*`
eckhart's avatar
eckhart committed
1234

eckhart's avatar
eckhart committed
1235
1236
        RE_LEADIN  = `/` &regex_heuristics | `^/`
        RE_LEADOUT = `/`
eckhart's avatar
eckhart committed
1237

eckhart's avatar
eckhart committed
1238
        CH_LEADIN  = `0x` | `#x`
eckhart's avatar
eckhart committed
1239

eckhart's avatar
eckhart committed
1240
        #: heuristics
eckhart's avatar
eckhart committed
1241

eckhart's avatar
eckhart committed
1242
1243
1244
1245
1246
1247
1248
1249
1250
        char_range_heuristics  = ! ( /[\n\t ]/
                                   | ~ literal_heuristics
                                   | [`::`|`:?`|`:`] SYM_REGEX /\s*\]/ )
        literal_heuristics     = /~?\s*"(?:[\\]\]|[^\]]|[^\\]\[[^"]*)*"/
                               | /~?\s*'(?:[\\]\]|[^\]]|[^\\]\[[^']*)*'/
                               | /~?\s*`(?:[\\]\]|[^\]]|[^\\]\[[^`]*)*`/
                               | /~?\s*´(?:[\\]\]|[^\]]|[^\\]\[[^´]*)*´/
                               | /~?\s*\/(?:[\\]\]|[^\]]|[^\\]\[[^\/]*)*\//
        regex_heuristics       = /[^ ]/ | /[^\/\n*?+\\]*[*?+\\][^\/\n]\//
eckhart's avatar
eckhart committed
1251
1252


eckhart's avatar
eckhart committed
1253
        #: basic-regexes
eckhart's avatar
eckhart committed
1254

eckhart's avatar
eckhart committed
1255
1256
1257
        RE_CORE    = /(?:(?<!\\)\\(?:\/)|[^\/])*/       # core of a regular expression, i.e. the dots in /.../
        SYM_REGEX  = /(?!\d)\w+/                        # regular expression for symbols
        HEXCODE    = /[A-Fa-f0-9]{1,8}/
1258
    """
1259
    countable = Forward()
1260
    element = Forward()
di68kap's avatar
di68kap committed
1261
    expression = Forward()
1262
    source_hash__ = "3bda01686407a47a9fd0a709bda53ae3"
1263
    disposable__ = re.compile('component$|pure_elem$|countable$|FOLLOW_UP$|SYM_REGEX$|ANY_SUFFIX$|EOF$')
1264
    static_analysis_pending__ = []  # type: List[bool]
eckhart's avatar
eckhart committed
1265
    parser_initialization__ = ["upon instantiation"]
1266
1267
1268
1269
1270
    error_messages__ = {'definition': [
        (re.compile(r','),
        'Delimiter "," not expected in definition!\\nEither this was meant to be a directive '
        'and the directive symbol @ is missing\\nor the error is due to inconsistent use of the '
        'comma as a delimiter\\nfor the elements of a sequence.')]}
eckhart's avatar
eckhart committed
1271
1272
    resume_rules__ = {'definition': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')],
                      'directive': [re.compile(r'\n\s*(?=@|\w+\w*\s*=)')]}
eckhart's avatar
eckhart committed
1273
    COMMENT__ = r'(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)'
1274
    comment_rx__ = re.compile(COMMENT__)
di68kap's avatar
di68kap committed
1275
    WHITESPACE__ = r'\s*'
1276
    WSP_RE__ = mixin_comment(whitespace=WHITESPACE__, comment=COMMENT__)
1277
1278
    wsp__ = Whitespace(WSP_RE__)
    dwsp__ = Drop(Whitespace(WSP_RE__))
1279
    HEXCODE = RegExp('[A-Fa-f0-9]{1,8}')
eckhart's avatar
eckhart committed
1280
1281
1282
    SYM_REGEX = RegExp('(?!\\d)\\w+')
    RE_CORE = RegExp('(?:(?<!\\\\)\\\\(?:/)|[^/])*')
    regex_heuristics = Alternative(RegExp('[^ ]'), RegExp('[^/\\n*?+\\\\]*[*?+\\\\][^/\\n]/'))
1283
1284
1285
1286
1287
1288
1289
1290
1291
    literal_heuristics = Alternative(RegExp('~?\\s*"(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^"]*)*"'),
                                     RegExp("~?\\s*'(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^']*)*'"),
                                     RegExp('~?\\s*`(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^`]*)*`'),
                                     RegExp('~?\\s*´(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^´]*)*´'),
                                     RegExp('~?\\s*/(?:[\\\\]\\]|[^\\]]|[^\\\\]\\[[^/]*)*/'))
    char_range_heuristics = NegativeLookahead(Alternative(
        RegExp('[\\n\\t ]'), Series(dwsp__, literal_heuristics),
        Series(Option(Alternative(Text("::"), Text(":?"), Text(":"))),
               SYM_REGEX, RegExp('\\s*\\]'))))
1292
1293
1294
1295
1296
1297
    CH_LEADIN = Capture(Alternative(Text("0x"), Text("#x")))
    RE_LEADOUT = Capture(Text("/"))
    RE_LEADIN = Capture(Alternative(Series(Text("/"), Lookahead(regex_heuristics)), Text("^/")))
    TIMES = Capture(Text("*"))
    RNG_DELIM = Capture(Text(","))
    BRACE_SIGN = Capture(Alternative(Text("{"), Text("(")))
eckhart's avatar
eckhart committed
1298
    RNG_BRACE = Capture(Retrieve(BRACE_SIGN))
1299
1300
1301
    ENDL = Capture(Alternative(Text(";"), Text("")))
    AND = Capture(Alternative(Text(","), Text("")))
    OR = Capture(Alternative(Text("|"), Series(Text("/"), NegativeLookahead(regex_heuristics))))
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
    DEF = Capture(Alternative(Text("="), Text(":="), Text("::="),
                              Text("<-"), RegExp(':\\n'), Text(": ")))
    EOF = Drop(Drop(Series(Drop(NegativeLookahead(RegExp('.'))),
                           Drop(Option(Drop(Pop(DEF, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(OR, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(AND, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(ENDL, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(RNG_DELIM, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(BRACE_SIGN, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(CH_LEADIN, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(TIMES, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(RE_LEADIN, match_func=optional_last_value)))),
                           Drop(Option(Drop(Pop(RE_LEADOUT, match_func=optional_last_value)))))))
1315
    whitespace = Series(RegExp('~'), dwsp__)
1316
    any_char = Series(Text("."), dwsp__)
eckhart's avatar
eckhart committed
1317
    free_char = Alternative(RegExp('[^\\n\\[\\]\\\\]'), RegExp('\\\\[nrt`´\'"(){}\\[\\]/\\\\]'))
1318
    character = Series(Retrieve(CH_LEADIN), HEXCODE)
1319
1320
1321
1322
    char_range = Series(Text("["), Lookahead(char_range_heuristics), Option(Text("^")),
                        Alternative(character, free_char),
                        ZeroOrMore(Alternative(Series(Option(Text("-")), character), free_char)),
                        Series(Text("]"), dwsp__))
eckhart's avatar
eckhart committed
1323
    regexp = Series(Retrieve(RE_LEADIN), RE_CORE, Retrieve(RE_LEADOUT), dwsp__)
1324
1325
1326
1327
    plaintext = Alternative(Series(RegExp('`(?:(?<!\\\\)\\\\`|[^`])*?`'), dwsp__),
                            Series(RegExp('´(?:(?<!\\\\)\\\\´|[^´])*?´'), dwsp__))
    literal = Alternative(Series(RegExp('"(?:(?<!\\\\)\\\\"|[^"])*?"'), dwsp__),
                          Series(RegExp("'(?:(?<!\\\\)\\\\'|[^'])*?'"), dwsp__))
1328
    symbol = Series(SYM_REGEX, dwsp__)
1329
    multiplier = Series(RegExp('[1-9]\\d*'), dwsp__)
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
    no_range = Alternative(NegativeLookahead(multiplier),
                           Series(Lookahead(multiplier), Retrieve(TIMES)))
    range = Series(RNG_BRACE, dwsp__, multiplier,
                   Option(Series(Retrieve(RNG_DELIM), dwsp__, multiplier)),
                   Pop(RNG_BRACE, match_func=matching_bracket), dwsp__)
    counted = Alternative(Series(countable, range),
                          Series(countable, Retrieve(TIMES), dwsp__, multiplier),
                          Series(multiplier, Retrieve(TIMES), dwsp__, countable, mandatory=3))
    option = Alternative(Series(NegativeLookahead(char_range), Series(Text("["), dwsp__),
                                expression, Series(Text("]"), dwsp__), mandatory=2),
                         Series(element, Series(Text("?"), dwsp__)))
    repetition = Alternative(Series(Series(Text("{"), dwsp__), no_range,
                                    expression, Series(Text("}"), dwsp__), mandatory=2),
                             Series(element, Series(Text("*"), dwsp__), no_range))
    oneormore = Alternative(Series(Series(Text("{"), dwsp__), no_range, expression,
                                   Series(Text("}+"), dwsp__)),
                            Series(element, Series(Text("+"), dwsp__)))
    group = Series(Series(Text("("), dwsp__), no_range,
                   expression, Series(Text(")"), dwsp__), mandatory=2)
    retrieveop = Alternative(Series(Text("::"), dwsp__),
                             Series(Text(":?"), dwsp__),
                             Series(Text(":"), dwsp__))
    flowmarker = Alternative(Series(Text("!"), dwsp__), Series(Text("&"), dwsp__),
                             Series(Text("<-!"), dwsp__), Series(Text("<-&"), dwsp__))
eckhart's avatar
eckhart committed
1354
    ANY_SUFFIX = RegExp('[?*+]')
1355
1356
1357
    element.set(Alternative(Series(Option(retrieveop), symbol, NegativeLookahead(Retrieve(DEF))),
                            literal, plaintext, regexp, char_range, Series(character, dwsp__),
                            any_char, whitespace, group))
eckhart's avatar
eckhart committed
1358
    pure_elem = Series(element, NegativeLookahead(ANY_SUFFIX), mandatory=1)
1359
    countable.set(Alternative(option, oneormore, element))
eckhart's avatar
eckhart committed
1360
    term = Alternative(oneormore, counted, repetition, option, pure_elem)
1361
1362
    difference = Series(term, Option(Series(Series(Text("-"), dwsp__),
                                            Alternative(oneormore, pure_elem), mandatory=1)))
1363
    lookaround = Series(flowmarker, Alternative(oneormore, pure_elem), mandatory=1)
1364
1365
1366
1367
1368
1369
    interleave = Series(difference, ZeroOrMore(Series(Series(Text("°"), dwsp__),
                                                      Option(Series(Text("§"), dwsp__)),
                                                      difference)))
    sequence = Series(Option(Series(Text("§"), dwsp__)), Alternative(interleave, lookaround),
                      ZeroOrMore(Series(Retrieve(AND), dwsp__, Option(Series(Text("§"), dwsp__)),
                                        Alternative(interleave, lookaround))))
Eckhart Arnold's avatar
Eckhart Arnold committed
1370
    expression.set(Series(sequence, ZeroOrMore(Series(Retrieve(OR), dwsp__, sequence))))
1371
1372
    FOLLOW_UP = Alternative(Text("@"), symbol, EOF)
    procedure = Series(SYM_REGEX, Series(Text("()"), dwsp__))
1373
    literals = OneOrMore(literal)
1374
    component = Alternative(regexp, literals, procedure, Series(symbol, NegativeLookahead(DEF)))
1375
1376
    directive = Series(
        Series(Text("@"), dwsp__), symbol, Series(Text("="), dwsp__),
1377
1378
        Alternative(Series(component, ZeroOrMore(Series(Series(Text(","), dwsp__), component))),
                    expression),
1379
1380
        Lookahead(FOLLOW_UP), mandatory=1)
    definition = Series(symbol, Retrieve(DEF), dwsp__, Option(Series(Retrieve(OR), dwsp__)),
1381
                        expression, Retrieve(ENDL), dwsp__, Lookahead(FOLLOW_UP), mandatory=1)
1382
    syntax = Series(dwsp__, ZeroOrMore(Alternative(definition, directive)), EOF)
di68kap's avatar
di68kap committed
1383
1384
    root__ = syntax

1385
1386
1387
1388
1389
    def __init__(self, root: Parser = None, static_analysis: Optional[bool] = None) -> None:
        Grammar.__init__(self, root, static_analysis)
        self.free_char_parsefunc__ = self.free_char._parse
        self.char_range_heuristics_parsefunc__ = self.char_range_heuristics._parse
        self.regex_heuristics_parserfunc__ = self.regex_heuristics._parse
1390
1391
1392
1393

    @property
    def mode(self) -> str:
        def which(p: Parser) -> str:
eckhart's avatar
eckhart committed
1394
            if p._parse_proxy.__qualname__ == 'Never._parse':
1395
                return 'never'
eckhart's avatar
eckhart committed
1396
            elif p._parse_proxy.__qualname__ == 'Always._parse':
1397
1398
1399
1400
1401
1402
1403
1404
1405
                return 'always'
            else:
                return 'custom'
        signature = (
            which(self.free_char),
            which(self.regex_heuristics),
            which(self.char_range_heuristics)
        )
        if signature == ('custom', 'custom', 'custom'):
1406
            return 'heuristic'
1407
        elif signature == ('never', 'always', 'always'):
1408
            return 'strict'  # or 'classic'
1409
        elif signature == ('custom', 'never', 'always'):
1410
            return 'peg-like'
1411
        elif signature == ('custom', 'always', 'always'):
1412
            return 'regex-like'
1413
1414
1415
1416
1417
        else:
            return "undefined"

    @mode.setter
    def mode(self, mode: str):
1418
1419
1420
        if mode == self.mode:
            return

1421
1422
        def set_parsefunc(p: Parser, f: ParseFunc):
            method = f.__get__(p, type(p))  # bind function f to parser p
eckhart's avatar
eckhart committed
1423
            p._parse_proxy = method
1424
1425
1426

        always = Always._parse
        never = Never._parse
1427
        if mode == 'heuristic':
1428
1429
1430
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, self.regex_heuristics_parserfunc__)
            set_parsefunc(self.char_range_heuristics, self.char_range_heuristics_parsefunc__)
1431
        elif mode in ('strict', 'classic'):
1432
1433
1434
            set_parsefunc(self.free_char, never)
            set_parsefunc(self.regex_heuristics, always)
            set_parsefunc(self.char_range_heuristics, always)
1435
        elif mode == 'peg-like':
1436
1437
1438
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, never)
            set_parsefunc(self.char_range_heuristics, always)
Eckhart Arnold's avatar
Eckhart Arnold committed
1439
        elif mode == 'regex-like':
1440
1441
1442
1443
            set_parsefunc(self.free_char, self.free_char_parsefunc__)
            set_parsefunc(self.regex_heuristics, always)
            set_parsefunc(self.char_range_heuristics, always)
        else:
1444
1445
            raise ValueError('Mode must be one of: ' + ', '.join(
                ALLOWED_PRESET_VALUES['syntax_variant']))
1446

eckhart's avatar
eckhart committed
1447

1448
1449
1450
1451
class FixedEBNFGrammar(Grammar):
    r"""Faster version of EBNF, where delimiters are not determined on
    first use, but defined as constant Text-parsers. They can still be
    adjusted with function `parse.update_scanner()`.
1452
1453
1454
1455
1456
1457
1458

    Different syntactical variants can be configured either by adjusting
    the definitions of DEF, OR, AND, ENDL, RNG_OPEN, RNG_CLOSE, RNG_DELIM,
<