# ebnf.py - EBNF -> Python-Parser compilation for DHParser # # Copyright 2016 by Eckhart Arnold (arnold@badw.de) # Bavarian Academy of Sciences an Humanities (badw.de) # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or # implied. See the License for the specific language governing # permissions and limitations under the License. """ Module ``ebnf`` provides an EBNF-parser-generator that compiles an EBNF-Grammar into avPython-code that can be executed to parse source text conforming to this grammar into concrete syntax trees. Specifying Grammers with EBNF ----------------------------- With DHParser, Grammars can be specified either directly in Python-code (see :py:mod:`parse`) or in one of several EBNF-dialects. (Yes, DHParser supports several different variants of EBNF! This makes it easy to crate a parser directly from Grammars found in external sources.) "EBNF" stands for the "Extended-Backus-Naur-Form" which is a common formalism for specifying Grammars for context-free-languages. (see https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) The recommended way of compiling grammars with DHParser is to either write the EBNF-specification for that Grammar into a text-file and then compile EBNF-source to an executable as well as importable Python-module with the help of the "dhparser"-skript. Or, for bigger projects, to create a new domain-specific-language-project with the DHParser-skript as described in the step-by-step-guide. However, here we will show how to compile an EBNF-specified grammar from within Python-code and how to execute the parser that was generated by compiling the grammar. As an example, we will realize a json-parser (https://www.json.org/). Let's start with creating some test-data:: >>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False} >>> import json >>> testdata = json.dumps(testobj) >>> testdata '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}' We define the json-Grammar (see https://www.json.org/) in top-down manner in EBNF. We'll use a regular-expression look-alike syntax. EBNF, as you may recall, consists of a sequence of symbol definitions. The definiens of those definitions either is a string literal or regular expression or other symbols or a combination of these with four different operators: 1. sequences 2. alternatives 3. options and 4. repetitions. Here is how these elements are denoted in classical and regex-like EBNF-syntax: ======================== ================== ================ element classical EBNF regex-like ======================== ================== ================ insignificant whitespace ~ ~ string literal "..." or \\`...\\` "..." or \\`...\\` regular expr. /.../ /.../ sequences A B C A B C alternatives A | B | C A | B | C options [ ... ] ...? repetions { ... } ...* one or more ...+ grouping (...) (...) ======================== ================== ================ "insignificant whitespace" is a speciality of DHParser. Denoting insignificant whitespace with a particular sign ``~`` allows to eliminate it already during the parsing process without burdening later syntax-tree-processing stages with this common task. DHParser offers several more facilities to restrain the verbosity of the concrete syntax tree, so that the outcome of the parsing stage comes close (or at least closer) to the intended abstract-syntax-tree, already. JSON consists of two complex data types, 1) associative arrays, called "object" and sequences of heterogeneous data, called array; and of four simple data types, 1) string, 2) number, 3) bool and 4) null. The structure of a JSON file can easily be described in EBNF:: >>> grammar = ''' ... json = ~ _element _EOF ... _EOF = /$/ ... _element = object | array | string | number | bool | null ... object = "{" ~ member ( "," ~ §member )* "}" ~ ... member = string ":" ~ _element ... array = "[" ~ ( _element ( "," ~ _element )* )? "]" ~ ... string = `"` _CHARS `"` ~ ... _CHARS = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/ ... number = _INT _FRAC? _EXP? ~ ... _INT = `-`? ( /[1-9][0-9]+/ | /[0-9]/ ) ... _FRAC = `.` /[0-9]+/ ... _EXP = (`E`|`e`) [`+`|`-`] /[0-9]+/ ... bool = "true" ~ | "false" ~ ... null = "null" ~ ''' This is a rather common EBNF-grammar. A few peculiarities are noteworthy, though: First of all you might notice that some components of the grammar (or "prduction rules" as they are commonly called) have names with a leading underscore ``_``. It is a convention to mark those elements, in which we are on interested on their own account, with an underscore ``_``. When moving from the concrete syntax-tree to a more abstract syntax-tree, these elements could be substituted by their content, to simplify the tree. Secondly, some production rules carry a name written in captial letters. This is also a convention to mark those symbols which with other parser-generators would represent tokens delivered by a lexical scanner. DHParser is a "scanner-less" parser, which means that the breaking down of the string into meaningful tokens is done in place with regular expressions (like in the definition of ``_EOF``) or simple combinations of regular expressions (see the definition of ``_INT`` above). Their is no sharp distinction between tokens and other symbols in DHParser, but we keep it as a loose convention. Regular expressions are enclosed in forward slashes and follow the standard syntax of Perl-style regular expression that is also used by the "re"-module of the Python standard library. (Don't worry about the number of backslashes in the line defining ``_CHARS`` for now!) Finally, it is another helpful conention to indent the defintions of symbols that have only been introduced to simplify an otherwise uneccessarily complicated definition (e.g. the definition of ``number``, above) or to make it more understandable by giving names to its componentns (like ``_EOF``). Let's try this grammar on our test-string. In order to compile this grammar into executable Python-code, we use the high-level-function :py:func:`~dsl.create_parser` from the :py:mod:`dsl`-module. >>> from DHParser.dsl import create_parser >>> # from DHParser.dsl import compileEBNF >>> # print(compileEBNF(grammar)) >>> parser = create_parser(grammar, branding="JSON") >>> syntax_tree = parser(testdata) >>> syntax_tree.content '{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}' As expected serializing the content of the resulting syntax-tree yields exactly the input-string of the parsing process. What we cannot see here, is that the parser has structured the string into the individual elements described in the grammar. Since the concrete syntax-tree that the parser vields is rather verbose, it would not make sense to print it out. We'll just look at a small part of it, to see what it looks like. Let's just pick the sub-tree that captures the first json-array within the syntax-tree:: >>> print(syntax_tree.pick('array').as_sxpr()) (array (:Text "[") (_element (number (_INT "1"))) (:Text ",") (:Whitespace " ") (_element (number (_INT "2") (_FRAC (:Text ".") (:RegExp "0")))) (:Text ",") (:Whitespace " ") (_element (string (:Text '"') (_CHARS "a string") (:Text '"'))) (:Text "]")) The nodes of the syntax-tree carry the names of the production rules by which they have been generated. Nodes, that have been created by components of a prduction receive the name of of the parser-type that has created the node (see :py:mod:`parse`) prefixed with a colon ":". In DHParser, these nodes are called "anonymous", because they lack the name of a proper grammatical component. .. _simplifying_syntax_trees: Simplifying Syntax-Trees while Parsing -------------------------------------- Usually, anonymous nodes are what you want to get rid of in the course of transforming the concrete syntax-tree into an abstract syntax-tree. (See :py:mod:`transform`). DHParser already eliminates per default all anonymous nodes that are not leaf-nodes by replacing them with their children during parsing. Anonymous leaf-nodes will be replaced by their content, if they are a single child of some parent, and otherwise be left in place. Without this optimization, each construct of the EBNF-grammar would leave a node in the syntax-tree:: >>> from DHParser.parse import CombinedParser, TreeReduction >>> _ = TreeReduction(parser.json, CombinedParser.NO_TREE_REDUCTION) >>> syntax_tree = parser(testdata) >>> print(syntax_tree.pick('array').as_sxpr()) (array (:Text "[") (:Option (:Series (_element (number (_INT (:Alternative (:RegExp "1"))))) (:ZeroOrMore (:Series (:Text ",") (:Whitespace " ") (_element (number (_INT (:Alternative (:RegExp "2"))) (:Option (_FRAC (:Text ".") (:RegExp "0")))))) (:Series (:Text ",") (:Whitespace " ") (_element (string (:Text '"') (_CHARS (:RegExp "a string")) (:Text '"'))))))) (:Text "]")) This can be helpful for understanding how parsing that is directed by an EBNF-grammar works (next to looking at the logs of the complete parsing-process, see :py:mod:`trace`), but other than that it is advisable to streamline the syntax-tree as early on as possible, because the processing time of all subsequent tree-processing stages increases with the number of nodes in the tree. Because of this, DHParser offers further means of simplifying syntax-trees during the parsing stage, already. These are not turned on by default, because they allow to drop content or to remove named nodes from the tree; but they must be turned on by "directives" that are listed at the top of an EBNF-grammar and that guide the parser-generation process. DHParser-directives always start with an ``@``-sign. For example, the ``@drop``-directive advises the parser to drop certain nodes entirely, including their content. In the following example, the parser is directed to drop all insignificant whitespace:: >>> drop_insignificant_wsp = '@drop = whitespace \\n' Directives look similar to productions, only that on the right hand side of the equal sign follows a list of parameters. In the case of the drop-directive these can be either names of non-anomymous nodes that shall be dropped or one of four particular classes of anonymous nodes (``strings``, ``backticked``, ``regexp``, ``whitespace``) that will be dropped. Another useful directive advises the parser to treat named nodes as anynouse nodes and to eliminate them accordingly during parsing. This is usefule, if we have introduced certain names in our grammar only as placeholders to render the definition of the grammar a bit more readable, not because we are intested in the text that is captured by the production associated with them in their own right:: >>> disposable_symbols = '@disposable = /_\w+/ \\n' Instead of passing a comma-separated list of symbols to the directive, which would also have been possible, we have leveraged our convention to prefix unimportant symbols with an underscore "_" by specifying the symbols that shall by anonymized with a regular expression. Now, let's examine the effect of these two directives:: >>> refined_grammar = drop_insignificant_wsp + disposable_symbols + grammar >>> parser = create_parser(refined_grammar, 'JSON') >>> syntax_tree = parser(testdata) >>> syntax_tree.content '{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}' You might have notived that all insigificant whitespaces adjacent to the delimiters have been removed this time (but, of course not the significant whitespace between "a" and "string" in "a string"). And the difference, the use of these two directives makes, is even more obvious, if we look at (a section of) the syntax-tree:: >>> print(syntax_tree.pick('array').as_sxpr()) (array (:Text "[") (number "1") (:Text ",") (number (:RegExp "2") (:Text ".") (:RegExp "0")) (:Text ",") (string (:Text '"') (:RegExp "a string") (:Text '"')) (:Text "]")) This tree looks more streamlined. But it still contains more structure than we might like to see in an abstract syntax tree. In particular, it still contains als the delimiters ("[", ",", '"', ...) next to the data. But other than in the UTF-8 representation of our json data, the delimiters are not needed any more, because the structural information is now retained in the tree-structure. So how can we get rid of those delimiters? The rather coarse-grained tools that DHParser offers in the parsing stage require some care to do this properly. The @drop-directive allows to drop all unnamed strings (i.e. strings that are not directly assigned to a symbol) and backticked strings (for the difference between strings and backticked strings, see below) and regular expressions. However, using ``@drop = whitespace, strings, backticked`` would also drop those parts captured as string that contain data:: >>> refined_grammar = '@drop = whitespace, strings, backticked \\n' \ + disposable_symbols + grammar >>> parser = create_parser(refined_grammar, 'JSON') >>> syntax_tree = parser(testdata) >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0)) (array (number "1") (number (:RegExp "2") (:RegExp "0")) (string "a string")) Here, suddenly, the number "2.0" has been turned into "20"! There are three ways to get around this problem: 1. Assigning all non-delmiter strings to symbols. In this case we would have to rewrite the definition of "number" as such:: number = _INT _FRAC? _EXP? ~ _INT = _MINUS? ( /[1-9][0-9]+/ | /[0-9]/ ) _FRAC = _DOT /[0-9]+/ _EXP = (_Ecap|_Esmall) [_PLUS|MINUS] /[0-9]+/ _MINUS = `-` _PLUS = `+` _DOT = `.` _Ecap = `E` _Esmall = `e` A simpler alternative of this technique would be to make use of the fact that the document-parts captured by regular expresseion are not dropped (although regular expressions can also be listed in the @drop-directive, if needed) and that at the same time delimiters are almost always simple strings containing keywords or punctuation characters. Thus, one only needs to rewrite those string-expressions that capture data as regular expressions:: number = _INT _FRAC? _EXP? ~ _INT = /[-]/ ( /[1-9][0-9]+/ | /[0-9]/ ) _FRAC = /[.]/ /[0-9]+/ _EXP = (/E/|/e/) [/[-+]/] /[0-9]+/ 2. Assigning all delimiter strings to symbols and drop the nodes and content captured by these symbols. This means doing exactly the opposite of the first solution. Here is an excerpt of what a JSON-grammar emploing this technique would look like:: @disposable = /_\w+/ @drop = whitespace, _BEGIN_ARRAY, _END_ARRAY, _KOMMA, _BEGIN_OBJECT, ... ... array = _BEGIN_ARRAY ~ ( _element ( _KOMMA ~ _element )* )? §_END_ARRAY ~ ... It is important that all symbols listed for dropping are also made disposable, either by listing them in the disposable-directive as well or using names that the regular-expressions for disposables matches. Otherwise, DHParser does not allow to drop the content of named nodes, because the default assumption is that symbols in the grammar are defined to capture meaningful parts of the document that contain relevant data. 3. Bailing out and leaving the further simplification of the syntax-tree to the next tree-processing stage which, if you follow DHParser's suggested usage pattern, is the abstract-syntax-tree-transformation proper and which allows for a much more fine-grained specification of transformation rules. See :py:mod:`transform`. To round this section up, we present the full grammar for a streamlined JSON-Parser according to the first solution-strategy. Observe, that the values of "bool" and "null" are now defined with regular expressions instead of string-literals, because the latter would be dropped because of the ``@drop = ... strings, ...``-directive, leaving an empty named node without a value, wheneever a bool value or null occurs in the input:: >>> json_gr = ''' ... @disposable = /_\\\\w+/ ... @drop = whitespace, strings, backticked, _EOF ... json = ~ _element _EOF ... _EOF = /$/ ... _element = object | array | string | number | bool | null ... object = "{" ~ member ( "," ~ §member )* "}" ~ ... member = string ":" ~ _element ... array = "[" ~ ( _element ( "," ~ _element )* )? "]" ~ ... string = `"` _CHARS `"` ~ ... _CHARS = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/ ... number = _INT _FRAC? _EXP? ~ ... _INT = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ ) ... _FRAC = /[.]/ /[0-9]+/ ... _EXP = /[Ee]/ [/[-+]/] /[0-9]+/ ... bool = /true/ ~ | /false/ ~ ... null = /null/ ~ ''' >>> json_parser = create_parser(json_gr, 'JSON') >>> syntax_tree = json_parser(testdata) >>> print(syntax_tree.pick('array').as_sxpr(flatten_threshold=0)) (array (number "1") (number (:RegExp "2") (:RegExp ".") (:RegExp "0")) (string "a string")) This time the data is not distorted, any more. One oddity reamins, however: We are most probably not interested in the fact that the number 2.0 consists of three components, each of which hast been captured by a regular expression. Luckiliy, there exists yet another directive that allows to reduce the tree further by merging adjacent anonymous leaf-nodes:: >>> json_gr = '@reduction = merge \\n' + json_gr >>> json_parser = create_parser(json_gr, 'JSON') >>> syntax_tree = json_parser(testdata) >>> print(syntax_tree.as_sxpr()) (json (object (member (string "array") (array (number "1") (number "2.0") (string "a string"))) (member (string "number") (number "-1.3e+25")) (member (string "bool") (bool "false")))) Merging adjacent anonymous leaf-nodes takes place after the @drop-directive comes into effect. It should be observed that merging only produces the desired result, if any delimiters have been dropped previously, because otherwise delimiters would be merged with content. Therefore, the ``@reduction = merge`- directive should at best only be applied in conjunction with the ``@drop`` and ``@disposable``-directives. Applying any of the here described tree-reduction (or "simplification" for that matter) requires a bit of careful planning concerning which nodes will be named and which nodes will be dropped. This, however, pays off in terms of speed and considerably simplified abtract-syntax-tree generation stage, because most of the unnecessary structure of concrete-syntax-trees has already been eliminated at the parsing stage. .. _comments_and_whitespace: Comments and Whitespace ----------------------- Why whitespace isn't trivial ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Handling whitespace in text-documents is not all trivial, because whitespace can serve several different purposes and there can be different kinds of whitespace: Whitespace can serve a syntactic function as delimiter. But whitespace can also be purely asthetic to render a document more readable. Depending on the data model, whitespace can be considered as significant and be included in the data or as insignificant and be excluded from the data and only be re-inserted when displaying the data in a human-readable-form. (For example, one can model a sentence as a seuqence of words and spaces or just as a sequence of words.) Note, that "significance" does not correlate with the syntatic or asthetic function, but only depends on whether you'd like to keep the whitespace in you data or not. There can be different kinds of whitespace with different meaning (and differing significance). For example, one can make a difference between horozontal whitespace (spaces and tabs) and vertical whitespace (including linefeeds). And there can be different sizes of whitespace with different meaning. For example in LaTeX, a single linefeed still counts as plain whitespace while an empty line (i.e. whitespace including two or more not linefeeds) signals a new paragraph. Finally, even the position of whitespace can make a difference. A certain number of whitespaces at the beginning of a line can have the meaning of "indentation" (as in Python code) while at the end of the line or between brackets it is just plain insignificant whitespace. (This is actually something, where the boundaries of the EBNF-formalism become visible and you'd probably use a preprocessor or some kind of "semantic actions" to handle such cases. There is some support for either of these in DHParser.) Coding significant Whitespace in EBNF-Grammars ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A reasonable approach to coding whitespace is to use one particular symbol for each kind of whitespace. Those kinds of whitespace that are insignficant, i.e. that do not need to appear in the data, should be dropped from the syntax-tree. With DHParser this can be done already while parsing, using the ``@disposable`` and ``@drop``-directives described earlier. But let's first look at an example which only includes significant whitespace. The following parser parses sequences of paragraphs which consist of sequences of sentences which consist of sequences of main clauses and subordinate clauses which consist of sequences of words:: >>> text_gr = ''' ... @ disposable = /_\\\\w+/ ... document = PBR* S? paragraph (PBR paragraph)* PBR* S? _EOF ... _EOF = /$/ ... paragraph = sentence (S sentence)* ... sentence = (clause _c_delimiter S)* clause _s_delimiter ... _c_delimiter = KOMMA | COLON | SEMICOLON ... _s_delimiter = DOT | QUESTION_MARK | EXCLAMATION_MARK ... clause = word (S word)* ... word = /(?:[A-Z]|[a-z])[a-z']*/ ... DOT = `.` ... QUESTION_MARK = `?` ... EXCLAMATION_MARK = `!` ... KOMMA = `,` ... COLON = `:` ... SEMICOLON = `;` ... PBR = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/ ... S = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/ ''' Here, we have two types of significant whitespace ``PBR`` ("paragraph-break") and ``S`` ("space"). Both types allow for a certain amount of flexibility, so that two whitespaces of the same type do not need to have exactly the same content, but we could always normalize these whitespaces in a subsequent transformation step. Two typical design patterns for significant whitespace are noteworthy, here: 1. Both whitespaces match only if there was at least one whitespace character. We may allow whitespace to be optional (as at the beginning and end of the document), but if the option has not been taken, we don't to see an empty whitespace-tag in the document, later on. (For insignificant whitespace, the opposite convention can be more convenient, because, typically, insignificant whitespace is dropped anyway, whether it's got content or not.) 2. The grammar is construed in such a way that the whitespace always appears *between* different elements at the same level, but not after the last or before the first element. The whitespace after the last word of a sentence or before the first word of a sentence is really whitespace between two sentences. If we pick out a sentence or a clause, we will have no dangling whitespace at its beginning or end. (Again, for soon to be dropped insignificant whitespace, another convention can be more advisable.) Let's just try our grammar on an example:: >>> text_example = ''' ... I want to say, in all seriousness, that a great deal of harm is being ... done in the modern world by belief in the virtuousness of work, and that ... the road to happiness and prosperity lies in an organized diminution of ... work. ... ... First of all: what is work? Work is of two kinds: first, altering the ... position of matter at or near the earth's surface relatively to other ... such matter; second, telling other people to do so. The first kind is ... unpleasant and ill paid; the second is pleasant and highly paid.''' >>> text_parser = create_parser(text_gr, 'Text') >>> text_as_data = text_parser(text_example) >>> sentence = text_as_data.pick(\ lambda nd: nd.tag_name == "sentence" and nd.content.startswith('First')) >>> print(sentence.as_sxpr()) (sentence (clause (word "First") (S " ") (word "of") (S " ") (word "all")) (COLON ":") (S " ") (clause (word "what") (S " ") (word "is") (S " ") (word "work")) (QUESTION_MARK "?")) Again, it is a question of design, whether we leave whitespace in the data or not. Leaving it has the advantage, that serialization become as simple as printing the content of the data-tree:: >>> print(sentence) First of all: what is work? Otherwise one would have to programm a dedicated serialization routine. Especially, if you receive data from a different source, you'll appreciate not having to do this - and so will other people, receiving your data. Think about it! However, dropping the whitespace will yield more consice data. Coding Comments ^^^^^^^^^^^^^^^ Allowing comments in a domain-specific language almost always makes sense, because it allows users to annotate the source texts while working on them and to share those comments with collaborators. From a technical point of view, adding comments to a DSL raises two questions: 1. At what places shall we allow to insert comments in the source code? Common answers are: a) at the end of a line, b) almost everywhere, or c) both. 2. How do we avoid pollution of the EBNF-grammar with comment markers? It's already curtails the readability that we have to put whitespace symbols in so many places. And speaking of comments at the end of the line: If linefeeds aren't important for us - as in our toy-grammar for prose-text, above - we probably wouldn't want to reframe our grammar jsut to allow for at the end of the line comments. Luckily, there exists a simple and highly intuitive solution that takes care of both of these concerns: We admitt comments, whereever whitespace is allowed. And we code this by defining a symbol that means: "whitespace and, optionally, a comment". Let's try this with our prose-text-grammar. In order to do so, we have to define a symbols for comments, a symbol for pure whitespace, and, finally, a symbol for whitespace with optional comment. Since, in our grammar, we actually have two kinds of whitespace, ``S`` and ``PBR``, we'll have to redefine both of them. As delimiters for comments, we use curly braces:: >>> wsp_gr = ''' ... PBR = pure_PBR COMMENT (pure_PBR | pure_S)? ... | (pure_S? COMMENT)? pure_PBR ... S = pure_S COMMENT pure_S? | COMMENT? pure_S ... COMMENT = /\{[^}]*\}/ ... pure_PBR = /[ \\\\t]*\\\\n[ \\\\t]*\\\\n[ \\\\t]*/ ... pure_S = /(?=[ \\\\n\\\\t])[ \\\\t]*(?:\\\\n[ \\\\t]*)?(?!\\\\n)/''' As can be seen, the concrete re-definition of the whitespace tokens requires a bit of careful consideration, because we want to allow additional whitespace next to comments, but at the same time avoid ending up with two whitespaces in sequence in our data. Let's see, if we have succeeded:: >>> extended_text_gr = text_gr[:text_gr.rfind(' PBR')] + wsp_gr >>> extended_parser = create_parser(extended_text_gr, 'Text') >>> syntax_tree = extended_parser('What {check this again!} is work?') >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children)) word S word S word >>> print(syntax_tree.pick('clause').as_sxpr()) (clause (word "What") (S (pure_S " ") (COMMENT "{check this again!}") (pure_S " ")) (word "is") (S (pure_S " ")) (word "work")) We will not worry about the more sub-structure of the S-nodes right now. If we are not interested in the comments, we could use the ``@disposable``, ``@drop`` and ``@reduction = merge``-directives to simplify these at the parsing stage. Or, we could extract the comments and normalize the whitespace at a later tree-processing stage. For now, let's just check wehter our comments work as expected:: >>> syntax_tree = extended_parser('What{check this again!} is work?') >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children)) word S word S word >>> syntax_tree = extended_parser('What {check this again!}is work?') >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('clause').children)) word S word S word >>> syntax_tree = extended_parser('What{check this again!}is work?') >>> print(syntax_tree.errors[0]) 1:24: Error (1040): Parser "pure_S" did not match: »is work?« The last error was to be expected, because we did not allow comments to serve a substitutes for whitespace. Let's check whether putting comments near paragraph breaks works as well:: >>> test_text = '''Happiness lies in the diminuniation of work. ... ... { Here comes the comment } ... ... What is work?''' >>> syntax_tree = extended_parser(test_text) >>> print(' '.join(nd.tag_name for nd in syntax_tree.children)) paragraph PBR paragraph >>> test_text = '''Happiness lies in the diminuniation of work. ... { Here comes the comment } ... What is work?''' >>> syntax_tree = extended_parser(test_text) >>> print(' '.join(nd.tag_name for nd in syntax_tree.children)) paragraph The last result might look surprising at first, but since a paragraph break requires at least one empty line as a separator, the input text is correctly understood by the parser as a sinlge paragrpah with two sentence interspersed by a sinlge whitespace which, incidently, contains a comment:: >>> print(' '.join(nd.tag_name for nd in syntax_tree.pick('paragraph').children)) sentence S sentence >>> print(syntax_tree.pick('paragraph')['S'].as_sxpr(flatten_threshold=0)) (S (pure_S "" "") (COMMENT "{ Here comes the comment }") (pure_S "" "")) A common problem with whitespace is that it tends to pollute the Grammar, because whereever you'd like to allow whitespace, you'd have to insert a symbol for whitespace. The same problem existis when it comes to allowing comments, because you'd probably allow to insert comments in as many places as possible. DHParser's support for insignificant whitespace and comments ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Coding insignificant whitespace and comments is exactly the same as coding siginificant whitespace and comments and does not need to be repeated, here. (The combination of insignificant whitespace and significant comments, is slightly more complicated, and probably best outsourced to some degree to the post-parsing processing stages. It will not be discussed here.) However, DHParser offers some special support for insignificant whitesapce and comments, which can make working with these easier in some cases. First of all, DHParser has a special dedicated token for insignificant whitespace which is the tilde ``~``-character. We have seen this earlier in the definition of the json-Grammar. The ``~``-whitespace-marker differs from the usual pattern for defining whitespace in that it is implicitly optional, or what amounts to the same, it matches the empty string. Normally, it is to be considered bad practice to define a symbol as optional. Rahter, a symbol should always match something and only at the places where it is used, it should be marked as optional. If this rule is obeyed, it is always easy to tell, wether some element is optional or not at a specific place in the Grammar. Otherwise, it can become quite confusing indeed. However, since the tilde character is usually used very often, it is more convenient not to mark it with a question-mark or, if you use classical EBNF-syntax, to enclose it with square brackets. The default regular expression for the tilde-whitespace captures arbitraily many spaces and tabs and at most one linefeed, but not an empty line (``[ \t]*(?:\n[ \t]*)?(?!\n)``), as this is the most convenient way to define whitespace for text-data. However, the tilde whitespace can also be definied with any other regular expression with the ``@whitespace``-directive. Let's go back to our JSON-grammar and define the optional insignificant whitespace marked by the tilde-character in such a way that it matches any amount of horizontal or vertical whitespace, which makes much more sense in the context of json than the default tilde-whitespace that is restricted vertically to at most a single linefeed:: >>> testdata = '{"array": [1, 2.0, "a string"], \\n\\n\\n "number": -1.3e+25, "bool": false}' >>> syntax_tree = json_parser(testdata) >>> print(syntax_tree.errors[0]) 1:32: Error (1010): member expected by parser 'object', » \\n \\n \\n "numb...« found! >>> json_gr = '@whitespace = /\\\\s*/ \\n' + json_gr >>> json_parser = create_parser(json_gr, "JSON") >>> syntax_tree = json_parser(testdata) >>> print(syntax_tree.errors) [] When redefining the tilde-whitespace, make sure that your regular expression also matches the empty string! There is no need to worry that the syntax tree get's cluttered by empty whitespace-nodes, because tilde-whitespace always yeidls anonymous nodes and DHParser drops empty anonymous nodes right away. Comments can be defined using the ``@comment``-directive. DHParser automatically intermingles comments and whitespace so that where-ever tilde-whitespace is allowed, a comment defined by the ``@comment``-directive is also allowed: >>> json_gr = '@comment = /#[^\\\\n]*(?:\\\\n|$)/ \\n' + json_gr >>> json_parser = create_parser(json_gr, "JSON") >>> testdata = '''{"array": [1, 2.0, "a string"], # a string ... "number": -1.3e+25, # a number ... "bool": false} # a bool''' >>> syntax_tree = json_parser(testdata) >>> print(syntax_tree.as_sxpr(compact = True)) (json (object (member (string "array") (array (number "1") (number "2.0") (string "a string"))) (member (string "number") (number "-1.3e+25")) (member (string "bool") (bool "false")))) Since the json-grammar still contains the ``@drop = whitespace, ...``- directive from earlier on (next to other tree-reductions), the comments have been nicely dropped along with the tilde-whitespace. There is one caveat: When using comments alongside with whitespace that captures at most one linefeed, the comments should be defined in such a way that the last charcter of a comment is never a linefeed. Also a few limitations of the tilde-whitespace and directive-defined comments should be kept in mind: 1. Only one kind of insignificant whitespace can be defined this way. If there are more kinds of insignificant whitespace, all but one need to be defined conventionally as part of the grammar. 2. Both directive-defined comments and tilde-whitespace can only be defined by a regular expresseion. In particular, nested comments are impossible to define with regular expressions, only. However, using tilde-whitespace has yet one more benefit: With the tilde-whitespace, cluttering of the grammar with whitespace-markers can be avoid, by adding implicit whitespace adjacent to string-literals. Remember the definition of the JSON-Grammar earlier. If you look at a definition like: ``object = "{" ~ member ( "," ~ §member )* "}" ~``, you'll notice that there are three whitespace markers, one next to each delimiter. Naturally so, because one usually wants to allow users of a domain specific language to put whitespace around delimiters. You may wonder, why the tilde appears only on the right hand side of the literals, although you'd probably like to allow whitespace on both side of a literal like "{". But if you look at the grammar closely, you'll find that almost every symbol definition ends either with a tilde sign or a symbol the definition of which ends with a tilde sign, which means that they allow whitespace on the right hand side. Now, if all elements of the grammar allow whitespace on the right hand side, this means that automatically also have whitespace on the left-hand side, too, which is, of course the whitespace on the right hand side of the previous element. In order to reduce cluttering the grammar with tile-signs, DHParser allows to turn on implicit tilde-whitespace adjacent to any string literal with the diretive ``@ literalws = right`` or ``@ literalws = left``. As the argument of the directive suggests, whitespace is either "eaten" at the right hand side or the left hand side of the literal. String literals can either be enclose in double quotes "..." or single quotes '...'. Both kinds of literals will have implicit whitespace, if the ``@literalws``-directive is used. (Don't confuse implicit whitespace with insignificant whitespace: Insignificnat whitespace is whitespace you do not need any more after parsing. Implicit whitespace is whitespace you do not denote explicitly in the grammar. It's a speciality of DHParser and DHParser allows onl the insignificant whitespace denoted by the tilde-character to be declared as "implicit".) If left-adjacent whitespace is declared as implicit with the ``@literalws``-directive, the expression:: object = "{" ~ member ( "," ~ §member )* "}" ~ can be written as:: object = "{" member ( "," §member )* "}" which is easier to read. For situations where implicit whitespace is not desired, DHParser has a special kind of string literal, written with backticks, which never carries any implicit whitespace. This is important, when literals are used for signs that enclose content, like the quotation marks for the string literals in our JSON-Grammar:: string = `"` _CHARS '"' # mind the difference between `"` and '"'! Regular expressions, also, never carry implicit whitespace. So, if you are using regular expressions as delimiters, you still have to add the tilde character, if adjacent insignificant whitespace is to be allowed:: bool = /true/~ | /false/~ The compliete json-grammar now looks like this:: >>> json_gr = ''' ... @disposable = /_\\\\w+/ ... @drop = whitespace, strings, backticked, _EOF ... @reduction = merge ... @whitespace= /\\\\s*/ ... @comment = /#[^\\\\n]*(?:\\\\n|$)/ ... @literalws = right ... json = ~ _element _EOF ... _EOF = /$/ ... _element = object | array | string | number | bool | null ... object = "{" member ( "," §member )* "}" ... member = string ":" _element ... array = "[" ( _element ( "," _element )* )? "]" ... string = `"` _CHARS '"' ... _CHARS = /[^"\\\\\\]+/ | /\\\\\\[\\\\/bnrt\\\\\\]/ ... number = _INT _FRAC? _EXP? ~ ... _INT = /[-]/? ( /[1-9][0-9]+/ | /[0-9]/ ) ... _FRAC = /[.]/ /[0-9]+/ ... _EXP = /[Ee]/ [/[-+]/] /[0-9]+/ ... bool = /true/~ | /false/~ ... null = /null/~ ''' >>> json_parser = create_parser(json_gr, "JSON") >>> syntax_tree_ = json_parser(testdata) >>> assert syntax_tree_.equals(syntax_tree) The whitespace defined by the ``@whitespace``-directive can be access from within the grammar via the name ``WHITESPACE__``. Other than the tilde-sign this name refers to the pure whitespace that is not intermingles with comments. Similarly, comments defined by the ``@comment``-directive can be accessed via the symbol ``COMMENT__``. Lookahead and Lookbehind ------------------------ Lookahead and lookbehind operators are a convenient way to resolve or rather avoid ambiguities while at the same time keeping the DSL lean. Assume for example a simple DSL for writing definitions like:: >>> definitions = ''' ... dog := carnivorous quadrupel that barks ... human := featherless biped''' Now, let's try to draw up a grammar for "definitions":: >>> def_DSL_first_try = ''' # WARNING: This grammar doesn't work, yet! ... @literalws = right ... definitions = ~ definition { definition } EOF ... definition = definiendum ":=" definiens ... definiendum = word ... definiens = word { word } ... word = /[a-z]+|[A-Z][a-z]*/~ ... EOF = /$/ ''' >>> def_parser = create_parser(def_DSL_first_try, "defDSL") Parsing our example with the generated parser yields an error, however:: >>> syntax_tree = def_parser(definitions) >>> for e in syntax_tree.errors_sorted: print(e) 3:11: Error (1040): Parser "word->/[a-z]+|[A-Z][a-z]*/" did not match: »:= featherless biped« The reason for this error is that the parser ``definiens`` captures as many words as occur in a sequence, including the definiendum of the next definition which is the word "human". But then the next definition does not find it definiendum, any more, because it has already been captured. (All this may not easily become clear from the error message itself, but can easily be found out by using the post-mortem debugger of module :py:mod:`trace`.) An common tequnique to avoid this problem would be to introduce an end-of-statemenet, for example, a semicolon ";". A more elegant way to solve the problem in this case is to make use of the fact that if a word is followed by the definition sign ":=" it cannot be part of the definiens any more, but must be a definiendum. This can be encoded by using the negative look-ahead operator "!": >>> def_DSL = def_DSL_first_try.replace('definiens = word { word }', ... 'definiens = word { word !":=" }') >>> def_parser = create_parser(def_DSL, "defDSL") >>> syntax_tree = def_parser(definitions) >>> for d in syntax_tree.select('definition'): ... print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}') A dog is a carnivorous quadrupel that barks A human is a featherless biped The statement ``word !":="`` is a squence of a ``word`` and a negative lookahead. This whole sequence only matches, if ``word`` matches and the negative looakahead matches, which is only the case of the following text cannot be matched by ":=". We could have achieved the same effect with a positive lookahead by checking whether any of the possible follow-up-sqeuences of parser ``definines`` ensues:: >>> def_DSL = def_DSL_first_try.replace('definiens = word { word }', ... 'definiens = word { word &(word|EOF) }') >>> def_parser = create_parser(def_DSL, "defDSL") >>> syntax_tree = def_parser(definitions) >>> for d in syntax_tree.select('definition'): ... print(f'A {d["definiendum"]} is a {str(d["definiens"]).strip()}') A dog is a carnivorous quadrupel that barks A human is a featherless biped Generally, lookahead operators, whether positive or negative, never capture any text. They merely match or fail depending on whether the following parser would match or would fail to match the next piece of text. The positive lookahead matches, if the parser would match. The negative operator matches, if it would fail. Lookahead operators can also be though of as a boolean condition on the following text, where the positive lookahead operator "&" resembles an and, "and" the negative lookahead operator an "and not". As in our example, these operators are very helpful for "exploring" the surroundings of a piece of text to be captured by a parser. They allow parsers to match or fail depending on the ensuing text. A negative lookahead expresseion can also serve to encode the meaning of "without" if placed in front of another expression. Let's rewrite our grammar of a definitions-DSL so as to exclude certain bad words:: >>> def_DSL = def_DSL[:def_DSL.find('word =')] + ''' ... word = !forbidden /[a-z]+|[A-Z][a-z]*/~ ... forbidden = /[sf][a-z][a-z][a-z]/~ ... EOF = /$/ ''' >>> def_parser = create_parser(def_DSL, "defDSL") >>> syntax_tree = def_parser('nice := nice word') >>> print(syntax_tree) nice := nice word >>> syntax_tree = def_parser('sxxx := bad word') >>> print(str(syntax_tree).strip()) <<< Error on "sxxx := bad word" | Parser "definitions" did not match: »sxxx := bad word« >>> The same effect can be achieved by using the subtraction operator "-". This is just syntactic sugar make the use of the negative lookahead operator in the sense of "without" more intuitive:: >>> def_DSL = def_DSL[:def_DSL.find('word =')] + ''' ... word = all_words - forbidden ... all_words = /[a-z]+|[A-Z][a-z]*/~ ... forbidden = /[sf][a-z][a-z][a-z]/~ ... EOF = /$/ ''' >>> def_parser = create_parser(def_DSL, "defDSL") >>> syntax_tree = def_parser('sxxx := bad word') >>> print(str(syntax_tree).strip()) <<< Error on "sxxx := bad word" | Parser "definitions" did not match: »sxxx := bad word« >>> Next to the lookahead operators, there also exist lookback operators. Be warned, though, that look back operators are an **experimental** feature in DHParser and that their implementation is highly idiosyncratic, that is, it is most likely not compatible with any other parser-generator-toolkit based on EBNF-grammers. Also, lookback operators in DHParser are more restricted than lookahead-operators. They can only be used in combination with simple text or regular expression parsers and - here comes the idiosyncratic part - they work in the opposite direction. This means that if you want to check whether a parser is preceeded, say, by the keyword "BEGIN", the text phrase that you have to check for with the lookback parser is actually "NIGEB". If that still does not put you off, here is how lookback-operators are used: Let's assume that our definition should not only allow for a definiens but, alternatively for enumerations and that the difference is indicated by using a simple equal sign "=" instead of the definition symbol ":=". Then using lookback-operators to distinguish the case, we can rewrite our grammar as follows:: >>> def_DSL_extended = ''' ... @literalws = right ... definitions = ~ definition { definition } EOF ... definition = definiendum (":=" | "=") (definiens | enumeration) ... definiendum = word ... definiens = <-& /\s*=:/ word { word &(word|EOF) } ... enumeration = <-& /\s*=[^:]/ word { word &(word|EOF) } ... word = /[a-z]+|[A-Z][a-z]*/~ ... EOF = /$/ ''' >>> def_parser = create_parser(def_DSL_extended, "defDSL") >>> definitions = ''' ... dog := carnivorous quadrupel that barks ... drinks = water beer juice ... human := featherless biped''' >>> def_parser = create_parser(def_DSL_extended, "defDSL") >>> syntax_tree = def_parser(definitions) >>> print(str(syntax_tree.pick('enumeration')).strip()) water beer juice The lookback operators are ``<-&`` for the positive lookback and ``<-!`` for the negative lookback, each of which must be followed by a regular expression or a string. Of course, this example is rather wanton and the grammar can easily be rewritten without the lookback-operators. Locating errors and customizing error messages ---------------------------------------------- Providing users with proper error information is one of the most tenacious problem when implementing the parser for a domain specific language. There are three different challenges: 1. Locating the error at the correct position in the source code. 2. Providing proper error messages that explain the reason for the error. 3. Resuming the parsing progress after an error has occurred at the nearest possible place without producing artificial follow-up errors. If the following, DHParser's techniques for the first two challenges, locating errors and customizing error messages will be described. Techniques for resuming the parsing process after an error occurred or for passing by erroneous passages in the source code will be explained below, under the heading "Fail-tolerant Parsing". Farthest-Fail-Heuristics ^^^^^^^^^^^^^^^^^^^^^^^^ Without adding any hints to the grammar, DHParser applies only a very basic technique for locating the error if the grammar does not match a text which is known as "farthest failure" and locates the error at the "farthest" position where a parser failed, reporting the last named parser in the call chain (that first reached this location) as the cause of the failure. This approach often works surprisingly well for locating errors, unless the grammar relies to heavy on regular expressions capturing large chunks of text, because the error location works only on the level of the parsing expression grammar not at that of the atomic regular expressions. To see how farthest fail word, consider a parser for simple arithmetic expressions:: >>> arithmetic_grammar = '''@ literalws = right ... arithmetic = expression EOF ... expression = ~ term { ("+" | "-") term } ... term = factor { ("*" | "/") factor } ... factor = number | group ... group = "(" expression ")" ... number = /\\\\d+/~ ... EOF = /$/''' >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic") >>> terms = arithmetic('(2 - 3 * (4 + 5)') >>> print(terms.errors[0]) 1:17: Error (1040): Parser "term->`*`" did not match: »« >>> terms = arithmetic('(2 - 3) * ( )') >>> print(terms.errors[0]) 1:13: Error (1040): Parser "number->/\\\\d+/" did not match: »)« As can be seen the location of the error is captured well enough, at least when we keep in mind that the computer cannot guess where we would have placed the forgotton closing bracket. It can only report the point where the mistake becomes aparant. However, the reported fact that it was the sub-parser \`*\` of parser term that failed at this location does little to enlighten us with respect to the cause of the failure. The "farthest fail"-method as implemented by DHParser yields the first parser (of possibly several) that has been tried at the position where the farthest fail occurred. Thus, in this case, a failure of the parser capturing \`*\` is reporeted rather than of the parser expression->\`+\`. Changing this by reporting the last parser or all parsers that failed at this location would do little to remedy this situation, however. In this example, it would just be as confusing to learn that expression->\`+\` failed at the end of the parsed string. Marking mandatory items with "§" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Thus, "farthest fail"-method is not very suitable for explaining the failure or pinpointing which parser really was the culprit. Therefore, DHParser provides a simple annotation that allows to raise a parsing error deliberately, if a certain point in the chain of parsers has not been reached: By placing the "§"-sign as a "mandatory-marker" in front of a parser, the parser as well as all subsequent parsers in the same sequence, will not simply return a non-match when failing, but it will cause the entire parsing process to stop and report an error at the location of failure:: >>> arithmetic_grammar = arithmetic_grammar.replace( ... 'group = "(" expression ")"', ... 'group = "(" § expression ")"') >>> arithmetic = create_parser(arithmetic_grammar, "arithmetic") >>> terms = arithmetic('(2 - 3 * (4 + 5)') >>> print(terms.errors[0]) 1:17: Error (1010): `)` ~ expected by parser 'group', »...« found! >>> terms = arithmetic('(2 - 3) * ( )') >>> print(terms.errors[0]) 1:13: Error (1010): expression expected by parser 'group', »)...« found! The error messages give a much better indication of the cause of the error. What is reported as cause is either the name of the parser that was expected to match, as in the second case, or the rule of the parser in case of unnamed parsers, as in the first case. This usually, though unfortunately not always, yields a much better indication of the location and cause of an error than the farthest failure. However, a little care has to be taken, not to place the mandatory marker in front of a parser that might fail at a location that could still be reached and matched by another branch of the grammar. (In our example it is clear that round brackets enclose only groups. Thus, if the opening round bracket has matched, we can be sure that what follows must be an expression followed by a closing round bracket, or, if not it is a mistake.) Luckily, although this may sound complicated, in practice it never is. Unless you grammar is very badly structured, you will hardly ever make this mistake, an if you do, you will notice soon enough. Also, there is an important restriction: There is only one §-marker allowed per named parser. In case you have a long EBNF-expression on the right hand side of a symbol-definition, where you'd like to use the §-marker at more than one place, you can, however, always split it into several expression by introducing new symbols. These symbols, if they serve no other purpose, can be marked as disposable with the ``@ disposable``-directive (see :ref:`simplifying_syntax_trees`). The §-marker has proven to be a very simple means of pinpointing errors the DSL-code, and I recommend to use it from early on in the process of developing a new grammar. Plus, the §-marker offers two further benefits, namely customizing error messages and resuming the parsing process after a failure. That latter is particularly helpful if the DSL is to be used in with an integrated development environment, which benefits greatly from fail-tolerant parsing. However I only recommend to start using these, only after the grammar has reached a certain amount of maturity, because changing the grammer ofter requires re-adjusting customized error messages and resume-clauses as well, which can become tedious. Customizing error messages ^^^^^^^^^^^^^^^^^^^^^^^^^^ While the error messages produced by the use of the §-marker are often quite understandable for the engineer designing the grammar of a DSL, they might not be so for the user the DSL, who might not know the names of the parsers of the grammar, let alone the expressions of the unnamed parsers und will therefore not always be able to make much sense of an error-messages that report just these. In order to customize error messages, the symbol-related directive ``@ SYMBOLNAME_error = CONDITION, ERROR_STRING`` is used. The directive's name consists of the name of a symbol that contains a §-marker and the appendix ``_error``. The directive always takes two arguments, separated as usual by a comma, of which the first is condition-expression and the second an error message. The condition can be used to make the choice of an error-message dependant on the text following the point of failure. It can either be a regular expression or a simple string which must match (or be equal to in the case of the string) the first part of the text at the position where the parser defined by the symbol failed to match and raised an error. Only if the condition matches, the error message given as the second argument will be emitted. Otherwise, the fallback error-expression described above ("... expected by parser ...") will be shown. The empty string ``''`` can be used as a fallback if the customized message shall always be emitted, no matter what the following text looks like. The error string is a format string that may include any of the two arguments ``{0}`` or ``{1}`` where ``{0}`` will be replaced by the name or string representation of the parser that was expected to match but didn't and ``{1}`` will be replaced by the first twenty or so letters of the unmatched rest of the text. Here is a simple example that could be part of a JSON-parser that is intended to deliver understandable error-messages:: >>> grammar = ''' ... @ string_error = '', 'Illegal character(s) »{1}« in string.' ... string = `"` §characters `"` ~ ... characters = { plain | escape } ... plain = /[^"\\\\\\]+/ ... escape = /\\\\\\[\\\\/bnrt\\\\\\]/''' >>> json_string = create_parser(grammar, 'json_string') >>> print(json_string('"alpha"')) "alpha" >>> for e in json_string('"al\\pha"').errors: print(e) 1:4: Error (1010): Illegal character(s) »\pha"...« in string. 1:4: Error (1040): Parser "string" stopped before end, at: \pha" Terminating parser. Customized error-messages must always be specified in the grammar before definition of the symbol, they are related to and they can be stacked. That is, several different error-directives with different conditions and messages but related to the same symbol can be specified. The conditions are evaluated in the order the error-directives appear in the grammar and the error message of the first matching condition is picked. Therefore, the more specific conditions should always be placed first and the more general or fallback conditions should be placed below these:: >>> grammar = ("@ string_error = /\\\\\\/, 'Illegal escape sequence »{1}« " ... "Allowed values are b,n,r,t,u'") + grammar >>> json_string = create_parser(grammar, 'json_string') >>> for e in json_string('"al\\pha"').errors: print(e) 1:4: Error (1010): Illegal escape sequence »\pha"...« Allowed values are b,n,r,t,u 1:4: Error (1040): Parser "string" stopped before end, at: \pha" Terminating parser. Here, the more specific and more understandable error message has been selected. Careful readers might notice that the the more general customized error message "Illegal character(s) ... found in string" will now only be selected, if the string contains a character that not even regular expression engine recognizes, because the only other character that is not allowed within the string are the closing quotation marks that terminate the string and which do not cause the parser to fail (but only to terminate to early). Also, it might be noticed that the errors are always caused by a failure to match the second ``"``-sign, because the characters-parser also matches the empty string and thus never fails or raises any error. Nonetheless, the error can occur in the interior of the string and can - with the help of customized error messages - be described as such and properly be located. However, emitting the right error messages based on a regular- expression-condition is not quite easy and sometimes customized error messages can even be more confusion for the users of the DSL. My recommendation is to wait for user feedback or to monitor the errors that users typically make and then to customize the error messages to the actual needs of the users to help them understand why the computer refuses to parse a certain construct. Fail-tolerant Parsing --------------------- A serious limitation of all previously described error-handling mechanisms that the parsing process still stops on the very first error. This is particularly annoying for beginners learning to code data or program code with a new DSL, because the compiler must be run at least as many times as there errors in the code to find all of them. It would be much better to receive a list of all or at least most errors on the first run. And, finally, modern development environments can make the most of incremental compilation make possible by fail-tolerant parsers. A generic method for fail-tolerant parsing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are a number of techniques for fail-tolerant parsing. One technique that is not specific to DHParser but can be used with any parser-generator is to add junctions for possibly erroneous code to the grammar:: >>> grammar = ''' ... string = `"` ([characters] `"` | string_error [`"`]) ~ ... string_error = /[^"]*/ ... characters = { plain | escape }+ ... plain = /[^"\\\\\\]+/ ... escape = /\\\\\\[\\\\/bnrt\\\\\\]/''' >>> json_string = create_parser(grammar, 'json_string') >>> tree = json_string('"al\\pha"') >>> print(tree.as_sxpr(flatten_threshold=0)) (string (:Text '"') (string_error "al\pha") (:Text '"')) This time the parser did not need to stop at the erroneous part. The erroneaus part itself has been caught within a node that betrays only by its name that there was an error. To produce error messages we have to add them explicitly to such nodes, afterwards: >>> for string_error in tree.select('string_error'): ... _ = tree.new_error(string_error, 'Fehler im String: ' + str(string_error)) >>> for e in tree.errors: print(e) 1:2: Error (1000): Fehler im String: al\pha Unfortunately, the error location is not very precise. This can be remedied by refining our error junction code:: >>> grammar = ''' ... string = `"` ([characters] `"` | string_error [`"`]) ~ ... string_error = [characters] { ups [characters] }+ ... ups = /[^"]/ ... characters = { plain | escape }+ ... plain = /[^"\\\\\\]+/ ... escape = /\\\\\\[\\\\/bnrt\\\\\\]/''' >>> json_string = create_parser(grammar, 'json_string') >>> tree = json_string('"al\\pha"') >>> print(tree.as_sxpr()) (string (:Text '"') (string_error (characters (plain "al")) (ups "\\") (characters (plain "pha"))) (:Text '"')) Here, the node named "ups" pinpoints the precise error location. Like most techniques for fail-tolerant parsing, this one is not quite as easy to master in practice as it might look. Generally, adding a junction for erroneous code works best, when the passage that shall be by-passed is delineated by a easily recognizable follow-up strings. In this example the follow-up string would be the ``"``-sign. The method fails, of course if the follow-up text is erroneous, too, or has even been forgotten. So, to be absolutely sure, one would have to consider different follow-up sequences, say empty lines, keywords that mark new parts of the document and the like. DHParser's support for fail-tolerant parsing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ DHParser offers two constructs for fail-tolerant parsing which are quite similar to the just described technique. However, they do not require rewriting the grammar and reuse the error-locating ability of the §-marker. A disadvantage is that the DHParser-specific support for fail-tolerant parsing presently relies entirely on regular expressions for finding the right re-entry points. DHParser allows to resume parsing after an error at a later point in the text. When trying to resume parsing two questions must be answered: 1. At what location should the parsing process be resumed? 2. Which parser in the parser call-chain should resume parsing? E.g. the parser that failed, the parser that called the parser that failed, ... ? The location where parsing should be resumed must be specified by a regular expression or a list of regular expressions. The resumption location is the nearest match of any of these expressions that does not fall into a comment (as specified by the ``@comment``-directive described above). More precisely it is the location directly after the match, because this allows to search for the reentry-location both by the text preceding this location and the text following this location by using a lookahead operator inside the regular expression. The parser that resumes parsing depends on the directive that guides the search for the reentry-point. DHParser offers two different directives for this purpose, the ``@..._skip``-directive and the ``@..._resume``-directive. The placeholder ... stands for the name of a parser that contains a §-marker. The skip-directive resumes parsing with the sequence-parser that contains the item(s) marked by the §-marker. In the following example, the skip-directive picks up parsing with the string- parser when an error was raised by the string-parser:: >>> grammar = ''' ... @ string_error = /\\\\\\/, 'Illegal escape sequence »{1}«' ... @ string_error = '', 'Illegal character "{1}" in string.' ... @ string_skip = /(?=")/ ... string = `"` §characters `"` ~ ... characters = { plain | escape } ... plain = /[^"\\\\\\]+/ ... escape = /\\\\\\[\\\\/bnrt\\\\\\]/''' >>> json_string = create_parser(grammar, 'json_string') >>> tree = json_string('"al\\pha"') >>> print(tree.content) "al\pha" >>> print(tree.errors[0]) 1:4: Error (1010): Illegal escape sequence »\pha"...« >>> print(tree.as_sxpr()) (string (:Text '"') (characters (plain "al")) (ZOMBIE__ `(1:4: Error (1010): Illegal escape sequence »\pha"...«) "\pha") (:Text '"')) After the error has occurred at the illegal escape-sequence, the skip-directive catches the error and skips to the location where the `"`-character lies just ahead and continues parsing with the string-parser. The skipped passage is stored in a ZOMBIE__-Node within the syntax-tree and parsing can continue through to the end of the text. In contrast to the skip-directive the resume-directive leaves the parser that raised the error and resumes one level higher up in the call chain. The ``@ ..._resume``-directive that tells the *calling* parsers where to continue after the array parser has failed. So, the parser resuming the parsing process is not the array parser that has failed, but the first parser in the reverse call-stack of "array" that catches up at the location indicated by the ``@ ..._resume``-directive. The location itself is determined by a regular expression, where the point for reentry is the location *after* the next match of the regular expression:: Semantic Actions and Storing Variables -------------------------------------- """ from collections import OrderedDict from functools import partial import keyword import os from typing import Callable, Dict, List, Set, FrozenSet, Tuple, Sequence, Union, Optional from DHParser.compile import CompilerError, Compiler, ResultTuple, compile_source, visitor_name from DHParser.configuration import access_thread_locals, get_config_value, \ NEVER_MATCH_PATTERN, ALLOWED_PRESET_VALUES from DHParser.error import Error, AMBIGUOUS_ERROR_HANDLING, WARNING, REDECLARED_TOKEN_WARNING,\ REDEFINED_DIRECTIVE, UNUSED_ERROR_HANDLING_WARNING, INAPPROPRIATE_SYMBOL_FOR_DIRECTIVE, \ DIRECTIVE_FOR_NONEXISTANT_SYMBOL, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING, \ UNCONNECTED_SYMBOL_WARNING, REORDERING_OF_ALTERNATIVES_REQUIRED, BAD_ORDER_OF_ALTERNATIVES, \ EMPTY_GRAMMAR_ERROR, MALFORMED_REGULAR_EXPRESSION from DHParser.parse import Parser, Grammar, mixin_comment, mixin_nonempty, Forward, RegExp, \ Drop, Lookahead, NegativeLookahead, Alternative, Series, Option, ZeroOrMore, OneOrMore, \ Text, Capture, Retrieve, Pop, optional_last_value, GrammarError, Whitespace, Always, Never, \ INFINITE, matching_bracket, ParseFunc, update_scanner, CombinedParser from DHParser.preprocess import nil_preprocessor, PreprocessorFunc from DHParser.syntaxtree import Node, RootNode, WHITESPACE_PTYPE, TOKEN_PTYPE, ZOMBIE_TAG, \ flatten_sxpr from DHParser.toolkit import load_if_file, escape_re, escape_ctrl_chars, md5, \ sane_parser_name, re, expand_table, unrepr, compile_python_object, DHPARSER_PARENTDIR, \ cython from DHParser.transform import TransformationFunc, traverse, remove_brackets, \ reduce_single_child, replace_by_single_child, is_empty, remove_children, \ remove_tokens, flatten, forbid, assert_content, remove_children_if, all_of, not_one_of, \ BLOCK_LEAVES from DHParser.versionnumber import __version__ __all__ = ('DHPARSER_IMPORTS', 'get_ebnf_preprocessor', 'get_ebnf_grammar', 'get_ebnf_transformer', 'get_ebnf_compiler', 'parse_ebnf', 'transform_ebnf', 'compile_ebnf_ast', 'EBNFGrammar', 'EBNFTransform', 'EBNFCompilerError', 'EBNFDirectives', 'EBNFCompiler', 'grammar_changed', 'compile_ebnf', 'PreprocessorFactoryFunc', 'ParserFactoryFunc', 'TransformerFactoryFunc', 'CompilerFactoryFunc') ######################################################################## # # source code support # ######################################################################## DHPARSER_IMPORTS = ''' import collections from functools import partial import os import sys from typing import Tuple, List, Union, Any, Optional, Callable try: scriptpath = os.path.dirname(__file__) except NameError: scriptpath = '' dhparser_parentdir = os.path.abspath(os.path.join(scriptpath, r'{dhparser_parentdir}')) if scriptpath not in sys.path: sys.path.append(scriptpath) if dhparser_parentdir not in sys.path: sys.path.append(dhparser_parentdir) try: import regex as re except ImportError: import re from DHParser import start_logging, suspend_logging, resume_logging, is_filename, load_if_file, \\ Grammar, Compiler, nil_preprocessor, PreprocessorToken, Whitespace, Drop, AnyChar, \\ Lookbehind, Lookahead, Alternative, Pop, Text, Synonym, Counted, Interleave, INFINITE, \\ Option, NegativeLookbehind, OneOrMore, RegExp, Retrieve, Series, Capture, TreeReduction, \\ ZeroOrMore, Forward, NegativeLookahead, Required, CombinedParser, mixin_comment, \\ compile_source, grammar_changed, last_value, matching_bracket, PreprocessorFunc, is_empty, \\ remove_if, Node, TransformationFunc, TransformationDict, transformation_factory, traverse, \\ remove_children_if, move_adjacent, normalize_whitespace, is_anonymous, matches_re, \\ reduce_single_child, replace_by_single_child, replace_or_reduce, remove_whitespace, \\ replace_by_children, remove_empty, remove_tokens, flatten, all_of, any_of, \\ merge_adjacent, collapse, collapse_children_if, transform_content, WHITESPACE_PTYPE, \\ TOKEN_PTYPE, remove_children, remove_content, remove_brackets, change_tag_name, \\ remove_anonymous_tokens, keep_children, is_one_of, not_one_of, has_content, apply_if, peek, \\ remove_anonymous_empty, keep_nodes, traverse_locally, strip, lstrip, rstrip, \\ transform_content, replace_content_with, forbid, assert_content, remove_infix_operator, \\ add_error, error_on, recompile_grammar, left_associative, lean_left, set_config_value, \\ get_config_value, node_maker, access_thread_locals, access_presets, PreprocessorResult, \\ finalize_presets, ErrorCode, RX_NEVER_MATCH, set_tracer, resume_notices_on, \\ trace_history, has_descendant, neg, has_ancestor, optional_last_value, insert, \\ positions_of, replace_tag_names, add_attributes, delimit_children, merge_connected, \\ has_attr, has_parent, ThreadLocalSingletonFactory, Error, canonical_error_strings, \\ has_errors, ERROR, FATAL, set_preset_value, get_preset_value, NEVER_MATCH_PATTERN, \\ gen_find_include_func, preprocess_includes, make_preprocessor, chain_preprocessors ''' ######################################################################## # # EBNF scanning # ######################################################################## def get_ebnf_preprocessor() -> PreprocessorFunc: """ Returns the preprocessor function for the EBNF compiler. As of now, no preprocessing is needed for EBNF-sources. Therefore, just a dummy function is returned. """ return nil_preprocessor ######################################################################## # # EBNF parsing # ######################################################################## class EBNFGrammar(Grammar): r"""Parser for a FlexibleEBNF source file. This grammar is tuned for flexibility, that is, it supports as many different flavors of EBNF as possible. However, this flexibility comes at the cost of some ambiguities. In particular: 1. the alternative OR-operator / could be mistaken for the start of a regular expression and vice versa, and 2. character ranges [a-z] can be mistaken for optional blocks and vice versa A strategy to avoid these ambiguities is to do all of the following: - replace the free_char-parser by a never matching parser - if this is done, it is safe to replace the char_range_heuristics- parser by an always matching parser - replace the regex_heuristics by an always matching parser Ambiguities can also be avoided by NOT using all the syntactic variants made possible by this EBNF-grammar within one and the same EBNF-document. EBNF-definition of the Grammar:: @ comment = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/ # comments can be either C-Style: /* ... */ # or pascal/modula/oberon-style: (* ... *) # or python-style: # ... \n, excluding, however, character markers: #x20 @ whitespace = /\s*/ # whitespace includes linefeed @ literalws = right # trailing whitespace of literals will be ignored tacitly @ disposable = component, pure_elem, countable, FOLLOW_UP, SYM_REGEX, ANY_SUFFIX, EOF @ drop = whitespace, EOF # do not include these even in the concrete syntax tree @ RNG_BRACE_filter = matching_bracket() # filter or transform content of RNG_BRACE on retrieve # re-entry-rules for resuming after parsing-error @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/ @ directive_resume = /\n\s*(?=@|\w+\w*\s*=)/ # specialized error messages for certain cases @ definition_error = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to ' 'be a directive and the directive symbol @ is missing\nor the error is ' 'due to inconsistent use of the comma as a delimiter\nfor the elements ' 'of a sequence.' #: top-level syntax = ~ { definition | directive } EOF definition = symbol §:DEF~ [ :OR~ ] expression :ENDL~ & FOLLOW_UP # [:OR~] to support v. Rossum's syntax directive = "@" §symbol "=" component { "," component } & FOLLOW_UP component = literals | procedure | expression literals = { literal }+ # string chaining, only allowed in directives! procedure = SYM_REGEX "()" # procedure name, only allowed in directives! FOLLOW_UP = `@` | symbol | EOF #: components expression = sequence { :OR~ sequence } sequence = ["§"] ( interleave | lookaround ) # "§" means all following terms mandatory { !`@` !(symbol :DEF) :AND~ ["§"] ( interleave | lookaround ) } interleave = difference { "°" ["§"] difference } lookaround = flowmarker § (oneormore | pure_elem) difference = term ["-" § (oneormore | pure_elem)] term = oneormore | counted | repetition | option | pure_elem #: elements countable = option | oneormore | element pure_elem = element § !ANY_SUFFIX # element strictly without a suffix element = [retrieveop] symbol !:DEF # negative lookahead to be sure it's not a definition | literal | plaintext | regexp | char_range | character ~ | any_char | whitespace | group ANY_SUFFIX = /[?*+]/ #: flow-operators flowmarker = "!" | "&" # '!' negative lookahead, '&' positive lookahead | "<-!" | "<-&" # '<-!' negative lookbehind, '<-&' positive lookbehind retrieveop = "::" | ":?" | ":" # '::' pop, ':?' optional pop, ':' retrieve #: groups group = "(" no_range §expression ")" oneormore = "{" no_range expression "}+" | element "+" repetition = "{" no_range §expression "}" | element "*" no_range option = !char_range "[" §expression "]" | element "?" counted = countable range | countable :TIMES~ multiplier | multiplier :TIMES~ §countable range = RNG_BRACE~ multiplier [ :RNG_DELIM~ multiplier ] ::RNG_BRACE~ no_range = !multiplier | &multiplier :TIMES multiplier = /[1-9]\d*/~ #: leaf-elements symbol = SYM_REGEX ~ # e.g. expression, term, parameter_list literal = /"(?:(? None: Grammar.__init__(self, root, static_analysis) self.free_char_parsefunc__ = self.free_char._parse self.char_range_heuristics_parsefunc__ = self.char_range_heuristics._parse self.regex_heuristics_parserfunc__ = self.regex_heuristics._parse @property def mode(self) -> str: def which(p: Parser) -> str: if p._parse_proxy.__qualname__ == 'Never._parse': return 'never' elif p._parse_proxy.__qualname__ == 'Always._parse': return 'always' else: return 'custom' signature = ( which(self.free_char), which(self.regex_heuristics), which(self.char_range_heuristics) ) if signature == ('custom', 'custom', 'custom'): return 'heuristic' elif signature == ('never', 'always', 'always'): return 'strict' # or 'classic' elif signature == ('custom', 'never', 'always'): return 'peg-like' elif signature == ('custom', 'always', 'always'): return 'regex-like' else: return "undefined" @mode.setter def mode(self, mode: str): if mode == self.mode: return def set_parsefunc(p: Parser, f: ParseFunc): method = f.__get__(p, type(p)) # bind function f to parser p p._parse_proxy = method always = Always._parse never = Never._parse if mode == 'heuristic': set_parsefunc(self.free_char, self.free_char_parsefunc__) set_parsefunc(self.regex_heuristics, self.regex_heuristics_parserfunc__) set_parsefunc(self.char_range_heuristics, self.char_range_heuristics_parsefunc__) elif mode in ('strict', 'classic'): set_parsefunc(self.free_char, never) set_parsefunc(self.regex_heuristics, always) set_parsefunc(self.char_range_heuristics, always) elif mode == 'peg-like': set_parsefunc(self.free_char, self.free_char_parsefunc__) set_parsefunc(self.regex_heuristics, never) set_parsefunc(self.char_range_heuristics, always) elif mode == 'regex-like': set_parsefunc(self.free_char, self.free_char_parsefunc__) set_parsefunc(self.regex_heuristics, always) set_parsefunc(self.char_range_heuristics, always) else: raise ValueError('Mode must be one of: ' + ', '.join( ALLOWED_PRESET_VALUES['syntax_variant'])) class FixedEBNFGrammar(Grammar): r"""Faster version of EBNF, where delimiters are not determined on first use, but defined as constant Text-parsers. They can still be adjusted with function `parse.update_scanner()`. Different syntactical variants can be configured either by adjusting the definitions of DEF, OR, AND, ENDL, RNG_OPEN, RNG_CLOSE, RNG_DELIM, CH_LEADIN, TIMES, RE_LEADIN, RE_LEADOUT either within this grammar definition or in the Grammar-object changing the `text`-field of the respective parser objects. EBNF-Definition of the Grammar:: @ comment = /(?!#x[A-Fa-f0-9])#.*(?:\n|$)|\/\*(?:.|\n)*?\*\/|\(\*(?:.|\n)*?\*\)/ # comments can be either C-Style: /* ... */ # or pascal/modula/oberon-style: (* ... *) # or python-style: # ... \n, excluding, however, character markers: #x20 @ whitespace = /\s*/ # whitespace includes linefeed @ literalws = right # trailing whitespace of literals will be ignored tacitly @ disposable = component, pure_elem, countable, FOLLOW_UP, SYM_REGEX, ANY_SUFFIX, EOF @ drop = whitespace, EOF # do not include these even in the concrete syntax tree @ RNG_BRACE_filter = matching_bracket() # filter or transform content of RNG_BRACE on retrieve # re-entry-rules for resuming after parsing-error @ definition_resume = /\n\s*(?=@|\w+\w*\s*=)/ @ directive_resume = /\n\s*(?=@|\w+\w*\s*=)/ # specialized error messages for certain cases @ definition_error = /,/, 'Delimiter "," not expected in definition!\nEither this was meant to ' 'be a directive and the directive symbol @ is missing\nor the error is ' 'due to inconsistent use of the comma as a delimiter\nfor the elements ' 'of a sequence.' #: top-level syntax = ~ { definition | directive } EOF definition = symbol §DEF~ [ OR~ ] expression ENDL~ & FOLLOW_UP # [OR~] to support v. Rossum's syntax directive = "@" §symbol "=" component { "," component } & FOLLOW_UP component = literals | procedure | expression literals = { literal }+ # string chaining, only allowed in directives! procedure = SYM_REGEX "()" # procedure name, only allowed in directives! FOLLOW_UP = `@` | symbol | EOF #: components expression = sequence { OR~ sequence } sequence = ["§"] ( interleave | lookaround ) # "§" means all following terms mandatory { AND~ ["§"] ( interleave | lookaround ) } interleave = difference { "°" ["§"] difference } lookaround = flowmarker § (oneormore | pure_elem) difference = term ["-" § (oneormore | pure_elem)] term = oneormore | counted | repetition | option | pure_elem #: elements countable = option | oneormore | element pure_elem = element § !ANY_SUFFIX # element strictly without a suffix element = [retrieveop] symbol !DEF # negative lookahead to be sure it's not a definition | literal | plaintext | regexp # | char_range | character ~ | any_char | whitespace | group ANY_SUFFIX = /[?*+]/ #: flow-operators flowmarker = "!" | "&" # '!' negative lookahead, '&' positive lookahead | "<-!" | "<-&" # '<-!' negative lookbehind, '<-&' positive lookbehind retrieveop = "::" | ":?" | ":" # '::' pop, ':?' optional pop, ':' retrieve #: groups group = "(" no_range §expression ")" oneormore = "{" no_range expression "}+" | element "+" repetition = "{" no_range §expression "}" | element "*" no_range option = # !char_range "[" §expression "]" | element "?" counted = countable range | countable TIMES~ multiplier | multiplier TIMES~ §countable range = RNG_OPEN~ multiplier [ RNG_DELIM~ multiplier ] RNG_CLOSE~ no_range = !multiplier | &multiplier TIMES multiplier = /[1-9]\d*/~ #: leaf-elements symbol = SYM_REGEX ~ # e.g. expression, term, parameter_list literal = /"(?:(? bool: """ Returns ``True`` if ``grammar_class`` does not reflect the latest changes of ``grammar_source`` Parameters: grammar_class: the parser class representing the grammar or the file name of a compiler suite containing the grammar grammar_source: File name or string representation of the EBNF code of the grammar Returns (bool): True, if the source text of the grammar is different from the source from which the grammar class was generated """ grammar = load_if_file(grammar_source) chksum = md5(grammar, __version__) if isinstance(grammar_class, str): # grammar_class = load_compiler_suite(grammar_class)[1] with open(grammar_class, 'r', encoding='utf8') as f: pycode = f.read() m = re.search(r'class \w*\(Grammar\)', pycode) if m: m = re.search(' {4}source_hash__ *= *"([a-z0-9]*)"', pycode[m.span()[1]:]) if m is None: return False return not (m.groups() and m.groups()[-1] == chksum) else: return True else: return chksum != grammar_class.source_hash__ def get_ebnf_grammar() -> EBNFGrammar: """Returns a thread-local EBNF-Grammar-object for parsing EBNF sources.""" THREAD_LOCALS = access_thread_locals() mode = get_config_value('syntax_variant') try: grammar = THREAD_LOCALS.ebnf_grammar_singleton if mode in ('fixed', 'configurable'): if not isinstance(grammar, FixedEBNFGrammar): raise AttributeError else: if not isinstance(grammar, EBNFGrammar): raise AttributeError except AttributeError: if mode in ('fixed', 'configurable'): grammar = FixedEBNFGrammar(static_analysis=False) if mode == "fixed": # configure grammar once update_scanner(grammar, get_config_value('delimiter_set')) THREAD_LOCALS.ebnf_grammar_singleton = grammar else: grammar = EBNFGrammar(static_analysis=False) THREAD_LOCALS.ebnf_grammar_singleton = grammar if mode == 'configurable': # configure grammar on each request of the grammar object update_scanner(grammar, get_config_value('delimiter_set')) elif mode != 'fixed': grammar.mode = mode return grammar def parse_ebnf(ebnf: str) -> Node: """Parses and EBNF-source-text and returns the concrete syntax tree of the EBNF-code..""" return get_ebnf_grammar()(ebnf) ######################################################################## # # EBNF concrete to abstract syntax tree transformation and validation # ######################################################################## EBNF_AST_transformation_table = { # AST Transformations for EBNF-grammar "<": [BLOCK_LEAVES, remove_children_if(all_of(not_one_of('regexp'), is_empty))], "syntax": [], "directive": [flatten, remove_tokens('@', '=', ',')], "procedure": [remove_tokens('()'), reduce_single_child], "literals": [replace_by_single_child], "definition": [flatten, remove_children('DEF', 'ENDL'), remove_tokens('=')], # remove_tokens('=') is only for backwards-compatibility "expression": [replace_by_single_child, flatten, remove_children('OR'), remove_tokens('|')], # remove_tokens('|') is only for backwards-compatibility "sequence": [replace_by_single_child, flatten, remove_children('AND')], "interleave": [replace_by_single_child, flatten, remove_tokens('°')], "lookaround": [], "difference": [remove_tokens('-'), replace_by_single_child], "term, pure_elem, element": [replace_by_single_child], "flowmarker, retrieveop": [reduce_single_child], "group": [remove_brackets], "oneormore, repetition, option": [reduce_single_child, remove_brackets, # remove_tokens('?', '*', '+'), forbid('repetition', 'option', 'oneormore'), assert_content(r'(?!§)(?:.|\n)*')], "counted": [remove_children('TIMES')], "range": [remove_children('BRACE_SIGN', 'RNG_BRACE', 'RNG_DELIM', 'RNG_OPEN', 'RNG_CLOSE')], "symbol, literal, any_char": [reduce_single_child], "plaintext": [], "regexp": [remove_children('RE_LEADIN', 'RE_LEADOUT'), reduce_single_child], "char_range": [flatten, remove_tokens('[', ']')], "character": [remove_children('CH_LEADIN'), reduce_single_child], "free_char": [], (TOKEN_PTYPE, WHITESPACE_PTYPE, "whitespace"): [reduce_single_child], "EOF, DEF, OR, AND, ENDL, BRACE_SIGN, RNG_BRACE, RNG_DELIM, RNG_OPEN, " "RNG_CLOSE, TIMES, RE_LEADIN, RE_CORE, RE_LEADOUT, CH_LEADIN": [], "*": [replace_by_single_child] } def EBNFTransform() -> TransformationFunc: return partial(traverse, processing_table=EBNF_AST_transformation_table.copy()) def get_ebnf_transformer() -> TransformationFunc: THREAD_LOCALS = access_thread_locals() try: transformer = THREAD_LOCALS.EBNF_transformer_singleton except AttributeError: THREAD_LOCALS.EBNF_transformer_singleton = EBNFTransform() transformer = THREAD_LOCALS.EBNF_transformer_singleton return transformer def transform_ebnf(cst: Node) -> None: """Transforms the concrete-syntax-tree of an EBNF-source-code into the abstract-syntax-tree. The transformation changes the syntax tree in place. No value is returned.""" get_ebnf_transformer()(cst) ######################################################################## # # EBNF abstract syntax tree to Python parser compilation # ######################################################################## PreprocessorFactoryFunc = Callable[[], PreprocessorFunc] ParserFactoryFunc = Callable[[], Grammar] TransformerFactoryFunc = Callable[[], TransformationFunc] CompilerFactoryFunc = Callable[[], Compiler] PREPROCESSOR_FACTORY = ''' RE_INCLUDE = NEVER_MATCH_PATTERN # To capture includes, replace the NEVER_MATCH_PATTERN # by a pattern with group "name" here, e.g. r'\\input{{(?P.*)}}' def {NAME}Tokenizer(original_text) -> Tuple[str, List[Error]]: # Here, a function body can be filled in that adds preprocessor tokens # to the source code and returns the modified source. return original_text, [] def preprocessor_factory() -> PreprocessorFunc: # below, the second parameter must always be the same as {NAME}Grammar.COMMENT__! find_next_include = gen_find_include_func(RE_INCLUDE, {COMMENT__}) include_prep = partial(preprocess_includes, find_next_include=find_next_include) tokenizing_prep = make_preprocessor({NAME}Tokenizer) return chain_preprocessors(include_prep, tokenizing_prep) get_preprocessor = ThreadLocalSingletonFactory(preprocessor_factory, ident=1) def preprocess_{NAME}(source): return get_preprocessor()(source) ''' GRAMMAR_FACTORY = ''' _raw_grammar = ThreadLocalSingletonFactory({NAME}Grammar, ident={ID}) def get_grammar() -> {NAME}Grammar: grammar = _raw_grammar() if get_config_value('resume_notices'): resume_notices_on(grammar) elif get_config_value('history_tracking'): set_tracer(grammar, trace_history) try: if not grammar.__class__.python_src__: grammar.__class__.python_src__ = get_grammar.python_src__ except AttributeError: pass return grammar def parse_{NAME}(document, start_parser = "root_parser__", *, complete_match=True): return get_grammar()(document, start_parser, complete_match) ''' TRANSFORMER_FACTORY = ''' def {NAME}Transformer() -> TransformationFunc: """Creates a transformation function that does not share state with other threads or processes.""" return partial(traverse, processing_table={NAME}_AST_transformation_table.copy()) get_transformer = ThreadLocalSingletonFactory({NAME}Transformer, ident={ID}) def transform_{NAME}(cst): get_transformer()(cst) ''' COMPILER_FACTORY = ''' get_compiler = ThreadLocalSingletonFactory({NAME}Compiler, ident={ID}) def compile_{NAME}(ast): return get_compiler()(ast) ''' WHITESPACE_TYPES = {'linefeed': r'[ \t]*(?:\n[ \t]*)?(?!\n)', # default 'horizontal': r'[\t ]*', 'vertical': r'\s*'} DROP_STRINGS = 'strings' DROP_BACKTICKED = 'backticked' DROP_WSPC = 'whitespace' DROP_REGEXP = 'regexps' DROP_VALUES = {DROP_STRINGS, DROP_BACKTICKED, DROP_WSPC, DROP_REGEXP} # Representation of Python code or, rather, something that will be output as Python code ReprType = Union[str, unrepr] VALID_DIRECTIVES = { 'comment': r'Regular expression for comments, e.g. /#.*(?:\n|$)/', 'whitespace': r'Regular expression for whitespace or one of: horizontal, linefeed, vertical', 'literalws': 'Controls implicit whitespace adjacent to literals: left, right, both, none', 'ignorecase': 'Controls case-sensitivity: True, False', '[preprocessor_]tokens': 'List of the names of all preprocessor tokens', 'disposable': 'List of symbols that shall not be turned into tag-names', 'drop': 'List of tags to be dropped with all their content from the tree, ' 'special values: strings, backticked, whitespace, regexps', '[tree_]reduction': 'Reduction level for simplifying the tree while parsing.' 'Possible levels: none, flatten, merge_treetops, merge', '$SYMBOL_filer': 'Function that transforms captured values of the given symbol on retrieval', '$SYMBOL_error': 'Pair of regular expression an custom error message if regex matches', '$SYMBOL_skip': 'List of regexes or functions to find reentry point after an error', '$SYMBOL_resume': 'List or regexes or functions to find reentry point for parent parser' } class EBNFDirectives: """ A Record that keeps information about compiler directives during the compilation process. :ivar whitespace: the regular expression string for (insignificant) whitespace :ivar comment: the regular expression string for comments :ivar literalws: automatic whitespace eating next to literals. Can be either 'left', 'right', 'none', 'both' :ivar tokens: set of the names of preprocessor tokens :ivar filter: mapping of symbols to python match functions that will be called on any retrieve / pop - operations on these symbols :ivar error: mapping of symbols to tuples of match conditions and customized error messages. A match condition can be either a string or a regular expression. The first error message where the search condition matches will be displayed. An empty string '' as search condition always matches, so in case of multiple error messages, this condition should be placed at the end. :ivar skip: mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for the series- or interleave-parser when a mandatory item failed to match the following text. :ivar resume: mapping of symbols to a list of search expressions. A search expressions can be either a string ot a regular expression. The closest match is the point of reentry for after a parsing error has error occurred. Other than the skip field, this configures resuming after the failing parser (`parser.Series` or `parser.Interleave`) has returned. :ivar disposable: A regular expression to identify "disposable" symbols, i.e. symbols that will not appear as tag-names. Instead the nodes produced by the parsers associated with these symbols will yield anonymous nodes just like "inner" parsers that are not directly assigned to a symbol. :ivar drop: A set that may contain the elements ``DROP_STRINGS`` and ``DROP_WSP``, ``DROP_REGEXP`` or any name of a symbol of a disposable parser (e.g. '_linefeed') the results of which will be dropped during the parsing process, already. :ivar reduction: The reduction level (0-3) for early tree-reduction during the parsing stage. :ivar super_ws: Cache for the "super whitespace" which is a regular expression that merges whitespace and comments. This property should only be accessed after the `whitespace` and `comment` field have been filled with the values parsed from the EBNF source. """ __slots__ = ['whitespace', 'comment', 'literalws', 'tokens', 'filter', 'error', 'skip', 'resume', 'disposable', 'drop', 'reduction', '_super_ws'] REPEATABLE_DIRECTIVES = {'tokens', 'preprocessor_tokens', 'disposable', 'drop'} def __init__(self): self.whitespace = WHITESPACE_TYPES['linefeed'] # type: str self.comment = '' # type: str self.literalws = {get_config_value('default_literalws')} # type: Set[str] self.tokens = set() # type: Set[str] self.filter = dict() # type: Dict[str, str] self.error = dict() # type: Dict[str, List[Tuple[ReprType, ReprType]]] self.skip = dict() # type: Dict[str, List[Union[unrepr, str]]] self.resume = dict() # type: Dict[str, List[Union[unrepr, str]]] self.disposable = get_config_value('default_disposable_regexp') # type: str self.drop = set() # type: Set[str] self.reduction = 1 # type: int self._super_ws = None # type: Optional[str] def __getitem__(self, key): return getattr(self, key) def __setitem__(self, key, value): assert hasattr(self, key) setattr(self, key, value) @property def super_ws(self): if self._super_ws is None: self._super_ws = mixin_comment(self.whitespace, self.comment) return self._super_ws def keys(self): return self.__slots__ class EBNFCompilerError(CompilerError): """Error raised by `EBNFCompiler` class. (Not compilation errors in the strict sense, see `CompilationError` in module ``dsl.py``)""" pass class EBNFCompiler(Compiler): """ Generates a Parser from an abstract syntax tree of a grammar specified in EBNF-Notation. Usually, this class will not be instantiated or instances of this class be called directly. Rather high-level functions like :py:func:`~dsl.create_parser` or :py:func:`~dsl.compile_EBNF` will be used to generate callable :py:class:`~parse.Grammar`-objects or Python-source-code from an EBNF-grammar. Instances of this class must be called with the root-node of the abstract syntax tree from an EBNF-specification of a formal language. The returned value is the Python-source-code of a Grammar class for this language that can be used to parse texts in this language. See classes `parser.Compiler` and `parser.Grammar` for more information. Additionally, class EBNFCompiler provides helper methods to generate code-skeletons for a preprocessor, AST-transformation and full compilation of the formal language. These method's names start with the prefix `gen_`. :ivar current_symbols: During compilation, a list containing the root node of the currently compiled definition as first element and then the nodes of the symbols that are referred to in the currently compiled definition. :ivar cache_literal_symbols: A cache for all symbols that are defined by literals, e.g. ``head = ""``. This is used by the on_expression()-method. :ivar rules: Dictionary that maps rule names to a list of Nodes that contain symbol-references in the definition of the rule. The first item in the list is the node of the rule- definition itself. Example:: alternative = a | b Now ``[node.content for node in self.rules['alternative']]`` yields ``['alternative = a | b', 'a', 'b']`` :ivar referred_symbols_cache: A dictionary that caches the results of method ``referred_symbols()``. ``referred_symbols()`` maps a to the set of symbols that are directly or indirectly referred to in the definition of the symbol. :ivar directly_referred_cache: A dictionary that caches the the results of method `directly_referred_symbols()`, which yields the set of symbols that are referred to in the definition of a particular symbol. :ivar symbols: A mapping of symbol names to their first usage (not their definition!) in the EBNF source. :ivar variables: A set of symbols names that are used with the Pop or Retrieve operator. Because the values of these symbols need to be captured they are called variables. See `test_parser.TestPopRetrieve` for an example. :ivar forward: A set of symbols that require a forward operator. :ivar definitions: A dictionary of definitions. Other than `rules` this maps the symbols to their compiled definienda. :ivar required_keywords: A list of keywords (like ``comment__`` or ``whitespace__``) that need to be defined at the beginning of the grammar class because they are referred to later. :ivar deferred_tasks: A list of callables that is filled during compilation, but that will be executed only after compilation has finished. Typically, it contains semantic checks that require information that is only available upon completion of compilation. :ivar root_symbol: The name of the root symbol. :ivar drop_flag: This flag is set temporarily when compiling the definition of a parser that shall drop its content. If this flag is set all contained parser will also drop their content as an optimization. :ivar directives: A record of all directives and their default values. :ivar defined_directives: A dictionary of all directives that have already been defined, mapped onto the list of nodes where they have been (re-)defined. With the exception of those directives contained in EBNFDirectives.REPEATABLE_DIRECTIVES, directives must only be defined once. :ivar consumed_custom_errors: A set of symbols for which a custom error has been defined and(!) consumed during compilation. This allows to add a compiler error in those cases where (i) an error message has been defined but will never used or (ii) an error message is accidently used twice. For examples, see `test_ebnf.TestErrorCustomization`. :ivar consumed_skip_rules: The same as `consumed_custom_errors` only for in-series-resume-rules (aka 'skip-rules') for Series-parsers. :ivar re_flags: A set of regular expression flags to be added to all regular expressions found in the current parsing process :ivar disposable_regexp: A regular expression to identify symbols that stand for parsers that shall yield anonymous nodes. The pattern of the regular expression is configured in configuration.py but can also be set by a directive. The default value is a regular expression that catches names with a leading underscore. See also `parser.Grammar.disposable__` :ivar python_src: A string that contains the python source code that was the outcome of the last EBNF-compilation. :ivar grammar_name: The name of the grammar to be compiled :ivar grammar_source: The source code of the grammar to be compiled. :ivar grammar_id: a unique id for every compiled grammar. (Required for disambiguation of of thread local variables storing compiled texts.) """ COMMENT_KEYWORD = "COMMENT__" COMMENT_PARSER_KEYWORD = "comment__" DROP_COMMENT_PARSER_KEYWORD = "dcomment__" COMMENT_RX_KEYWORD = "comment_rx__" WHITESPACE_KEYWORD = "WSP_RE__" RAW_WS_KEYWORD = "WHITESPACE__" RAW_WS_PARSER_KEYWORD = "whitespace__" DROP_RAW_WS_PARSER_KEYWORD = "dwhitespace__" WHITESPACE_PARSER_KEYWORD = "wsp__" DROP_WHITESPACE_PARSER_KEYWORD = "dwsp__" RESUME_RULES_KEYWORD = "resume_rules__" SKIP_RULES_KEYWORD = 'skip_rules__' ERR_MSGS_KEYWORD = 'error_messages__' COMMENT_OR_WHITESPACE = {COMMENT_PARSER_KEYWORD, DROP_COMMENT_PARSER_KEYWORD, RAW_WS_PARSER_KEYWORD, DROP_RAW_WS_PARSER_KEYWORD, WHITESPACE_PARSER_KEYWORD, DROP_WHITESPACE_PARSER_KEYWORD} RESERVED_SYMBOLS = {COMMENT_KEYWORD, COMMENT_RX_KEYWORD, COMMENT_PARSER_KEYWORD, WHITESPACE_KEYWORD, RAW_WS_KEYWORD, RAW_WS_PARSER_KEYWORD, WHITESPACE_PARSER_KEYWORD, DROP_WHITESPACE_PARSER_KEYWORD, RESUME_RULES_KEYWORD} KEYWORD_SUBSTITUTION = {COMMENT_KEYWORD: COMMENT_PARSER_KEYWORD, COMMENT_PARSER_KEYWORD: COMMENT_PARSER_KEYWORD, RAW_WS_KEYWORD: RAW_WS_PARSER_KEYWORD, RAW_WS_PARSER_KEYWORD: RAW_WS_PARSER_KEYWORD, WHITESPACE_KEYWORD: WHITESPACE_PARSER_KEYWORD, WHITESPACE_PARSER_KEYWORD: WHITESPACE_PARSER_KEYWORD, DROP_WHITESPACE_PARSER_KEYWORD: DROP_WHITESPACE_PARSER_KEYWORD} AST_ERROR = "Badly structured syntax tree. " \ "Potentially due to erroneous AST transformation." PREFIX_TABLE = {'§': 'Required', '&': 'Lookahead', '!': 'NegativeLookahead', '<-&': 'Lookbehind', '<-!': 'NegativeLookbehind', '::': 'Pop', ':?': 'Pop', ':': 'Retrieve'} def __init__(self, grammar_name="DSL", grammar_source=""): self.grammar_id = 0 # type: int super(EBNFCompiler, self).__init__() # calls the reset()-method self.set_grammar_name(grammar_name, grammar_source) def reset(self): super(EBNFCompiler, self).reset() self.python_src = '' # type: str self.re_flags = set() # type: Set[str] self.rules = OrderedDict() # type: OrderedDict[str, List[Node]] self.referred_symbols_cache = dict() # type: Dict[str, FrozenSet[str]] self.directly_referred_cache = dict() # type: Dict[str, FrozenSet[str]] self.current_symbols = [] # type: List[Node] self.cache_literal_symbols = None # type: Optional[Dict[str, str]] self.symbols = {} # type: Dict[str, Node] self.variables = set() # type: Set[str] self.forward = set() # type: Set[str] self.definitions = {} # type: Dict[str, str] self.required_keywords = set() # type: Set[str] self.deferred_tasks = [] # type: List[Callable] self.root_symbol = "" # type: str self.drop_flag = False # type: bool self.directives = EBNFDirectives() # type: EBNFDirectives self.defined_directives = dict() # type: Dict[str, List[Node]] self.consumed_custom_errors = set() # type: Set[str] self.consumed_skip_rules = set() # type: Set[str] self.grammar_id += 1 @property def result(self) -> str: return self.python_src def set_grammar_name(self, grammar_name: str = "", grammar_source: str = ""): """ Changes the grammar name and source. The grammar name and the source text are metadata that do not affect the compilation process. It is used to name and annotate the output. Returns `self`. """ assert grammar_name == "" or re.match(r'\w+\Z', grammar_name) if not grammar_name and re.fullmatch(r'[\w/:\\]+', grammar_source): grammar_name = os.path.splitext(os.path.basename(grammar_source))[0] self.grammar_name = grammar_name or "NameUnknown" self.grammar_source = load_if_file(grammar_source) return self # methods for generating skeleton code for preprocessor, transformer, and compiler def gen_preprocessor_skeleton(self) -> str: """ Returns Python-skeleton-code for a preprocessor-function for the previously compiled formal language. """ comment_re = self.directives.comment rs = repr(comment_re) if comment_re else 'NEVER_MATCH_PATTERN' return PREPROCESSOR_FACTORY.format(NAME=self.grammar_name, COMMENT__=rs) # name = self.grammar_name + "Preprocessor" # return "def nop(pos, source_name, source_text):\n"\ # " return SourceLocation(source_name, source_text, pos)\n\n\n" \ # "def %s(source_text, source_name):\n"\ # " return PreprocessorResult(\n"\ # " source_text, source_text,\n"\ # " partial(nop, source_name=source_name, source_text=source_text))\n" % name \ # + PREPROCESSOR_FACTORY.format(NAME=self.grammar_name) def gen_transformer_skeleton(self) -> str: """ Returns Python-skeleton-code for the AST-transformation for the previously compiled formal language. """ if not self.rules and not self._dirty_flag: raise EBNFCompilerError('Compiler must be run before calling ' '"gen_transformer_Skeleton()"!') tt_name = self.grammar_name + '_AST_transformation_table' transtable = [tt_name + ' = {', ' # AST Transformations for the ' + self.grammar_name + '-grammar', ' # "<": flatten', ' # "*": replace_by_single_child', ' # ">: []'] for name in self.rules: transformations = '[]' transtable.append(' "' + name + '": %s,' % transformations) transtable += ['}', ''] transtable += [TRANSFORMER_FACTORY.format(NAME=self.grammar_name, ID=self.grammar_id)] return '\n'.join(transtable) def gen_compiler_skeleton(self) -> str: """ Returns Python-skeleton-code for a Compiler-class for the previously compiled formal language. """ if not self.rules and not self._dirty_flag: raise EBNFCompilerError('Compiler has not been run before calling ' '"gen_Compiler_Skeleton()"!') compiler = ['class ' + self.grammar_name + 'Compiler(Compiler):', ' """Compiler for the abstract-syntax-tree of a ' + self.grammar_name + ' source file.', ' """', '', ' def __init__(self):', ' super(' + self.grammar_name + 'Compiler, self).__init__()', '', ' def reset(self):', ' super().reset()', ' # initialize your variables here, not in the constructor!', ''] for name in self.rules: method_name = visitor_name(name) if name == self.root_symbol: compiler += [' def ' + method_name + '(self, node):', ' return self.fallback_compiler(node)', ''] else: compiler += [' # def ' + method_name + '(self, node):', ' # return node', ''] compiler += [COMPILER_FACTORY.format(NAME=self.grammar_name, ID=self.grammar_id)] return '\n'.join(compiler) def verify_transformation_table(self, transtable): """ Checks for symbols that occur in the transformation-table but have never been defined in the grammar. Usually, this kind of inconsistency results from an error like a typo in the transformation table. """ assert self._dirty_flag table_entries = set(expand_table(transtable).keys()) - {'*', '<', '>', '~'} symbols = self.rules.keys() messages = [] for entry in table_entries: if entry not in symbols and not entry[:1] == ":": messages.append(Error(('Symbol "%s" is not defined in grammar %s but appears in ' 'the transformation table!') % (entry, self.grammar_name), 0, UNDEFINED_SYMBOL_IN_TRANSTABLE_WARNING)) return messages def verify_compiler(self, compiler): """ Checks for on_XXXX()-methods that occur in the compiler, although XXXX has never been defined in the grammar. Usually, this kind of inconsistency results from an error like a typo in the compiler-code. """ pass # TODO: add verification code here def check_rx(self, node: Node, rx: str) -> str: """ Checks whether the string `rx` represents a valid regular expression. Makes sure that multi-line regular expressions are prepended by the multi-line-flag. Returns the regular expression string. """ # TODO: Support atomic grouping: https://stackoverflow.com/questions/13577372/ # do-python-regular-expressions-have-an-equivalent-to-rubys-atomic-grouping flags = self.re_flags | {'x'} if rx.find('\n') >= 0 else self.re_flags if flags: rx = "(?%s)%s" % ("".join(flags), rx) try: re.compile(rx) except Exception as re_error: self.tree.new_error(node, "malformed regular expression %s: %s" % (repr(rx), str(re_error)), MALFORMED_REGULAR_EXPRESSION) return rx def extract_regex(self, node: Node) -> str: """Extracts regular expression string from regexp-Node.""" value = node.content if node.tag_name in ('literal', 'plaintext'): assert value assert value[0] + value[-1] in ('""', "''", "``") value = escape_re(value[1:-1]) else: assert node.tag_name == "regexp""" value = self.check_rx(node, value) return value def gen_search_rule(self, nd: Node) -> ReprType: """Generates a search rule, which can be either a string for simple string search or a regular expression from the node's content. Returns an empty string in case the node is neither regexp nor literal. """ if nd.tag_name == 'regexp': super_ws = self.directives.super_ws nonempty_ws = mixin_nonempty(super_ws) search_regex = self.extract_regex(nd)\ .replace(r'\~!', nonempty_ws).replace(r'\~', super_ws) return unrepr("re.compile(r'%s')" % search_regex) elif nd.tag_name == 'literal': s = nd.content[1:-1] # remove quotation marks return unrepr("re.compile(r'(?=%s)')" % escape_re(s)) elif nd.tag_name == 'procedure': return unrepr(nd.content) elif nd.tag_name == 'symbol': return unrepr(nd.content.strip()) else: return '' # self.tree.new_error(nd, 'Only regular expressions, string literals and external ' # 'procedures are allowed as search rules, but not: ' + nd.tag_name) # return unrepr('') def directly_referred(self, symbol: str) -> FrozenSet[str]: """Returns the set of symbols that are referred to in the definition of `symbol`.""" try: return self.directly_referred_cache[symbol] except KeyError: pass try: referred_nodes = self.rules[symbol] except KeyError: referred_nodes = [] # Missing Symbol definition error will be caught later result = frozenset({nd.content for nd in referred_nodes[1:]}) self.directly_referred_cache[symbol] = result return result def referred_symbols(self, symbol: str) -> FrozenSet[str]: """Returns the set of all symbols that are directly or indirectly referred to in the definition of `symbol`. The symbol itself can be contained in this set, if and only if its rule is recursive. `referred_symbols()` only yields reliable results if the collection of definitions has been completed. """ try: return self.referred_symbols_cache[symbol] except KeyError: pass collected = set() # type: Set[str] def gather(sym: str): for s in self.directly_referred(sym): if s not in collected and s not in EBNFCompiler.RESERVED_SYMBOLS: collected.add(s) if s != symbol: gather(s) gather(symbol) result = frozenset(collected) self.referred_symbols_cache[symbol] = result return result def recursive_paths(self, symbol: str) -> FrozenSet[Tuple[str]]: """Returns the recursive paths from symbol to itself. If sym is not recursive, the returned tuple (of paths) will be empty. This method exists only for debugging (so far...).""" path = [] # type: List[str] recursive_paths = set() # type: Set[Tuple[str]] def gather(sym: str): nonlocal path, recursive_paths path.append(sym) for s in self.directly_referred(sym): if s not in EBNFCompiler.RESERVED_SYMBOLS: if s == symbol: recursive_paths.add(tuple(path)) elif s not in path: gather(s) path.pop() gather(symbol) return frozenset(recursive_paths) @cython.locals(N=cython.int, top=cython.int, pointer=cython.int, i=cython.int, k = cython.int, j=cython.int) def optimize_definitions_order(self, definitions: List[Tuple[str, str]]): """Reorders the definitions so as to minimize the number of Forward declarations. Forward declarations remain inevitable only where recursion is involved. """ if not get_config_value('reorder_definitions'): return N = len(definitions) root = definitions[0][0] if N > 0 else '' recursive = {sym for sym in self.forward if sym in self.referred_symbols(sym) or sym == root} # move recursive symbols to the top of the list top = 1 # however, leave the root symbol at the top while top < N and definitions[top][0] in recursive: top += 1 for i in range(top, N): if definitions[i][0] in recursive: definitions[top], definitions[i] = definitions[i], definitions[top] top += 1 # order the other definitions pointer = top while pointer < N: topsym = definitions[pointer][0] for i in range(pointer + 1, N): if topsym in self.directly_referred(definitions[i][0]): definitions[pointer], definitions[i] = definitions[i], definitions[pointer] break else: pointer += 1 # try to place as many recursive symbols as possible among the # non-forwardly-defined-symbols i = top - len(recursive) while i < top: sym = definitions[i][0] referred = self.directly_referred(sym) if root not in referred: k = i while k < N and definitions[k][0] not in referred: k += 1 j = k while j < N and sym not in self.directly_referred(definitions[j][0]): j += 1 if j == N: definitions.insert(k, definitions[i]) del definitions[i] top -= 1 recursive.remove(sym) i += 1 self.forward = recursive def assemble_parser(self, definitions: List[Tuple[str, str]]) -> str: """ Creates the Python code for the parser after compilation of the EBNF-Grammar """ def pp_rules(rule_name: str, ruleset: Dict[str, List]) -> Tuple[str, str]: """Pretty-print skip- and resume-rule and error-messages dictionaries to avoid excessively long lines in the generated python source.""" assert ruleset indent = ",\n" + " " * (len(rule_name) + 8) rule_repr = [] for k, v in ruleset.items(): delimiter = indent + ' ' * (len(k) + 5) val = '[' + delimiter.join(str(it) for it in v) + ']' rule_repr.append("'{key}': {value}".format(key=k, value=val)) rule_repr[0] = '{' + rule_repr[0] rule_repr[-1] = rule_repr[-1] + '}' return rule_name, indent.join(rule_repr) def verify_directive_against_symbol(directive: str, related_symbol: str): """Adds an error if the symbol that the directive relates to does not exist.""" if related_symbol not in self.rules: nd = self.defined_directives[directive][0] self.tree.new_error(nd, 'Directive "%s" relates to a symbol that is ' 'nowhere defined!' % directive, DIRECTIVE_FOR_NONEXISTANT_SYMBOL) # execute deferred tasks, for example semantic checks that cannot # be done before the symbol table is complete for task in self.deferred_tasks: task() # minimize the necessary number of forward declarations self.optimize_definitions_order(definitions) self.root_symbol = definitions[0][0] if definitions else "" # provide for capturing of symbols that are variables, i.e. the # value of which will be retrieved at some point during the parsing process # The following is needed for cases, where the first retrival-use of a symbol # follows its definition if self.variables: for i in range(len(definitions)): if definitions[i][0] in self.variables: definitions[i] = (definitions[i][0], 'Capture(%s)' % definitions[i][1]) self.definitions[definitions[i][0]] = definitions[i][1] # add special fields for Grammar class if DROP_WSPC in self.directives.drop or DROP_STRINGS in self.directives.drop: definitions.append((EBNFCompiler.DROP_WHITESPACE_PARSER_KEYWORD, 'Drop(Whitespace(%s))' % EBNFCompiler.WHITESPACE_KEYWORD)) definitions.append((EBNFCompiler.WHITESPACE_PARSER_KEYWORD, 'Whitespace(%s)' % EBNFCompiler.WHITESPACE_KEYWORD)) definitions.append((EBNFCompiler.WHITESPACE_KEYWORD, ("mixin_comment(whitespace=" + EBNFCompiler.RAW_WS_KEYWORD + ", comment=" + EBNFCompiler.COMMENT_KEYWORD + ")"))) if EBNFCompiler.RAW_WS_PARSER_KEYWORD in self.required_keywords: definitions.append((EBNFCompiler.RAW_WS_PARSER_KEYWORD, "Whitespace(%s)" % EBNFCompiler.RAW_WS_KEYWORD)) definitions.append((EBNFCompiler.RAW_WS_KEYWORD, "r'{}'".format(self.directives.whitespace))) comment_rx = ("re.compile(%s)" % EBNFCompiler.COMMENT_KEYWORD) \ if self.directives.comment else "RX_NEVER_MATCH" if EBNFCompiler.COMMENT_PARSER_KEYWORD in self.required_keywords: definitions.append((EBNFCompiler.COMMENT_PARSER_KEYWORD, "RegExp(%s)" % EBNFCompiler.COMMENT_RX_KEYWORD)) definitions.append((EBNFCompiler.COMMENT_RX_KEYWORD, comment_rx)) definitions.append((EBNFCompiler.COMMENT_KEYWORD, "r'{}'".format(self.directives.comment))) # prepare and add resume-rules resume_rules = dict() # type: Dict[str, List[ReprType]] for symbol, raw_rules in self.directives.resume.items(): refined_rules = [] # type: List[ReprType] verify_directive_against_symbol(symbol + '_resume', symbol) for rule in raw_rules: if isinstance(rule, unrepr) and rule.s.isidentifier(): try: nd = self.rules[rule.s][0].children[1] refined = self.gen_search_rule(nd) if not refined: refined = unrepr(rule.s) except IndexError: nd = self.tree # TODO: Allow arbitrary parsers, here refined = '' # refined = rule except KeyError: # rule represents a procedure name nd = self.tree refined = rule if refined: refined_rules.append(refined) else: self.tree.new_error( nd, 'Symbol "%s" cannot be used in resume rule, since it represents ' 'neither literal nor regexp nor procedure!', INAPPROPRIATE_SYMBOL_FOR_DIRECTIVE) else: refined_rules.append(rule) resume_rules[symbol] = refined_rules if resume_rules: definitions.insert(0, pp_rules(self.RESUME_RULES_KEYWORD, resume_rules)) # prepare and add skip-rules skip_rules = dict() # # type: Dict[str, List[ReprType]] for symbol, skip in self.directives.skip.items(): rules = [] # type: List[ReprType] verify_directive_against_symbol(symbol + '_skip', symbol) for search in skip: if isinstance(search, unrepr) and search.s.isidentifier(): try: nd = self.rules[search.s][0].children[1] search = self.gen_search_rule(nd) except IndexError: search = '' except KeyError: # rule represents a procedure name pass rules.append(search) skip_rules[symbol] = rules if skip_rules: definitions.insert(0, pp_rules(self.SKIP_RULES_KEYWORD, skip_rules)) for symbol in self.directives.skip.keys(): if symbol not in self.consumed_skip_rules: # check for indirectly reachable unconsumed mandatory markers for s in self.referred_symbols(symbol): if s not in self.directives.skip and s not in self.directives.resume: if self.definitions[s].find('mandatory=') >= 0: self.consumed_skip_rules.add(symbol) break else: try: def_node = self.rules[symbol][0] self.tree.new_error( def_node, '"Skip-rules" for symbol "{}" will never be used, because ' 'the mandatory marker "§" does not appear in its definiendum or has ' 'already been consumed earlier.' .format(symbol), UNUSED_ERROR_HANDLING_WARNING) except KeyError: pass # error has already been notified earlier! # prepare and add customized error-messages error_messages = dict() # type: Dict[str, List[Tuple[ReprType, ReprType]]] for symbol, err_msgs in self.directives.error.items(): custom_errors = [] # type: List[Tuple[ReprType, ReprType]] verify_directive_against_symbol(symbol + '_error', symbol) for search, message in err_msgs: if isinstance(search, unrepr) and search.s.isidentifier(): try: nd = self.rules[search.s][0].children[1] search = self.gen_search_rule(nd) except IndexError: search = '' except KeyError: pass custom_errors.append((search, message)) error_messages[symbol] = custom_errors if error_messages: definitions.append(pp_rules(self.ERR_MSGS_KEYWORD, error_messages)) for symbol in self.directives.error.keys(): if symbol in self.rules and symbol not in self.consumed_custom_errors: # try: def_node = self.rules[symbol][0] self.tree.new_error( def_node, 'Customized error message for symbol "{}" will never be used, ' 'because the mandatory marker "§" appears nowhere in its definiendum!' .format(symbol), UNUSED_ERROR_HANDLING_WARNING) # prepare parser class header and docstring and # add EBNF grammar to the doc string of the parser class article = 'an ' if self.grammar_name[0:1] in "AaEeIiOoUu" else 'a ' # what about 'hour', 'universe' etc.? show_source = get_config_value('add_grammar_source_to_parser_docstring') declarations = ['class ' + self.grammar_name + 'Grammar(Grammar):', 'r"""Parser for ' + article + self.grammar_name + ' source file' + ('. Grammar:' if self.grammar_source and show_source else '.')] definitions.append(('parser_initialization__', '["upon instantiation"]')) definitions.append(('static_analysis_pending__', '[True]')) definitions.append(('disposable__', 're.compile(' + repr(self.directives.disposable) + ')')) if self.grammar_source: definitions.append(('source_hash__', '"%s"' % md5(self.grammar_source, __version__))) declarations.append('') if show_source: declarations += [line for line in self.grammar_source.split('\n')] while declarations[-1].strip() == '': declarations = declarations[:-1] declarations.append('"""') # turn definitions into declarations in reverse order definitions.reverse() declarations += [symbol + ' = Forward()' for symbol in sorted(list(self.forward))] for symbol, statement in definitions: if symbol in self.forward: declarations += [symbol + '.set(' + statement + ')'] else: declarations += [symbol + ' = ' + statement] # check for symbols used but never defined defined_symbols = set(self.rules.keys()) | self.RESERVED_SYMBOLS for symbol in self.symbols: if symbol not in defined_symbols: self.tree.new_error(self.symbols[symbol], "Missing definition for symbol '%s'" % symbol) # check for unconnected rules defined_symbols.difference_update(self.RESERVED_SYMBOLS) def remove_connections(symbol): """Recursively removes all symbols which appear in the definiens of a particular symbol.""" if symbol in defined_symbols: defined_symbols.remove(symbol) for related in self.rules[symbol][1:]: remove_connections(str(related)) remove_connections(self.root_symbol) for leftover in defined_symbols: self.tree.new_error(self.rules[leftover][0], 'Rule "%s" is not connected to parser root "%s" !' % (leftover, self.root_symbol), UNCONNECTED_SYMBOL_WARNING) # check for filters assigned to non-existing or uncaptured symbols def directive_node(tree, directive) -> Node: """Returns the node, where the given directive was stated in the EBNF-source.""" for dr in tree.select('directive'): if dr.pick('symbol').content == directive: return dr return tree for symbol in self.directives.filter: if symbol not in self.symbols: self.tree.new_error(directive_node(self.tree, symbol + '_filter'), 'Filter declared for non-existent symbol "%s"' % symbol, WARNING) else: if not self.definitions[symbol][:8] == 'Capture(': self.tree.new_error(directive_node(self.tree, symbol + '_filter'), 'Filter declared for uncaptured symbol "%s"' % symbol, WARNING) # assemble python grammar definition if self.root_symbol: if self.directives.reduction != CombinedParser.DEFAULT_OPTIMIZATION: opt = ', CombinedParser.' + ('NO_TREE_REDUCTION', 'FLATTEN', 'MERGE_TREETOPS', 'MERGE_LEAVES')[self.directives.reduction] + ')' declarations.append('root__ = TreeReduction(' + self.root_symbol + opt) else: declarations.append('root__ = ' + self.root_symbol) else: declarations.append(f'root__ = RegExp(r"{NEVER_MATCH_PATTERN}")') declarations.append('') self.python_src = '\n '.join(declarations) \ + GRAMMAR_FACTORY.format(NAME=self.grammar_name, ID=self.grammar_id) return self.python_src # compilation methods ### def on_ZOMBIE__(self, node: Node) -> str: result = ['Illegal node in AST generated from EBNF-Source!'] if node.children: result.append(' Fragments found: ') result.extend([str(self.compile(child)) for child in node.children]) return '\n'.join(result) def on_syntax(self, node: Node) -> str: definitions = [] # type: List[Tuple[str, str]] # drop the wrapping sequence node if len(node.children) == 1 and node.children[0].anonymous: node = node.children[0] # compile definitions and directives and collect definitions for nd in node.children: if nd.tag_name == "definition": definitions.append(self.compile(nd)) else: assert nd.tag_name == "directive", nd.as_sxpr() self.compile(nd) self.definitions.update(definitions) if not self.definitions: self.tree.new_error(node, "Grammar does not contain any rules!", EMPTY_GRAMMAR_ERROR) python_src = self.assemble_parser(definitions) if get_config_value('static_analysis') == 'early' and not \ any(e.code == MALFORMED_REGULAR_EXPRESSION for e in self.tree.errors): errors = [] try: grammar_class = compile_python_object( DHPARSER_IMPORTS.format(dhparser_parentdir=DHPARSER_PARENTDIR) + python_src, (self.grammar_name or "DSL") + "Grammar") errors = grammar_class().static_analysis_errors__ python_src = python_src.replace( 'static_analysis_pending__ = [True]', 'static_analysis_pending__ = [] # type: List[bool]', 1) except NameError: pass # undefined names in the grammar are already caught and reported except GrammarError as ge: errors = ge.errors for sym, _, err in errors: symdef_node = self.rules[sym][0] err.pos = self.rules[sym][0].pos self.tree.add_error(symdef_node, err) self.python_src = python_src return python_src def on_definition(self, node: Node) -> Tuple[str, str]: rule = node.children[0].content if rule.endswith('_error') or rule.endswith('_skip') \ or rule.endswith('_resume') or rule.endswith('_filter'): self.tree.new_error(node, 'Symbol name "%s" suggests directive, ' % rule + 'but directive marker @ is missing...', WARNING) if rule in self.rules: first = self.rules[rule][0] if not id(first) in self.tree.error_nodes: self.tree.new_error(first, 'First definition of rule "%s" ' 'followed by illegal redefinitions.' % rule) self.tree.new_error(node, 'A rule "%s" has already been defined earlier.' % rule) elif rule in EBNFCompiler.RESERVED_SYMBOLS: self.tree.new_error(node, 'Symbol "%s" is a reserved symbol.' % rule) elif not sane_parser_name(rule): self.tree.new_error(node, 'Illegal symbol "%s". Symbols must not start or ' ' end with a double underscore "__".' % rule) elif rule in self.directives.tokens: self.tree.new_error(node, 'Symbol "%s" has already been defined as ' 'a preprocessor token.' % rule) elif keyword.iskeyword(rule): self.tree.new_error(node, 'Python keyword "%s" may not be used as a symbol. ' % rule + '(This may change in the future.)') try: self.current_symbols = [node] self.rules[rule] = self.current_symbols self.drop_flag = rule in self.directives['drop'] and rule not in DROP_VALUES defn = self.compile(node.children[1]) if defn.find("(") < 0: # assume it's a synonym, like 'page = REGEX_PAGE_NR' defn = 'Synonym(%s)' % defn if self.drop_flag and defn[:5] != "Drop(": defn = 'Drop(%s)' % defn # TODO: Recursively drop all contained parsers for optimization? except TypeError as error: from traceback import extract_tb, format_list trace = ''.join(format_list((extract_tb(error.__traceback__)))) errmsg = "%s (TypeError: %s;\n%s)\n%s" \ % (EBNFCompiler.AST_ERROR, str(error), trace, node.as_sxpr()) self.tree.new_error(node, errmsg) rule, defn = rule + ':error', '"' + errmsg + '"' finally: self.drop_flag = False return rule, defn @staticmethod def join_literals(nd): assert nd.tag_name == "literals" parts = [nd.children[0].content[:-1]] for child in nd.children[1:-1]: parts.append(child.content[1:-1]) parts.append(nd.children[-1].content[1:]) nd.result = "".join(parts) nd.tag_name = "literal" def add_to_disposable_regexp(self, pattern): if self.directives.disposable == NEVER_MATCH_PATTERN: self.directives.disposable = pattern else: old_pattern = self.directives.disposable self.directives.disposable = '(?:%s)|(?:%s)' % (old_pattern, pattern) def on_directive(self, node: Node) -> str: for child in node.children: if child.tag_name == "literal": child.result = escape_ctrl_chars(child.content) elif child.tag_name == "literals": self.join_literals(child) child.result = escape_ctrl_chars(child.content) key = node.children[0].content assert key not in self.directives.tokens if key not in EBNFDirectives.REPEATABLE_DIRECTIVES and not key.endswith('_error') \ and key in self.defined_directives: self.tree.new_error(node, 'Directive "%s" has already been defined earlier. ' % key + 'Later definition will be ignored!', code=REDEFINED_DIRECTIVE) return "" self.defined_directives.setdefault(key, []).append(node) def check_argnum(n: int = 1): if len(node.children) > n + 1: self.tree.new_error(node, 'Directive "%s" can have at most %i values.' % (key, n)) if key in {'comment', 'whitespace'}: check_argnum() if node.children[1].tag_name == "symbol": value = node.children[1].content if key == 'whitespace' and value in WHITESPACE_TYPES: value = WHITESPACE_TYPES[value] # replace whitespace-name by regex else: self.tree.new_error(node, 'Value "%s" not allowed for directive "%s".' % (value, key)) else: value = self.extract_regex(node.children[1]) if key == 'whitespace' and not re.match(value, ''): self.tree.new_error(node, "Implicit whitespace should always " "match the empty string, /%s/ does not." % value) self.directives[key] = value elif key == 'disposable': if node.children[1].tag_name == "regexp": check_argnum() re_pattern = node.children[1].content if re.match(re_pattern, ''): self.tree.new_error( node, "The regular expression r'%s' matches any symbol, " "which is not allowed!" % re_pattern) else: self.add_to_disposable_regexp(re_pattern) else: args = node.children[1:] assert all(child.tag_name == "symbol" for child in args) alist = [child.content for child in args] for asym in alist: if asym not in self.symbols: self.symbols[asym] = node self.add_to_disposable_regexp('$|'.join(alist) + '$') elif key == 'drop': if len(node.children) <= 1: self.tree.new_error(node, 'Directive "@ drop" requires as least one argument!') unmatched = [] # type: List[str] # dropped syms that are not already anonymous syms for child in node.children[1:]: content = child.content if re.match(self.directives.disposable, content): self.directives[key].add(content) elif content.lower() in DROP_VALUES: self.directives[key].add(content.lower()) else: unmatched.append(content) if self.directives.disposable == NEVER_MATCH_PATTERN: self.tree.new_error(node, 'Illegal value "%s" for Directive "@ drop"! ' 'Should be one of %s or a disposable parser, where ' 'the "@disposable"-directive must precede the ' '@drop-directive.' % (content, str(DROP_VALUES))) else: self.tree.new_error( node, 'Illegal value "%s" for Directive "@ drop"! Should be one of ' '%s or a string matching r"%s".' % (content, str(DROP_VALUES), self.directives.disposable)) if unmatched: self.directives[key].add(content) self.add_to_disposable_regexp('$|'.join(unmatched) + '$') elif key in ('reduction', 'tree_reduction'): check_argnum(1) arg = node.children[1].content if arg not in ('none', 'flatten', 'merge_treetops', 'merge', 'merge_leaves', '0', '1', '2', '3'): self.tree.new_error(node, 'Wrong value "%s" for Directive "%s". ' % (arg, key) + 'Choose from: 0-4 or none, flatten, merge_treetop, merge.') elif arg.isdigit(): self.directives.reduction = int(arg) else: level = {'none': 0, 'flatten': 1, 'merge_treetops': 2, 'merge': 3, 'merge_leaves': 3}[arg] self.directives.reduction = level elif key == 'ignorecase': check_argnum(1) if node.children[1].content.lower() not in {"off", "false", "no"}: self.re_flags.add('i') elif key == 'literalws': values = {child.content.strip().lower() for child in node.children[1:]} if ((values - {'left', 'right', 'both', 'none'}) or ('none' in values and len(values) > 1)): self.tree.new_error(node, 'Directive "literalws" allows only `left`, `right`, ' '`both` or `none`, not `%s`' % ", ".join(values)) wsp = {'left', 'right'} if 'both' in values \ else set() if 'none' in values else values self.directives.literalws = wsp elif key in ('tokens', 'preprocessor_tokens'): tokens = {child.content.strip() for child in node.children[1:]} redeclared = self.directives.tokens & tokens if redeclared: self.tree.new_error(node, 'Tokens %s have already been declared earlier. ' % str(redeclared) + 'Later declaration will be ignored', code=REDECLARED_TOKEN_WARNING) self.directives.tokens |= tokens - redeclared elif key.endswith('_filter'): check_argnum() symbol = key[:-7] if node.children[1].tag_name != "procedure": self.tree.new_error( node, 'Filter must be a procedure, denoted as "foo()", not not a %s: %s' % (node.children[1].tag_name, node.children[1].content)) self.directives.filter[symbol] = node.children[1].content.strip() elif key.endswith('_error'): check_argnum(2) symbol = key[:-6] error_msgs = self.directives.error.get(symbol, []) # if symbol in self.rules: # self.tree.new_error(node, 'Custom error message for symbol "%s"' % symbol # + ' must be defined before the symbol!') if node.children[1 if len(node.children) == 2 else 2].tag_name != 'literal': self.tree.new_error( node, 'Directive "%s" requires message string or a a pair ' % key + '(regular expression or search string, message string) as argument!') if len(node.children) == 2: error_msgs.append(('', unrepr(node[1].content))) elif len(node.children) == 3: rule = self.gen_search_rule(node[1]) error_msgs.append((rule if rule else unrepr(node[1].content), unrepr(node[2].content))) else: self.tree.new_error(node, 'Directive "%s" allows at most two parameters' % key) self.directives.error[symbol] = error_msgs elif key.endswith('_skip'): symbol = key[:-5] # if symbol in self.rules: # self.tree.new_error(node, 'Skip list for resuming in series for symbol "{}"' # ' must be defined before the symbol!'.format(symbol)) self.directives.skip[symbol] = [self.gen_search_rule(nd) for nd in node[1:]] elif key.endswith('_resume'): symbol = key[:-7] self.directives.resume[symbol] = [self.gen_search_rule(nd) for nd in node[1:]] else: if any(key.startswith(directive) for directive in ('skip', 'error', 'resume')): kl = key.split('_') proper_usage = '_'.join(kl[1:] + kl[0:1]) self.tree.new_error(node, 'Directive "%s" must be used as postfix not prefix to ' 'the symbolname. Please, write: "%s"' % (kl[0], proper_usage)) else: self.tree.new_error(node, 'Unknown directive %s ! (Known directives: %s.)' % (key, ', '.join(k for k in VALID_DIRECTIVES.keys()))) return "" def non_terminal(self, node: Node, parser_class: str, custom_args: List[str] = []) -> str: """ Compiles any non-terminal, where `parser_class` indicates the Parser class name for the particular non-terminal. """ arguments = [self.compile(r) for r in node.children] + custom_args assert all(isinstance(arg, str) for arg in arguments), str(arguments) # remove drop clause for non dropping definitions of forms like "/\w+/~" if (parser_class == "Series" and node.tag_name not in self.directives.drop and DROP_REGEXP in self.directives.drop and self.context[-2].tag_name == "definition" and all((arg[:12] == 'Drop(RegExp(' or arg[:10] == 'Drop(Text(' or arg in EBNFCompiler.COMMENT_OR_WHITESPACE) for arg in arguments)): arguments = [arg.replace('Drop(', '').replace('))', ')') for arg in arguments] if self.drop_flag: return 'Drop(' + parser_class + '(' + ', '.join(arguments) + '))' else: return parser_class + '(' + ', '.join(arguments) + ')' def on_expression(self, node) -> str: # The following algorithm reorders literal alternatives, so that earlier alternatives # doe not pre-empt later alternatives, e.g. 'ID' | 'IDREF' will be reordered as # 'IDREF' | 'ID' def move_items(l: List, a: int, b: int): """Moves all items in the interval [a:b[ one position forward and moves the item at position b to position a.""" if a > b: a, b = b, a last = l[b] for i in range(b, a, -1): l[i] = l[i-1] l[a] = last def literal_content(nd: Node) -> Optional[str]: """Returns the literal content of either a literal-Node, a plaintext-Node or a symbol-Node, where the symbol is defined as a literal or plaintext.""" if nd.tag_name in ('literal', 'plaintext'): return nd.content[1:-1] elif nd.tag_name == 'symbol': if self.cache_literal_symbols is None: self.cache_literal_symbols = dict() for df in self.tree.select_children('definition'): if df.children[1].tag_name in ('literal', 'plaintext'): self.cache_literal_symbols[df.children[0].content] = \ df.children[1].content[1:-1] return self.cache_literal_symbols.get(nd.content, None) return None literals = [] # type: List[List] for i, child in enumerate(node.children): content = literal_content(child) if content is not None: literals.append([i, content]) move = [] # type: List[Tuple[int, int]] snapshots = set() # type: Set[Tuple[int]] start_over = True # type: bool while start_over: start_over = False for i in range(1, len(literals)): later = literals[i][1] for k in range(i): earlier = literals[k][1] if later.startswith(earlier): a, b = literals[k][0], literals[i][0] move.append((a, b)) literals[i][0] = a for n in range(k, i): literals[n][0] += 1 self.tree.new_error( node, 'Alternative "%s" should not precede "%s" where it occurs ' 'at the beginning! Reorder to avoid warning.' % (earlier, later), REORDERING_OF_ALTERNATIVES_REQUIRED) start_over = True break if start_over: break if start_over: move_items(literals, k, i) snapshot = tuple(item[1] for item in literals) if snapshot in snapshots: self.tree.new_error( node, 'Reordering of alternatives "%s" and "%s" required but not possible!' % (earlier, later), BAD_ORDER_OF_ALTERNATIVES) move = [] start_over = False else: snapshots.add(snapshot) if move: children = list(node.children) for mv in move: move_items(children, mv[0], mv[1]) node.result = tuple(children) return self.non_terminal(node, 'Alternative') def _error_customization(self, node) -> Tuple[Tuple[Node, ...], List[str]]: """Generates the customization arguments (mandatory, error_msgs, skip) for `MandatoryNary`-parsers (Series, Interleave, ...).""" mandatory_marker = [] filtered_children = [] # type: List[Node] for nd in node.children: if nd.tag_name == TOKEN_PTYPE and nd.content == "§": mandatory_marker.append(len(filtered_children)) if len(mandatory_marker) > 1: self.tree.new_error(nd, 'One mandatory marker (§) is sufficient to declare ' 'the rest of the elements as mandatory.', WARNING) else: filtered_children.append(nd) custom_args = ['mandatory=%i' % mandatory_marker[0]] if mandatory_marker else [] # add custom error message if it has been declared for the current definition if custom_args: current_symbol = next(reversed(self.rules.keys())) # add customized error messages, if defined if current_symbol in self.directives.error: if current_symbol in self.consumed_custom_errors: self.tree.new_error( node, "Cannot apply customized error messages unambiguously, because " "symbol {} contains more than one parser with a mandatory marker '§' " "in its definiens.".format(current_symbol), AMBIGUOUS_ERROR_HANDLING) else: # use class field instead or direct representation of error messages! # custom_args.append('err_msgs={err_msgs_name}["{symbol}"]' # .format(err_msgs_name=self.ERR_MSGS_KEYWORD, # symbol=current_symbol)) self.consumed_custom_errors.add(current_symbol) # add skip-rules to resume parsing of a series, if rules have been declared if current_symbol in self.directives.skip: if current_symbol in self.consumed_skip_rules: self.tree.new_error( node, "Cannot apply 'skip-rules' unambigiously, because symbol " "{} contains more than one parser with a mandatory marker '§' " "in its definiens.".format(current_symbol), AMBIGUOUS_ERROR_HANDLING) else: # use class field instead or direct representation of error messages! # custom_args.append('skip={skip_rules_name}["{symbol}"]' # .format(skip_rules_name=self.SKIP_RULES_KEYWORD, # symbol=current_symbol)) self.consumed_skip_rules.add(current_symbol) return tuple(filtered_children), custom_args def on_sequence(self, node) -> str: filtered_result, custom_args = self._error_customization(node) mock_node = Node(node.tag_name, filtered_result) return self.non_terminal(mock_node, 'Series', custom_args) def on_interleave(self, node) -> str: children = [] repetitions = [] filtered_result, custom_args = self._error_customization(node) for child in filtered_result: if child.tag_name == "group": assert len(child.children) == 1 if child.children[0].tag_name == 'repetition': child.tag_name = "option" child.children[0].tag_name = "oneormore" elif child.children[0].tag_name == 'option': child = child.children[0] else: repetitions.append((1, 1)) children.append(child.children[0]) continue if child.tag_name == "oneormore": repetitions.append((1, INFINITE)) assert len(child.children) == 1 children.append(child.children[0]) elif child.tag_name == "repetition": repetitions.append((0, INFINITE)) assert len(child.children) == 1 children.append(child.children[0]) elif child.tag_name == "option": repetitions.append((0, 1)) assert len(child.children) == 1 children.append(child.children[0]) elif child.tag_name == "counted": what, r = self.extract_counted(child) repetitions.append(r) children.append(what) else: repetitions.append((1, 1)) children.append(child) custom_args.append('repetitions=' + str(repetitions).replace(str(INFINITE), 'INFINITE')) mock_node = Node(node.tag_name, tuple(children)) return self.non_terminal(mock_node, 'Interleave', custom_args) def on_lookaround(self, node: Node) -> str: assert node.children assert len(node.children) == 2 assert node.children[0].tag_name == 'flowmarker' prefix = node.children[0].content # arg_node = node.children[1] node.result = node.children[1:] assert prefix in {'&', '!', '<-&', '<-!'} parser_class = self.PREFIX_TABLE[prefix] result = self.non_terminal(node, parser_class) if prefix[:2] == '<-': def verify(node: Optional[Node]): nd = node if len(nd.children) >= 1: nd = nd.children[0] while nd.tag_name == "symbol": symlist = self.rules.get(nd.content, []) if len(symlist) == 2: nd = symlist[1] else: if len(symlist) == 1: nd = symlist[0].children[1] break # content = nd.content if nd.tag_name != "regexp": # outdated: or content[:1] != '/' or content[-1:] != '/'): self.tree.new_error(node, "Lookbehind-parser can only be used with RegExp" "-parsers, not: " + nd.tag_name) if not result[:7] == 'RegExp(': self.deferred_tasks.append(lambda: verify(node)) return result def on_difference(self, node: Node) -> str: assert len(node.children) == 2, str(node.as_sxpr()) left = self.compile(node.children[0]) right = self.compile(node.children[1]) return "Series(NegativeLookahead(%s), %s)" % (right, left) def on_element(self, node: Node) -> str: assert node.children assert len(node.children) == 2 assert node.children[0].tag_name == "retrieveop" assert node.children[1].tag_name == "symbol" prefix = node.children[0].content # type: str arg = node.children[1].content # type: str node.result = node.children[1:] assert prefix in {'::', ':?', ':'} if re.match(self.directives.disposable, arg): self.tree.new_error( node, 'Retrieve operator "%s" does not work with disposable parsers like %s' % (prefix, arg)) return arg custom_args = [] # type: List[str] match_func = 'last_value' # type: str if arg in self.directives.filter: match_func = self.directives.filter[arg] if prefix.endswith('?'): match_func = 'optional_' + match_func if match_func != 'last_value': custom_args = ['match_func=%s' % match_func] self.variables.add(arg) parser_class = self.PREFIX_TABLE[prefix] return self.non_terminal(node, parser_class, custom_args) def on_option(self, node) -> str: return self.non_terminal(node, 'Option') def on_repetition(self, node) -> str: return self.non_terminal(node, 'ZeroOrMore') def on_oneormore(self, node) -> str: return self.non_terminal(node, 'OneOrMore') def on_group(self, node) -> str: assert len(node.children) == 1, node.as_sxpr() return self.compile(node.children[0]) def extract_range(self, node) -> Tuple[int, int]: """Returns the range-value of a range-node as a tuple of two integers. """ assert node.tag_name == "range" assert all(child.tag_name == 'multiplier' for child in node.children) if len(node.children) == 2: r = (int(node.children[0].content), int(node.children[1].content)) if r[0] > r[1]: self.tree.new_error( node, "Upper bound %i of range is greater than lower bound %i!" % r) return r[1], r[0] return r else: assert len(node.children) == 1 return (int(node.children[0].content), int(node.children[0].content)) def extract_counted(self, node) -> Tuple[Node, Tuple[int, int]]: """Returns the content of a counted-node in a normalized form: (node, (n, m)) where node is root of the sub-parser that is counted, i.e. repeated n or n up to m times. """ assert node.tag_name == 'counted' if len(node.children) != 2: self.tree.new_error(node, f'Wrong number of arguments for repetition: ' f'{len(node.children)} (two expected)!') return Node(ZOMBIE_TAG, ''), (0, 0) rng = node.get('range', None) if rng: r = self.extract_range(rng) what = node.children[0] assert what.tag_name != 'range' else: # multiplier specified instead of range if node.children[0].tag_name == 'multiplier': m = int(node.children[0].content) what = node.children[1] else: assert node.children[1].tag_name == 'multiplier' m = int(node.children[1].content) what = node.children[0] if what.tag_name == 'option': what = what.children[0] r = (0, m) elif what.tag_name == 'oneormore': what = what.children[0] r = (m, INFINITE) elif what.tag_name == 'repetition': self.tree.new_error( node, 'Counting zero or more repetitions of something does not make sense! ' 'Its still zero or more repetitions all the same.') what = what.children[0] r = (0, INFINITE) else: r = (m, m) return what, r def on_counted(self, node) -> str: what, r = self.extract_counted(node) # whatstr = self.compile(what) rstr = str(r).replace(str(INFINITE), 'INFINITE') mock_node = Node(node.tag_name, (what,)) return self.non_terminal(mock_node, 'Counted', ['repetitions=' + rstr]) def on_symbol(self, node: Node) -> str: # called only for symbols on the right hand side! symbol = node.content # ; assert result == cast(str, node.result) if symbol in self.directives.tokens: return 'PreprocessorToken("' + symbol + '")' else: self.current_symbols.append(node) if symbol not in self.symbols: # remember first use of symbol, so that dangling references or # redundant definitions or usages of symbols can be detected later self.symbols[symbol] = node if symbol in self.rules: self.forward.add(symbol) if symbol in EBNFCompiler.KEYWORD_SUBSTITUTION: keyword = EBNFCompiler.KEYWORD_SUBSTITUTION[symbol] self.required_keywords.add(keyword) return keyword elif symbol.endswith('__'): self.tree.new_error(node, 'Illegal use of reserved symbol name "%s"!' % symbol) return symbol def drop_on(self, category): return category in self.directives.drop and self.context[-2].tag_name != "definition" def TEXT_PARSER(self, text, drop): return 'Drop(Text(' + text + '))' if drop else 'Text(' + text + ')' def REGEXP_PARSER(self, regexp, drop): return 'Drop(RegExp(' + regexp + '))' if drop else 'RegExp(' + regexp + ')' def WSPC_PARSER(self, force_drop=False): if ((force_drop or DROP_WSPC in self.directives.drop) and (self.context[-2].tag_name != "definition" or self.context[-1].tag_name == 'literal')): return 'dwsp__' return 'wsp__' def on_literals(self, node: Node) -> str: self.join_literals(node) return self.on_literal(node) def on_literal(self, node: Node) -> str: center = self.TEXT_PARSER(escape_ctrl_chars(node.content), self.drop_on(DROP_STRINGS)) force = DROP_STRINGS in self.directives.drop left = self.WSPC_PARSER(force) if 'left' in self.directives.literalws else '' right = self.WSPC_PARSER(force) if 'right' in self.directives.literalws else '' if left or right: return 'Series(' + ", ".join(item for item in (left, center, right) if item) + ')' return center def on_plaintext(self, node: Node) -> str: tk = escape_ctrl_chars(node.content) rpl = '"' if tk.find('"') < 0 else "'" if tk.find("'") < 0 else '' if rpl: tk = rpl + tk[1:-1] + rpl else: tk = rpl + tk.replace('"', '\\"')[1:-1] + rpl return self.TEXT_PARSER(tk, self.drop_on(DROP_BACKTICKED)) def on_regexp(self, node: Node) -> str: rx = node.content try: arg = repr(self.check_rx(node, rx.replace(r'\/', '/'))) except AttributeError as error: from traceback import extract_tb, format_list trace = ''.join(format_list(extract_tb(error.__traceback__))) errmsg = "%s (AttributeError: %s;\n%s)\n%s" \ % (EBNFCompiler.AST_ERROR, str(error), trace, node.as_sxpr()) self.tree.new_error(node, errmsg) return '"' + errmsg + '"' return self.REGEXP_PARSER(arg, self.drop_on(DROP_REGEXP)) def on_char_range(self, node) -> str: for child in node.children: if child.tag_name == 'character': child.result = self.extract_character(child) elif child.tag_name == 'free_char': child.result = self.extract_free_char(child) re_str = re.sub(r"(? str: hexcode = node.content if len(hexcode) > 4: assert len(hexcode) <= 8 code = '\\U' + (8 - len(hexcode)) * '0' + hexcode elif len(hexcode) > 2: code = '\\u' + (4 - len(hexcode)) * '0' + hexcode else: code = '\\x' + (2 - len(hexcode)) * '0' + hexcode return code def on_character(self, node: Node) -> str: return "RegExp('%s')" % self.extract_character(node) def extract_free_char(self, node: Node) -> str: assert len(node.content) == 1 bytestr = node.content.encode('unicode-escape') return bytestr.decode() def on_any_char(self, node: Node) -> str: return 'AnyChar()' def on_whitespace(self, node: Node) -> str: return self.WSPC_PARSER() def get_ebnf_compiler(grammar_name="", grammar_source="") -> EBNFCompiler: THREAD_LOCALS = access_thread_locals() try: compiler = THREAD_LOCALS.ebnf_compiler_singleton compiler.set_grammar_name(grammar_name, grammar_source) return compiler except AttributeError: compiler = EBNFCompiler(grammar_name, grammar_source) compiler.set_grammar_name(grammar_name, grammar_source) THREAD_LOCALS.ebnf_compiler_singleton = compiler return compiler def compile_ebnf_ast(ast: RootNode) -> str: """Compiles the abstract-syntax-tree of an EBNF-source-text into python code of a class derived from `parse.Grammar` that can parse text following the grammar described with the EBNF-code.""" return get_ebnf_compiler()(ast) ######################################################################## # # EBNF compiler # ######################################################################## def compile_ebnf(ebnf_source: str, branding: str = 'DSL', *, preserve_AST: bool = False) \ -> ResultTuple: """ Compiles an `ebnf_source` (file_name or EBNF-string) and returns a tuple of the python code of the compiler, a list of warnings or errors and the abstract syntax tree of the EBNF-source. This function is merely syntactic sugar. """ return compile_source(ebnf_source, get_ebnf_preprocessor(), get_ebnf_grammar(), get_ebnf_transformer(), get_ebnf_compiler(branding, ebnf_source), preserve_AST=preserve_AST)