Currently job artifacts in CI/CD pipelines on LRZ GitLab never expire. Starting from Wed 26.1.2022 the default expiration time will be 30 days (GitLab default). Currently existing artifacts in already completed jobs will not be affected by the change. The latest artifacts for all jobs in the latest successful pipelines will be kept. More information:

Commit 6cba61bd authored by eckhart's avatar eckhart
Browse files

- renamed to to avoid name conflicts with python stock parser module

parent 9447d26f
""" - parser combinators for for DHParser
Copyright 2016 by Eckhart Arnold (
Bavarian Academy of Sciences an Humanities (
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
implied. See the License for the specific language governing
permissions and limitations under the License.
Module ```` contains a number of classes that together
make up parser combinators for left-recursive grammers. For each
element of the extended Backus-Naur-Form as well as for a regular
expression token a class is defined. The set of classes can be used to
define a parser for (ambiguous) left-recursive grammers.
References and Acknowledgements:
Dominikus Herzberg: Objekt-orientierte Parser-Kombinatoren in Python,
Blog-Post, September, 18th 2008 on denkspuren. gedanken, ideen,
anregungen und links rund um informatik-themen, URL:
Dominikus Herzberg: Eine einfache Grammatik für LaTeX, Blog-Post,
September, 18th 2008 on denkspuren. gedanken, ideen, anregungen und
links rund um informatik-themen, URL:
Dominikus Herzberg: Uniform Syntax, Blog-Post, February, 27th 2007 on
denkspuren. gedanken, ideen, anregungen und links rund um
informatik-themen, URL:
Richard A. Frost, Rahmatullah Hafiz and Paul Callaghan: Parser
Combinators for Ambiguous Left-Recursive Grammars, in: P. Hudak and
D.S. Warren (Eds.): PADL 2008, LNCS 4902, pp. 167–181, Springer-Verlag
Berlin Heidelberg 2008.
Elizabeth Scott and Adrian Johnstone, GLL Parsing,
in: Electronic Notes in Theoretical Computer Science 253 (2010) 177–189,
Juancarlo Añez: grako, a PEG parser generator in Python,
Vegard Øye: General Parser Combinators in Racket, 2012,
import collections
import copy
import html
import os
from functools import partial
from DHParser.error import Error, is_error, has_errors, linebreaks, line_col
from DHParser.stringview import StringView, EMPTY_STRING_VIEW
from DHParser.syntaxtree import Node, TransformationFunc, ParserBase, WHITESPACE_PTYPE, \
from DHParser.toolkit import is_logging, log_dir, logfile_basename, escape_re, sane_parser_name, \
load_if_file, re, typing
from typing import Any, Callable, cast, Dict, List, Set, Tuple, Union, Optional
__all__ = ('PreprocessorFunc',
# 'UnaryOperator',
# 'NaryOperator',
# Grammar and parsing infrastructure
PreprocessorFunc = Union[Callable[[str], str], partial]
LEFT_RECURSION_DEPTH = 8 # type: int
# because of python's recursion depth limit, this value ought not to be
# set too high. PyPy allows higher values than CPython
MAX_DROPOUTS = 3 # type: int
# stop trying to recover parsing after so many errors
class HistoryRecord:
Stores debugging information about one completed step in the
parsing history.
A parsing step is "completed" when the last one of a nested
sequence of parser-calls returns. The call stack including
the last parser call will be frozen in the ``HistoryRecord``-
object. In addition a reference to the generated leaf node
(if any) will be stored and the result status of the last
parser call, which ist either MATCH, FAIL (i.e. no match)
__slots__ = ('call_stack', 'node', 'text', 'line_col')
Snapshot = collections.namedtuple('Snapshot', ['line', 'column', 'stack', 'status', 'text'])
COLGROUP = '<colgroup>\n<col style="width:2%"/><col style="width:2%"/><col style="width:75"/>' \
'<col style="width:6%"/><col style="width:15%"/>\n</colgroup>\n'
'<html>\n<head>\n<meta charset="utf-8"/>\n<style>\n'
'td.line, td.column {font-family:monospace;color:darkgrey}\n'
'table{border-spacing: 0px; border: thin solid darkgrey; width:100%}\n'
'td{border-right: thin solid grey; border-bottom: thin solid grey}\n'
'\n</style>\n</head>\n<body>\n<table>\n' + COLGROUP)
HTML_LEAD_OUT = '\n</table>\n</body>\n</html>\n'
def __init__(self, call_stack: List['Parser'], node: Node, text: StringView) -> None:
# copy call stack, dropping uninformative Forward-Parsers
self.call_stack = [p for p in call_stack if p.ptype != ":Forward"] # type: List['Parser']
self.node = node # type: Node
self.text = text # type: StringView
self.line_col = (1, 1) # type: Tuple[int, int]
if call_stack:
grammar = call_stack[-1].grammar
document = grammar.document__
lbreaks = grammar.document_lbreaks__
self.line_col = line_col(lbreaks, len(document) - len(text))
def __str__(self):
return '%4i, %2i: %s; %s; "%s"' % self.as_tuple()
def as_tuple(self) -> Snapshot:
Returns history record formatted as a snapshot tuple.
return self.Snapshot(self.line_col[0], self.line_col[1],
self.stack, self.status, self.excerpt)
def as_csv_line(self) -> str:
Returns history record formatted as a csv table row.
return '"{}", "{}", "{}", "{}", "{}"'.format(*self.as_tuple())
def as_html_tr(self) -> str:
Returns history record formatted as an html table row.
stack = html.escape(self.stack).replace(
'-&gt;', '<span class="delimiter">&shy;-&gt;</span>')
status = html.escape(self.status)
excerpt = html.escape(self.excerpt)
if status == self.MATCH:
status = '<span class="match">' + status + '</span>'
i = stack.rfind('-&gt;')
chr = stack[i+12:i+13]
while not chr.isidentifier() and i >= 0:
i = stack.rfind('-&gt;', 0, i)
chr = stack[i+12:i+13]
if i >= 0:
i += 12
k = stack.find('<', i)
if k < 0:
stack = stack[:i] + '<span class="matchstack">' + stack[i:]
stack = stack[:i] + '<span class="matchstack">' + stack[i:k] \
+ '</span>' + stack[k:]
elif status == self.FAIL:
status = '<span class="fail">' + status + '</span>'
stack += '<br/>\n' + status
status = '<span class="error">ERROR</span>'
tpl = self.Snapshot(str(self.line_col[0]), str(self.line_col[1]), stack, status, excerpt)
# return ''.join(['<tr>'] + [('<td>%s</td>' % item) for item in tpl] + ['</tr>'])
return ''.join(['<tr>'] + [('<td class="%s">%s</td>' % (cls, item))
for cls, item in zip(tpl._fields, tpl)] + ['</tr>'])
def err_msg(self) -> str:
return self.ERROR + ": " + "; ".join(
str(e) for e in (self.node._errors if self.node._errors else
def stack(self) -> str:
return "->".join((p.repr if p.ptype == ':RegExp' else or p.ptype)
for p in self.call_stack)
def status(self) -> str:
return self.FAIL if self.node is None else \
('"%s"' % self.err_msg()) if self.node.error_flag else self.MATCH # has_errors(self.node._errors)
def excerpt(self):
length = len(self.node) if self.node else len(self.text)
excerpt = str(self.node)[:min(length, 20)] if self.node else self.text[:20]
excerpt = excerpt.replace('\n', '\\n')
if length > 20:
excerpt += '...'
return excerpt
# @property
# def extent(self) -> slice:
# return (slice(-self.remaining - len(self.node), -self.remaining) if self.node
# else slice(-self.remaining, None))
def remaining(self) -> int:
return len(self.text) - (len(self.node) if self.node else 0)
def last_match(history: List['HistoryRecord']) -> Union['HistoryRecord', None]:
Returns the last match from the parsing-history.
history: the parsing-history as a list of HistoryRecord objects
the history record of the last match or none if either history is
empty or no parser could match
for record in reversed(history):
if record.status == HistoryRecord.MATCH:
return record
return None
def most_advanced_match(history: List['HistoryRecord']) -> Union['HistoryRecord', None]:
Returns the closest-to-the-end-match from the parsing-history.
history: the parsing-history as a list of HistoryRecord objects
the history record of the closest-to-the-end-match or none if either history is
empty or no parser could match
remaining = -1
result = None
for record in history:
if (record.status == HistoryRecord.MATCH and
(record.remaining < remaining or remaining < 0)):
result = record
remaining = record.remaining
return result
def add_parser_guard(parser_func):
Add a wrapper function to a parser functions (i.e. Parser.__call__ method)
that takes care of memoizing, left recursion and, optionally, tracing
(aka "history tracking") of parser calls. Returns the wrapped call.
def guarded_call(parser: 'Parser', text: StringView) -> Tuple[Optional[Node], StringView]:
location = len(text) # mind that location is always the distance to the end
grammar = parser.grammar # grammar may be 'None' for unconnected parsers!
if grammar.last_rb__loc__ <= location:
# if location has already been visited by the current parser,
# return saved result
if location in parser.visited:
return parser.visited[location]
if grammar.history_tracking__:
grammar.moving_forward__ = True
# break left recursion at the maximum allowed depth
if grammar.left_recursion_handling__:
if parser.recursion_counter.setdefault(location, 0) > LEFT_RECURSION_DEPTH:
return None, text
parser.recursion_counter[location] += 1
# run original __call__ method
node, rest = parser_func(parser, text)
if grammar.left_recursion_handling__:
parser.recursion_counter[location] -= 1
if node is None:
# retrieve an earlier match result (from left recursion) if it exists
if location in grammar.recursion_locations__:
if location in parser.visited:
node, rest = parser.visited[location]
# TODO: maybe add a warning about occurrence of left-recursion here?
# don't overwrite any positive match (i.e. node not None) in the cache
# and don't add empty entries for parsers returning from left recursive calls!
elif grammar.memoization__:
# otherwise also cache None-results
parser.visited[location] = (None, rest)
elif (grammar.last_rb__loc__ > location
and (grammar.memoization__ or location in grammar.recursion_locations__)):
# - variable manipulating parsers will not be entered into the cache,
# because caching would interfere with changes of variable state
# - in case of left recursion, the first recursive step that
# matches will store its result in the cache
parser.visited[location] = (node, rest)
if grammar.history_tracking__:
# don't track returning parsers except in case an error has occurred
remaining = len(rest)
if grammar.moving_forward__ or (node and node.error_flag): # node._errors
record = HistoryRecord(grammar.call_stack__, node, text)
# print(record.stack, record.status, rest[:20].replace('\n', '|'))
grammar.moving_forward__ = False
except RecursionError:
node = Node(None, str(text[:min(10, max(1, text.find("\n")))]) + " ...")
node.add_error("maximum recursion depth of parser reached; "
"potentially due to too many errors!")
return node, rest
return guarded_call
class Parser(ParserBase):
(Abstract) Base class for Parser combinator parsers. Any parser
object that is actually used for parsing (i.e. no mock parsers)
should should be derived from this class.
Since parsers can contain other parsers (see classes UnaryOperator
and NaryOperator) they form a cyclical directed graph. A root
parser is a parser from which all other parsers can be reached.
Usually, there is one root parser which serves as the starting
point of the parsing process. When speaking of "the root parser"
it is this root parser object that is meant.
There are two different types of parsers:
1. *Named parsers* for which a name is set in field ``.
The results produced by these parsers can later be retrieved in
the AST by the parser name.
2. *Anonymous parsers* where the name-field just contains the empty
string. AST-transformation of Anonymous parsers can be hooked
only to their class name, and not to the individual parser.
Parser objects are callable and parsing is done by calling a parser
object with the text to parse.
If the parser matches it returns a tuple consisting of a node
representing the root of the concrete syntax tree resulting from the
match as well as the substring `text[i:]` where i is the length of
matched text (which can be zero in the case of parsers like
`ZeroOrMore` or `Option`). If `i > 0` then the parser has "moved
If the parser does not match it returns `(None, text). **Note** that
this is not the same as an empty match `("", text)`. Any empty match
can for example be returned by the `ZeroOrMore`-parser in case the
contained parser is repeated zero times.
Attributes and Properties:
visited: Mapping of places this parser has already been to
during the current parsing process onto the results the
parser returned at the respective place. This dictionary
is used to implement memoizing.
recursion_counter: Mapping of places to how often the parser
has already been called recursively at this place. This
is needed to implement left recursion. The number of
calls becomes irrelevant once a resault has been memoized.
cycle_detection: The apply()-method uses this variable to make
sure that one and the same function will not be applied
(recursively) a second time, if it has already been
applied to this parser.
grammar: A reference to the Grammar object to which the parser
is attached.
ApplyFunc = Callable[['Parser'], None]
def __init__(self, name: str = '') -> None:
# assert isinstance(name, str), str(name)
super(Parser, self).__init__(name)
self._grammar = None # type: 'Grammar'
# add "aspect oriented" wrapper around parser calls
# for memoizing, left recursion and tracing
guarded_parser_call = add_parser_guard(self.__class__.__call__)
# The following check is necessary for classes that don't override
# the __call__() method, because in these cases the non-overridden
# __call__()-method would be substituted a second time!
if self.__class__.__call__.__code__ != guarded_parser_call.__code__:
self.__class__.__call__ = guarded_parser_call
def __deepcopy__(self, memo):
"""Deepcopy method of the parser. Upon instantiation of a Grammar-
object, parsers will be deep-copied to the Grammar object. If a
derived parser-class changes the signature of the constructor,
`__deepcopy__`-method must be replaced (i.e. overridden without
calling the same method from the superclass) by the derived class.
return self.__class__(
def reset(self):
"""Initializes or resets any parser variables. If overwritten,
the `reset()`-method of the parent class must be called from the
`reset()`-method of the derived class."""
self.visited = dict() # type: Dict[int, Tuple[Optional[Node], StringView]]
self.recursion_counter = dict() # type: Dict[int, int]
self.cycle_detection = set() # type: Set[Callable]
def __call__(self, text: StringView) -> Tuple[Optional[Node], StringView]:
"""Applies the parser to the given `text` and returns a node with
the results or None as well as the text at the position right behind
the matching string."""
return None, text # default behaviour: don't match
def __add__(self, other: 'Parser') -> 'Series':
"""The + operator generates a series-parser that applies two
parsers in sequence."""
return Series(self, other)
def __or__(self, other: 'Parser') -> 'Alternative':
"""The | operator generates an alternative parser that applies
the first parser and, if that does not match, the second parser.
return Alternative(self, other)
def grammar(self) -> 'Grammar':
return self._grammar
def grammar(self, grammar: 'Grammar'):
if self._grammar is None:
self._grammar = grammar
assert self._grammar == grammar, \
"Parser has already been assigned to a different Grammar object!"
def _grammar_assigned_notifier(self):
"""A function that notifies the parser object that it has been
assigned to a grammar."""
def apply(self, func: ApplyFunc) -> bool:
Applies function `func(parser)` recursively to this parser and all
descendant parsers if any exist. The same function can never
be applied twice between calls of the ``reset()``-method!
Returns `True`, if function has been applied, `False` if function
had been applied earlier already and thus has not been applied again.
if func in self.cycle_detection:
return False
assert not self.visited, "No calls to Parser.apply() during or " \
"after ongoing parsing process. (Call Parser.reset() first.)"
return True
def mixin_comment(whitespace: str, comment: str) -> str:
Returns a regular expression that merges comment and whitespace
regexps. Thus comments cann occur whereever whitespace is allowed
and will be skipped just as implicit whitespace.
Note, that because this works on the level of regular expressions,
nesting comments is not possible. It also makes it much harder to
use directives inside comments (which isn't recommended, anyway).
wspc = '(?:' + whitespace + '(?:' + comment + whitespace + ')*)'
return wspc
class Grammar:
Class Grammar directs the parsing process and stores global state
information of the parsers, i.e. state information that is shared
accross parsers.
Grammars are basically collections of parser objects, which are
connected to an instance object of class Grammar. There exist two
ways of connecting parsers to grammar objects: Either by passing
the root parser object to the constructor of a Grammar object
("direct instantiation"), or by assigning the root parser to the
class variable "root__" of a descendant class of class Grammar.
Example for direct instantian of a grammar:
>>> number = RE('\d+') + RE('\.') + RE('\d+') | RE('\d+')
>>> number_parser = Grammar(number)
>>> number_parser("3.1416").content
Collecting the parsers that define a grammar in a descendant class of
class Grammar and assigning the named parsers to class variables
rather than global variables has several advantages:
1. It keeps the namespace clean.
2. The parser names of named parsers do not need to be passed to the
constructor of the Parser object explicitly, but it suffices to
assign them to class variables, which results in better
readability of the Python code.
3. The parsers in the class do not necessarily need to be connected
to one single root parser, which is helpful for testing and
building up a parser successively of several components.
As a consequence, though, it is highly recommended that a Grammar
class should not define any other variables or methods with names
that are legal parser names. A name ending with a double
underscore '__' is *not* a legal parser name and can safely be
class Arithmetic(Grammar):
# special fields for implicit whitespace and comment configuration
COMMENT__ = r'#.*(?:\n|$)' # Python style comments
wspR__ = mixin_comment(whitespace=r'[\t ]*', comment=COMMENT__)
# parsers
expression = Forward()
INTEGER = RE('\\d+')
factor = INTEGER | Token("(") + expression + Token(")")
term = factor + ZeroOrMore((Token("*") | Token("/")) + factor)
expression.set(term + ZeroOrMore((Token("+") | Token("-")) + term))
root__ = expression
Upon instantiation the parser objects are deep-copied to the
Grammar object and assigned to object variables of the same name.
Any parser that is directly assigned to a class variable is a
'named' parser and its field `` contains the variable
name after instantiation of the Grammar class. All other parsers,
i.e. parsers that are defined within a `named` parser, remain
"anonymous parsers" where `` is the empty string, unless
a name has been passed explicitly upon instantiation.
If one and the same parser is assigned to several class variables
such as, for example the parser `expression` in the example above,
the first name sticks.
Grammar objects are callable. Calling a grammar object with a UTF-8
encoded document, initiates the parsing of the document with the
root parser. The return value is the concrete syntax tree. Grammar
objects can be reused (i.e. called again) after parsing. Thus, it
is not necessary to instantiate more than one Grammar object per
Grammar classes contain a few special class fields for implicit
whitespace and comments that should be overwritten, if the defaults
(no comments, horizontal right aligned whitespace) don't fit:
COMMENT__: regular expression string for matching comments
WSP__: regular expression for whitespace and comments
wspL__: regular expression string for left aligned whitespace,
which either equals WSP__ or is empty.
wspR__: regular expression string for right aligned whitespace,
which either equals WSP__ or is empty.
root__: The root parser of the grammar. Theoretically, all parsers of the
grammar should be reachable by the root parser. However, for testing
of yet incomplete grammars class Grammar does not assume that this
is the case.
parser_initializiation__: Before the parser class (!) has been initialized,
which happens upon the first time it is instantiated (see doctring for
method `_assign_parser_names()` for an explanation), this class
field contains a value other than "done". A value of "done" indicates
that the class has already been initialized.