parse.py 140 KB
Newer Older
1001
1002
1003
1004
1005
1006
                that serves the same function.

        static_analysis_errors__: A pointer to the class attribute of the same name.
                (See the description above.) If the class is instantiated with a
                parser, this pointer will be overwritten with an instance variable
                that serves the same function.
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026

        # tracing and debugging support

        # These parameters are needed by the debugging functions in module
        # `trace.py`. They should not be manipulated by the users of class
        #  Grammar directly.

        history_tracking__:  A flag indicating that the parsing history is
                being tracked. This flag should not be manipulated by the
                user. Use `trace.set_tracer(grammar, trace.trace_history)` to
                turn (full) history tracking on and
                `trace.set_tracer(grammar, None)` to turn it off. Default is off.

        resume_notices__: A flag indicating that resume messages are generated
                in addition to the error messages, in case the parser was able
                to resume after an error. Use `trace.resume_notices(grammar)` to
                turn resume messages on and `trace.set_tracer(grammar, None)`
                to turn resume messages (as well as history recording) off.
                Default is off.

di68kap's avatar
di68kap committed
1027
1028
1029
        call_stack__:  A stack of the tag names and locations of all parsers
                in the call chain to the currently processed parser during
                parsing. The call stack can be thought of as a breadcrumb trail.
1030
                This is required for recording the parser history (for debugging)
1031
1032
1033
                and, eventually, i.e. one day in the future, for tracing through
                the parsing process.

1034
1035
1036
1037
        history__:  A list of history records. A history record is appended to
                the list each time a parser either matches, fails or if a
                parser-error occurs. See class `log.HistoryRecord`. History
                records store copies of the current call stack.
1038
1039
1040
1041

        moving_forward__: This flag indicates that the parsing process is currently
                moving forward . It is needed to reduce noise in history recording
                and should not be considered as having a valid value if history
1042
                recording is turned off! (See :func:`Parser.__call__`)
1043

1044
1045
1046
        most_recent_error__: The most recent parser error that has occurred
                or `None`. This can be read by tracers. See module `trace`

1047

1048
1049
1050
1051
        # Configuration parameters.

        # These values of theses parameters are copied from the global configuration
        # in the constructor of the Grammar object. (see mpodule `configuration.py`)
1052

1053
1054
1055
        flatten_tree__:  If True (default), anonymous nodes will be flattened
                during parsing already. This greatly reduces the concrete syntax
                tree and simplifies and speeds up abstract syntax tree generation.
1056
                Default is on.
1057
1058
1059

        left_recursion_depth__: the maximum allowed depth for left-recursion.
                A depth of zero means that no left recursion handling will
1060
                take place. Default is 5.
1061
1062
1063

        max_parser_dropouts__: Maximum allowed number of retries after errors
                where the parser would exit before the complete document has
1064
1065
                been parsed. Default is 1, as usually the retry-attemts lead
                to a proliferation of senseless error messages.
1066
1067
1068

        reentry_search_window__: The number of following characters that the
                parser considers when searching a reentry point when a syntax error
1069
                has been encountered. Default is 10.000 characters.
1070
    """
eckhart's avatar
eckhart committed
1071
    python_src__ = ''  # type: str
1072
    root__ = PARSER_PLACEHOLDER  # type: Parser
1073
    # root__ must be overwritten with the root-parser by grammar subclass
eckhart's avatar
eckhart committed
1074
    parser_initialization__ = ["pending"]  # type: List[str]
eckhart's avatar
eckhart committed
1075
    resume_rules__ = dict()  # type: Dict[str, ResumeList]
1076
    anonymous__ = RX_NEVER_MATCH  # type: RxPatternType
1077
    # some default values
eckhart's avatar
eckhart committed
1078
1079
    COMMENT__ = r''  # type: str  # r'#.*(?:\n|$)'
    WSP_RE__ = mixin_comment(whitespace=r'[\t ]*', comment=COMMENT__)  # type: str
1080
    static_analysis_pending__ = [True]  # type: List[bool]
1081
    static_analysis_errors__ = []  # type: List[AnalysisError]
1082
1083
1084
1085

    @classmethod
    def _assign_parser_names__(cls):
        """
di68kap's avatar
di68kap committed
1086
        Initializes the `parser.pname` fields of those
1087
        Parser objects that are directly assigned to a class field with
1088
        the field's name, e.g.::
1089

1090
1091
            class Grammar(Grammar):
                ...
1092
                symbol = RE(r'(?!\\d)\\w+')
1093

di68kap's avatar
di68kap committed
1094
        After the call of this method symbol.pname == "symbol" holds.
di68kap's avatar
di68kap committed
1095
1096
        Parser names starting or ending with a double underscore like
        ``root__`` will be ignored. See :func:`sane_parser_name()`
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106

        This is done only once, upon the first instantiation of the
        grammar class!

        Attention: If there exists more than one reference to the same
        parser, only the first one will be chosen for python versions
        greater or equal 3.6.  For python version <= 3.5 an arbitrarily
        selected reference will be chosen. See PEP 520
        (www.python.org/dev/peps/pep-0520/) for an explanation of why.
        """
eckhart's avatar
eckhart committed
1107
        if cls.parser_initialization__[0] != "done":
1108
1109
1110
            cdict = cls.__dict__
            for entry, parser in cdict.items():
                if isinstance(parser, Parser) and sane_parser_name(entry):
1111
                    anonymous = True if cls.anonymous__.match(entry) else False
1112
                    assert anonymous or not parser.drop_content, entry
di68kap's avatar
di68kap committed
1113
                    if isinstance(parser, Forward):
di68kap's avatar
di68kap committed
1114
1115
                        if not cast(Forward, parser).parser.pname:
                            cast(Forward, parser).parser.pname = entry
1116
                            cast(Forward, parser).parser.anonymous = anonymous
1117
                    else:
di68kap's avatar
di68kap committed
1118
                        parser.pname = entry
1119
                        parser.anonymous = anonymous
1120
1121
1122
1123
            if cls != Grammar:
                cls.parser_initialization__ = ["done"]  # (over-)write subclass-variable
                # cls.parser_initialization__[0] = "done"
                pass
1124
1125


eckhart's avatar
eckhart committed
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
    def __deepcopy__(self, memo):
        """Deepcopy method of the parser. Upon instantiation of a Grammar-
        object, parsers will be deep-copied to the Grammar object. If a
        derived parser-class changes the signature of the `__init__`-constructor,
        `__deepcopy__`-method must be replaced (i.e. overridden without
        calling the same method from the superclass) by the derived class.
        """
        duplicate = self.__class__(self.root_parser__)
        duplicate.history_tracking__ = self.history_tracking__
        duplicate.resume_notices__ = self.resume_notices__
        duplicate.flatten_tree__ = self.flatten_tree__
        duplicate.left_recursion_depth__ = self.left_recursion_depth__
        duplicate.max_parser_dropouts__ = self.max_parser_dropouts__
        duplicate.reentry_search_window__ = self.reentry_search_window__
        return duplicate


1143
    def __init__(self, root: Parser = None) -> None:
1144
        self.all_parsers__ = set()             # type: Set[Parser]
1145
        # add compiled regular expression for comments, if it does not already exist
eckhart's avatar
eckhart committed
1146
1147
1148
1149
1150
        if not hasattr(self, 'comment_rx__') or self.comment_rx__ is None:
            if hasattr(self.__class__, 'COMMENT__') and self.__class__.COMMENT__:
                self.comment_rx__ = re.compile(self.__class__.COMMENT__)
            else:
                self.comment_rx__ = RX_NEVER_MATCH
1151
        else:
eckhart's avatar
eckhart committed
1152
1153
            assert ((self.__class__.COMMENT__
                     and self.__class__.COMMENT__ == self.comment_rx__.pattern)
eckhart's avatar
eckhart committed
1154
                    or (not self.__class__.COMMENT__ and self.comment_rx__ == RX_NEVER_MATCH))
1155
        self.start_parser__ = None             # type: Optional[Parser]
1156
1157
        self._dirty_flag__ = False             # type: bool
        self.memoization__ = True              # type: bool
1158
1159
        self.history_tracking__ = False        # type: bool
        self.resume_notices__ = False          # type: bool
1160
        self.flatten_tree__ = get_config_value('flatten_tree')                    # type: bool
1161
1162
1163
        self.left_recursion_depth__ = get_config_value('left_recursion_depth')    # type: int
        self.max_parser_dropouts__ = get_config_value('max_parser_dropouts')      # type: int
        self.reentry_search_window__ = get_config_value('reentry_search_window')  # type: int
1164
1165
1166
1167
1168
1169
1170
        self._reset__()

        # prepare parsers in the class, first
        self._assign_parser_names__()

        # then deep-copy the parser tree from class to instance;
        # parsers not connected to the root object will be copied later
1171
1172
1173
        # on demand (see Grammar.__getitem__()).
        # (Usually, all parsers should be connected to the root object. But
        # during testing and development this does not need to be the case.)
1174
1175
1176
1177
1178
        if root:
            self.root_parser__ = copy.deepcopy(root)
            self.static_analysis_pending__ = [True]  # type: List[bool]
            self.static_analysis_errors__ = []       # type: List[AnalysisError]
        else:
1179
1180
            assert self.__class__ == Grammar or self.__class__.root__ != PARSER_PLACEHOLDER, \
                "Please add `root__` field to definition of class " + self.__class__.__name__
1181
1182
1183
            self.root_parser__ = copy.deepcopy(self.__class__.root__)
            self.static_analysis_pending__ = self.__class__.static_analysis_pending__
            self.static_analysis_errors__ = self.__class__.static_analysis_errors__
1184
1185
        self.static_analysis_caches__ = dict()  # type: Dict[str, Dict]

eckhart's avatar
eckhart committed
1186
1187
1188
        self.root_parser__.apply(self._add_parser__)
        assert 'root_parser__' in self.__dict__
        assert self.root_parser__ == self.__dict__['root_parser__']
1189

1190
        if self.static_analysis_pending__ \
1191
                and get_config_value('static_analysis') in {'early', 'late'}:
eckhart's avatar
eckhart committed
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
            # try:
            result = self.static_analysis()
            # clears any stored errors without overwriting the pointer
            while self.static_analysis_errors__:
                self.static_analysis_errors__.pop()
            self.static_analysis_errors__.extend(result)
            has_errors = any(is_error(tpl[-1].code) for tpl in result)
            if has_errors:
                raise GrammarError(result)
            self.static_analysis_pending__.pop()
            # except (NameError, AttributeError) as e:
            #     pass  # don't fail the initialization of PLACEHOLDER
1204

1205
1206
    def __str__(self):
        return self.__class__.__name__
1207

1208

1209
1210
1211
1212
    def __getitem__(self, key):
        try:
            return self.__dict__[key]
        except KeyError:
eckhart's avatar
eckhart committed
1213
            #  p = getattr(self, key, None)
eckhart's avatar
eckhart committed
1214
            parser_template = getattr(self.__class__, key, None)
1215
1216
1217
1218
            if parser_template:
                # add parser to grammar object on the fly...
                parser = copy.deepcopy(parser_template)
                parser.apply(self._add_parser__)
eckhart's avatar
eckhart committed
1219
                assert self[key] == parser
1220
                return self[key]
1221
            raise UnknownParserError('Unknown parser "%s" !' % key)
1222

1223

1224
1225
    def __contains__(self, key):
        return key in self.__dict__ or hasattr(self, key)
1226

1227

1228
    def _reset__(self):
1229
        self.tree__ = RootNode()              # type: RootNode
1230
1231
        self.document__ = EMPTY_STRING_VIEW   # type: StringView
        self._reversed__ = EMPTY_STRING_VIEW  # type: StringView
eckhart's avatar
eckhart committed
1232
        self.document_length__ = 0            # type: int
1233
        self._document_lbreaks__ = []         # type: List[int]
1234
        # variables stored and recalled by Capture and Retrieve parsers
eckhart's avatar
eckhart committed
1235
        self.variables__ = defaultdict(lambda: [])  # type: DefaultDict[str, List[str]]
1236
1237
1238
        self.rollback__ = []                  # type: List[Tuple[int, Callable]]
        self.last_rb__loc__ = -1              # type: int
        # support for call stack tracing
eckhart's avatar
eckhart committed
1239
        self.call_stack__ = []                # type: List[CallItem]  # tag_name, location
1240
1241
1242
1243
1244
        # snapshots of call stacks
        self.history__ = []                   # type: List[HistoryRecord]
        # also needed for call stack tracing
        self.moving_forward__ = False         # type: bool
        self.recursion_locations__ = set()    # type: Set[int]
1245
        self.last_recursion_location__ = -1   # type: int
1246
        self.most_recent_error__ = None       # type: Optional[ParserError]
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256


    @property
    def reversed__(self) -> StringView:
        """
        Returns a reversed version of the currently parsed document. As
        about the only case where this is needed is the Lookbehind-parser,
        this is done lazily.
        """
        if not self._reversed__:
1257
            self._reversed__ = StringView(self.document__.get_text()[::-1])
1258
1259
1260
        return self._reversed__


1261
    def _add_parser__(self, context: List[Parser]) -> None:
1262
1263
1264
1265
        """
        Adds the particular copy of the parser object to this
        particular instance of Grammar.
        """
1266
        parser = context[-1]
di68kap's avatar
di68kap committed
1267
        if parser.pname:
1268
            # prevent overwriting instance variables or parsers of a different class
eckhart's avatar
eckhart committed
1269
1270
            assert (parser.pname not in self.__dict__
                    or isinstance(self.__dict__[parser.pname], parser.__class__)), \
1271
                ('Cannot add parser "%s" because a field with the same name '
di68kap's avatar
di68kap committed
1272
                 'already exists in grammar object: %s!'
di68kap's avatar
di68kap committed
1273
1274
                 % (parser.pname, str(self.__dict__[parser.pname])))
            setattr(self, parser.pname, parser)
1275
        parser.tag_name = parser.ptype if parser.anonymous else parser.pname
1276
1277
1278
1279
        self.all_parsers__.add(parser)
        parser.grammar = self


eckhart's avatar
eckhart committed
1280
1281
    def __call__(self,
                 document: str,
eckhart's avatar
eckhart committed
1282
                 start_parser: Union[str, Parser] = "root_parser__",
1283
                 *, complete_match: bool = True) -> RootNode:
1284
1285
1286
1287
1288
        """
        Parses a document with with parser-combinators.

        Args:
            document (str): The source text to be parsed.
di68kap's avatar
di68kap committed
1289
1290
            start_parser (str or Parser): The name of the parser with which
                to start. This is useful for testing particular parsers
1291
                (i.e. particular parts of the EBNF-Grammar.)
1292
1293
            complete_match (bool): If True, an error is generated, if
                `start_parser` did not match the entire document.
1294
        Returns:
di68kap's avatar
di68kap committed
1295
            Node: The root node to the parse tree.
1296
        """
1297

eckhart's avatar
eckhart committed
1298
        def tail_pos(predecessors: Union[List[Node], Tuple[Node, ...], None]) -> int:
eckhart's avatar
eckhart committed
1299
1300
            """Adds the position after the last node in the list of
            predecessors to the node."""
eckhart's avatar
eckhart committed
1301
            return predecessors[-1].pos + len(predecessors[-1]) if predecessors else 0
eckhart's avatar
eckhart committed
1302

1303
1304
1305
1306
1307
1308
        def lookahead_failure_only(parser):
            """EXPERIMENTAL!

            Checks if failure to match document was only due to a succeeding
            lookahead parser, which is a common design pattern that can break test
            cases. (Testing for this case allows to modify the error message, so
1309
            that the testing framework knows that the failure is only a
1310
1311
1312
            test-case-artifact and no real failure.
            (See test/test_testing.TestLookahead !)
            """
1313
1314
1315
            def is_lookahead(tag_name: str) -> bool:
                return (tag_name in self and isinstance(self[tag_name], Lookahead)
                        or tag_name[0] == ':' and issubclass(eval(tag_name[1:]), Lookahead))
eckhart's avatar
eckhart committed
1316
1317
            last_record = self.history__[-2] if len(self.history__) > 1 \
                else None  # type: Optional[HistoryRecord]
1318
            return last_record and parser != self.root_parser__ \
eckhart's avatar
eckhart committed
1319
1320
1321
1322
                and any(h.status == HistoryRecord.MATCH  # or was it HistoryRecord.MATCH !?
                        and any(is_lookahead(tn) and location >= len(self.document__)
                                for tn, location in h.call_stack)
                        for h in self.history__[:-1])
1323

1324
        # assert isinstance(document, str), type(document)
1325
1326
1327
        parser = self[start_parser] if isinstance(start_parser, str) else start_parser
        assert parser.grammar == self, "Cannot run parsers from a different grammar object!" \
                                       " %s vs. %s" % (str(self), str(parser.grammar))
1328
1329
        if self._dirty_flag__:
            self._reset__()
1330
            parser.apply(lambda ctx: ctx[-1].reset())
1331
1332
        else:
            self._dirty_flag__ = True
eckhart's avatar
eckhart committed
1333

1334
        self.start_parser__ = parser.parser if isinstance(parser, Forward) else parser
1335
        self.document__ = StringView(document)
eckhart's avatar
eckhart committed
1336
        self.document_length__ = len(self.document__)
1337
        self._document_lbreaks__ = linebreaks(document) if self.history_tracking__ else []
Eckhart Arnold's avatar
Eckhart Arnold committed
1338
        self.last_rb__loc__ = -1  # rollback location
1339
1340
1341
1342
1343
1344
        result = None  # type: Optional[Node]
        stitches = []  # type: List[Node]
        rest = self.document__
        if not rest:
            result, _ = parser(rest)
            if result is None:
Eckhart Arnold's avatar
Eckhart Arnold committed
1345
                result = Node(ZOMBIE_TAG, '').with_pos(0)
1346
1347
                if lookahead_failure_only(parser):
                    self.tree__.new_error(
1348
1349
                        result, 'Parser "%s" only did not match empty document '
                                'because of lookahead' % str(parser),
1350
                        PARSER_LOOKAHEAD_FAILURE_ONLY)
1351
1352
1353
                else:
                    self.tree__.new_error(
                        result, 'Parser "%s" did not match empty document.' % str(parser),
eckhart's avatar
eckhart committed
1354
                        PARSER_STOPPED_BEFORE_END)
1355

1356
1357
1358
        # copy to local variable, so break condition can be triggered manually
        max_parser_dropouts = self.max_parser_dropouts__
        while rest and len(stitches) < max_parser_dropouts:
1359
            result, rest = parser(rest)
1360
            if rest and complete_match:
1361
1362
1363
                fwd = rest.find("\n") + 1 or len(rest)
                skip, rest = rest[:fwd], rest[fwd:]
                if result is None:
1364
                    err_info = '' if not self.history_tracking__ else \
di68kap's avatar
di68kap committed
1365
1366
1367
1368
1369
                               '\n    Most advanced: %s\n    Last match:    %s;' % \
                               (str(HistoryRecord.most_advanced_match(self.history__)),
                                str(HistoryRecord.last_match(self.history__)))
                    # Check if a Lookahead-Parser did match. Needed for testing, because
                    # in a test case this is not necessarily an error.
1370
                    if lookahead_failure_only(parser):
1371
1372
                        error_msg = 'Parser "%s" only did not match because of lookahead! ' \
                                    % str(parser) + err_info
1373
                        error_code = PARSER_LOOKAHEAD_FAILURE_ONLY
di68kap's avatar
di68kap committed
1374
                    else:
1375
                        error_msg = 'Parser "%s" did not match!' % str(parser) + err_info
eckhart's avatar
eckhart committed
1376
                        error_code = PARSER_STOPPED_BEFORE_END
1377
1378
                else:
                    stitches.append(result)
1379
                    for h in reversed(self.history__):
1380
1381
                        if h.node and h.node.tag_name != EMPTY_NODE.tag_name \
                                and any('Lookahead' in tag for tag, _ in h.call_stack):
1382
1383
                            break
                    else:
eckhart's avatar
eckhart committed
1384
                        h = HistoryRecord([], EMPTY_NODE, StringView(''), (0, 0))
1385
                    if h.status == h.MATCH and (h.node.pos + len(h.node) == len(self.document__)):
1386
                        # TODO: this case still needs unit-tests and support in testing.py
1387
                        error_msg = "Parser stopped before end, but matched with lookahead."
1388
                        error_code = PARSER_LOOKAHEAD_MATCH_ONLY
1389
                        max_parser_dropouts = -1  # no further retries!
1390
1391
1392
1393
1394
1395
                    else:
                        error_msg = "Parser stopped before end" \
                            + (("! trying to recover"
                                + (" but stopping history recording at this point."
                                   if self.history_tracking__ else "..."))
                                if len(stitches) < self.max_parser_dropouts__
di68kap's avatar
di68kap committed
1396
1397
                                else " too often!" if self.max_parser_dropouts__ > 1 else "!"
                                     + " Terminating parser.")
1398
                        error_code = PARSER_STOPPED_BEFORE_END
1399
1400
1401
1402
1403
                stitch = Node(ZOMBIE_TAG, skip).with_pos(tail_pos(stitches))
                stitches.append(stitch)
                error = Error(error_msg, stitch.pos, error_code)
                self.tree__.add_error(stitch, error)
                if self.history_tracking__:
eckhart's avatar
eckhart committed
1404
                    lc = line_col(self.document_lbreaks__, error.pos)
1405
                    self.history__.append(HistoryRecord([(stitch.tag_name, stitch.pos)], stitch,
eckhart's avatar
eckhart committed
1406
                                                        self.document__[error.pos:], lc, [error]))
1407
            else:
eckhart's avatar
eckhart committed
1408
1409
                # if complete_match is False, ignore the rest and leave while loop
                rest = StringView('')
1410
1411
        if stitches:
            if rest:
1412
                stitches.append(Node(ZOMBIE_TAG, rest))
Eckhart Arnold's avatar
Eckhart Arnold committed
1413
            result = Node(ZOMBIE_TAG, tuple(stitches)).with_pos(0)
1414
        if any(self.variables__.values()):
1415
1416
            error_msg = "Capture-stack not empty after end of parsing: " \
                + ', '.join(k for k, i in self.variables__.items() if len(i) >= 1)
eckhart's avatar
eckhart committed
1417
1418
1419
1420
            if parser.apply(has_non_autocaptured_symbols):
                error_code = CAPTURE_STACK_NOT_EMPTY
            else:
                error_code = AUTORETRIEVED_SYMBOL_NOT_CLEARED
1421
1422
1423
1424
1425
            if result:
                if result.children:
                    # add another child node at the end to ensure that the position
                    # of the error will be the end of the text. Otherwise, the error
                    # message above ("...after end of parsing") would appear illogical.
Eckhart Arnold's avatar
Eckhart Arnold committed
1426
                    error_node = Node(ZOMBIE_TAG, '').with_pos(tail_pos(result.children))
di68kap's avatar
di68kap committed
1427
                    self.tree__.new_error(error_node, error_msg, error_code)
eckhart's avatar
eckhart committed
1428
                    result.result = result.children + (error_node,)
1429
                else:
di68kap's avatar
di68kap committed
1430
                    self.tree__.new_error(result, error_msg, error_code)
di68kap's avatar
di68kap committed
1431
        self.tree__.swallow(result)
eckhart's avatar
eckhart committed
1432
        self.start_parser__ = None
1433
        # self.history_tracking__ = save_history_tracking
1434
        return self.tree__
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446


    def push_rollback__(self, location, func):
        """
        Adds a rollback function that either removes or re-adds
        values on the variable stack (`self.variables`) that have been
        added (or removed) by Capture or Pop Parsers, the results of
        which have been dismissed.
        """
        self.rollback__.append((location, func))
        self.last_rb__loc__ = location

1447

1448
1449
1450
1451
1452
    @property
    def document_lbreaks__(self) -> List[int]:
        if not self._document_lbreaks__:
            self._document_lbreaks__ = linebreaks(self.document__)
        return self._document_lbreaks__
1453

1454

1455
1456
1457
1458
1459
    def rollback_to__(self, location):
        """
        Rolls back the variable stacks (`self.variables`) to its
        state at an earlier location in the parsed document.
        """
Eckhart Arnold's avatar
Eckhart Arnold committed
1460
        while self.rollback__ and self.rollback__[-1][0] >= location:
1461
1462
1463
1464
1465
1466
1467
            _, rollback_func = self.rollback__.pop()
            # assert not loc > self.last_rb__loc__, \
            #     "Rollback confusion: line %i, col %i < line %i, col %i" % \
            #     (*line_col(self.document__, len(self.document__) - loc),
            #      *line_col(self.document__, len(self.document__) - self.last_rb__loc__))
            rollback_func()
        self.last_rb__loc__ == self.rollback__[-1][0] if self.rollback__ \
di68kap's avatar
di68kap committed
1468
            else (self.document__.__len__() + 1)
1469
1470


1471
1472
1473
    def as_ebnf(self) -> str:
        """
        Serializes the Grammar object as a grammar-description in the
1474
1475
1476
        Extended Backus-Naur-Form. Does not serialize directives and
        may contain abbreviations with three dots " ... " for very long
        expressions.
1477
        """
1478
1479
        ebnf = ['# This grammar does not include any of the DHParser-specific ',
                '# directives and may contain abbreviations ("...")!', '']
1480
1481
1482
        for entry, parser in self.__dict__.items():
            if isinstance(parser, Parser) and sane_parser_name(entry):
                ebnf.append(str(parser))
1483
        ebnf.append('')
1484
1485
1486
        return '\n'.join(ebnf)


eckhart's avatar
eckhart committed
1487
    def associated_symbol(self, parser: Parser) -> Parser:
1488
1489
1490
        r"""Returns the closest named parser that contains `parser`.
        If `parser` is a named parser itself, `parser` is returned.
        If `parser` is not connected to any symbol in the Grammar,
eckhart's avatar
eckhart committed
1491
        an AttributeError is raised.
1492

1493
        >>> word = Series(RegExp(r'\w+'), Whitespace(r'\s*'))
1494
1495
        >>> word.pname = 'word'
        >>> gr = Grammar(word)
1496
        >>> anonymous_re = gr['word'].parsers[0]
1497
1498
        >>> gr.associated_symbol(anonymous_re).pname
        'word'
1499
1500
1501
        """
        symbol = None   # type: Optional[Parser]

1502
        def find_symbol_for_parser(context: List[Parser]) -> Optional[bool]:
1503
            nonlocal symbol, parser
1504
1505
1506
1507
1508
1509
1510
            if parser in context[-1].sub_parsers():
                for p in reversed(context):
                    if p.pname:
                        # save the name of the closest containing named parser
                        symbol = p
                        return True  # stop searching
            return False  # continue searching
1511
1512
1513
1514

        if parser.pname:
            return parser
        self.root_parser__.apply(find_symbol_for_parser)
eckhart's avatar
eckhart committed
1515
1516
1517
        if symbol is None:
            raise AttributeError('Parser %s (%i) is not contained in Grammar!'
                                 % (str(parser), id(parser)))
1518
1519
1520
        return symbol


1521
    def static_analysis(self) -> List[AnalysisError]:
1522
        """
1523
        Checks the parser tree statically for possible errors.
1524

1525
        This function is called by the constructor of class Grammar and does
eckhart's avatar
eckhart committed
1526
        not need to (and should not) be called externally.
1527
1528
1529
1530
1531

        :return: a list of error-tuples consisting of the narrowest containing
            named parser (i.e. the symbol on which the failure occurred),
            the actual parser that failed and an error object.
        """
1532
        error_list = []  # type: List[AnalysisError]
1533
1534
        leaf_state = dict()  # type: Dict[Parser, Optional[bool]]

1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
        def has_leaf_parsers(prsr: Parser) -> bool:
            def leaf_parsers(p: Parser) -> Optional[bool]:
                if p in leaf_state:
                    return leaf_state[p]
                sub_list = p.sub_parsers()
                if sub_list:
                    leaf_state[p] = None
                    state = any(leaf_parsers(s) for s in sub_list)
                    if not state and any(leaf_state[s] is None for s in sub_list):
                        state = None
                else:
                    state = True
                leaf_state[p] = state
                return state

            # remove parsers with unknown state (None) from cache
            state_unknown = [p for p, s in leaf_state.items() if s is None]
            for p in state_unknown:
                del leaf_state[p]

            result = leaf_parsers(prsr) or False
            leaf_state[prsr] = result
            return result
eckhart's avatar
eckhart committed
1558

eckhart's avatar
eckhart committed
1559
        # cache = dict()  # type: Dict[Parser, Set[Parser]]
1560
1561
        # for debugging: all_parsers = sorted(list(self.all_parsers__), key=lambda p:p.pname)
        for parser in self.all_parsers__:
1562
1563
            error_list.extend(parser.static_analysis())
            if parser.pname and not has_leaf_parsers(parser):
eckhart's avatar
eckhart committed
1564
1565
1566
1567
                error_list.append((parser.symbol, parser, Error(
                    'Parser %s is entirely cyclical and, therefore, cannot even '
                    'touch the parsed document' % parser.location_info(),
                    0, PARSER_NEVER_TOUCHES_DOCUMENT)))
1568
        return error_list
1569
1570


1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
def dsl_error_msg(parser: Parser, error_str: str) -> str:
    """
    Returns an error message for errors in the parser configuration,
    e.g. errors that result in infinite loops.

    Args:
        parser (Parser):  The parser where the error was noticed. Note
            that this is not necessarily the parser that caused the
            error but only where the error became apparent.
        error_str (str):  A short string describing the error.
    Returns:
        str: An error message including the call stack if history
        tacking has been turned in the grammar object.
    """
    msg = ["DSL parser specification error:", error_str, 'Caught by parser "%s".' % str(parser)]
    if parser.grammar.history__:
        msg.extend(["\nCall stack:", parser.grammar.history__[-1].stack])
    else:
        msg.extend(["\nEnable history tracking in Grammar object to display call stack."])
    return " ".join(msg)


1593
GRAMMAR_PLACEHOLDER = Grammar()
eckhart's avatar
eckhart committed
1594
1595


1596
1597
########################################################################
#
1598
# Special parser classes: Alway, Never, PreprocessorToken (leaf classes)
1599
1600
1601
#
########################################################################

1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
class Always(Parser):
    """A parser that always matches, but does not capture anything."""
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
        return EMPTY_NODE, text


class Never(Parser):
    """A parser that never matches."""
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
        return None, text

1613

eckhart's avatar
eckhart committed
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
class AnyChar(Parser):
    """A parser that returns the next unicode character of the document
    whatever that is. The parser fails only at the very end of the text."""
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
        if len(text) >= 1:
            return Node(self.tag_name, text[:1]), text[1:]
        else:
            return None, text


1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
class PreprocessorToken(Parser):
    """
    Parses tokens that have been inserted by a preprocessor.

    Preprocessors can generate Tokens with the ``make_token``-function.
    These tokens start and end with magic characters that can only be
    matched by the PreprocessorToken Parser. Such tokens can be used to
    insert BEGIN - END delimiters at the beginning or ending of a
    quoted block, for example.
    """

    def __init__(self, token: str) -> None:
        assert token and token.isupper()
1637
        assert RX_TOKEN_NAME.match(token)
1638
        super(PreprocessorToken, self).__init__()
di68kap's avatar
di68kap committed
1639
        self.pname = token
1640
1641
        if token:
            self.anonymous = False
1642

1643
    def __deepcopy__(self, memo):
di68kap's avatar
di68kap committed
1644
        duplicate = self.__class__(self.pname)
eckhart's avatar
eckhart committed
1645
        copy_parser_base_attrs(self, duplicate)
1646
        duplicate.anonymous = self.anonymous
1647
        duplicate.tag_name = self.tag_name
1648
1649
        return duplicate

1650
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
1651
1652
1653
        if text[0:1] == BEGIN_TOKEN:
            end = text.find(END_TOKEN, 1)
            if end < 0:
eckhart's avatar
eckhart committed
1654
                node = Node(self.tag_name, '')  # type: Node
eckhart's avatar
eckhart committed
1655
1656
                self.grammar.tree__.new_error(
                    node,
1657
                    'END_TOKEN delimiter missing from preprocessor token. '
eckhart's avatar
eckhart committed
1658
                    '(Most likely due to a preprocessor bug!)')
1659
                return node, text[1:]
1660
            elif end == 0:
1661
                node = Node(self.tag_name, '')
eckhart's avatar
eckhart committed
1662
1663
                self.grammar.tree__.new_error(
                    node,
1664
1665
                    'Preprocessor-token cannot have zero length. '
                    '(Most likely due to a preprocessor bug!)')
1666
                return node, text[2:]
1667
            elif text.find(BEGIN_TOKEN, 1, end) >= 0:
di68kap's avatar
di68kap committed
1668
                node = Node(self.tag_name, text[len(self.pname) + 1:end])
eckhart's avatar
eckhart committed
1669
1670
                self.grammar.tree__.new_error(
                    node,
1671
1672
1673
1674
                    'Preprocessor-tokens must not be nested or contain '
                    'BEGIN_TOKEN delimiter as part of their argument. '
                    '(Most likely due to a preprocessor bug!)')
                return node, text[end:]
di68kap's avatar
di68kap committed
1675
            if text[1:len(self.pname) + 1] == self.pname:
1676
1677
                if self.drop_content:
                    return EMPTY_NODE, text[end + 1:]
di68kap's avatar
di68kap committed
1678
                return Node(self.tag_name, text[len(self.pname) + 2:end]), text[end + 1:]
1679
1680
1681
        return None, text


1682
1683
########################################################################
#
1684
# Text and Regular Expression parser classes (leaf classes)
1685
1686
1687
#
########################################################################

1688
class Text(Parser):
eckhart's avatar
eckhart committed
1689
    """
1690
    Parses plain text strings. (Could be done by RegExp as well, but is faster.)
eckhart's avatar
eckhart committed
1691

eckhart's avatar
eckhart committed
1692
1693
    Example::

1694
        >>> while_token = Text("while")
eckhart's avatar
eckhart committed
1695
1696
        >>> Grammar(while_token)("while").content
        'while'
eckhart's avatar
eckhart committed
1697
    """
1698
    assert TOKEN_PTYPE == ":Text"
eckhart's avatar
eckhart committed
1699

1700
    def __init__(self, text: str) -> None:
1701
        super(Text, self).__init__()
eckhart's avatar
eckhart committed
1702
        self.text = text
1703
        self.len = len(text)
eckhart's avatar
eckhart committed
1704
1705

    def __deepcopy__(self, memo):
1706
        duplicate = self.__class__(self.text)
1707
        copy_parser_base_attrs(self, duplicate)
1708
        return duplicate
eckhart's avatar
eckhart committed
1709

1710
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
eckhart's avatar
eckhart committed
1711
        if text.startswith(self.text):
1712
1713
1714
            if self.drop_content:
                return EMPTY_NODE, text[self.len:]
            elif self.text or not self.anonymous:
1715
                return Node(self.tag_name, self.text, True), text[self.len:]
1716
            return EMPTY_NODE, text
eckhart's avatar
eckhart committed
1717
1718
        return None, text

1719
    def __repr__(self):
1720
1721
        return '`%s`' % abbreviate_middle(self.text, 80)
        # return ("'%s'" if self.text.find("'") <= 0 else '"%s"') % abbreviate_middle(self.text, 80)
1722

eckhart's avatar
eckhart committed
1723

1724
1725
1726
1727
1728
1729
1730
1731
1732
class RegExp(Parser):
    r"""
    Regular expression parser.

    The RegExp-parser parses text that matches a regular expression.
    RegExp can also be considered as the "atomic parser", because all
    other parsers delegate part of the parsing job to other parsers,
    but do not match text directly.

1733
1734
1735
1736
1737
    Example::

        >>> word = RegExp(r'\w+')
        >>> Grammar(word)("Haus").content
        'Haus'
1738

1739
    EBNF-Notation:  ``/ ... /``
1740

1741
    EBNF-Example:   ``word = /\w+/``
1742
1743
    """

1744
    def __init__(self, regexp) -> None:
1745
        super(RegExp, self).__init__()
1746
1747
1748
1749
1750
1751
1752
1753
        self.regexp = re.compile(regexp) if isinstance(regexp, str) else regexp

    def __deepcopy__(self, memo):
        # `regex` supports deep copies, but not `re`
        try:
            regexp = copy.deepcopy(self.regexp, memo)
        except TypeError:
            regexp = self.regexp.pattern
1754
        duplicate = self.__class__(regexp)
1755
        copy_parser_base_attrs(self, duplicate)
1756
        return duplicate
1757

1758
    def _parse(self, text: StringView) -> Tuple[Optional[Node], StringView]:
1759
1760
1761
        match = text.match(self.regexp)
        if match:
            capture = match.group(0)
1762
            if capture or not self.anonymous:
1763
                end = text.index(match.end())
1764
1765
                if self.drop_content:
                    return EMPTY_NODE, text[end:]
1766
                return Node(self.tag_name, capture, True), text[end:]
1767
            return EMPTY_NODE, text
1768
1769
1770
        return None, text

    def __repr__(self):
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
        pattern = self.regexp.pattern
        try:
            if pattern == self._grammar.WSP_RE__:
                return '~'
            elif pattern == self._grammar.COMMENT__:
                return 'comment__'
            elif pattern == self._grammar.WHITESPACE__:
                return 'whitespace__'
        except (AttributeError, NameError):
            pass
1781
1782
        return '/' + escape_control_characters('%s' % abbreviate_middle(pattern, 118))\
            .replace('/', '\\/') + '/'
1783
1784


1785
1786
def DropText(text: str) -> Text:
    return cast(Text, Drop(Text(text)))
Eckhart Arnold's avatar
Eckhart Arnold committed
1787
1788
1789


def DropRegExp(regexp) -> RegExp:
eckhart's avatar
eckhart committed
1790
    return cast(RegExp, Drop(RegExp(regexp)))
Eckhart Arnold's avatar
Eckhart Arnold committed
1791
1792


eckhart's avatar
eckhart committed
1793
def withWS(parser_factory, wsL='', wsR=r'\s*'):
1794
    """Syntactic Sugar for 'Series(Whitespace(wsL), parser_factory(), Whitespace(wsR))'.
1795
    """
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
    if wsL and isinstance(wsL, str):
        wsL = Whitespace(wsL)
    if wsR and isinstance(wsR, str):
        wsR = Whitespace(wsR)
    if wsL and wsR:
        return Series(wsL, parser_factory(), wsR)
    elif wsL:
        return Series(wsL, parser_factory())
    elif wsR:
        return Series(parser_factory(), wsR)
    else:
        return parser_factory()
1808

1809

eckhart's avatar
eckhart committed
1810
def RE(regexp, wsL='', wsR=r'\s*'):
1811
    """Syntactic Sugar for 'Series(Whitespace(wsL), RegExp(regexp), Whitespace(wsR))'"""
eckhart's avatar
eckhart committed
1812
    return withWS(lambda: RegExp(regexp), wsL, wsR)
1813
1814


eckhart's avatar
eckhart committed
1815
def TKN(token, wsL='', wsR=r'\s*'):
1816
1817
    """Syntactic Sugar for 'Series(Whitespace(wsL), Text(token), Whitespace(wsR))'"""
    return withWS(lambda: Text(token), wsL, wsR)
1818
1819


eckhart's avatar
eckhart committed
1820
def DTKN(token, wsL='', wsR=r'\s*'):
1821
1822
    """Syntactic Sugar for 'Series(Whitespace(wsL), DropText(token), Whitespace(wsR))'"""
    return withWS(lambda: Drop(Text(token)), wsL, wsR)
eckhart's avatar
eckhart committed
1823
1824


1825
1826
1827
1828
class Whitespace(RegExp):
    """An variant of RegExp that signifies through its class name that it
    is a RegExp-parser for whitespace."""
    assert WHITESPACE_PTYPE == ":Whitespace"
eckhart's avatar
eckhart committed
1829

eckhart's avatar
eckhart committed
1830
1831
1832
    def __repr__(self):
        return '~'

1833
1834
1835

########################################################################
#
eckhart's avatar
eckhart committed
1836
# Meta parser classes, i.e. parsers that contain other parsers
1837
1838
1839
1840
1841
# to which they delegate parsing
#
########################################################################


di68kap's avatar
di68kap committed
1842
1843
1844
1845
1846
class CombinedParser(Parser):
    """Class CombinedParser is the base class for all parsers that
    call ("combine") other parsers. It contains functions for the
    optimization of return values of such parser
    (i.e descendants of classes UnaryParser and NaryParser).
1847
1848
1849
1850
1851

    The optimization consists in flattening the tree by eliminating
    anonymous nodes. This is the same as what the function
    DHParser.transform.flatten() does, only at an earlier stage.
    The reasoning is that the earlier the tree is reduced, the less work
di68kap's avatar
di68kap committed
1852
1853
1854
    remains to do at all later processing stages. As these typically run
    through all nodes of the syntax tree, this results in a considerable
    speed up.
1855
1856
    """

eckhart's avatar
eckhart committed
1857
    def _return_value(self, node: Optional[Node]) -> Node:
1858
        """
1859
1860
        Generates a return node if a single node has been returned from
        any descendant parsers. Anonymous empty nodes will be dropped.
1861
1862
1863
1864
1865
        If `self` is an unnamed parser, a non-empty descendant node
        will be passed through. If the descendant node is anonymous,
        it will be dropped and only its result will be kept.
        In all other cases or if the optimization is turned off by
        setting `grammar.flatten_tree__` to False, a new node will be
1866
        generated and the descendant node will be its single child.
1867
        """
eckhart's avatar
eckhart committed
1868
        assert node is None or isinstance(node, Node)
1869
        if self._grammar.flatten_tree__:
di68kap's avatar
di68kap committed
1870
            if node is not None:
1871
                if self.anonymous:
1872
1873
                    if self.drop_content:
                        return EMPTY_NODE
1874
1875
1876
1877
1878
1879
1880
                    return node
                if node.tag_name[0] == ':':  # faster than node.is_anonymous()
                    return Node(self.tag_name, node._result)
                return Node(self.tag_name, node)
            elif self.anonymous:
                return EMPTY_NODE  # avoid creation of a node object for anonymous empty nodes
            return Node(self.tag_name, ())
1881
1882
        if self.drop_content:
            return EMPTY_NODE
1883
        return Node(self.tag_name, node or ())  # unoptimized code
eckhart's avatar
eckhart committed
1884

eckhart's avatar
eckhart committed
1885
1886
    @cython.locals(N=cython.int)
    def _return_values(self, results: Tuple[Node, ...]) -> Node:
1887
1888
1889
1890
1891
1892
        """
        Generates a return node from a tuple of returned nodes from
        descendant parsers. Anonymous empty nodes will be removed from
        the tuple. Anonymous child nodes will be flattened if
        `grammar.flatten_tree__` is True.
        """
eckhart's avatar
eckhart committed
1893
        assert isinstance(results, tuple)
1894
1895
        if self.drop_content:
            return EMPTY_NODE
eckhart's avatar
eckhart committed
1896
1897
        N = len(results)
        if N > 1:
1898
            if self._grammar.flatten_tree__:
1899
                nr = []  # type: List[Node]
1900
                # flatten parse tree
1901
1902
1903
                for child in results:
                    if child.children and child.tag_name[0] == ':':  # faster than c.is_anonymous():
                        nr.extend(child.children)
1904
                    elif child._result or child.tag_name[0] != ':':
1905
                        nr.append(child)
1906
1907
1908
1909
                if nr or not self.anonymous:
                    return Node(self.tag_name, tuple(nr))
                else:
                    return EMPTY_NODE
1910
            return Node(self.tag_name, results)  # unoptimized code
eckhart's avatar
eckhart committed
1911
1912
        elif N == 1:
            return self._return_value(results[0])
1913
        elif self._grammar.flatten_tree__:
1914
1915
1916
            if self.anonymous:
                return EMPTY_NODE  # avoid creation of a node object for anonymous empty nodes
            return Node(self.tag_name, ())
1917
        return Node(self.tag_name, results)  # unoptimized code
eckhart's avatar
eckhart committed
1918

1919
1920
1921
1922
1923
    def location_info(self) -> str:
        """Returns a description of the location of the parser within the grammar
        for the purpose of transparent erorr reporting."""
        return '%s%s in definition of "%s" as %s' % (self.pname or '_', self.ptype, self.symbol, str(self))

eckhart's avatar
eckhart committed
1924

di68kap's avatar
di68kap committed
1925
class UnaryParser(CombinedParser):
1926
    """
eckhart's avatar
eckhart committed
1927
    Base class of all unary parsers, i.e. parser that contains
1928
1929
1930
    one and only one other parser, like the optional parser for example.

    The UnaryOperator base class supplies __deepcopy__ and apply
eckhart's avatar
eckhart committed
1931
    methods for unary parsers. The __deepcopy__ method needs
1932
1933
1934
1935
    to be overwritten, however, if the constructor of a derived class
    has additional parameters.
    """

1936
    def __init__(self, parser: Parser) -> None:
eckhart's avatar
eckhart committed
1937
        super(UnaryParser, self).__init__()
1938
1939
1940
1941
1942
        assert isinstance(parser, Parser), str(parser)
        self.parser = parser  # type: Parser

    def __deepcopy__(self, memo):
        parser = copy.deepcopy(self.parser, memo)
1943
        duplicate = self.__class__(parser)
1944
        copy_parser_base_attrs(self, duplicate)
1945
        return duplicate
1946

eckhart's avatar
eckhart committed
1947
1948
    def sub_parsers(self) -> Tuple['Parser', ...]:
        return (self.parser,)
eckhart's avatar