Commit 4f6c3ae8 authored by Eckhart Arnold's avatar Eckhart Arnold

further LaTeX tests

parent c13ed3d3
...@@ -294,13 +294,19 @@ class Parser(ParserBase, metaclass=ParserMetaClass): ...@@ -294,13 +294,19 @@ class Parser(ParserBase, metaclass=ParserMetaClass):
only to their class name, and not to the individual parser. only to their class name, and not to the individual parser.
Parser objects are callable and parsing is done by calling a parser Parser objects are callable and parsing is done by calling a parser
object with the text to parse. If the parser matches it returns object with the text to parse.
a tuple consisting of a node representing the root of the concrete
syntax tree resulting from the match as well as the substring If the parser matches it returns a tuple consisting of a node
`text[i:]` where i is the length of matched text (which can be representing the root of the concrete syntax tree resulting from the
zero in the case of parsers like `ZeroOrMore` or `Optional`). match as well as the substring `text[i:]` where i is the length of
If `i > 0` then the parser has "moved forward". If the parser does matched text (which can be zero in the case of parsers like
not match it returns `(None, text). `ZeroOrMore` or `Optional`). If `i > 0` then the parser has "moved
forward".
If the parser does not match it returns `(None, text). **Note** that
this is not the same as an empty match `("", text)`. Any empty match
can for example be returned by the `ZeroOrMore`-parser in case the
contained parser is repeated zero times.
""" """
ApplyFunc = Callable[['Parser'], None] ApplyFunc = Callable[['Parser'], None]
...@@ -674,8 +680,17 @@ class Grammar: ...@@ -674,8 +680,17 @@ class Grammar:
stitches.append(Node(None, rest)) stitches.append(Node(None, rest))
result = Node(None, tuple(stitches)) result = Node(None, tuple(stitches))
if any(self.variables__.values()): if any(self.variables__.values()):
result.add_error("Capture-retrieve-stack not empty after end of parsing: " error_str = "Capture-retrieve-stack not empty after end of parsing: " + \
+ str(self.variables__)) str(self.variables__)
if result.children:
# add another child node at the end to ensure that the position
# of the error will be the end of the text. Otherwise, the error
# message above ("...after end of parsing") would appear illogical.
error_node = Node(ZOMBIE_PARSER, '')
error_node.add_error(error_str)
result.result = result.children + (error_node,)
else:
result.add_error(error_str)
result.pos = 0 # calculate all positions result.pos = 0 # calculate all positions
return result return result
...@@ -886,14 +901,14 @@ class RE(Parser): ...@@ -886,14 +901,14 @@ class RE(Parser):
Regular Expressions with optional leading or trailing whitespace. Regular Expressions with optional leading or trailing whitespace.
The RE-parser parses pieces of text that match a given regular The RE-parser parses pieces of text that match a given regular
expression. Other than the ``RegExp``-Parser it can also skip expression. Other than the ``RegExp``-Parser it can also skip
"implicit whitespace" before or after the matched text. "implicit whitespace" before or after the matched text.
The whitespace is in turn defined by a regular expression. It The whitespace is in turn defined by a regular expression. It should
should be made sure that this expression also matches the empty be made sure that this expression also matches the empty string,
string, e.g. use r'\s*' or r'[\t ]+', but not r'\s+'. If the e.g. use r'\s*' or r'[\t ]+', but not r'\s+'. If the respective
respective parameters in the constructor are set to ``None`` the parameters in the constructor are set to ``None`` the default
default whitespace expression from the Grammar object will be used. whitespace expression from the Grammar object will be used.
Example (allowing whitespace on the right hand side, but not on Example (allowing whitespace on the right hand side, but not on
the left hand side of a regular expression): the left hand side of a regular expression):
...@@ -976,9 +991,8 @@ class RE(Parser): ...@@ -976,9 +991,8 @@ class RE(Parser):
class Token(RE): class Token(RE):
""" """
Class Token parses simple strings. Any regular regular Class Token parses simple strings. Any regular regular expression
expression commands will be interpreted as simple sequence of commands will be interpreted as simple sequence of characters.
characters.
Other than that class Token is essentially a renamed version of Other than that class Token is essentially a renamed version of
class RE. Because tokens often have a particular semantic different class RE. Because tokens often have a particular semantic different
...@@ -1000,16 +1014,16 @@ class Token(RE): ...@@ -1000,16 +1014,16 @@ class Token(RE):
######################################################################## ########################################################################
# #
# Combinator parser classes (i.e. trunk classes of the parser tree) # Containing parser classes, i.e. parsers that contain other parsers
# to which they delegate (i.e. trunk classes)
# #
######################################################################## ########################################################################
class UnaryOperator(Parser): class UnaryOperator(Parser):
""" """
Base class of all unary parser operators, i.e. parser that Base class of all unary parser operators, i.e. parser that contains
contains one and only one other parser, like the optional one and only one other parser, like the optional parser for example.
parser for example.
The UnaryOperator base class supplies __deepcopy__ and apply The UnaryOperator base class supplies __deepcopy__ and apply
methods for unary parser operators. The __deepcopy__ method needs methods for unary parser operators. The __deepcopy__ method needs
...@@ -1036,10 +1050,10 @@ class NaryOperator(Parser): ...@@ -1036,10 +1050,10 @@ class NaryOperator(Parser):
contains one or more other parsers, like the alternative contains one or more other parsers, like the alternative
parser for example. parser for example.
The NnaryOperator base class supplies __deepcopy__ and apply The NnaryOperator base class supplies __deepcopy__ and apply methods
methods for unary parser operators. The __deepcopy__ method needs for unary parser operators. The __deepcopy__ method needs to be
to be overwritten, however, if the constructor of a derived class overwritten, however, if the constructor of a derived class has
has additional parameters. additional parameters.
""" """
def __init__(self, *parsers: Parser, name: str = '') -> None: def __init__(self, *parsers: Parser, name: str = '') -> None:
super(NaryOperator, self).__init__(name) super(NaryOperator, self).__init__(name)
...@@ -1103,6 +1117,19 @@ class Optional(UnaryOperator): ...@@ -1103,6 +1117,19 @@ class Optional(UnaryOperator):
class ZeroOrMore(Optional): class ZeroOrMore(Optional):
"""
`ZeroOrMore` applies a parser repeatedly as long as this parser
matches. Like `Optional` the `ZeroOrMore` parser always matches. In
case of zero repetitions, the empty match `((), text)` is returned.
Examples:
>>> sentence = ZeroOrMore(RE(r'\w+,?')) + Token('.')
>>> Grammar(sentence)('Wo viel der Weisheit, da auch viel des Grämens.').content()
'Wo viel der Weisheit, da auch viel des Grämens.'
EBNF-Notation: `{ ... }`
EBNF-Example: `sentence = { /\w+,?/ } "."`
"""
def __call__(self, text: str) -> Tuple[Node, str]: def __call__(self, text: str) -> Tuple[Node, str]:
results = () # type: Tuple[Node, ...] results = () # type: Tuple[Node, ...]
n = len(text) + 1 n = len(text) + 1
......
...@@ -197,8 +197,6 @@ class Node: ...@@ -197,8 +197,6 @@ class Node:
# self.pos: int = 0 # continuous updating of pos values wastes a lot of time # self.pos: int = 0 # continuous updating of pos values wastes a lot of time
self._pos = -1 # type: int self._pos = -1 # type: int
self.parser = parser or ZOMBIE_PARSER self.parser = parser or ZOMBIE_PARSER
self.error_flag = any(r.error_flag for r in self._children) \
if self._children else False # type: bool
def __str__(self): def __str__(self):
if self.children: if self.children:
...@@ -242,6 +240,8 @@ class Node: ...@@ -242,6 +240,8 @@ class Node:
self._result = (result,) if isinstance(result, Node) else result or '' # type: StrictResultType self._result = (result,) if isinstance(result, Node) else result or '' # type: StrictResultType
self._children = cast(ChildrenType, self._result) \ self._children = cast(ChildrenType, self._result) \
if isinstance(self._result, tuple) else cast(ChildrenType, ()) # type: ChildrenType if isinstance(self._result, tuple) else cast(ChildrenType, ()) # type: ChildrenType
self.error_flag = any(r.error_flag for r in self._children) \
if self._children else False # type: bool
@property @property
def children(self) -> ChildrenType: def children(self) -> ChildrenType:
......
...@@ -119,11 +119,11 @@ text = { cfgtext | (BRACKETS //~) }+ ...@@ -119,11 +119,11 @@ text = { cfgtext | (BRACKETS //~) }+
cfgtext = { word_sequence | (ESCAPED //~) }+ cfgtext = { word_sequence | (ESCAPED //~) }+
word_sequence = { TEXTCHUNK //~ }+ word_sequence = { TEXTCHUNK //~ }+
no_command = "\begin{" | "\end" | structural no_command = "\begin{" | "\end" | BACKSLASH structural
blockcmd = /[\\]/ ( ( "begin{" | "end{" ) blockcmd = BACKSLASH ( ( "begin{" | "end{" )
( "enumerate" | "itemize" | "figure" | "quote" ( "enumerate" | "itemize" | "figure" | "quote"
| "quotation" | "tabular") "}" | "quotation" | "tabular") "}"
| structural | begin_generic_block | end_generic_block ) | structural | begin_generic_block | end_generic_block )
structural = "subsection" | "section" | "chapter" | "subsubsection" structural = "subsection" | "section" | "chapter" | "subsubsection"
| "paragraph" | "subparagraph" | "item" | "paragraph" | "subparagraph" | "item"
...@@ -147,7 +147,8 @@ WSPC = /[ \t]+/ # (horizontal) whitespace ...@@ -147,7 +147,8 @@ WSPC = /[ \t]+/ # (horizontal) whitespace
LF = !PARSEP /[ \t]*\n[ \t]*/ # linefeed but not an empty line LF = !PARSEP /[ \t]*\n[ \t]*/ # linefeed but not an empty line
PARSEP = /[ \t]*(?:\n[ \t]*)+\n[ \t]*/ # at least one empty line, i.e. PARSEP = /[ \t]*(?:\n[ \t]*)+\n[ \t]*/ # at least one empty line, i.e.
# [whitespace] linefeed [whitespace] linefeed # [whitespace] linefeed [whitespace] linefeed
EOF = /(?!.)/
LB = /\s*?\n|$/ # backwards line break for Lookbehind-Operator LB = /\s*?\n|$/ # backwards line break for Lookbehind-Operator
# beginning of text marker '$' added for test code # beginning of text marker '$' added for test code
\ No newline at end of file BACKSLASH = /[\\]/
EOF = /(?!.)/ # End-Of-File
...@@ -170,11 +170,11 @@ class LaTeXGrammar(Grammar): ...@@ -170,11 +170,11 @@ class LaTeXGrammar(Grammar):
cfgtext = { word_sequence | (ESCAPED //~) }+ cfgtext = { word_sequence | (ESCAPED //~) }+
word_sequence = { TEXTCHUNK //~ }+ word_sequence = { TEXTCHUNK //~ }+
no_command = "\begin{" | "\end" | structural no_command = "\begin{" | "\end" | BACKSLASH structural
blockcmd = /[\\]/ ( ( "begin{" | "end{" ) blockcmd = BACKSLASH ( ( "begin{" | "end{" )
( "enumerate" | "itemize" | "figure" | "quote" ( "enumerate" | "itemize" | "figure" | "quote"
| "quotation" | "tabular") "}" | "quotation" | "tabular") "}"
| structural | begin_generic_block | end_generic_block ) | structural | begin_generic_block | end_generic_block )
structural = "subsection" | "section" | "chapter" | "subsubsection" structural = "subsection" | "section" | "chapter" | "subsubsection"
| "paragraph" | "subparagraph" | "item" | "paragraph" | "subparagraph" | "item"
...@@ -198,24 +198,26 @@ class LaTeXGrammar(Grammar): ...@@ -198,24 +198,26 @@ class LaTeXGrammar(Grammar):
LF = !PARSEP /[ \t]*\n[ \t]*/ # linefeed but not an empty line LF = !PARSEP /[ \t]*\n[ \t]*/ # linefeed but not an empty line
PARSEP = /[ \t]*(?:\n[ \t]*)+\n[ \t]*/ # at least one empty line, i.e. PARSEP = /[ \t]*(?:\n[ \t]*)+\n[ \t]*/ # at least one empty line, i.e.
# [whitespace] linefeed [whitespace] linefeed # [whitespace] linefeed [whitespace] linefeed
EOF = /(?!.)/
LB = /\s*?\n|$/ # backwards line break for Lookbehind-Operator LB = /\s*?\n|$/ # backwards line break for Lookbehind-Operator
# beginning of text marker '$' added for test code # beginning of text marker '$' added for test code
BACKSLASH = /[\\]/
EOF = /(?!.)/ # End-Of-File
""" """
begin_generic_block = Forward() begin_generic_block = Forward()
block_environment = Forward() block_environment = Forward()
block_of_paragraphs = Forward() block_of_paragraphs = Forward()
end_generic_block = Forward() end_generic_block = Forward()
text_elements = Forward() text_elements = Forward()
source_hash__ = "7f6e1c72047e44b0b39db4d20f5186e2" source_hash__ = "06385bac4dd7cb009bd29712a8fc692c"
parser_initialization__ = "upon instantiation" parser_initialization__ = "upon instantiation"
COMMENT__ = r'%.*(?:\n|$)' COMMENT__ = r'%.*(?:\n|$)'
WSP__ = mixin_comment(whitespace=r'[ \t]*(?:\n(?![ \t]*\n)[ \t]*)?', comment=r'%.*(?:\n|$)') WSP__ = mixin_comment(whitespace=r'[ \t]*(?:\n(?![ \t]*\n)[ \t]*)?', comment=r'%.*(?:\n|$)')
wspL__ = '' wspL__ = ''
wspR__ = WSP__ wspR__ = WSP__
LB = RegExp('\\s*?\\n|$')
EOF = RegExp('(?!.)') EOF = RegExp('(?!.)')
BACKSLASH = RegExp('[\\\\]')
LB = RegExp('\\s*?\\n|$')
PARSEP = RegExp('[ \\t]*(?:\\n[ \\t]*)+\\n[ \\t]*') PARSEP = RegExp('[ \\t]*(?:\\n[ \\t]*)+\\n[ \\t]*')
LF = Series(NegativeLookahead(PARSEP), RegExp('[ \\t]*\\n[ \\t]*')) LF = Series(NegativeLookahead(PARSEP), RegExp('[ \\t]*\\n[ \\t]*'))
WSPC = RegExp('[ \\t]+') WSPC = RegExp('[ \\t]+')
...@@ -225,8 +227,8 @@ class LaTeXGrammar(Grammar): ...@@ -225,8 +227,8 @@ class LaTeXGrammar(Grammar):
NAME = Capture(RE('\\w+')) NAME = Capture(RE('\\w+'))
CMDNAME = RE('\\\\(?:(?!_)\\w)+') CMDNAME = RE('\\\\(?:(?!_)\\w)+')
structural = Alternative(Token("subsection"), Token("section"), Token("chapter"), Token("subsubsection"), Token("paragraph"), Token("subparagraph"), Token("item")) structural = Alternative(Token("subsection"), Token("section"), Token("chapter"), Token("subsubsection"), Token("paragraph"), Token("subparagraph"), Token("item"))
blockcmd = Series(RegExp('[\\\\]'), Alternative(Series(Alternative(Token("begin{"), Token("end{")), Alternative(Token("enumerate"), Token("itemize"), Token("figure"), Token("quote"), Token("quotation"), Token("tabular")), Token("}")), structural, begin_generic_block, end_generic_block)) blockcmd = Series(BACKSLASH, Alternative(Series(Alternative(Token("begin{"), Token("end{")), Alternative(Token("enumerate"), Token("itemize"), Token("figure"), Token("quote"), Token("quotation"), Token("tabular")), Token("}")), structural, begin_generic_block, end_generic_block))
no_command = Alternative(Token("\\begin{"), Token("\\end"), structural) no_command = Alternative(Token("\\begin{"), Token("\\end"), Series(BACKSLASH, structural))
word_sequence = OneOrMore(Series(TEXTCHUNK, RE(''))) word_sequence = OneOrMore(Series(TEXTCHUNK, RE('')))
cfgtext = OneOrMore(Alternative(word_sequence, Series(ESCAPED, RE('')))) cfgtext = OneOrMore(Alternative(word_sequence, Series(ESCAPED, RE(''))))
text = OneOrMore(Alternative(cfgtext, Series(BRACKETS, RE('')))) text = OneOrMore(Alternative(cfgtext, Series(BRACKETS, RE(''))))
......
...@@ -21,7 +21,8 @@ ...@@ -21,7 +21,8 @@
[fail:block_environment] [fail:block_environment]
1 : "\begin{generic}inline environment\end{generic}" 1 : """\begin{generic}inline environment\end{generic}
"""
2 : """\begin{generic}inline environment 2 : """\begin{generic}inline environment
\end{generic} \end{generic}
...@@ -33,7 +34,8 @@ ...@@ -33,7 +34,8 @@
[match:inline_environment] [match:inline_environment]
1 : "\begin{generic}inline environment\end{generic}" 1 : """\begin{generic}inline environment\end{generic}
"""
2 : """\begin{generic}inline environment 2 : """\begin{generic}inline environment
\end{generic} \end{generic}
...@@ -46,3 +48,61 @@ ...@@ -46,3 +48,61 @@
invalid enivronment \end{generic} invalid enivronment \end{generic}
""" """
[match:itemize]
1 : \begin{itemize}
\item Items doe not need to be
\item separated by empty lines.
\end{itemize}
2 : \begin{itemize}
\item But items may be
\item separated by blank lines.
\item
Empty lines at the beginning of an item will be ignored.
\end{itemize}
3 : \begin{itemize}
\item Items can consist of
several paragraphs.
\item Or of one paragraph
\end{itemize}
4 : \begin{itemize}
\item
\begin{itemize}
\item Item-lists can be nested!
\end{itemize}
\end{itemize}
[fail:itemize]
1 : \begin{itemize}
Free text is not allowed within an itemized environment!
\end{itemize}
[match:enumerate]
1 : \begin{enumerate}
\item Enumerations work just like item-lists.
\item Only that the bullets are numbers.
\end{enumerate}
2: \begin{enumerate}
\item \begin{itemize}
\item Item-lists and
\item Enumeration-lists
\begin{enumerate}
\item can be nested
\item arbitrarily
\end{enumerate}
\item Another item
\end{itemize}
\item Plain numerated item.
\end{enumerate}
...@@ -15,6 +15,16 @@ ...@@ -15,6 +15,16 @@
% or like this comment. % or like this comment.
Comment lines do not break paragraphs. Comment lines do not break paragraphs.
5 : Paragraphs may contain {\em emphasized} or {\bf bold} text.
Most of these commands can have different forms as, for example:
\begin{small} small \end{small} or {\large large}.
6 : Paragraphs may also contain {\xy unknown blocks }.
7 : Paragraphs may contain \xy[xycgf]{unbknown} commands.
8 : Unknwon \xy commands within paragraphs may be simple
or \xy{complex}.
[fail:paragraph] [fail:paragraph]
1 : \begin{enumerate} 1 : \begin{enumerate}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment