Commit d929b790 authored by di68kap's avatar di68kap
Browse files

ebnf.py: documentation extended

parent 3f3fa403
......@@ -46,11 +46,11 @@ generated by compiling the grammar.
As an example, we will realize a json-parser (https://www.json.org/).
Let's start with creating some test-data::
>>> testobj = {'array': [1,2,"string"], 'int': 3, 'bool': False}
>>> testobj = {'array': [1, 2.0, "a string"], 'number': -1.3e+25, 'bool': False}
>>> import json
>>> testdata = json.dumps(testobj)
>>> testdata
'{"array": [1, 2, "string"], "int": 3, "bool": false}'
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
We define the json-Grammar (see https://www.json.org/) in
top-down manner in EBNF. We'll use a regular-expression look-alike
......@@ -97,7 +97,7 @@ The structure of a JSON file can easily be described in EBNF::
'string = `"` §_CHARS `"` ~ \\n'\
' _CHARS = /[^"\\\\\]+/ | /\\\\\\\[\/bnrt\\\\\]/ \\n'\
'number = _INT _FRAC? _EXP? ~ \\n'\
' _INT = `-` /[1-9][0-9]+/ | /[0-9]/ \\n'\
' _INT = `-`? ( /[1-9][0-9]+/ | /[0-9]/ ) \\n'\
' _FRAC = `.` /[0-9]+/ \\n'\
' _EXP = (`E`|`e`) [`+`|`-`] /[0-9]+/ \\n'\
'bool = "true" ~ | "false" ~ \\n'\
......@@ -138,7 +138,7 @@ this grammar into executable Python-code, we use the high-level-function
>>> parser = create_parser(grammar, branding="JSON")
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array": [1, 2, "string"], "int": 3, "bool": false}'
'{"array": [1, 2.0, "a string"], "number": -1.3e+25, "bool": false}'
As expected serializing the content of the resulting syntax-tree yields exactly
the input-string of the parsing process. What we cannot see here, is that the
......@@ -158,13 +158,16 @@ captures the first json-array within the syntax-tree::
(:Whitespace " ")
(_element
(number
(_INT "2")))
(_INT "2")
(_FRAC
(:Text ".")
(:RegExp "0"))))
(:Text ",")
(:Whitespace " ")
(_element
(string
(:Text '"')
(_CHARS "string")
(_CHARS "a string")
(:Text '"')))
(:Text "]"))
......@@ -196,7 +199,8 @@ construct of the EBNF-grammar would leave a node in the syntax-tree::
(_element
(number
(_INT
(:RegExp "1"))))
(:Alternative
(:RegExp "1")))))
(:ZeroOrMore
(:Series
(:Text ",")
......@@ -204,7 +208,12 @@ construct of the EBNF-grammar would leave a node in the syntax-tree::
(_element
(number
(_INT
(:RegExp "2")))))
(:Alternative
(:RegExp "2")))
(:Option
(_FRAC
(:Text ".")
(:RegExp "0"))))))
(:Series
(:Text ",")
(:Whitespace " ")
......@@ -212,7 +221,7 @@ construct of the EBNF-grammar would leave a node in the syntax-tree::
(string
(:Text '"')
(_CHARS
(:RegExp "string"))
(:RegExp "a string"))
(:Text '"')))))))
(:Text "]"))
......@@ -223,16 +232,70 @@ is advisable to streamline the syntax-tree as early on as possible,
because the processing time of all subsequent tree-processing stages
increases with the number of nodes in the tree.
Because of this DHParser offers further means of simplifying
Because of this, DHParser offers further means of simplifying
syntax-trees during the parsing stage, already. These are not turned
on by default, because they allow to drop content or to remove named
nodes from the tree; but they must be turned on by "directives" that
are listed at the top of an EBNF-grammar and that guide the
parser-generation process. DHParser-directives always start with an
`@`-sign. For example, the `@drop`-directive advises the parser to
drop certain nodes ::
drop certain nodes entirely, including their content. In the following
example, the parser is directed to drop all insignificant whitespace::
>>> drop_insignificant_wsp = '@drop = whitespace \\n'
Directives look similar to productions, only that on the right hand
side of the equal sign follows a list of parameters. In the case
of the drop-directive these can be either names of non-anomymous
nodes that shall be dropped or one of four particular classes of
anonymous nodes (`strings`, `backticked`, `regexp`, `whitespace`) that
will be dropped.
Another useful directive advises the parser to treat named nodes as
anynouse nodes and to eliminate them accordingly during parsing. This
is usefule, if we have introduced certain names in our grammar
only as placeholders to render the definition of the grammar a bit
more readable, not because we are intested in the text that is
captured by the production associated with them in their own right::
>>> anonymize_symbols = '@ anonymous = /_\w+/ \\n'
Instead of passing a comma-separated list of symbols to the directive,
which would also have been possible, we have leveraged our convention
to prefix unimportant symbols with an underscore "_" by specifying the
symbols that shall by anonymized with a regular expression.
Now, let's examine the effect of these two directives::
>>> grammar = drop_insignificant_wsp + anonymize_symbols + grammar
>>> parser = create_parser(grammar, 'JSON')
>>> syntax_tree = parser(testdata)
>>> syntax_tree.content
'{"array":[1,2.0,"a string"],"number":-1.3e+25,"bool":false}'
You might have notived that all insigificant whitespaces adjacent to
the delimiters have been removed this time (but, of course not the
significant whitespace between "a" and "string" in "a string"). And
the difference, the use of these two directives makes, is even more
obvious, if we look at (a section of) the syntax-tree::
>>> print(syntax_tree.pick('array').as_sxpr(compact=True))
(array
(:Text "[")
(number "1")
(:Text ",")
(number
(:RegExp "2")
(:Text ".")
(:RegExp "0"))
(:Text ",")
(string
(:Text '"')
(:RegExp "a string")
(:Text '"'))
(:Text "]"))
>>>
"""
......
......@@ -381,7 +381,8 @@ class Parser:
tag_name: The tag_name for the nodes that are created by
the parser. If the parser is named, this is the same as
`pname`, otherwise it is the name of the parser's type.
`pname`, otherwise it is the name of the parser's type
prefixed with a colon ":".
visited: Mapping of places this parser has already been to
during the current parsing process onto the results the
......@@ -1964,7 +1965,7 @@ class CombinedParser(Parser):
if self.drop_content:
return EMPTY_NODE
return node
if node.tag_name[0] == ':': # faster than node.is_anonymous()
if node.anonymous:
return Node(self.tag_name, node._result)
return Node(self.tag_name, node)
elif self.anonymous:
......@@ -1991,9 +1992,9 @@ class CombinedParser(Parser):
nr = [] # type: List[Node]
# flatten parse tree
for child in results:
if child.children and child.tag_name[0] == ':': # faster than c.is_anonymous():
if child.children and child.anonymous: # faster than c.is_anonymous():
nr.extend(child.children)
elif child._result or child.tag_name[0] != ':':
elif child._result or not child.anonymous:
nr.append(child)
if nr or not self.anonymous:
return Node(self.tag_name, tuple(nr))
......@@ -3302,7 +3303,7 @@ class Synonym(UnaryParser):
if not self.anonymous:
if node is EMPTY_NODE:
return Node(self.tag_name, ''), text
if node.tag_name[:1] == ':':
if node.anonymous:
# eliminate anonymous child-node on the fly
node.tag_name = self.tag_name
else:
......
......@@ -392,7 +392,7 @@ def grammar_unit(test_unit, parser_factory, transformer_factory, report='REPORT'
if not get_config_value('test_parallelization'):
print(' Testing parser: ' + parser_name)
track_history = False
track_history = get_config_value('history_tracking')
try:
if has_lookahead(parser_name):
set_tracer(all_descendants(parser[parser_name]), trace_history)
......
......@@ -46,7 +46,7 @@ ESCAPE = /\\[\/bnrt\\]/ | UNICODE
UNICODE = "\u" HEX HEX
HEX = /[0-9a-fA-F][0-9a-fA-F]/
INT = [NEG] /[1-9][0-9]+/ | /[0-9]/
INT = [NEG] ( /[1-9][0-9]+/ | /[0-9]/ )
NEG = `-`
FRAC = [ DOT /[0-9]+/ ]
DOT = `.`
......
......@@ -79,7 +79,7 @@ class jsonGrammar(Grammar):
r"""Parser for a json source file.
"""
_element = Forward()
source_hash__ = "daa269448372c300359c9c6875a23031"
source_hash__ = "bd32b246b5aa5fbdb1e18ac24d1da53b"
anonymous__ = re.compile('_[A-Za-z]+|[A-Z]+')
static_analysis_pending__ = [] # type: List[bool]
parser_initialization__ = ["upon instantiation"]
......@@ -90,22 +90,22 @@ class jsonGrammar(Grammar):
wsp__ = Whitespace(WSP_RE__)
dwsp__ = Drop(Whitespace(WSP_RE__))
_EOF = NegativeLookahead(RegExp('.'))
EXP = Option(Series(Alternative(Drop(Text("E")), Drop(Text("e"))), Option(Alternative(Drop(Text("+")), Drop(Text("-")))), RegExp('[0-9]+')))
EXP = Option(Series(Alternative(Text("E"), Text("e")), Option(Alternative(Text("+"), Text("-"))), RegExp('[0-9]+')))
DOT = Text(".")
FRAC = Option(Series(DOT, RegExp('[0-9]+')))
NEG = Text("-")
INT = Alternative(Series(Option(NEG), RegExp('[1-9][0-9]+')), RegExp('[0-9]'))
INT = Series(Option(NEG), Alternative(RegExp('[1-9][0-9]+'), RegExp('[0-9]')))
HEX = RegExp('[0-9a-fA-F][0-9a-fA-F]')
UNICODE = Series(Series(Drop(Text("\\u")), dwsp__), HEX, HEX)
ESCAPE = Alternative(RegExp('\\\\[/bnrt\\\\]'), UNICODE)
PLAIN = RegExp('[^"\\\\]+')
_CHARACTERS = ZeroOrMore(Alternative(PLAIN, ESCAPE))
null = Series(Text("null"), dwsp__)
false = Series(Drop(Text("false")), dwsp__)
true = Series(Drop(Text("true")), dwsp__)
false = Series(Text("false"), dwsp__)
true = Series(Text("true"), dwsp__)
_bool = Alternative(true, false)
number = Series(INT, FRAC, EXP, dwsp__)
string = Series(Drop(Text('"')), _CHARACTERS, Drop(Text('"')), dwsp__, mandatory=1)
string = Series(Text('"'), _CHARACTERS, Text('"'), dwsp__, mandatory=1)
array = Series(Series(Drop(Text("[")), dwsp__), Option(Series(_element, ZeroOrMore(Series(Series(Drop(Text(",")), dwsp__), _element)))), Series(Drop(Text("]")), dwsp__), mandatory=2)
member = Series(string, Series(Drop(Text(":")), dwsp__), _element, mandatory=1)
object = Series(Series(Drop(Text("{")), dwsp__), member, ZeroOrMore(Series(Series(Drop(Text(",")), dwsp__), member, mandatory=1)), Series(Drop(Text("}")), dwsp__), mandatory=3)
......
......@@ -66,7 +66,7 @@ ESCAPE = /\\[\/bnrt\\]/ | UNICODE
UNICODE = "\u" HEX HEX
HEX = /[0-9a-fA-F][0-9a-fA-F]/
INT = [ NEG ] /[0-9]/ | /[1-9][0-9]+/
INT = [ NEG ] ( /[0-9]/ | /[1-9][0-9]+/ )
NEG = `-`
FRAC = [ DOT /[0-9]+/ ]
DOT = `.`
......
......@@ -110,6 +110,7 @@ M2: 1.1
M3: 0
M4: 1.43E+22
M5: 20
M6: -1.3e+25
[ast:number]
......
......@@ -18,6 +18,7 @@ try:
from DHParser import dsl
import DHParser.log
from DHParser import testing
from DHParser.configuration import set_config_value, access_presets, set_preset_value, finalize_presets
except ModuleNotFoundError:
print('Could not import DHParser. Please adjust sys.path in file '
'"%s" manually' % __file__)
......@@ -52,6 +53,10 @@ if __name__ == '__main__':
if len(argv) > 1 and sys.argv[1] == "--debug":
LOGGING = 'LOGS'
del argv[1]
access_presets()
set_preset_value('history_tracking', True)
finalize_presets()
DHParser.log.start_logging(LOGGING)
if (len(argv) >= 2 and (argv[1].endswith('.ebnf') or
os.path.splitext(argv[1])[1].lower() in testing.TEST_READERS.keys())):
# if called with a single filename that is either an EBNF file or a known
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment