2.12.2021, 9:00 - 11:00: Due to updates GitLab may be unavailable for some minutes between 09:00 and 11:00.

StepByStepGuide.rst 42.3 KB
Newer Older
eckhart's avatar
eckhart committed
1
2
3
4
5
DHParser's Step by Step Guide
*****************************

This step by step guide goes through the whole process of desining and testing
a domain specific notation from the very start. (The terms "domain specific
6
notation" and "domain specific language" are used interchangeably in the
eckhart's avatar
eckhart committed
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
following. Both will abbreviated by "DSL", however.) We will design a simple
domain specific notation for poems as a teaching example. On the way we will
learn:

1. how to setup a new DHParser project

2. how to use the test driven development approach to designing a DSL

3. how to describe the syntax of a DSL with the EBNF-notation

4. how to specify the transformations for converting the concrete syntax tree
   that results from parsing a text denoted in our DLS into an abstract syntax
   tree that represents or comes close to representing out data model.

5. how to write a compiler that transforms the abstract syntax tree into a
   target representation which might be a html page, epub or printable pdf in
   the case of typical Digital-Humanities-projects.


Setting up a new DHParser project
=================================

Since DHParser, while quite mature in terms of implemented features, is still
in a pre-first-release state, it is for the time being more recommendable to
clone the most current version of DHParser from the git-repository rather than
installing the packages from the Python Package Index (PyPI).

eckhart's avatar
eckhart committed
34
35
This section takes you from cloning the DHParser git repository to setting up
a new DHParser-project in the ``experimental``-subdirectory and testing
36
whether the setup works. Similarly to current web development practices, most
eckhart's avatar
eckhart committed
37
38
39
40
of the work with DHParser is done from the shell. In the following, we assume
a Unix (Linux) environment. The same can most likely be done on other
operating systems in a very similar way, but there might be subtle
differences.
eckhart's avatar
eckhart committed
41
42
43
44
45
46
47

Installing DHParser from the git repository
-------------------------------------------

In order to install DHParser from the git repository, open up a shell window
and type::

eckhart's avatar
eckhart committed
48
49
    $ git clone https://gitlab.lrz.de/badw-it/DHParser.git
    $ cd DHParser
eckhart's avatar
eckhart committed
50
51
52
53
54

The second command changes to the DHParser directory. Within this directory
you should recognise the following subdirectories and files. There are more
files and directories for sure, but those will not concern us for now::

eckhart's avatar
eckhart committed
55
56
57
58
59
60
61
62
63
64
    DHParser/            - the DHParser python packages
    documentation/       - DHParser's documentation in html-form
    documentation_source - DHParser's documentation in reStructedText-Format
    examples/            - some exmamples for DHParser (mostly incomplete)
    experimental/        - an empty directory for experimenting
    test/                - DHParser's unit-tests
    dhparser.py          - DHParser's command line tool for setting up projects
    README.md            - General information about DHParser
    LICENSE.txt          - DHParser's license. It's open source (hooray!)
    Introduction.md      - An introduction and appetizer for DHParser
eckhart's avatar
eckhart committed
65

eckhart's avatar
eckhart committed
66
67
In order to verify that the installation works, you can simply run the
"dhparser.py" script and, when asked, chose "3" for the self-test. If the
68
self-test runs through without error, the installation has succeeded.
eckhart's avatar
eckhart committed
69
70
71
72

Staring a new DHParser project
------------------------------

eckhart's avatar
eckhart committed
73
74
75
In order to setup a new DHParser project, you run the ``dhparser.py``-script
with the name of the new project. For the sake of the example, let's type::

eckhart's avatar
eckhart committed
76
77
    $ python dhparser.py experimental/poetry
    $ cd experimental/poetry
eckhart's avatar
eckhart committed
78

eckhart's avatar
eckhart committed
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
This creates a new DHParser-project with the name "poetry" in directory with
the same name within the subdirectory "experimental". This new directory
contains the following files::

    README.md           - a stub for a readme-file explaiing the project
    poetry.ebnf         - a trivial demo grammar for the new project
    example.dsl         - an example file written in this grammar
    tst_poetry_grammar.py - a python script ("test-script") that re-compiles
                            the grammar (if necessary) and runs the unit tests
    grammar_tests/01_test_word.ini     - a demo unit test
    grammar_tests/02_test_document.ini - another unit test

Now, if you look into the file "example.dsl" you will find that it contains a
simple sequence of words, namely "Life is but a walking shadow". In fact, the
demo grammar that comes with a newly created project is nothing but simple
grammar for sequences of words separated by whitespace. Now, since we alread
have unit tests, our first exercise will be to run the unit tests by starting
the script "tst_poetry_grammar.py"::

eckhart's avatar
eckhart committed
98
    $ python tst_poetry_grammar.py
eckhart's avatar
eckhart committed
99
100
101
102
103
104
105
106

This will run through the unit-tests in the grammar_tests directory and print
their success or failure on the screen. If you check the contents of your
project directory after running the script, you might notice that there now
exists a new file "poetryCompiler.py" in the project directory. This is an
auto-generated compiler-script for our DSL. You can use this script to compile
any source file of your DSL, like "example.dsl". Let's try::

eckhart's avatar
eckhart committed
107
    $ python poetryCompiler.py example.dsl
eckhart's avatar
eckhart committed
108
109
110

The output is a block of pseudo-XML, looking like this::

eckhart's avatar
eckhart committed
111
    <document>
eckhart's avatar
eckhart committed
112
113
114
115
116
117
118
119
120
121
122
123
     <:ZeroOrMore>
       <WORD>
         <:RegExp>Life</:RegExp>
         <:Whitespace> </:Whitespace>
       </WORD>
       <WORD>
         <:RegExp>is</:RegExp>
         <:Whitespace> </:Whitespace>
       </WORD>
    ...

Now, this does not look too helpful yet, partly, because it is cluttered with
eckhart's avatar
eckhart committed
124
all sorts of seemingly superfluous pseudo-XML-tags like "<:ZeroOrMore>".
eckhart's avatar
eckhart committed
125
However, you might notice that it contains the original sequence of words
eckhart's avatar
eckhart committed
126
"Life is but a walking shadow" in a structured form, where each word is
eckhart's avatar
eckhart committed
127
(among other things) surrounded by <WORD>-tags. In fact, the output of the
eckhart's avatar
eckhart committed
128
compiler script is a pseudo-XML-representation of the *concrete syntax tree*
eckhart's avatar
eckhart committed
129
130
of our "example.dsl"-document according the grammar specified in "poetry.ebnf"
(which we haven't looked into yet, but we will do so soon).
eckhart's avatar
eckhart committed
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148

If you see the pseudo-XML on screen, the setup of the new DHParser-project
has been successful.

Understanding how compilation of DSL-documents with DHParser works
------------------------------------------------------------------

Generally speaking, the compilation process consists of three stages:

1. Parsing a document. This yields a *concrete syntax tree* (CST) of the
   document.

2. Transforming. This transforms the CST into the much more concise *abstract
   syntax tree* (AST) of the document.

3. Compiling. This turns the AST into anything you'd like, for example, an
   XML-representation or a relational database record.

eckhart's avatar
eckhart committed
149
Now, DHParser can fully automatize the generation of a parser from a
eckhart's avatar
eckhart committed
150
syntax-description in EBNF-form, like our "poetry.ebnf", but it cannot
151
automatize the transformation from the concrete into the abstract syntax tree
eckhart's avatar
eckhart committed
152
(which for the sake of brevity we will simply call "AST-Transformation" in the
153
following), and neither can it automatize the compilation of the abstract syntax
eckhart's avatar
eckhart committed
154
155
156
tree into something more useful. Therefore, the AST-Transformation in the
autogenerated compile-script is simply left empty, while the compiling stage
simply converts the syntax tree into a pseudo-XML-format.
157
158
159

The latter two stages have to be coded into the compile-script by hand, with
the support of templates within this script. If the grammar of the DSL is
eckhart's avatar
eckhart committed
160
161
162
163
164
165
166
changed - as it will be frequently during the development of a DSL - the
parser-part of this script will be regenerated by the testing-script before
the unit tests are run. The script will notice if the grammar has changed.
This also means that the parser part of this script will be overwritten and
should never be edited by hand. The other two stages can and should be edited
by hand. Stubs for theses parts of the compile-script will only be generated
if the compile-script does not yet exist, that is, on the very first calling
167
of the test-script.
eckhart's avatar
eckhart committed
168

eckhart's avatar
eckhart committed
169
170
171
Usually, if you have adjusted the grammar, you will want to run the unit tests
anyway. Therefore, the regeneration of the parser-part of the compile-script
is triggered by the test-script.
eckhart's avatar
eckhart committed
172
173
174
175

The development workflow for DSLs
---------------------------------

eckhart's avatar
eckhart committed
176
177
178
179
180
181
182
When developing a domain specific notation it is recommendable to first
develop the grammar and the parser for that notation, then to the abstract
syntax tree transformations and finally to implement the compiler. Of course
one can always come back and change the grammar later. But in order to avoid
revising the AST-transformations and the compiler time and again it helps if
the grammar has been worked out before. A bit of interlocking between these
steps does not hurt, though.
eckhart's avatar
eckhart committed
183

184
A reasonable workflow for developing the grammar proceeds like this:
eckhart's avatar
eckhart committed
185
186
187
188
189

1. Set out by writing down a few example documents for your DSL. It is
   advisable to start with a few simple examples that use only a subset of the
   intended features of your DSL.

190
2. Next you sketch a grammar for your DSL that is just rich enough to capture
eckhart's avatar
eckhart committed
191
192
193
194
195
196
197
198
199
   those examples.

3. Right after sketching the grammar you should write test cases for your
   grammar. The test cases can be small parts or snippets of your example
   documents. You could also use your example documents as test cases, but
   usually the test cases should have a smaller granularity to make locating
   errors easier.

4. Next, you should run the test script. Usually, some test will fail at
eckhart's avatar
eckhart committed
200
201
   the first attempt. So you'll keep revising the EBNF-grammar, adjusting and
   adding test cases until all tests pass.
eckhart's avatar
eckhart committed
202
203
204

5. Now it is time to try and compile the example documents. By this time the
   test-script should have generated the compile-script, which you can be
eckhart's avatar
eckhart committed
205
206
   called with the example documents. Don't worry too much about the output,
   yet. What is important at this stage is merely whether the parser can
eckhart's avatar
eckhart committed
207
   handle the examples or not. If not, further test cases and adjustments the
eckhart's avatar
eckhart committed
208
209
   EBNF grammar will be needed - or revision of the examples in case you
   decide to use different syntactic constructs.
eckhart's avatar
eckhart committed
210

eckhart's avatar
eckhart committed
211
212
213
   If all examples can be parsed, you go back to step one and add further more
   complex examples, and continue to do so until you have the feeling that you
   DSL's grammar is rich enough for all intended application cases.
eckhart's avatar
eckhart committed
214

eckhart's avatar
eckhart committed
215
216
217
218
219
220
Let's try this with the trivial demo example that comes with creating a new
project with the "dhparser.py"-script. Now, you have already seen that the
"example.dsl"-document merely contains a simple sequence of words: "Life is
but a walking shadow" Now, wouldn't it be nice, if we could end this sequence
with a full stop to turn it into a proper sentence. So, open "examples.dsl"
with a text editor and add a full stop::
eckhart's avatar
eckhart committed
221

eckhart's avatar
eckhart committed
222
    Life is but a walking shadow.
eckhart's avatar
eckhart committed
223
224
225

Now, try to compile "examples.dsl" with the compile-script::

eckhart's avatar
eckhart committed
226
227
    $ python poetryCompiler.py example.dsl
    example.dsl:1:29: Error: EOF expected; ".\n " found!
eckhart's avatar
eckhart committed
228
229
230
231

Since the grammar, obviously, did not allow full stops so far, the parser
returns an error message. The error message is pretty self-explanatory in this
case. (Often, you will unfortunately find that the error message are somewhat
eckhart's avatar
eckhart committed
232
233
234
235
236
237
238
difficult to decipher. In particular, because it so happens that an error the
parser complains about is just the consequence of an error made at an earlier
location that the parser may not have been able to recognize as such. We will
learn more about how to avoid such situations, later.) EOF is actually the
name of a parser that captures the end of the file, thus "EOF"! But instead of
the expected end of file an, as of now, unparsable construct, namely a full
stop followed by a line feed, signified by "\n", was found.
eckhart's avatar
eckhart committed
239

eckhart's avatar
eckhart committed
240
241
Let's have look into the grammar description "poetry.ebnf". We ignore the
beginning of the file, in particular all lines starting with "@" as these
242
lines do not represent any grammar rules, but meta rules or so-called
eckhart's avatar
eckhart committed
243
244
245
246
"directives" that determine some general characteristics of the grammar, such
as whitespace-handling or whether the parser is going to be case-sensitive.
Now, there are exactly three rules that make up this grammar::

eckhart's avatar
eckhart committed
247
248
249
    document = ~ { WORD } §EOF
    WORD     =  /\w+/~
    EOF      =  !/./
eckhart's avatar
eckhart committed
250

251
EBNF-Grammars describe the structure of a domain specific notation in top-down
252
fashion. Thus, the first rule in the grammar describes the components out of
253
254
255
256
257
258
259
which a text or document in the domain specific notation is composed as a
whole. The following rules then break down the components into even smaller
components until, finally, there a only atomic components left which are
described be matching rules. Matching rules are rules that do not refer to
other rules any more. They consist of string literals or regular expressions
that "capture" the sequences of characters which form the atomic components of
our DSL. Rules in general always consist of a symbol on the left hand side of
260
a "="-sign (which in this context can be understood as a definition signifier)
261
262
and the definition of the rule on the right hand side.

263
.. note:: Traditional parser technology for context-free grammars often
eckhart's avatar
eckhart committed
264
265
266
267
268
269
270
271
272
273
274
275
276
277
    distinguishes two phases, *scanning* and *parsing*, where a lexical scanner
    would take a stream of characters and yield a sequence of tokens and the
    actual parser would then operate on the stream of tokens. DHParser,
    however, is an instance of a *scannerless parser* where the functionality
    of the lexical scanner is seamlessly integrated into the
    parser. This is done by allowing regular expressions in the definiendum of
    grammar symbols. The regular expressions do the work of the lexical
    scanner.

    Theoretically, one could do without scanners or regular expressions.
    Because regular languages are a subset of context-free languages, parsers
    for context-free languages can do all the work that regular expressions can
    do. But it makes things easier - and, in the case of DHParser, also faster
    - to have them.
278

279
280
281
In our case the text as a whole, conveniently named "document" (any other name
would be allowed, too), consists of a leading whitespace, a possibly empty
sequence of an arbitrary number of words words ending only if the end of file
282
283
284
has been reached. Whitespace in DHParser-grammars is always denoted by a tilde
"~". Thus, the definiens of the rule "document" starts with a "~" on the right
hand side of the definition sign ("="). Next, you find the symbol "WORD"
285
286
287
enclosed in braces. "WORD", like any symbol composed of letters in DHParser,
refers to another rule further below that defines what words are. The meaning
of the braces is that whatever is enclosed by braces may be repeated zero or
288
more times. Thus the expression "{ WORD }" describes a sequence of arbitrarily
289
many repetitions of WORD, whatever WORD may be. Finally, EOF refers to yet
290
another rule defined further below. We do not yet know what EOF is, but we
291
292
know that when the sequence of words ends, it must be followed by an EOF. The
paragraph sign "§" in front of EOF means that it is absolutely mandatory that
293
the sequence of WORDs is followed by an EOF. If it doesn't the program issues
294
295
296
297
298
299
300
301
an error message. Without the "§"-sign the parser simply would not match,
which in itself is not considered an error.

Now, let's look at our two matching rules. Both of these rules contain regular
expressions. If you do not know about regular expressions yet, you should head
over to an explanation or tutorial on regular expressions, like
https://docs.python.org/3/library/re.html, before continuing, because we are
not going to discuss them here. In DHParser-Grammars regular expressions are
302
enclosed by simple forward slashes "/". Everything between two forward slashes
303
is a regular expression as it would be understood by Python's "re"-module.
304
Thus the rule ``WORD = /\w+/~`` means that a word consists of a sequence of
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
letters, numbers or underscores '_' that must be at least one sign long. This
is what the regular expression "\w+" inside the slashes means. In regular
expressions, "\w" stands for word-characters and "+" means that the previous
character can be repeated one or more times. The tile "~" following the
regular expression, we already know. It means that a a word can be followed by
whitespace. Strictly speaking that whitespace is part of "WORD" as it is
defined here.

Similarly, the EOF (for "end of line") symbol is defined by a rule that
consists of a simple regular expression, namely ".". The dot in regular
expressions means any character. However, the regular expression itself
preceded by an exclamations mark "!". IN DHParser-Grammars, the explanation
mark means "not". Therefore the whole rule means, that *no* character must
follow. Since this is true only for the end of file, the parser looking for
EOF will only match if the very end of the file has been reached.

Now, what would be the easiest way to allow our sequence of words to be ended
Eckhart Arnold's avatar
Eckhart Arnold committed
322
323
324
325
326
like a real sentence with a dot "."?  As always when defining grammars on can
think of different choice to implement this requirement in our grammar. One
possible solution is to add a dot-literal before the "§EOF"-component at the
end of the definition of the "document"-rule. So let's do that. Change the
line where the "document"-rule is defined to::
327

eckhart's avatar
eckhart committed
328
    document = ~ { WORD } "." §EOF
329

eckhart's avatar
eckhart committed
330
331
332
333
334
335
As you can see, string-literals are simply denoted as strings between inverted
commas in DHParser's variant of the EBNF-Grammar. Now, before we can compile
the file "example.dsl", we will have to regenerate the our parser, because we
have changed the grammar. In order to recompile, we simply run the test-script
again::

eckhart's avatar
eckhart committed
336
    $ python tst_poetry_grammar.py
eckhart's avatar
eckhart committed
337

338
But what is that? A whole lot of error messages? Well, this it not surprising,
Eckhart Arnold's avatar
Eckhart Arnold committed
339
because we change the grammar, some of our old test-cases fail with the new
340
341
grammar. So we will have to update our test-cases. Actually, the grammar
gets compiled never the less and we could just ignore the test failures and
Eckhart Arnold's avatar
Eckhart Arnold committed
342
343
344
345
346
carry on with compiling our "example.dsl"-file again. But, for this time,
we'll follow good practice and adjust the test cases. So open the test that
failed, "grammar_tests/02_test_document.ini", in the editor and add full stops
at the end of the "match"-cases and remove the full stop at the end of the
"fail"-case::
eckhart's avatar
eckhart committed
347

eckhart's avatar
eckhart committed
348
349
    [match:document]
    M1: """This is a sequence of words
eckhart's avatar
eckhart committed
350
       extending over several lines."""
eckhart's avatar
eckhart committed
351
    M2: """  This sequence contains leading whitespace."""
eckhart's avatar
eckhart committed
352

eckhart's avatar
eckhart committed
353
354
    [fail:document]
    F1: """This test should fail, because neither
eckhart's avatar
eckhart committed
355
356
357
358
359
360
361
362
363
       comma nor full have been defined anywhere"""

The format of the test-files should be pretty self-explanatory. It is a simple
ini-file, where the section markers hold the name of the grammar-rule to be
tested which is either preceded by "match" or "fail". "match means" that the
following examples should be matched by the grammar-rule. "fail" means they
should *not* match. It is just as important that a parser or grammar-rules
does not match those strings it should not match as it is that it matches
those strings that it should match. The individual test-cases all get a name,
364
365
366
in this case M1, M2, F1, but if you prefer more meaningful names this is also
possible. (Beware, however, that the names for the match-tests must be different from the
names for the fail-tests for the same rule!). Now, run the test-script again
eckhart's avatar
eckhart committed
367
368
and you'll see that no errors get reported any more.

Eckhart Arnold's avatar
Eckhart Arnold committed
369
370
Finally, we can recompile out "example.dsl"-file, and by its XML output we can
tell that it worked::
eckhart's avatar
eckhart committed
371

eckhart's avatar
eckhart committed
372
    $ python poetryCompiler.py example.dsl
eckhart's avatar
eckhart committed
373
374
375

So far, we have seen *in nuce* how the development workflow for a building up
DSL-grammar goes. Let's take this a step further by adding more capabilities
376
to our grammar.
eckhart's avatar
eckhart committed
377
378
379
380
381
382
383
384
385

Extending the example DSL further
---------------------------------

A grammar that can only digest single sentences is certainly a rather boring.
So we'll extend our grammar a little further so that it can capture paragraphs
of sentences. To see, where we are heading, let's first start a new example
file, let's call it "macbeth.dsl" and enter the following lines::

eckhart's avatar
eckhart committed
386
387
388
    Life’s but a walking shadow, a poor player that struts and frets his hour
    upon the stage and then is heard no more. It is a tale told by an idiot,
    full of sound and fury, signifying nothing.
eckhart's avatar
eckhart committed
389

Eckhart Arnold's avatar
Eckhart Arnold committed
390
391
392
393
What have we got, there? We've got a paragraph that consists of several
sentences each of which ends with a full stop. The sentences themselves can
consist of different parts which a separated by a comma. If, so far, we have
got a clear idea (in verbal terms) of the structure of texts in our DSL, we
eckhart's avatar
eckhart committed
394
can now try to formulate this in the grammar.::
eckhart's avatar
eckhart committed
395

eckhart's avatar
eckhart committed
396
397
398
399
400
    document = ~ { sentence } §EOF
    sentence = part {"," part } "."
    part     = { WORD }              # a subtle mistake, right here!
    WORD     =  /\w+/~               # something forgotten, here!
    EOF      =  !/./
eckhart's avatar
eckhart committed
401

Eckhart Arnold's avatar
Eckhart Arnold committed
402
403
404
405
406
407
408
The most important new part is the grammar rule "sentence". It reads as this:
A sentence is a part of a sentence potentially followed by a repeated sequence
of a comma and another part of a sentence and ultimately ending with a full
stop. (Understandable? If you have ever read Russell's "Introduction to
Mathematical Philosophy" you will be used to this kind of prose. Other than
that I find the formal definition easier to understand. However, for learning
EBNF or any other formalism, it helps in the beginning to translate the
409
meaning of its statements into plain old English.)
Eckhart Arnold's avatar
Eckhart Arnold committed
410
411
412
413
414
415
416
417

There is are two subtle mistakes in this grammar. If you can figure them out
just by thinking about it, feel free to correct the grammar right now. (Would
you really have noticed the mistakes if they hadn't already been marked in the
code above?) For all less intelligent people, like me: Let's be prudent and -
since the grammar has become more complex - add a few test cases. This should
make it easier to locate any errors. So open up an editor with a new file in
the tests subdirectory, say ``grammar_tests/03_test_sentence.ini`` (Test files
eckhart's avatar
eckhart committed
418
should always contain the component `test_` in the filename, otherwise they
Eckhart Arnold's avatar
Eckhart Arnold committed
419
420
will be overlooked by DHParser's unit testing subsystem) and enter a few
test-cases like these::
eckhart's avatar
eckhart committed
421

eckhart's avatar
eckhart committed
422
423
    [match:sentence]
    M1: """It is a tale told by an idiot,
eckhart's avatar
eckhart committed
424
      full of sound and fury, signifying nothing."""
eckhart's avatar
eckhart committed
425
    M2: """Plain old sentence."""
eckhart's avatar
eckhart committed
426

eckhart's avatar
eckhart committed
427
428
429
    [fail:sentence]
    F1: """Ups, a full stop is missing"""
    F2: """No commas at the end,."""
eckhart's avatar
eckhart committed
430

Eckhart Arnold's avatar
Eckhart Arnold committed
431
432
Again, we recompile the grammar and run the test at the same time by running
the testing-script::
eckhart's avatar
eckhart committed
433

eckhart's avatar
eckhart committed
434
435
436
    $ python tst_poetry_grammar.py
    Errors found by unit test "03_test_sentence.ini":
    Fail test "F2" for parser "sentence" yields match instead of expected failure!
eckhart's avatar
eckhart committed
437

Eckhart Arnold's avatar
Eckhart Arnold committed
438
439
440
441
442
443
Too bad, something went wrong here. But what? Didn't the definition of the
rule "sentence" make sure that parts of sentences are, if at all, only be
followed by a sequence of a comma *and* another part of a sentence. So, how
come that between the last comma and the full stop there is nothing but empty
space? Ah, there's the rub! If we look into our grammar, how parts of
sentences have been defined, we find that the rule::
eckhart's avatar
eckhart committed
444

eckhart's avatar
eckhart committed
445
    part = { WORD }
eckhart's avatar
eckhart committed
446

447
defines a part of a sentence as a sequence of *zero* or more WORDs. This
Eckhart Arnold's avatar
Eckhart Arnold committed
448
449
means that a string of length zero also counts as a valid part of a sentence.
Now in order to avoid this, we could write::
eckhart's avatar
eckhart committed
450

eckhart's avatar
eckhart committed
451
    part = WORD { WORD }
eckhart's avatar
eckhart committed
452

Eckhart Arnold's avatar
Eckhart Arnold committed
453
454
455
This definition makes sure that there is at least on WORD in a part. Since the
case that at least one item is needed occurs rather frequently in grammars,
DHParser offers a special syntax for this case::
eckhart's avatar
eckhart committed
456

eckhart's avatar
eckhart committed
457
    part = { WORD }+
eckhart's avatar
eckhart committed
458

Eckhart Arnold's avatar
Eckhart Arnold committed
459
(The plus sign "+" must always follow directly after the curly brace "}"
460
without any whitespace in between, otherwise DHParser won't understannd it.)
Eckhart Arnold's avatar
Eckhart Arnold committed
461
462
463
464
At this point the worry may arise that the same problem could reoccur at
another level, if the rule for WORD would match empty strings as well. Let's
quickly add a test case for this to the file
``grammar_tests/01_test_word.ini``::
eckhart's avatar
eckhart committed
465

eckhart's avatar
eckhart committed
466
467
468
    [fail:WORD]
    F1: two words
    F2: ""
eckhart's avatar
eckhart committed
469
470
471

Thus, we are sure to be warned in case the definition of rule "WORD" matches
the empty string. Luckily, it does not do so now. But it might happen that we
472
change this definition later again for some reason, we might have forgotten
eckhart's avatar
eckhart committed
473
474
475
476
about this subtlety and introduce the same error again. With a test case we
can reduce the risk of such a regression error. This time the tests run
through, nicely. So let's try the parser on our new example::

eckhart's avatar
eckhart committed
477
478
    $ python poetryCompiler.py macbeth.dsl
    macbeth.dsl:1:1: Error: EOF expected; "Life’s but" found!
eckhart's avatar
eckhart committed
479
480

That is strange. Obviously, there is an error right at the beginning (line 1
481
column 1). But what could possibly be wrong with the word "Life". Now you might
eckhart's avatar
eckhart committed
482
483
484
485
486
already have guessed what the error is and that the error is not exactly
located in the first column of the first line.

Unfortunately, DHParser - like almost any other parser out there - is not
always very good at spotting the exact location of an error. Because rules
487
488
489
490
491
492
493
refer to other rules, a rule may fail to parse - or, what is just as bad,
succeed to parse when it should indeed fail - as a consequence of an error in
the definition of one of the rules it refers to. But this means if the rule
for the whole document fails to match, the actual error can be located
anywhere in the document! There a different approaches to dealing with this
problem. A tool that DHParser offers is to write log-files that document the
parsing history. The log-files allow to spot the location, where the parsing
494
error occurred. However, you will have to look for the error manually. A good
495
496
497
498
499
starting point is usually either the end of the parsing process or the point
where the parser reached the farthest into the text. In order to receive the
parsing history, you need to run the compiler-script again with the debugging
option::

eckhart's avatar
eckhart committed
500
    $ python poetryCompiler.py macbeth.dsl
501

Eckhart Arnold's avatar
Eckhart Arnold committed
502
503
504
505
506
507
You will receive the same error messages as before. but this time various
kinds of debugging information have been written into a new created
subdirectory "LOGS". (Beware that any files in the "LOGS" directory may be
overwritten or deleted by any of the DHParser scripts upon the next run! So
don't store any important data there.) The most interesting file in the
"LGOS"-directory is the full parser log. We'll ignore the other files and just
508
open the file "macbeth_full_parser.log.html" in an internet-browser. As the
Eckhart Arnold's avatar
Eckhart Arnold committed
509
510
parsing history tends to become quite long, this usually takes a while, but
luckily not in the case of our short demo example::
511

eckhart's avatar
eckhart committed
512
    $ firefox LOGS/macbeth_full_parser.log.html &
513

eckhart's avatar
eckhart committed
514
.. image:: parsing_history.png
515

Eckhart Arnold's avatar
Eckhart Arnold committed
516
What you see is a representation of the parsing history. It might look a bit
517
tedious in the beginning, especially the column that contains the parser
Eckhart Arnold's avatar
Eckhart Arnold committed
518
519
520
521
522
523
call sequence. But it is all very straight forward: For every application of a
match rule, there is a row in the table. Typically, match rules are applied at
the end of a long sequence of parser calls that is displayed in the third
column. You will recognise the parsers that represent rules by their names,
e.g. "document", "sentence" etc. Those parsers that merely represent
constructs of the EBNF grammar within a rule do not have a name and are
524
represented by this type, which always begins with a colon, like
Eckhart Arnold's avatar
Eckhart Arnold committed
525
526
527
528
529
530
531
532
533
534
535
":ZeroOrMore". Finally, the regular expression or literal parsers are
represented by the regular expression pattern or the string literal
themselves. (Arguably, it can be confusing that parsers are represented in
three different ways in the parer call sequence. I am still figuring out a
better way to display the parser call sequence. Any suggestions welcome!) The
first two columns display the position in the text in terms of lines and
columns. The second but last column, labeled "success" shows wether the last
parser in the sequence matched or failed or produced an error. In case of an
error, the error message is displayed in the third column as well. In case the
parser matched, the last column displays exactly that section of the text that
the parser did match. If the parser did not match, the last column displays
eckhart's avatar
eckhart committed
536
the text that still lies ahead and has not yet been parsed.
Eckhart Arnold's avatar
Eckhart Arnold committed
537

538
539
In our concrete example, we can see that the parser "WORD" matches "Life", but
not "Life’s" or "’s". And this ultimately leads to the failure of the parsing
eckhart's avatar
eckhart committed
540
process as a whole. The most simple solution would be to add the apostrophe to
541
the list of allowed characters in a word by changing the respective line in
542
543
544
545
546
the grammar definition to ``WORD = /[\w’]+/``. Now, before we even change the
grammar we first add another test case to capture this kind of error. Since we
have decided that "Life’s" should be parsed as a singe word, let's open the
file "grammar_tests/01_test_word.ini" and add the following test::

eckhart's avatar
eckhart committed
547
548
    [match:WORD]
    M3: Life’s
549

eckhart's avatar
eckhart committed
550
To be sure that the new test captures the error we have found you might want
551
552
553
554
to run the script "tst_poetry_grammar.py" and verify that it reports the
failure of test "M3" in the suite "01_test_word.ini". After that, change the
regular expression for the symbol WORD in the grammar file "poetry.ebnf" as
just described. Now both the tests and the compilation of the file
eckhart's avatar
eckhart committed
555
"macbeth.dsl" should run through smoothly.
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570

.. caution:: Depending on the purpose of your DSL, the simple solution of
   allowing apostrophes within words, might not be what you want. After all
   "Life’s" is but a shorthand for the two word phrase "Life is". Now,
   whatever alternative solution now comes to your mind, be aware that there
   are also cases like Irish names, say "O’Dolan" where the apostrophe is
   actually a part of a word and cases like "don’t" which, if expanded, would
   be two words *not* separated at the position of the apostrophe.

   We leave that as an exercise, first to figure out, what different cases for
   the use of apostrophes in the middle of a word exist. Secondly, to make a
   reasonable decision which of these should be treated as a single and which
   as separate words and, finally, if possible, to write a grammar that
   provides for these cases. These steps are quite typical for the kind of
   challenges that occur during the design of a DSL for a
eckhart's avatar
eckhart committed
571
   Digital-Humanities-Project.
572

eckhart's avatar
eckhart committed
573
574
575

Controlling abstract-syntax-tree generation
-------------------------------------------
576
577
578
579

Compiling the example "macbeth.dsl" with the command ``python poetryCompier.py
macbeth.dsl``, you might find yourself not being able to avoid the impression
that the output is rather verbose. Just looking at the beginning of the
eckhart's avatar
eckhart committed
580
output, we find::
581

eckhart's avatar
eckhart committed
582
    <document>
583
584
585
586
587
588
589
590
591
592
593
       <:ZeroOrMore>
           <sentence>
               <part>
                   <WORD>
                       <:RegExp>Life’s</:RegExp>
                       <:Whitespace> </:Whitespace>
                   </WORD>
                   <WORD>
                       <:RegExp>but</:RegExp>
                       <:Whitespace> </:Whitespace>
                   </WORD>
eckhart's avatar
eckhart committed
594
    ...
595
596
597
598
599
600
601
602
603
604
605
606

But why do we need to know all those details! Why would we need a
":ZeroOrMore" element inside the "<document>" element, if the
"<sentence>"-elements could just as well be direct descendants of the
"<document>"-element? Why do we need the information that "Life’s" has been
captured by a regular expression parser? Wouldn't it suffice to know that the
word captured is "Life’s"? And is the whitespace really needed at all? If the
words in a sequence are separated by definition by whitespace, then it would
suffice to have the word without whitespace in our tree, and to add whitespace
only later when transforming the tree into some kind of output format. (On the
other hand, it might be convenient to have it in the tree never the less...)

607
Well, the answer to most of these questions is that what our compilation
608
609
610
611
612
613
614
615
616
617
618
script yields is more or less the output that the parser yields which in turn
is the *concrete syntax tree* of the parsed text. Being a concrete syntax tree
it is by its very nature very verbose, because it captures every minute
syntactic detail described in the grammar and found in the text, no matter how
irrelevant it is, if we are primarily interested in the structure of our text.
In order for our tree to become more handy we have to transform it into an
*abstract syntax tree* first, which is called thus because it abstracts from
all details that deem us irrelevant. Now, which details we consider as
irrelevant is almost entirely up to ourselves. And we should think carefully
about what features must be included in the abstract syntax tree, because the
abstract syntax tree more or less reflects the data model (or is at most one
619
step away from it) with which we want to capture our material.
620

eckhart's avatar
eckhart committed
621
For the sake of our example, let's assume that we are not interested in
622
whitespace and that we want to get rid of all uninformative nodes, i.e. nodes
eckhart's avatar
eckhart committed
623
624
625
626
that merely demark syntactic structures but not semantic entities.

DHParser supports the transformation of the concrete syntax tree (CST) into the
abstract syntax tree (AST) with a simple technology that (in theory) allows to
627
specify the necessary transformations in an almost declarative fashion: You
eckhart's avatar
eckhart committed
628
629
630
631
632
simply fill in a Python-dictionary of tag-names with transformation *operators*.
Technically, these operators are simply Python-functions. DHParser comes with a
rich set of predefined operators. Should these not suffice, you
can easily write your own. How does this look like? ::

eckhart's avatar
eckhart committed
633
    poetry_AST_transformation_table = {
eckhart's avatar
eckhart committed
634
635
636
637
638
639
       "+": remove_empty,
       "document": [],
       "sentence": [],
       "part": [],
       "WORD": [],
       "EOF": [],
640
       ":_Token, :_RE": reduce_single_child,
eckhart's avatar
eckhart committed
641
       "*": replace_by_single_child
eckhart's avatar
eckhart committed
642
    }
eckhart's avatar
eckhart committed
643
644
645
646
647
648
649
650
651
652
653
654
655
656

You'll find this table in the script ``poetryCompiler.py``, which is also the
place where you edit the table, because then it is automatically used when
compiling your DSL-sources. Now, AST-Transformation works as follows: The whole
tree is scanned, starting at the deepest level and applying the specified
operators and then working its way upward. This means that the operators
specified for "WORD"-nodes will be applied before the operators of "part"-nodes
and "sentence"-nodes. This has the advantage that when a particular node is
reached the transformations for its descendant nodes have already been applied.

As you can see, the transformation-table contains an entry for every known
parser, i.e. "document", "sentence", "part", "WORD", "EOF". (If any of these are
missing in the table of your ``poetryCompiler.py``, add them now!) In the
template you'll also find transformations for two anonymous parsers, i.e.
657
":_Token" and ":_RE" as well as some curious entries such as "*" and "+". The
eckhart's avatar
eckhart committed
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
latter are considered to be "jokers". The transformations related to the
"+"-sign will be applied on any node, before any other transformation is
applied. In this case, all empty nodes will be removed first (transformation:
``remove_empty``). The "*"-joker contains a list of transformations that will be
applied to all those tags that have not been entered explicitly into the
transformation table. For example, if the transformation reaches a node with the
tag-name ":ZeroOrMore" (i.e. an anonymous node that has been generated by the
parser ":ZeroOrmore"), the "*"-joker-operators will be applied. In this
case it is just one transformation, namely, ``replace_by_single_child`` which
replaces a node that has but one child by its child. In contrast, the
transformation ``reduce_single_child`` eliminates a single child node by
attaching the child's children or content directly to the parent node. We'll see
what this means and how this works, briefly.

.. caution:: Once the compiler-script "xxxxCompiler.py" has been generated, the
eckhart's avatar
eckhart committed
673
674
675
676
677
678
679
680
681
682
683
684
685
686
    *only* part that is changed after editing and extending the grammar is the
    parser-part of this script (i.e. the class derived from class Grammar),
    because this part is completely auto-generated and can therefore be
    overwritten safely. The other parts of that script, including the
    AST-transformation-dictionary, if never changed once it has been generated,
    because it needs to be filled in by hand by the designer of the DSL and the
    hand-made changes should not be overwritten. There it is left as it is when
    regenerating the parser. However, this means, if you add symbols to your
    grammar later, you will not find them as keys in the
    AST-transformation-table, but you'll have to add them yourself.

    The comments in the compiler-script clearly indicate which parts can be
    edited by hand safely, i.e. without running the risk of being overwritten, an
    which cannot.
eckhart's avatar
eckhart committed
687
688

We can either specify no operator (empty list), a single operator or a list of
689
690
operators for transforming a node. There is a difference between specifying an
empty list for a particular tag-name or leaving out a tag-name completely. In the
eckhart's avatar
eckhart committed
691
692
693
694
695
696
697
latter case the "*"-joker is applied, in place of the missing list of operators.
In the former case only the "+"-joker is applied. If a list of operators is
specified, these operator will be applied in sequence one after the other. We
also call the list of operators or the single operator if there is only one the
*transformation* for a particular tag (or parser name or parser type for that
matter).

698
Because the AST-transformation works through the table from the inside to the
eckhart's avatar
eckhart committed
699
700
701
702
outside, it is reasonable to do the same when designing the AST-transformations,
to proceed in the same order. The innermost nodes that concern us are the nodes
captured by the <WORD>-parser, or simply, <WORD>-nodes. As we can see, these
nodes usually contain a <:RegExp>-node and a <:Whitespace>-node. As the "WORD"
703
parser is defined as a simple regular expression with followed by optional
eckhart's avatar
eckhart committed
704
705
706
707
708
709
whitespace in our grammar, we now that this must always be the case, although
the whitespace may occasionally be empty. Thus, we can eliminate the
uninformative child nodes by removing whitespace first and the reducing the
single left over child node. The respective line in the AST-transformation-table
in the compiler-script should be changed as follows::

eckhart's avatar
eckhart committed
710
    "WORD": [remove_whitespace, reduce_single_child],
eckhart's avatar
eckhart committed
711
712

Running the "poetryCompiler.py"-script on "macbeth.dsl" again, yields::
713

eckhart's avatar
eckhart committed
714
    <document>
eckhart's avatar
eckhart committed
715
716
717
718
719
720
721
722
723
724
     <:ZeroOrMore>
       <sentence>
         <part>
           <WORD>Life’s</WORD>
           <WORD>but</WORD>
           <WORD>a</WORD>
           <WORD>walking</WORD>
           <WORD>shadow</WORD>
         </part>
         <:Series>
725
           <:_Token>
eckhart's avatar
eckhart committed
726
727
             <:PlainText>,</:PlainText>
             <:Whitespace> </:Whitespace>
728
           </:_Token>
eckhart's avatar
eckhart committed
729
730
           <part>
             <WORD>a</WORD>
eckhart's avatar
eckhart committed
731
    ...
eckhart's avatar
eckhart committed
732

733
It starts to become more readable and concise, but there are sill some oddities.
eckhart's avatar
eckhart committed
734
735
736
Firstly, the Tokens that deliminate parts of sentences still contain whitespace.
Secondly, if several <part>-nodes follow each other in a <sentence>-node, the
<part>-nodes after the first one are enclosed by a <:Series>-node or even a
737
cascade of <:ZeroOrMore> and <:Series>-nodes. As for the <:_Token>-nodes, have
eckhart's avatar
eckhart committed
738
739
can do the same trick as with the WORD-nodes::

740
741
    ":_Token": [remove_whitespace, reduce_single_child],
    ":_RE": reduce_single_child,
eckhart's avatar
eckhart committed
742
743
744
745
746
747
748
749
750
751
752
753

As to the nested structure of the <part>-nodes within the <sentence>-node, this
a rather typical case of syntactic artefacts that can be found in concrete
syntax trees. It is obviously a consequence of the grammar definition::

    sentence = part {"," part } "."

We'd of course prefer to have flat structure of parts and punctuation marks
following each other within the sentence. Since this is a standard case,
DHParser includes a special operator to "flatten" nested structures of this
kind::

eckhart's avatar
eckhart committed
754
    "sentence" = [flatten],
eckhart's avatar
eckhart committed
755
756
757
758
759
760
761
762
763
764
765
766
767

The ``flatten`` operator recursively eliminates all intermediary anonymous child
nodes. We do not need to do anything in particular for transforming the
<part>-node, except that we should explicitly assign an empty operator-list to
it, because we do not want the "*" to step in. The reason is that a <part> with
a single <WORD> should still be visible as a part a not replaced by the
<WORD>-node, because we would like our data model to have has regular a form as
possible. (This does of course imply a decision that we have taken on the form
of our data model, which would lead too far to discuss here. Suffice it to say
that depending on the occasion and purpose, such decisions can also be taken
otherwise.)

The only kind of nodes left are the <document>-nodes. In the output of the
768
compiler-script (see above), the <document>-node had a single child-node
eckhart's avatar
eckhart committed
769
770
771
772
773
774
775
776
777
":ZeroOrMore". Since this child node does not have any particular semantic
meaning it would reasonable to eliminate it and attach its children directly to
"document". We could do so by entering ``reduce_single_child`` in the lost of
transformations for "document"-nodes. However, when designing the
AST-transformations, it is important not only to consider the concrete output
that a particular text yields, but all possible outputs. Therefore, before
specifying a transformation, we should also take a careful look at the grammar
again, where "document" is defined as follows::

eckhart's avatar
eckhart committed
778
    document = ~ { sentence } §EOF
eckhart's avatar
eckhart committed
779
780
781
782
783
784
785
786
787
788

As we can see a "document"-node may also contain whitespace and an EOF-marker.
The reason why we don't find these in the output is that empty nodes have been
eliminated by the ``remove_empty``-transformation specified in the "+"-joker,
before. While EOF is always empty (little exercise: explain why!). But there
could be ":Whitespace"-nodes next to the zero or more sentences in the document
node, in which case the "reduce_single_child"-operator would do nothing, because
there is more than a single child. (We could of course also use the
"flatten"-operator, instead. Try this as an exercise.) Test cases help to
capture those different scenarios, so adding test cases and examining the output
789
in the test report help to get a grip on this, if just looking at the grammar
eckhart's avatar
eckhart committed
790
791
792
793
794
795
796
strains you imagination too much.

Since we have decided, that we do not want to include whitespace in our data
model, we can simply eliminate any whitespace before we apply the
``reduce_single_child``-operator, so we change the "document"-entry in the
AST-transformation-table as thus::

eckhart's avatar
eckhart committed
797
    "document": [remove_whitespace, reduce_single_child],
eckhart's avatar
eckhart committed
798
799
800

Now that everything is set, let's have a look at the result::

eckhart's avatar
eckhart committed
801
    <document>
eckhart's avatar
eckhart committed
802
803
804
805
806
807
808
809
     <sentence>
       <part>
         <WORD>Life’s</WORD>
         <WORD>but</WORD>
         <WORD>a</WORD>
         <WORD>walking</WORD>
         <WORD>shadow</WORD>
       </part>
810
       <:_Token>,</:_Token>
eckhart's avatar
eckhart committed
811
812
813
814
       <part>
         <WORD>a</WORD>
         <WORD>poor</WORD>
         <WORD>player</WORD>
eckhart's avatar
eckhart committed
815
    ...
eckhart's avatar
eckhart committed
816
817
818

That is much better. There is but one slight blemish in the output: While all
nodes left a named nodes, i.e. nodes associated with a named parser, there are a
819
820
few anonymous <:_Token> nodes. Here is a little exercise: Do away with those
<:_Token>-nodes by replacing them by something semantically more meaningful.
eckhart's avatar
eckhart committed
821
822
823
824
Hint: Add a new symbol "delimiter" in the grammar definition "poetry.ebnf". An
alternative strategy to extending the grammar would be to use the
``replace_parser`` operator. Which of the strategy is the better one? Explain
why.