Parsing via metaprogramming

Published , updated

This blog post is adapted from my book The Tcl Programming Language.

What is metaprogramming? Roughly speaking, metaprogramming involves writing a program that in turn writes a program to do the desired task. In some cases metaprogramming makes for simpler or more succinct code while in others it optimizes performance by generating specialized code at runtime. Tcl lends itself naturally to this style of programming. This article illustrates one such use.

A common task in programming is parsing of structured data or text. One technique used for this purpose that you will see in the Tcl world is to transform the data into a Tcl script that embeds Tcl commands within the data and then execute the generated script.

This is easiest explained through an example. We will use the following code derived from Stephen Uhler's famous 4-line HTML parser.

proc html_parse {html callback} {
    set re {<(/?)([^ \t\r\n>]+)[ \t\r\n]*([^>]*)>}
    set sub "\}\n[list $callback] {\\2} {\\1} {\\3} \{"
    regsub -all $re [string map {\{ \&ob; \} \&cb;} $html] $sub script
    eval "$callback PARSE {} {} \{ $script \}; $callback PARSE / {} {}"
}

The intent is to transform the HTML text to a Tcl script where each HTML tag results in the invocation of a command which is passed the tag as the parameter callback. To understand what html_parse is doing, let us interactively execute a slightly simplified version of the above implementation line by line.

Since we are going to interactively execute the commands in html_parse to understand what's going on, we first have to define sample values for variables html and callback corresponding to the arguments passed to html_parse.

% set html {
    <p class='important'>Something <b>really</b> important.</p>
    <p>A second paragraph</p>
}
 
    <p class='important'>Something <b>really</b> important.</p>
    <p>A second paragraph</p>
% set callback html_cb
html_cb

(We will define the html_cb callback procedure later.)

Now we execute the first line of html_parse which defines a regular expression that matches both opening and closing HTML tags.

% set re {<(/?)([^ \t\r\n>]+)[ \t\r\n]*([^>]*)>}
<(/?)([^ \t\r\n>]+)[ \t\r\n]*([^>]*)>

This regular expression just matches a tag (starting or ending) along with any contained attributes. The first subexpression in the match result (retrievable as \\1) will be either the empty string or / depending on whether it is a starting tag or ending one. The second subexpression (similarly retrievable as \\2) is the tag name. The third subexpression will contain attribute text if any.

The next line of html_parse defines the regsub substitution that will convert each tag to a call to the callback command. The subexpressions referenced above will be used to construct the arguments to the command.

% set sub "\}\n[list $callback] {\\2} {\\1} {\\3} \{"
}
html_cb {\2} {\1} {\3} {

Our intent is that a tag of the form <body> will result in a call to html_cb (the callback procedure passed in) with the body tag name passed as the first argument, the start/end indicator as the second argument and attributes as the third. These all come from the matched subexpressions in the regular expression. In addition, the callback also receives a fourth argument which contains all text succeeding the tag up until the next tag. This argument is not derived from the regular expression but rather is implicitly provided by the leading } and trailing { in the substitution text. The leading }, together with the following \n, will serve as the ending delimiter for the fourth argument and command corresponding to the previous tag being processed. The trailing { serves as the starting delimiter of the fourth argument (the succeeding text) to the current tag being processed. We will see this in a minute.

Having defined the matching and substition expressions above, the regsub command in the third line uses these to transform the HTML text into an equivalent Tcl script fragment.

% regsub -all $re $html $sub script
6
% puts $script

    }
html_cb {p} {} {class='important'} {Something }
html_cb {b} {} {} {really}
html_cb {b} {/} {} { important.}
...Additional lines omitted...

This is actually a script fragment, not an entire script (hence the leading and trailing brace characters). It calls the specified command passing four parameters. To see the entire script that would be evaluated, we can print the argument to the eval in the last line of html_parse.

% puts "$callback PARSE {} {} \{ $script \}; $callback PARSE / {} {}"
html_cb PARSE {} {} { 
    }
html_cb {p} {} {class='important'} {Something }
html_cb {b} {} {} {really}
html_cb {b} {/} {} { important.}
...Additional lines omitted...

Notice how each tag has been replaced with a call to our callback command with appropriate arguments, the first three coming from the regular expression matches and the fourth arising out of the extraneous braces in the expression that wraps the text between tags.

Invoking our 4-line HTML parser as follows

html_parse $html html_cb

will result in the script printed above being generated and evaluated.

To summarize the working, html_parse transforms the passed HTML text such that each HTML begin and end tag is converted to a call to the passed callback command, html_cb in our example, with four arguments:

  • the name of the tag, such as p or b,

  • an argument that is empty if it is the beginning of the tag and / if it corresponds to the tag termination

  • any attributes for the tag

  • The text content until the start of the next tag.

The script uses a special tag name PARSE that allows the callback command to recognize the start and end of parsing for any required state initialization or finalization.

All we need to do now is define our callback command to do the desired parsing action. Let us define a trivial callback to simply convert all tags to upper case.

proc html_cb {tag place attrs content} {
    if {$tag ne "PARSE"} {
        if {$attrs ne ""} {
            set attrs " $attrs"
        }
        puts -nonewline "<$place[string toupper $tag]$attrs>$content"
    }
}

Now examine the output when our sample HTML fragment is processed as below.

% html_parse $html html_cb
<P class='important'>Something <B>really</B> important.</P>
    <P>A second paragraph</P>

As slightly more complex example, here is a HTML to Latex markup converter (that only understands two tags and ignores the rest). Again, this is only illustrative of the technique and does not consider even basic requirements such as escaping of special characters.

proc html_latex {tag place attrs content} {
    switch -exact -nocase -- $tag {
        B {
            if {$place eq ""} {
                append_paragraph "\{\\em $content"
            } else {
                append_paragraph "\}$content"
            }
        }
        P {
            flush_paragraph
            append_paragraph $content
        }
        PARSE {
            if {$place eq ""} {
                append_paragraph $content
            } else {
                flush_paragraph
            }
        }
    }
}

proc append_paragraph {content} {
    if {[string length $content]} {
        append ::paragraph $content
    }
}

proc flush_paragraph {} {
    if {[info exists ::paragraph]} {
        set para [string trim $::paragraph]
        if {[string length $para]} {
            puts "$para\n"
        }
        unset ::paragraph
    }
}

We can then convert HTML to Latex input like this

% html_parse $html html_latex
Something {\em really} important.

A second paragraph

This HTML parser is simplistic and does not take into account all of HTML syntax details and idiosyncracies. It is meant to illustrate the "data to script" transform technique. The htmlparse module in tcllib provides a more robust HTML parsing solution based on the same technique. (As an aside, a fast, standards compliant solution for parsing HTML is the tdom package. This is however a binary extension and not based on the above technique).

A final note of caution: This technique of transforming data to a script has potential security issues when the data comes from an unknown (and potentially malicious) source. Although this can be guarded against with proper escaping and quoting of the input data, it is advisable to execute the generated script in a safe interpreter.