Introducing Tcl 8.7 Part 1: regsub enhancements

Published

The long-awaited alpha release of Tcl 8.7 calls for a series of posts summarizing the enhancements in this release. The first in this series is about the new -command option to the regsub command. I have wished for this feature many times and now thanks to DKF it is now available in Tcl.

To take Tcl 8.7 for a spin, you can download a pre-alpha binary for your platform. Alternatively, you can build it yourself from the core-8-branch branch in the Tcl fossil repository.

The regsub command before Tcl 8.7 allowed for transforming strings by substitution where the replacement could be composed of literal strings and matched patterns from the string. What Tcl 8.7 adds is the ability substitute any arbitrary computed value which may depend on the matched patterns.

A first example

Consider the problem of URL encoding a given string where a simple but still conforming method is to replace all non-alphanumeric characters with their hexadecimal values preceded by a % character. This following does the job by invoking the enc procedure on every character matching the regular expression and replacing the character with the returned value.

% proc enc {ch} {format %%%02X [scan $ch %c]}
% regsub -command -all {[^0-9A-Za-z]} some-random+string enc
some%2Drandom%2Bstring

Or if you have an aversion to one-use procedures and are willing to tolerate readability issues with apply,

% regsub -command -all {[^0-9A-Za-z]} some-random+string {
   apply {ch {format %%%02X [scan $ch %c]}}
}
some%2Drandom%2Bstring

Or if you keep the Tcllib lambda package handy,

package require lambda
% regsub -command -all {[^0-9A-Za-z]} some-random+string [
    lambda ch {format %%%02X [scan $ch %c]}
]

But I digress. The original point was to illustrate the utility of this new feature. So for comparison, you might want to look at the implementation of the ncgi::encode command from Tcllib's ncgi module that provides the functionality for Tcl 8.6 and earlier.

Syntax

The syntax of regsub takes the form

regsub ?SWITCHES? REGEX STRING SUBSTITUTIONSPEC ?VARNAME?

Note from the earlier example, that -command is a binary switch and not an option that takes the command as an argument. Rather it causes the SUBSTITUTIONSPEC argument to regsub to be treated as a command and not the actual substitution. When the command is invoked, it is passed one or more arguments. The first is the value of the matched expression. Subsequent arguments, if present, correspond to matched subexpressions specified in REGEX.

For illustrative purposes,

% proc print {whole first second} {puts $whole,$first,$second}
% regsub -all -command {(.)(.)} abcd print
ab,a,b
cd,c,d

The convenience of the command arises from its succint combination of iteration (with the -all option), selective matching of the iteration operand, and execution of code of any complexity.

More examples

Ensure first character after punctuation is upper case:

set text "First sentence. second sentence? third sentence." 
% regsub -command -all {([.!?])\s+(.)} $text {
  apply {{- punc ch} {return "$punc [string toupper $ch]"}}
}
First sentence. Second sentence? Third sentence.

Converting Markdown headings to HTML:

% set md_line "# First level heading\n## A second level heading" 
% regsub -all -line -command {^(#+)\s+(.*)$} $md_line [
  lambda {- level text} {
      set h h[string length $level]; return "<$h>$text</$h>"
  }
]
<h1>First level heading</h1>
<h2>A second level heading</h2>

Sometimes, you may not even care about transforming the string but use it for iteration. To find the longest word in a sentence:

set maxlen 0
regsub -command -all {\w+} $sentence [
  lambda word {
    set len [string length $word]
    if {$len > $::maxlen} {set ::maxlen $len}
  }
]
set maxlen

One last point about using regsub. Consider changing all numerals in a sentence to their Devnagiri versions. Whip out your regsub hammer!

% regsub -all -command {\d} "42 is the answer." [
  lambda {ch} {format %c [expr {[scan $ch %c]+0x966-0x30}]}
]
४२ is the answer.

Realize, if you haven't already, that string map could do this just as well using a character map and likely be faster and clearer. Where regsub should be used is when the map is not static, or is too big, or depends on the context as in the earlier examples.

See the command reference for additional examples.

References

TIP 463: Command-Driven Substitutions for regexp

Regsub manpage