Exploring Tcl internals from script - Part I

Published , updated

Tcl has some commands that are undocumented because they are liable to change, or even be removed, at any time, even in a patch release. Nevertheless, these commands can be very useful in exploring and understanding the inner workings of Tcl and in some cases, dealing with issues related to performance or interaction with external systems like COM on Windows.

This post examines the use of one of these commands - representation - that lets us introspect on data. A future post will look at the disassemble command which provides a way to look at compiled Tcl bytecode. Both commands lie in the ::tcl::unsupported:: namespace so to save us some typing let us add it to our namespace path.

% namespace path [linsert [namespace path] 0 ::tcl::unsupported]

Warning: These undocumented commands have been placed within the ::tcl::unsupported namespace for good reason. You should restrict their use to interactive sessions during development or debugging and refrain from making decisions based on their return values in production code.

Before we look at the use of the commands, we will disable Tcl's history feature which keeps track of previously executed commands in interactive mode. We need to do this because it interferes with reference counts of values that appear in the history list.

The easiest way to disable history is to redefine the command to do nothing.

% proc history args {}

Examining data using representation

The tcl::unsupported::representation command dumps the internal representation of a Tcl value in human readable form. In the simplest case, let us look at how a constructed string might be stored internally and contrast that with a constructed list value. (We say might because as we shall see, this can change depending on how the value is used.)

% representation [string repeat abc 2]
�?? value is a pure string with a refcount of 1, object pointer at 0000000002D56AC0,
   �?? string representation "abcabc"
% representation [lrepeat 2 abc]
�?? value is a list with a refcount of 1, object pointer at 0000000002D56820,
   �?? internal representation 0000000003217070:0000000000000000, no string
   �?? representation

Let us look at each component of the dump in the two cases. We briefly summarize here and expand on them later.

The internal types of the two returned values are shown as a pure string and list respectively.

The first value is shown to have a string representation whereas the second is shown as not having one. On the other hand, the second has an internal representation field which the first does not.

The object pointer is the memory address where the internal structure for the constructed value is stored. This structure has the type Tcl_Obj in the Tcl source code.

The refcount value reflects the reference count for that Tcl_Obj. Tcl uses reference counting internally to keep track of these structures.

Understanding internal representations

Although Tcl semantics are defined in terms of everything is a string, for performance reasons Tcl maintains an internal representation more suitable for computation when necessary. Tcl stores values internally in a Tcl_Obj C structure. The internal representation is stored as two fields within this Tcl_Obj structure and may change based on the operations invoked on the value.

We illustrate this for integer values.

% set ival 100
�?? 100
% representation $ival
�?? value is a pure string with a refcount of 3, object pointer at 0000000002D556E0,
   �?? string representation "100"

Because we simply assigned a literal to ival, and have not invoked any operations on it, its type is shown as a pure string, the pure indicating that there is no other representation associated with it. Also note that there is no mention of an internal representation in the command's output.

Now we look at how that changes once we invoke an integer operation on it.

% incr ival ; representation $ival
�?? value is a int with a refcount of 2, object pointer at 0000000002D556E0,
   �?? internal representation 0000000000000065:0000000002D573F0, no string
   �?? representation

The invocation of an integer operation creates an internal representation of type int held within the Tcl_Obj structure. Thus the next time an integer operation is to performed on the value, there is no cost incurred converting a string to an integer.

Note the presence of an internal representation field where the first value 0x65 directly stores the integer value. The second field is not used for integers.

Note furthermore that there is no string representation for the incremented value. Tcl will only generate one when required. An operation like string length or even I/O will force the string representation to be created. This is also why we placed the representation command on the same line as the incr. Otherwise, when the shell printed the result of incr, the string representation would have been generated and we would not have been able to show the intermediate step.

% puts $ival ; representation $ival
�?? 101
  value is a int with a refcount of 2, object pointer at 0000000002D556E0,
   �?? internal representation 0000000000000065:0000000002D573F0, string representation
   �?? "101"

Invoking a list operation will transform the internal representation yet again.

% llength $ival ; representation $ival
�?? value is a list with a refcount of 2, object pointer at 0000000002D556E0,
   �?? internal representation 00000000030EE0A0:0000000000000000, string representation
   �?? "101"

Again note the internal representation field is present but has changed. It now holds a pointer to a list structure in memory.

Note that throughout the above sequence of operations, the object pointer has stayed the same, as has the reference count. Only the type of the internal representation has changed. This is commonly referred to as shimmering.

Here is a short example of how shimmering to an appropriate type on the fly leads to more efficient operation.

% proc rgb {color} {
    set colors {
        red   0xff0000
        green 0x00ff00
        blue  0x0000ff
    }
    puts [representation $colors]
    return [dict get $colors $color]
}
% rgb red
�?? value is a pure string with a refcount of 4, object pointer at 0000000002D55890,
   �?? string representation "
          red ..."
  0xff0000
% rgb red
�?? value is a dict with a refcount of 4, object pointer at 0000000002D55890,
   �?? internal representation 0000000003118540:0000000000000000, string representation
   �?? "
          red ..."
  0xff0000

When the procedure is compiled, the value assigned to colors is stored as a string as we see in the output from representation on the first call to rgb. The dict get operation shimmers the internal representation to a dictionary. Thus on subsequent calls, the command is spared the expense of converting the string to a dictionary before looking it up.

In the above example, the shimmering of the colors value happens just once - on the first call to rgb. Thereafter it is accessed as and remains internally stored as a dictionary. On the other hand repeated shimmering of values between different types not only entails a performance hit, it is often indicative of a design or conceptual flaw, for example using a string operation like append in place of the list operation lappend.

Tcl uses internal representations for many types of objects. There is no means to enumerate them all since this is supposed to be an implementation detail and not intended to be visible to scripts at all. A small subset of these is shown below.

% representation  [dict create key val]  <1>
�?? value is a dict with a refcount of 3, object pointer at 0000000002D58260, int...
% 
% representation [pwd]                   <2>
�?? value is a path with a refcount of 6, object pointer at 00000000005C3050, int...
% 
% set pos end-1 ; lindex {1 2} $pos
�?? 1
% representation $pos                    <3>
�?? value is a end-offset with a refcount of 2, object pointer at 0000000002D5769...
% 
% set code "set x 1" ; eval $code
�?? 1
% representation $code                   <4>
�?? value is a bytecode with a refcount of 2, object pointer at 0000000002D57E70,...

<1> Dictionary <2> File paths <3> List indices <4> Compiled byte code

Having looked at using representation for the purpose of exploring the different internal types, we now use it to delve into Tcl's memory management and reference counting mechanisms.

Object storage and reference counting

Tcl uses reference counting to manage its values. When a variable is assigned to another, rather than making a copy of the value contained in it, the reference count for the Tcl_Obj holding the value is incremented and the same Tcl_Obj value is assigned to the target variable.

% set avar "some value"
�?? some value
% set bvar $avar
�?? some value

Now when we look at the representations for avar and bvar, we will see that both point to the same Tcl_Obj structure in memory.

% representation $bvar
�?? value is a pure string with a refcount of 3, object pointer at 0000000002D57540,
   �?? string representation "some value"
% representation $avar
�?? value is a pure string with a refcount of 3, object pointer at 0000000002D57540,
   �?? string representation "some value"

The reference count for the Tcl_Obj value includes references from each of the two variables. In addition, any time a value is passed as an argument to a command (including to representation), its reference count is incremented to reflect its presence on the call stack.

Correspondingly, when a reference to the value is removed, the value's reference count is decremented.

% set bvar "some other value"
�?? some other value
% representation $avar
�?? value is a pure string with a refcount of 2, object pointer at 0000000002D57540,
   �?? string representation "some value"
% representation $bvar
�?? value is a pure string with a refcount of 2, object pointer at 0000000002D56820,
   �?? string representation "some other value"

Notice that the two variables now point to different Tcl_Obj locations in memory and reference counts have been adjusted accordingly.

The literal table

The representation command can also be used to look at another aspect of the current Tcl implementation - the literal table.

Tcl internally maintains a table of all literal values encountered. In an effort to save memory, when compiling a procedure Tcl checks if any literals it encounters are already in this table and if so simply references them instead of creating a new Tcl_Obj with the same literal value. The following snippet demonstrates this.

% proc p1 {} {representation "just a literal"}
% p1
�?? value is a pure string with a refcount of 3, object pointer at 0000000002D580E0,
   �?? string representation "just a literal"
% proc p2 {} {representation "just a literal"}
% p2
�?? value is a pure string with a refcount of 4, object pointer at 0000000002D580E0,
   �?? string representation "just a literal"

Notice that even though the two literals are defined in different procedures, they both point to the same Tcl_Obj in memory.

Next time...

Having poked around some into Tcl's internal structures for data, we will look at doing the same for code in our next blog post.