The current implementation of the Tcl-byte-code engine is built around
a large switch statement (the technique is called switch-threading).
It is actually quite hard to understand and to de-mangle, since there
are many goto-operations for optimization, etc. There exists several
other implementation techniques for execution engines coming from the
field of direct threaded code. In particular i invested into
differences between switch-threading, call-threading and label
threading. implementation mechanisms around, namely call-threading and
label-threading (direct threading). 

First of all, threaded.c is a short introductory example, showing in a
small concise program the differences of these techniques.  On my
machine i got the following results:

1: 0 seconds, 94591 usec
2: 0 seconds, 84749 usec
3: 0 seconds, 59811 usec

[1] switch threading
[2] call threading
[3] label threading 

This experiment shows that the implementation of switch-threading 
is the slowest, followed by call threading and label threading. 
However, the implementation of label-threading depends on a 
non-standard C extension (label address operator) needed for 
computed gotos. However, the label address operator is supported 
by at least three different compilers (supported on GCC, 
clang and IBM's XL C/C++).

Based on this study i built an experimental new code-engine for tcl
(integrated in nsf/nx) based on the following ideas:

- Use Tcl's basic implementation elements as base elements of the
  execution-engine: Tcl_Obj, Tcl_ObjCmdProc (objProcs).

- Instead of defining a "byte code", implement the code essentially as
  an array of objProcs, were as well the argument vectors of the
  objProcs are built as far as possible at compile time. This makes it
  possible to implement quickly an execution engine for all of Tcl.

- By using objProcs as basic mechanism, every tcl-command
  implementation will become effectively an "instruction" of the code
  engine. By defining new objProcs, the command set of the engine will
  be extended automatically. Specialized instructions with the same
  format can be registered depending on syntactic properties of the
  source (e.g. not all argument test will be required, part of the
  syntax errors can be caught at compile time, on could implement
  specialized versions for Tcl's "set" for setting and reading
  variables, etc.).

- The code engine is designed for allowing different resolving
  strategies for cmds (objProcs): come commands could be resolved at
  compile-time, some commands must be resolved at runtime. Similar
  experiments can be made in the object oriented case for objects and
  methods. It is not completely clear, if the binding time flexibility
  should be an option for a production version of the execution
  engine, since this would cause differences in the semantics of Tcl
  (when commands are redefined). However, this option provides some
  good evidence for potential speedups.

- The current byte code engine performs double dispatches on object
  oriented calls (one dispatch on the object, a second dispatch on the
  method). The new execution engine allows direct dispatch on methods.

- Exploit typing information: NSF/NX provides "optionally typing"
  information (e.g. x:int). This optional typing information can be
  used at runtime as well for optimizing the code.

- Provide an intermediate representation ("Tcl assembler") instead of
  "hard wiring" instructions to Tcl commands as implemented in
  Tcl. The Tcl assembler makes it easier to implement high-level
  optimizations to provide different implementations for e.g. a
  sequence of tcl instructions. Note that a Tcl compiler (a program
  translating Tcl source into Tcl assembler) can be implemented in Tcl
  as well. Furthermore, for some high-performance requirements, it is
  possible to simply write Tcl assembly code by hand (this is at least
  an alternative to C-coding).  Note that the Tcl assembler is not
  machine dependent, the instructions are those of the "abstract Tcl
  machine". Currently, the syntax of the Tcl assembler is a nested Tcl
  list.

- As explained above, label-threading is the fastest implementation
  technique of the compared approaches, but it is not supported by all
  c-compilers. Implemented execution engines are sufficiently
  different to make maintenance of two (or more) execution engines
  hard. In order to address this issue, i implemented an assembly code
  generator and an execution code generator in NX from a common,
  declarative source. The assembly code and the execution engines can
  be included in the source. Depending on the C-compiler, the "right"
  include can be chosen. Note that it is certainly possible to
  implement some more different (optimized) execution engines as well.

- Replacing for Tcl the complete byte-code engine is quite a large
  task. In order to experiment with the new execution engines i
  implemented asm-proc and asm-method as counterparts of "proc" and
  "method", which will take in their body the tcl assembly
  language. When such a command is called, the appropriate execution
  engine is called. This way, the new execution engine can co-exist
  with Tcl's classical byte-code engine, it should not be as well
  quite easy to follow this approach in jimtcl.

- In cases where the new tcl-compilation and assembly generation
  fails, one could fall back to the basic implementation (e.g. tcl
  byte code engine).
  
Some preliminary results:

  - If one executes just the objProcs (e.g. for "set" Tcl_SetObjCmd),
    the old code engine is faster, but the new ones "work" as well. It
    is certainly possible to provide new optimized instructions for
    certain tasks (accessing/setting variables, increment operations
    etc.) using the same format.

  - Using such instructions, both new execution engines (for
    call-threading and label-threading) show to be much faster than
    Tcl's current bytecode engine.

  - The current implementation based on call-threading is for the
    example below up to about three times faster than the
    Tcl-implementation, the label-threading variant is about 5 times
    faster.

Current shortcomings / work in progress: 

  - The current implementation is not re-entrant (no stack
    integration, proc local variables are currently stored in the
    assembly proc structure, not on the stack).

  - For allowing "uplevels" and "upvars" and compiled locals, we will
    need in Tcl the proc call frames, which will certainly cost some
    performance. For high-performance situations, these frames could
    be made optional (these are not available if one would code the
    procs straightforward in C). 

    Note that the example below with the for loop and incr, reentrancy
    and stack frames are not required/needed at all (since the code
    does neither call a proc (or eval) and the code is not recursive -
    it is not clear, how often such optimization would be possible).

  - The instruction set is just sufficient for the about 30 examples
    in my current regression test set. The instruction set is just
    experimental, and not sufficiently thoughtfully designed.

  - The assembly language is very primitive: one has a set of "slot"
    declarations which are constants or variables. Later references
    address these slots via their indices (starting with
    0). Similarly, the instructions are numbered as well (the code is
    a C array of instructions). A "goto" simply addresses the next
    instruction via the array index. All "declarations" of local vars
    have to be written before the "instructions".

  - It is not clear, how much Tcl compatibility is needed or wanted
    (interface to compiled locals, variable-/call-traces, etc.)

  - No effort so far on compiling Tcl source into Tcl assembly.

Below is a short example in Tcl (and the Tcl byte code engine) and in
Tcl assembly in two implementation variants, one with Tcl_Objs, one
with integers (in case the typing information is available or
inferable from the Tcl source). 

The timing results based on clang under macOS:

Call Threading:
asm/t.001:	6.64
asm/t.002:	4.04
asm/t.003:	2.46

Label Threading:
asm/t.001:	6.58
asm/t.002:	2.74
asm/t.003:	1.33

I'll push the version over the holidays to our public git repository.

All the best
-gustaf


==================================================
package req nx::test
nx::Test parameter count 100000

proc sum10.tcl {} {
  set sum 0
  for {set i 0} {$i < 100} {incr i} {
    incr sum $i
  }
  return $sum
}

# Implementation in Tcl assembly, using tcl-objs for "sum", "i" and
# the constants

nsf::asm::proc sum10.asm1 {} {
  {obj sum}
  {obj i}
  {obj 0}
  {obj 1}
  {obj 100}
  {obj 0}
  {var obj 0} 
  {var obj 1}              
  {duplicateObj slot 6 obj 2}  
  {duplicateObj slot 7 obj 5}  
  {leIntObj slot 4 slot 7}
  {jumpTrue instruction 7}
  {incrObj slot 6 slot 7} 
  {incrObj slot 7 slot 3} 
  {jump instruction 2}
  {setResult slot 6}
}

# Implementation in Tcl assembly, using integers for "sum", "i" and
# the constants

nsf::asm::proc sum10.asm2 {} {
  {obj sum}
  {obj i}
  {integer int 1}
  {integer int 100}
  {integer int 0}
  {integer int 0}
  {setInt slot 4 int 0}
  {setInt slot 5 int 0}
  {leInt slot 3 slot 5}
  {jumpTrue instruction 7}
  {incrInt slot 4 slot 5} 
  {incrInt slot 5 slot 2} 
  {jump instruction 2}
  {setResultInt slot 4}
}

? {sum10.tcl} "4950"
? {sum10.asm1} "4950"
? {sum10.asm2} "4950"