Forth Recognizers in SwiftForth

This describes the initial implementation of the Forth Recognizers extension in SwiftForth, starting with SwiftForth 4.0.0-RC75 beta version.

Background

The Forth interpreter doesn’t have a standard method for extending how it processes various kinds of text tokens in the input stream. SwiftForth and other systems have long had hooks that provided places to extend the processing of text tokens (e.g., optional floating-point support, parsing Windows system constants, OOP packages like SWOOP, local variables, etc.). The proposed Forth Recognizers wordset allows the system to be extended in a standard way. It also turns out to be a nice simplification that reduces the complexity of the SwiftForth interpret and compile loops.

Recognizers

The recognizer implementation divides the classic Forth interpreter into three blocks:

  • Interpreter. It maintains STATE and organizes the work.
  • Token recognizer. This is called from the interpreter and analyzes each text token to see if is matches the criteria for a certain data type.
  • Handler. The result of the parsing words is handed over to the interpreter with a pointer to the data-specific handling methods.

There are three methods for each data type:

  • Interpret
  • Compile
  • Postpone

The Interpret and Compile methods are called from within the interpreter loop based on STATE. The Postpone method is called directly from POSTPONE.

The combination of a parsing word and the set of data handling words is called a recognizer. There is no strict one-to-one relation between the parsing words and the data handling sets. For example, the data handling set for single-cell numbers can be used by different parsing words.

Interpreter Loop

The simplified (and extensible) interpreter loop looks like this:

   BEGIN  ?STACK  PARSE-NAME DUP WHILE
      RECSTACK RECOGNIZE
      STATE @ 2+ CELLS + @EXECUTE
   REPEAT 2DROP 
  • After a quick check for stack underflow, we use the Standard word PARSE-NAME to return the address and length of the next token in the input stream. The loop terminates on zero length (i.e., no more tokens to parse).
  • For each token, we call RECOGNIZE with the system recognizer sequence RECSTACK.
  • The result of RECOGNIZE (the address of a methods list) is used as an execution vector indexed by STATE.

Token Recognizers

Each token recognizer has this stack effect:

REC-SOMETYPE ( c-addr len -- i*x addr1 | addr2 )

It takes the text token address and length (c-addr len) as inputs and if it recognizes the token, it returns the address of the handler vector (addr1) along with any data required by the interpret, compile, or postpone behaviors in the handler vector. The recognizing word must not change the string. If the token is not recognized, it returns the address REC-NONE, the “unrecognized” handler vector.

Note that while the primary use in this implementation is in the interpreter loop, any string can be passed to a recognizer word. This is also how POSTPONE is implemented.

: POSTPONE ( "name" -- )
   PARSE-NAME  RECSTACK RECOGNIZE  @EXECUTE ;  IMMEDIATE

Handler Vectors

The defining word RECTYPE: compiles the three-element vector table that handles a specific recognizer type (“rectype”).

RECTYPE: ( xt1 xt2 xt3 "name" -- ) 

RECTYPE: defines a recognizer vector table and compiles the cells xt1, xt2, and xt3 in this order:

Cell OffsetVectorAction
0xt3Postpone
1xt2Compile
2xt1Interpret

Recognizer Sequences

A recognizer sequence is a “stack” of token recognizers. The first cell is the number of recognizers and is followed by that many execution tokens.

The first element in the SwiftForth kernel’s default recognizer sequence is REC-FIND, which recognizes Forth words in the current dictionary search order. Note the two cases for immediate and nonimmediate words.

' EXECUTE ' COMPILE, ' POSTPONE, RECTYPE: RECTYPE-WORD
' EXECUTE ' EXECUTE ' COMPILE, RECTYPE: RECTYPE-IMM

: REC-FIND ( c-addr len -- xt addr1 | addr2 )
   (FIND) CASE
      -1 OF  RECTYPE-WORD  ENDOF
      1 OF  RECTYPE-IMM  ENDOF
      0 OF  RECTYPE-NONE  ENDOF
   ENDCASE ;

The second element in the sequence is REC-NUM, which recognizes single and double numbers.

' DROP ' EXECUTE ' NOPOST RECTYPE: RECTYPE-NUM

: REC-NUM ( c-addr len -- i*x xt addr1 | addr2 )
   ANY-NUMBER? CASE
      0 OF  RECTYPE-NONE  ENDOF
      1 OF  ['] LITERAL RECTYPE-NUM  ENDOF
      2 OF  ['] 2LITERAL RECTYPE-NUM  ENDOF
   ENDCASE ;

Glossary

Recognizer Types

RECTYPE:
( xt1 xt2 xt3 “name” — )

Define a named recognizer type (rectype) table given the interpret (x1), compile (xt2), and postpone (xt3) actions.

RECTYPE-NONE
( — addr )

Returns the address of the “not found” rectype.

RECTYPE-WORD
( — addr )

Returns the address of the rectype for non-immediate Forth words.

RECTYPE-IMM
( — addr )

Returns the address of the rectype for immediate Forth words.

RECTYPE-NUM
( — addr )

Returns the address of the rectype for numeric values (single, double, and float).

Text Token Recognizers

REC-FIND
( c-addr len — xt addr1 | addr2 )

Searches the Forth dictionary for the parsed token c-addr len and if found, returns its xt and the address addr1 of the rectype for subsequent processing. If not found, returns RECTYPE-NONE.

REC-NUM
( c-addr len — i*x xt addr1 | addr2 )

Attempts to convert the parsed token c-addr len into a single or double number. If successful, returns the number i*x (single or double integer), the xt of the corresponding literal compiling word (LITERAL or 2LITERAL), and the address addr1 of the rectype for subsequent processing. If not successful, returns RECTYPE-NONE.

REC-CHAR
( c-addr len — u xt addr1 | addr2 )

Attempts to convert the parsed token into a character literal (a single character set in single quotes). If successful, returns the character value u, the xt of LITERAL, and the address addr1 of the rectype for subsequent processing. If not successful, returns RECTYPE-NONE.

REC-LOCAL
( c-addr len — xt addr1 | addr2 )

Searches the list of defined local variables within the current colon definition and if found, returns its xt and the address of the rectype for subsequent processing. If not found, returns RECTYPE-NONE.

REC-WINCON
( c-addr len — x xt addr1 | addr2 )

Searches the list of Windows constants and if found, returns the named constant’s value and the address of the rectype for subsequent processing. If not found, returns RECTYPE-NONE.

REC-FNUM
( c-addr len — xt addr1 | addr2 )
( F: — r | )

Attempts to convert the parsed token c-addr len into a floating point value following the rules in Forth Standard 12.3.7 (Text interpreter input number conversion). If successful, returns the value on the floating-point stack and the xt of FLITERAL on the data stack along with the rectype addr1 for subsequent processing. If unsuccessful, returns RECTYPE-NONE.

Recognizer Sequences

SET-RECOGNIZERS
( xtn … xt1 n — )

Sets the system recognizer sequence to the n recognizers whose execution tokens are on the stack. The top stack item (xt1) is the first in the recognizer sequence.

GET-RECOGNIZERS
( — xtn … xt1 n )

Returns the current recognizer sequence with the number of members on top of the data stack.

+RECOGNIZER
( xt — )

Append recognizer xt to the system recognizer sequence.

-RECOGNIZER
( — )

Delete the last recognizer from the system recognizer sequence.

RECOGNIZE
( c-addr len addr1 — i*x addr2 )

Takes the parsed token c-addr len and passes it to the recognizer sequence at addr1. Returns the optional parameters i*x along with the rectype addr2 of the first recognizer that returns a rectype other than RECTYPE-NONE. If the token cannot be recognized, returns the original string c-addr len and RECTYPE-NONE as addr2.