Lexing and Parsing

Lex and YACC HOW-TO

Lex and YACC are tools that are used together to parse languages. This is useful for reading configuration files or compiling languages into other languages (including the machine code of the computer).

Lex and YACC (Yet Another Compiler Compiler) are available as the GNU software programs Flex and Bison. The actions taken are programmed in C. The programs ocamllex and ocamlyacc perform the same task but the actions are coded in Ocaml. Ocaml is well-suited for this purpose because of the ease of defining complex data types to fit the problem in hand.

Tying lex and yacc together

To put together a useful program the lexer must know about the types that the parser interprets. This is accomplished by opening the parser module in the preamble to the lexer, so the parser must be constructed first (which makes sense since you need to know the structure of the language to start with). For example the parser may process tokens of type REAL having an associated real number value. The lexer then identifies sequences of chars in the input stream that correspond to real numnbers and outputs them as tokens of type REAL and the associated value.

More complex types are defined in the header to the parser (or opened from external modules), ie simple types can just be defined with %token INT etc but more complex types (eg tree ADTs) must be declared in the header.

Cookie jar parser

Before getting too ambitious (eg defining tree data types to store a full hierachical structure etc) start with the very basics: lexing and parsing a [cookie jar format] file. This consists of entries that are strings of text separated by a delimiter on a separate line (eg %% on a line by itself).

The test file I use is the first few lines of my fortune file, containing quotes from [The Evil Overlord list]

My Legions of Terror will have helmets with clear plexiglass visors, not
face-concealing ones.
  -- The Evil Overlord Guide
%
My ventilation ducts will be too small to crawl through.
  -- The Evil Overlord Guide
%
My noble half-brother whose throne I usurped will be killed, not kept
anonymously imprisoned in a forgotten cell of my dungeon.
  -- The Evil Overlord Guide
%

The parser then just needs to identify the EOF character, the COOKIE tokens and the DELIMITER tokens. As each COOKIE is identified it is added to a list, which is returned when the EOF is reached. The lexer is simple, just look for the delimiter and enything else is either a string that is included into the COOKIE, or the eof token. At least that was the first approach, but due to greedy matching the rule "match any character" matched the entire file. I replaced this rule with one that matches alphabetic characters, some punctuation, etc. Will need to extend this to match numbers, html tags etc but leave it at this level for the simple case.

The file cookieParse.mly: this parses the grammar COOKIE DELIMITER COOKIE DELIMITER ... COOKIE DELIMITER EOF and returns a list of all the cookies.

%token EOF DELIMITER
%token <string> COOKIE
%start main
%type <string list> main
%%

main:
| COOKIE DELIMITER main {$1 :: $3}
| EOF {[]}

The file cookieLex.mll: this lexer takes the incoming character stream and emits a token each time it recognises one. The lexer does not recursively call itself (which is the behaviour of the lexers in SooHyoung Oh's Ocamllex tutorial), rather the parser is expected to do this. An extra debugging variable is included to keep track of which cookie is being examined.

{ 
  open CookieParse
  let line=ref 1
}

let delim = "%"
let word = ['a'-'z''A'-'Z'] ['a'-'z''A'-'Z''0'-'9']*
let number = ['0'-'9']+
let punct = ['.' ',' '?' '-']
let ws = ['\n' '\t' ' ']+

rule token = parse
  | delim ws* {incr line; DELIMITER}
    | (word | punct | ws | number)+ as str {COOKIE (str) }
  | eof          {EOF}
  | _ {failwith((Lexing.lexeme lexbuf) ^ 
       ": mistake at cookie " ^ string_of_int !line)}
{
}

Compile things in the order:

ocamlyacc cookieParse.mly
ocamlc -c cookieParse.mli
ocamlc -c cookieParse.ml
ocamllex cookieLex.mll
ocamlc -c cookieLex.ml

and make a toplevel, if required, with

ocamlmktop cookieParse.cmo cookieLex.cmo -o cookie.top

Use a Makefile for this:

all: cookieParse.cmo cookieLex.cmo main.cmo
    ocamlmktop cookieParse.cmo cookieLex.cmo -o cookie.top

main.cmo: main.ml cookieParse.cmo cookieLex.cmo
    ocamlc -c main.ml

cookieLex.cmo: cookieLex.ml
    ocamlc -c cookieLex.ml

cookieParse.cmo: cookieParse.ml cookieParse.cmi
    ocamlc -c cookieParse.ml

cookieParse.cmi: cookieParse.mli
    ocamlc -c cookieParse.mli

cookieParse.mli cookieParse.ml: cookieParse.mly
    ocamlyacc cookieParse.mly

cookieLex.ml: cookieLex.mll cookieParse.cmi
    ocamllex cookieLex.mll

One annoying thing about make is that the actions (the indented lines) must be indented with a tab NOT SPACES. Since I usually have expandtab set in vim this requires a quick set noexpandtab when working with the Makefile.

Call the lexer and parser in the toplevel using:

let cookielist=
    let lexbuf = Lexing.from_channel (open_in "testcookie.txt") in
    CookieParse.main CookieLex.token lexbuf;;

Or write a main.ml file that calls the lexer and parser and then does something to the list of cookies that are returned.

Thermostat Example

Bert Hubert's [Lex and YACC primer/HOWTO] has an example of how to use Lex and YACC to construct a parser that acts as the control program for a thermostat. The action of the parser is only to print out the response of the thermostat to the commands entered, however it would be trivial to modify this to output low level commands to an actual device connected to by a file handle to the parser.

The example has been extended with rudimentry error checking in the lexer and parser, to identify unknown words and to ignore parse errors respectively, rather than crashing.

TO DO: implement the thermostat control program as an Erlang application, i.e. have a gen_server open a port to the main.ml program (actually the compiled version) and pass requests back and forth from the user to the parser. In this case the error handling code should be removed from the lexer and parser, so that the supervisor in the Erlang application is relied on to restart the parser if it crashes.

The lexer thermLex.mll:

{open ThermParse}

let ws = ['\n''\t'' ']
let number = ['0'-'9']+
let alphanum = ['a'-'z''A'-'Z''0'-'9']

rule token = parse
  | "quit"|"q"|"Q"      {EXIT}
  | "heat"              {TOKHEAT}
  | "temp"              {TOKTEMP}
  | "target"            {TOKTARGET}
  | "on"|"off" as state {STATE state}
  | ws+                 {token lexbuf}
  | number as num       {NUMBER (int_of_string num)}
  | alphanum+ as str    
            {Printf.printf "Unknown token: %s\n" str;
             flush stdout;
             token lexbuf}

{}

The parser thermParse.mly:

%token TOKHEAT TOKTEMP TOKTARGET EXIT
%token <string> STATE
%token <int> NUMBER
%start commands
%type <unit> commands
%%

commands:            {}
  | commands command {}
  | commands error   {};

command:
    heat_switch      {}
  | target_set       {}
  | EXIT             {exit 0};

heat_switch:
  | TOKHEAT STATE {Printf.printf "\tHeat turned %s\n" $2; flush stdout};

target_set:
  | TOKTARGET TOKTEMP NUMBER 
      {Printf.printf "\tTemperature set to %d\n" $3; flush stdout};

The Makefile is the same as for the cookie parser example, with all instances of "cookie" replaced by "therm".

Write a main.ml program that uses the parser, or call from the therm.top toplevel:

  let lexbuf = Lexing.from_channel stdin in
  ThermParse.commands ThermLex.token lexbuf;;

Key-Value Parser

Write a lexer/parser for records stored as keys and values with a separate record for each line. Keys start at the start of a line and are alphanumeric starting with a letter, with the key field being terminated by a colon. The value is any text from the colon to the newline ending the line. At present this text cannot contain the colon character - I tried breaking up the lexer into two separate rules to deal with this problem but this didn't interact with the parser correctly. The correct solution is probably to rewrite the parser to treat colon characters as separate tokens and modify the grammar to something like the following:

kvpair : WORD COLON values EOL
values : WORD values
         | COLON values

Folding White Space (FWS) is used to fold a line onto the next line and consists of LF ('\n') followed by a whitespace character. When lexing, any LF immediately followed by a whitespace character is ignored (as well as the whitespace itself - this may count as a bug, which I'll fix when I figure out how to get the lexer to unread a character).

This is almost how folding whitespace is treated in RFC 2822, except that to be compliant to this RFC I would need to treat CRLF as the newline character. Since I work mainly on Unix I don't want to do this. Unix uses LF for text file line separators (x0A), DOS uses CRLF (x0D x0A). See RFC 2822 (the successor of RFC 822 which describes a format for electronic mail).

kvLex.mll: The lexer, with added output to show what is going on.

{ 
  open KvParse
  let line=ref 1
}

let delim = '\n'
let word = ['a'-'z''A'-'Z'] ['a'-'z''A'-'Z''0'-'9']*
let number = ['0'-'9']+
let punct = ['.' ',' '?' '-']
let ws = ['\t' ' ']
let gentext = [^'\n'':']

rule token = parse
  | delim (ws as c) 
      {token lexbuf}
  | delim                                
      { incr line; 
        Printf.printf "EOL token\n";
        EOL }
  | (word+ as key) ':'                   
      { Printf.printf "KEY token: %s\n" key;
        KEY (key) }
    | (gentext)+ as str 
      { Printf.printf "VALUE token: %s\n" str;
        VALUE (str) }
  | eof { EOF }
  | _ { failwith((Lexing.lexeme lexbuf) ^ 
        ": mistake at line " ^ string_of_int !line)}
{
}

kvParse.mll: The parser.

%token EOL EOF
%token <string> KEY
%token <string> VALUE
%start main
%type <(string * string) list> main
%%

main:            
  | kvpair main {$1 :: $2}
  | EOF  {[]};

kvpair:
  | KEY values EOL {($1, $2)}

values:
  | VALUE {$1}
  | values VALUE {$1 ^ $2}

kvmain.ml: An example application that uses the lexer and parser to parse a text file and print out the key, value pairs.

let main() = 
  let lexbuf = Lexing.from_channel stdin in
  let parselist = KvParse.main KvLex.token lexbuf in
  List.iter (fun (k,v) -> 
             Printf.printf "Key: %s Value: %s\n" k v) parselist;;

let _ = Printexc.print main ();;

kvtest.txt: Some test data.

key1:This is the first key
key2: and this is the second key
key3:This is a multi-
 line folded comment. It treats
  a newline followed by a space or tab as a blank.
key4:The fourth and final line.

Pulse sequences for Ion Trap experiment

Write a lexer/parser for specifying pulse sequences for the experimental control program in a higher level scripting language. The output of the program should be a .SEQ file to be read by the existing experimental control program. The grammar is designed so that pulse sequences are made up of sets of pulses or other pulse sequences. The present implementation allows for repetition of a pulse or sequence within a sequence, branching will be introduced in a later version.

Also on the to-do list is to allow pulses to take parameters such as pulse duration and then specify the parameters during the pulse sequence. At present the pulse duration is specified in the pulse definition and the sequence uses only those pulses that are defined, so two pulses with the same lasers on but different outputs would require two pulse definitions, rather than calling a single pulse definition with the two different durations within the sequence definition.

pulseLex.mll: The lexer for the sequence definition file. Whitespace is ignored and comments starting with a hash symbol are ignored up to end of line.

{ 
  open PulseParse
  let line=ref 1
}

let delim = '\n'
let word = ['a'-'z''A'-'Z'] ['a'-'z''A'-'Z''0'-'9''_']*
let number = ['0'-'9']+
let punct = ['.' ',' '?' '-']
let ws = ['\t'' ''\n']
let gentext = [^'\n'':']

rule token = parse
  | ws+  { token lexbuf }
  | '#'[^'\n']*'\n' { token lexbuf } 
  | '{'  { (* Printf.printf "LBRACE\n"; *) LBRACE }
  | '}'  { (* Printf.printf "RBRACE\n"; *) RBRACE }
  | '*'  { (* Printf.printf "TIMES\n"; *) TIMES }
  | ','  { (* Printf.printf "COMMA\n"; *) COMMA }
  | ';'  { (* Printf.printf "SEMICOLON\n"; *) SEMICOLON }
  | "us" { (* Printf.printf "US\n"; *) US }
  | "ms" { (* Printf.printf "MS\n"; *) MS }
  | '('  { (* Printf.printf "LBRACKET\n"; *) LBRACKET }
  | ')'  { (* Printf.printf "RBRACKET\n"; *) RBRACKET }
  | word as w 
         { (* Printf.printf "NAME %s\n" w; *) NAME (w) }
  | number as n 
         { (* Printf.printf "NUMBER %s \n" n; *) 
           NUMBER (int_of_string n) }
  | eof { EOF }
  | _ { failwith((Lexing.lexeme lexbuf) ^ 
        ": mistake at line " ^ string_of_int !line)}
{
}

pulseParse.mly: The parser inserts pulses and sequences into a pulse has table and sequence hash table respectively. On reaching end-of-file the pulse sequence named "main" is extracted from the sequence hash table and used to construct the output file. In the program below the output file is a homemade format that is read by the Pascal program running on the experimental control computer.

%{
  (* Hash table to store pulses *)
  let pulse_table = Hashtbl.create 16
  (* Hash table to store sequences *)
  let seq_table   = Hashtbl.create 16
  (* Function to repeat a sequence *)
  let replist n lst = 
    let rec aux k acc = match k with
    | 0 -> acc
    | _ -> aux (k-1) (lst @ acc)
    in aux n [];;
  (* Function to convert dout strings to number *)
  let dacnum str = match str with
    | "master_397" -> 0
    | "doppler_397" -> 15
    | "sigma_397" -> 10
    | "probe_397" -> 9
    | "repump_866" -> 1
    | "readout_393" -> 8
    | "readout_850" -> 2
    | "deshelve_854" -> 14
    | "switch_to_carrier" -> 5
    | "switch_to_rsb" -> 6
    | "r_para" -> 12
    | "r_perp" -> 13
    | "count_a" -> 24
    | "count_b" -> 25
    | _ -> 0;;
  let calc_dout lst =
    List.fold_left (fun acc x -> acc +(1 lsl (dacnum x))) 0 lst;;
  let dac_str lst =
    let outstr = String.make 30 '.' in
    let dac_labels = "MrshEXYZSPåbRNdD--------AB.....V" in  
    List.iter (fun x -> outstr.[dacnum x] <- dac_labels.[dacnum x] ) lst;
    outstr;;
%}

%token LBRACE RBRACE SEMICOLON COMMA 
%token TIMES EOF LBRACKET RBRACKET
%token US MS 
%token <int> NUMBER
%token <string> NAME
%start pulsedefs 
%type <unit> pulsedefs
%%

pulsedefs:            
  | pulse pulsedefs { }
  | sequence pulsedefs { }
  | EOF             
    { let pulselist = Hashtbl.find seq_table "main" in
      List.iter (fun (name, (ti, tf), lst) -> 
                 begin
                   Printf.printf "%6d %6d %11d %2d %2d " ti tf
                   (calc_dout lst) 1 0;
                   Printf.printf " %30s" (dac_str lst);
                   Printf.printf " %s" name;
                   Printf.printf "\n"
                 end)
      pulselist};

pulse:
  | NAME duration LBRACE outputs RBRACE 
      { Hashtbl.add pulse_table $1 ($1, $2, $4) };

duration:
  | LBRACKET NUMBER time RBRACKET 
      { ($2 * $3, $2 * $3) }
  | LBRACKET NUMBER time COMMA NUMBER time RBRACKET 
      { ($2 * $3, $5 * $6) };

time:
  | US { 1 }
  | MS { 1000 };

outputs:
  | NAME COMMA outputs { $1 :: $3 } 
  | NAME               { [ $1 ] };

sequence: 
  | NAME LBRACE pulses RBRACE { Hashtbl.add seq_table $1 $3};

pulses:
  | pulsename pulses {$1 @ $2}
  | NUMBER TIMES pulsename pulses {(replist $1 $3) @ $4 }
  |                  { [] }

pulsename:
  | NAME SEMICOLON  
      { try [ Hashtbl.find pulse_table $1 ]
        with Not_found ->
          try Hashtbl.find seq_table $1 
          with Not_found -> failwith("Unknown pulse " ^ $1 ^ "\n") }

pulsemain.ml: Program to call the lexer and parser to analyse the testpulse.txt file.

let main () =
  let lexbuf = Lexing.from_channel (open_in "testpulse.txt") in
  PulseParse.pulsedefs PulseLex.token lexbuf;;

let _ = Printexc.print main();;

testpulse.txt: Example pulse sequence file.

# Define pulses
# Pulses are of the form pulse_name (duration) {definition}
# commas in the definition specify pulses that happen concurrently
# Pulse sequences are specified by sequence_name {definition}, where
# each pulse in the sequence is specified by a pulse_name + semicolon

# DAC scans happen outside the pulse sequence program (controlled by
# the main experiment procedure), pulse duration scans are specified
# by a pair of durations in the pulse definition

# Doppler cooling pulse
doppler_cool (10 ms) { master_397, doppler_397, repump_866 }

# Optical pumping to intialise spin state
pump_spin (10 us) { master_397, sigma_397, repump_866 }

# pi pulse on red sideband
rsb_pi_pulse (20 us) { switch_to_rsb, r_perp, r_para }

# EIT cooling pulse
eit_cool (1 ms) { master_397, sigma_397, probe_397 }

# Carrier Rabi flop, scan time 0->40us
carrier_flop (0 us, 40 us) { switch_to_carrier, r_perp, r_para }

# Shelving pulse
shelve (1 ms) { readout_393, readout_850 }

# Fluorescence check that ion is still happy
count_a (5ms) {count_a, master_397, doppler_397, repump_866,
               deshelve_854 }

# Check for fluorescence after shelving
count_b (5ms) {count_b, master_397, doppler_397, repump_866 }

############################################################
#  Start of pulse sequences
############################################################

# Sideband cooling sequence
sideband_cool { pump_spin; rsb_pi_pulse; }

# Readout sequence
readout { shelve; count_b; count_a; }

# Define main sequence
main {
  doppler_cool;
  eit_cool;
  10 * sideband_cool; 
  pump_spin;
  carrier_flop;
  readout;
}