The first phase of the compiler, reading raw characters and grouping them into tokens (e.g. id, num, +, if). Regular expressions (regex) define what the tokens look like, and finite automata (NFAs, DFAs) are the machines that are able to recognize them.

Regular Expressions

Three fundamental operations in the order of precedence (tightest first):

  • Repetition (*): zero or more. a* matches , a, aa, aaa, โ€ฆ
  • Concatenation: ab matches only โ€œabโ€
  • Alternation (|): choice between a | b, which matches โ€œaโ€ or โ€œbโ€

Precedence matters. a | b* means a | (b*), either a single a, or zero or more bโ€™s. If we want zer oor more of (a or b), weโ€™d need (a | b)*.

Some shorthands:

  • a+ โ†’ a a*
  • a? โ†’ zero or one, or ฮต | a
  • [a-z] โ†’ character class, or a | b | c | ... | z

NFAs vs. DFAs

NFA (nondeterministic finite automaton) can have multiple transitions from the same state on the same input, plus -transitions that move without consuming any input. A DFA (deterministic) has exactly one transition per state per input symbol, without -transitions.

Every regex can be converted into an NFA via Thompsonโ€™s construction, every NFA can be converted into a DFA via subset construction, and DFAs can be minimized to a scanner.

Thompsonโ€™s Construction

  • For a single character a, start state โ†’ on a โ†’ accept state
  • For a concatenation , connect the accept state of โ€˜s NFA to the start state of โ€˜s NFA with an -transition
  • For an alternation , new start state with -transitions to both and โ€˜s start states, and both accept states get -transitions to a new shared accept state.
  • For a repetition , new start state with -transitions to Rโ€™s start. Rโ€™s accept state gets an -transition back to Rโ€™s start (the loop). New start state also has -transition directly to the new accept state (for the zero-times case).

Subset Construction

Each DFA represents a set of NFA states. We start with the -closure of the NFAโ€™s start state (all states reachable with -transitions). For each input symbol, we compute where the set of NFA states can go, take the -closure of that, and that becomes a new DFA state. A DFA state is accepting if any NFA state in its set is accepting.