The first phase of the compiler, reading raw characters and grouping them into tokens (e.g. id, num, +, if). Regular expressions (regex) define what the tokens look like, and finite automata (NFAs, DFAs) are the machines that are able to recognize them.
Regular Expressions
Three fundamental operations in the order of precedence (tightest first):
- Repetition (
*): zero or more.a*matches , a, aa, aaa, โฆ - Concatenation:
abmatches only โabโ - Alternation (
|): choice betweena | b, which matches โaโ or โbโ
Precedence matters. a | b* means a | (b*), either a single a, or zero or more bโs. If we want zer oor more of (a or b), weโd need (a | b)*.
Some shorthands:
a+โa a*a?โ zero or one, orฮต | a[a-z]โ character class, ora | b | c | ... | z
NFAs vs. DFAs
NFA (nondeterministic finite automaton) can have multiple transitions from the same state on the same input, plus -transitions that move without consuming any input. A DFA (deterministic) has exactly one transition per state per input symbol, without -transitions.
Every regex can be converted into an NFA via Thompsonโs construction, every NFA can be converted into a DFA via subset construction, and DFAs can be minimized to a scanner.
Thompsonโs Construction
- For a single character
a, start state โ on a โ accept state - For a concatenation , connect the accept state of โs NFA to the start state of โs NFA with an -transition
- For an alternation , new start state with -transitions to both and โs start states, and both accept states get -transitions to a new shared accept state.
- For a repetition , new start state with -transitions to Rโs start. Rโs accept state gets an -transition back to Rโs start (the loop). New start state also has -transition directly to the new accept state (for the zero-times case).
Subset Construction
Each DFA represents a set of NFA states. We start with the -closure of the NFAโs start state (all states reachable with -transitions). For each input symbol, we compute where the set of NFA states can go, take the -closure of that, and that becomes a new DFA state. A DFA state is accepting if any NFA state in its set is accepting.