Handling Character Classes
Deliberating on how to handle reading character classes in from CCT's regular-expression-to-DFA tool.
As it is now, one would specify a class as an edge label when explicitly defining the graph_hash. For example:
accept_types = {...,81=>"string",...} graph_hash = { ... # Naive String Literal sans \-escaping. 79 => { '"' => 80 }, 80 => { /[^"\n]/ => 80, '"' => 81 }, 81 => nil, # accept string ... }
Internally, this doesn't exactly map to 'everything except " or newline.' It represents any edge label that matches that class.
Consider a scanner recognizing only numbers and comments.
int:d+ float:d+.d*|d*.d+ comment:/\*[^*]\*\/
The DFA looks like this:
The only fixed edge labels in for that DFA are 'd', '.', '/', and '*'. The [^*] refers to an occurrence of a 'd', a '.', or a '/', but reading an 'a' would result in an error, rather than a transition.
Older Thoughts
Some keys use a regex to specify "edge labels" (or "any character") EXCEPT (some match). Example: [^\n\*\/] would be anything but newline, *, or /. [expand_regex_edges()] replaces those regex specifications with ALL of the other available edge labels that don't match the regex.
Specifying a scanner's construction rules using an NFA is less terse than using a set of regexes and accept notes. We'd prefer specifying our naive string like this:
string:"[^"]"
That could be a potential hurdle when implementing character class behavior down the line. I could write something into my Thompson-like Construction algorithm implementation to treat [^\]] as a concatenation of an edge with a regular expression. E.g., a[0-9]b would be a 4 state graph composed of 3 concatenations:
0 => {"a" =>1}, 1 => {/[0-9]/ => 2}, 2 => {"b" => 3}
Internally, it wouldn't feel especially elegant or true to the Thompson-McNaughton-Yamada spirit. Another option would be to do something like the following:
string:"Q" %% Q:[^"]
I'm slightly hesitant to do it this way. It's more work for the user and less intuitive, but it's less divergent from the TMY construction.