Handling Character Classes


Deliberating on how to handle reading character classes in from CCT's regular-expression-to-DFA tool.

As it is now, one would specify a class as an edge label when explicitly defining the graph_hash. For example:

accept_types = {...,81=>"string",...}
graph_hash = {
  ... 
  # Naive String Literal sans \-escaping.
  79 => { '"' => 80 },
  80 => { /[^"\n]/ => 80, '"' => 81 },
  81 => nil, # accept string
  ... 
}

Internally, this doesn't exactly map to 'everything except " or newline.' It represents any edge label that matches that class.

Consider a scanner recognizing only numbers and comments.

int:d+
float:d+.d*|d*.d+
comment:/\*[^*]\*\/

The DFA looks like this:

Numbers_and_comments

The only fixed edge labels in for that DFA are 'd', '.', '/', and '*'. The [^*] refers to an occurrence of a 'd', a '.', or a '/', but reading an 'a' would result in an error, rather than a transition.

Older Thoughts

Some keys use a regex to specify "edge labels" (or "any character") EXCEPT (some match). Example: [^\n\*\/] would be anything but newline, *, or /. [expand_regex_edges()] replaces those regex specifications with ALL of the other available edge labels that don't match the regex.

Specifying a scanner's construction rules using an NFA is less terse than using a set of regexes and accept notes. We'd prefer specifying our naive string like this:

string:"[^"]"

That could be a potential hurdle when implementing character class behavior down the line. I could write something into my Thompson-like Construction algorithm implementation to treat [^\]] as a concatenation of an edge with a regular expression. E.g., a[0-9]b would be a 4 state graph composed of 3 concatenations:

0 => {"a"     =>1},
1 => {/[0-9]/ => 2},
2 => {"b"     => 3}

Internally, it wouldn't feel especially elegant or true to the Thompson-McNaughton-Yamada spirit. Another option would be to do something like the following:

string:"Q"
%%
Q:[^"]

I'm slightly hesitant to do it this way. It's more work for the user and less intuitive, but it's less divergent from the TMY construction.