My adventure with supporting Unicode identifiers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

My adventure with supporting Unicode identifiers

William Lahti
For people who don't care about multilingual support in their grammars, a token type like this might be sufficient:

IDENTIFIER                   = <<[A-Za-z][A-Za-z0-9_]*>>

I wanted to expand this to support more than just the basic latin alphabet, hoping it would be as easy as reformulating this expression into character classes, now that Grammatica 1.5 apparently supports unicode regular expressions. First I tried something like:

IDENTIFIER                   = <<[[:alpha:]][[:alnum:]]*>>

However it appears that Grammatica entirely ignores this type of structure, instead treating it like a character set composed of :s and the letters a, l, p, h, a, etc. So I found out about the \p{Class} formulation for property classes...

IDENTIFIER                   = <<[\p{L&}][\p{L&}]*>>

The \p formulations like \p{L&} for these weren't working until I consulted Java's own set of property classes, which list Alpha and Alnum, so it turned into:
IDENTIFIER                   = <<[\p{Alpha}][\p{Alnum}]*>>

This made it past the grammar build, targetting .NET for the tokenizer code. But when the compiler runs, from all appearances I gather that it is using .NET's regular expression library, which does not accept {Alpha} and {Alnum} but prefers the Unicode block names instead like {L&}. 

As I wrote this email I came up with a solution (though it's not as pretty as the formulations above:

IDENTIFIER                   = <<[\p{Ll}\p{Lu}\p{Lt}][\p{Ll}\p{Lu}\p{Lt}\p{Nd}]*>>

This makes it through grammar compilation and parsing, matching the correct input. It matches any alphabetic letter (upper, lower, title case) as the first character, and any alphanumeric character for the rest.

The inability to use the Java style classes looks like a bug/oversight in the C# port of Grammatica, unless I miss a more elegant way to pull this off? More than likely a little fudge could be thrown in to translate the {Alpha} class into Ll + Lu + Lt and {Number} into Nd.

But otherwise congrats on the new 1.5 release! I've been using 1.4 in my compiler project for quite awhile, so it was nice to have a bit of freshness and new features :-D

-- 
rezonant

long name: William Lahti
handle :: rezonant
freenode :: xfury
blog :: http://xfurious.blogspot.com/
site :: http://komodocorp.com/~wilahti

_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: My adventure with supporting Unicode identifiers

Per Cederberg
Great writeup of the problem! Yes, this is indeed an oversight on my
part. I just threw in native regex support without realizing that the
regex syntax for .NET must also work for Java... Ouch.

Now, Grammatica has two internal built-in regex engines as well. Both
provide tokenizing speeds that are faster (or much faster for the NFA
version) than the native regex support in .NET and Java. So one option
would be to improve these engines to understand a bit more about
character classes.

Apart from that, I guess Grammatica should allow a pass-through mode
so that regex analysis for .NET is pushed to runtime. That way the
original construct should have been possible. Or, as you suggest, we
could add some code into the SystemRE class (in Tokenizer.cs) to make
simple translations from Java-style to .NET-style... It's open source,
so patches are very welcome.

Thanks again for a great analysis!

Cheers,

/Per

2009/3/20 William Lahti <[hidden email]>:

> For people who don't care about multilingual support in their grammars, a
> token type like this might be sufficient:
> IDENTIFIER                   = <<[A-Za-z][A-Za-z0-9_]*>>
>
> I wanted to expand this to support more than just the basic latin alphabet,
> hoping it would be as easy as reformulating this expression into character
> classes, now that Grammatica 1.5 apparently supports unicode regular
> expressions. First I tried something like:
> IDENTIFIER                   = <<[[:alpha:]][[:alnum:]]*>>
> However it appears that Grammatica entirely ignores this type of structure,
> instead treating it like a character set composed of :s and the letters a,
> l, p, h, a, etc. So I found out about the \p{Class} formulation for property
> classes...
> IDENTIFIER                   = <<[\p{L&}][\p{L&}]*>>
> The \p formulations like \p{L&} for these weren't working until I consulted
> Java's own set of property classes, which list Alpha and Alnum, so it turned
> into:
> IDENTIFIER                   = <<[\p{Alpha}][\p{Alnum}]*>>
> This made it past the grammar build, targetting .NET for the tokenizer code.
> But when the compiler runs, from all appearances I gather that it is using
> .NET's regular expression library, which does not accept {Alpha} and {Alnum}
> but prefers the Unicode block names instead like {L&}.
> As I wrote this email I came up with a solution (though it's not as pretty
> as the formulations above:
> IDENTIFIER                   =
> <<[\p{Ll}\p{Lu}\p{Lt}][\p{Ll}\p{Lu}\p{Lt}\p{Nd}]*>>
> This makes it through grammar compilation and parsing, matching the correct
> input. It matches any alphabetic letter (upper, lower, title case) as the
> first character, and any alphanumeric character for the rest.
> The inability to use the Java style classes looks like a bug/oversight in
> the C# port of Grammatica, unless I miss a more elegant way to pull this
> off? More than likely a little fudge could be thrown in to translate the
> {Alpha} class into Ll + Lu + Lt and {Number} into Nd.
> But otherwise congrats on the new 1.5 release! I've been using 1.4 in my
> compiler project for quite awhile, so it was nice to have a bit of freshness
> and new features :-D
> --
> rezonant
>
> long name: William Lahti
> handle :: rezonant
> freenode :: xfury
> blog :: http://xfurious.blogspot.com/
> site :: http://komodocorp.com/~wilahti
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/grammatica-users
>
>


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users