Regex token order

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Regex token order

Drew Vogel
If I have two regex tokens A and B and A is a subset of B, how do I disambiguate them such that A will always be tried before B? The order they appear in the %tokens% section does not seem to affect this and I did not see an example of this in the documentation.

The parser I am trying to construct is for a template-like language with commands embedded in text. Thus I have a "text" token regex <<.+>> to match everything not otherwise matched as a command, but I only want to match it after all other token regex patterns have been tried.

Drew Vogel

_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Regex token order

Oliver Bock
I had to do a similar thing, but putting the more specific tokens first in %tokens% worked for me.  From my grammar:

ON = "ON"
VARNAME = <[hidden email]>

The text "ON" could match both these tokens, but for me ON matches, not VARNAME.  I suggest you cut your example down into a very simple grammar (like the above).


  Oliver

On 28/02/2011 4:37 PM, Drew Vogel wrote:
If I have two regex tokens A and B and A is a subset of B, how do I disambiguate them such that A will always be tried before B? The order they appear in the %tokens% section does not seem to affect this and I did not see an example of this in the documentation.

The parser I am trying to construct is for a template-like language with commands embedded in text. Thus I have a "text" token regex <<.+>> to match everything not otherwise matched as a command, but I only want to match it after all other token regex patterns have been tried.

Drew Vogel
_______________________________________________ Grammatica-users mailing list [hidden email] http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Regex token order

Drew Vogel
I would expect the token definition order to matter, based on my experience with similar tools like flex. I must be doing something wrong.

This is the test file I am trying to parse:
--------------------------------------------------
>email<
  Enter your email address:


This is my test grammar:
--------------------------------------------------
%header%
GRAMMARTYPE = "LL"

%tokens%
RCARET = ">"
LCARET = "<"
ITEM_NAME = <<[a-zA-Z][a-zA-Z0-9]+>>
TEXT = <<.+>>

%productions%
Item = ItemDecl TEXT;
ItemDecl = RCARET ITEM_NAME LCARET ;


This is the error I get from grammatica:
--------------------------------------------------
java -jar grammatica-1.5.jar Q.grammar --parse test.q
Parse tree from test.q:
Error: in test.q: line 1:
    unexpected token ">email<" <TEXT>, expected ">"


If I remove the TEXT token definition and the reference in the Item production, the remaining grammar does properly match the first line and I get a parse error at the new line character (as expected). Why does the introduction of my TEXT token override those previously-matching tokens, even though it is listed last in the %tokens% section?



On Sun, Feb 27, 2011 at 11:49 PM, Oliver Bock <[hidden email]> wrote:
I had to do a similar thing, but putting the more specific tokens first in %tokens% worked for me.  From my grammar:

ON = "ON"
VARNAME = <<[A-Z@#]([A-Z0-9._$#@]*[A-Z0-9_$#@])?>>

The text "ON" could match both these tokens, but for me ON matches, not VARNAME.  I suggest you cut your example down into a very simple grammar (like the above).


  Oliver


On 28/02/2011 4:37 PM, Drew Vogel wrote:
If I have two regex tokens A and B and A is a subset of B, how do I disambiguate them such that A will always be tried before B? The order they appear in the %tokens% section does not seem to affect this and I did not see an example of this in the documentation.

The parser I am trying to construct is for a template-like language with commands embedded in text. Thus I have a "text" token regex <<.+>> to match everything not otherwise matched as a command, but I only want to match it after all other token regex patterns have been tried.

Drew Vogel
_______________________________________________ Grammatica-users mailing list [hidden email] http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users



_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Regex token order

Per Cederberg
It works like this (as flex):

1. Longest matching token first.

2. On equal length, use the token defined previously in the grammar.

So, in your case, don't make your TEXT token a repetitive regexp.
Match a single char only.

Cheers,

/Per

On Monday, February 28, 2011, Drew Vogel <[hidden email]> wrote:

> I would expect the token definition order to matter, based on my experience with similar tools like flex. I must be doing something wrong.
> This is the test file I am trying to parse:
> --------------------------------------------------
>>email<  Enter your email address:
>
> This is my test grammar:--------------------------------------------------
> %header%GRAMMARTYPE = "LL"
> %tokens%RCARET = ">"LCARET = "<"ITEM_NAME = <<[a-zA-Z][a-zA-Z0-9]+>>
> TEXT = <<.+>>
> %productions%Item = ItemDecl TEXT;ItemDecl = RCARET ITEM_NAME LCARET ;
>
> This is the error I get from grammatica:
> --------------------------------------------------java -jar grammatica-1.5.jar Q.grammar --parse test.qParse tree from test.q:
> Error: in test.q: line 1:    unexpected token ">email<" <TEXT>, expected ">"
>
> If I remove the TEXT token definition and the reference in the Item production, the remaining grammar does properly match the first line and I get a parse error at the new line character (as expected). Why does the introduction of my TEXT token override those previously-matching tokens, even though it is listed last in the %tokens% section?
>
>
>
> On Sun, Feb 27, 2011 at 11:49 PM, Oliver Bock <[hidden email]> wrote:
>
>
>
>
>
>
>
>     I had to do a similar thing, but putting the more specific tokens
>     first in %tokens% worked for me.  From my grammar:
>
>     ON = "ON"
>     VARNAME = <<[A-Z@#]([A-Z0-9._$#@]*[A-Z0-9_$#@])?>>
>
>     The text "ON" could match both these tokens, but for me ON matches,
>     not VARNAME.  I suggest you cut your example down into a very simple
>     grammar (like the above).
>
>
>       Oliver
>
>     On 28/02/2011 4:37 PM, Drew Vogel wrote:
>     If I have two regex tokens A and B and A is a subset
>       of B, how do I disambiguate them such that A will always be tried
>       before B? The order they appear in the %tokens% section does not
>       seem to affect this and I did not see an example of this in the
>       documentation.
>
>
>
>       The parser I am trying to construct is for a template-like
>         language with commands embedded in text. Thus I have a "text"
>         token regex <<.+>> to match everything not otherwise
>         matched as a command, but I only want to match it after all
>         other token regex patterns have been tried.
>
>
>
>           Drew Vogel
>
>
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/grammatica-users
>
>
>
>
>
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/grammatica-users
>
>
>

_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users