multiline in token definition

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

multiline in token definition

Steffen Gaede
I'm writing the grammar for an Select-from-where string.
The Main Problem is, that the regExp at defining token not known about
exclude words (like command words):

e.g.:
        FUNCTION_U =
<<(^((from|where|distinct|unique|or|and|not|group|order|by|having|as|asc|desc|on|like|between|in|count|min|max|avg|sum)[
]*\())[a-z0-9]+[ ]*\(>>
doesn't work: It accepts then only FUNCTION_U with suffix of excluded
text like "minxxxx", "havingbbb" not "murx".

Ok, then I decided to build a combination, that has all literal
combination. It functions, but this is confusing in a long line. Is
there the possibility to join it over multiple lines (like bash with "\")?

FUNCTION_U = <<(
(a([a-mo-rtuw-z0-9_][a-z0-9_]*|n([a-ce-z0-9_][a-z0-9_]*|d[a-z0-9_]+)?|s([abd-z0-9_]+|c[a-z0-9_]+))?)|
(b([a-df-xz0-9_][a-z0-9_]*|e([a-su-z0-9_][a-z0-9_]*|t([a-vx-z0-9_][a-z0-9_]*|w([a-df-z0-9_]+|e([a-df-z0-9_][a-z0-9_]*|e([a-mo-z0-9_][a-z0-9_]*|n[a-z0-9_]+)?)?)?)?)?|y[a-z0-9_]+)?)|
(c([a-np-z0-9_][a-z0-9_]*|o([a-tv-z0-9_][a-z0-9_]*|u([a-mo-z0-9_][a-z0-9_]*|n([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?)?)?)|
(d([a-df-hj-z0-9_][a-z0-9_]*|[ei]([a-rt-z0-9_][a-z0-9_]*|s([abd-su-z0-9_][a-z0-9_]*|c[a-z0-9_]+|t([a-hj-z0-9_][a-z0-9_]*|i([a-mo-z0-9_][a-z0-9_]*|n([abd-z0-9_]+|c([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?)?)?)?)?)?)|
(f([a-qstv-z0-9_][a-z0-9_]*|r([a-np-z0-9_][a-z0-9_]*|o([a-ln-z0-9_][a-z0-9_]*|m[a-z0-9_]+)?)?|u([a-km-z0-9_][a-z0-9_]*|l([a-km-z0-9_][a-z0-9_]*|l[a-z0-9_]+)?)?)?)|
(g([a-qs-z0-9_][a-z0-9_]*|r([a-np-z0-9_][a-z0-9_]*|o([a-tv-z0-9_][a-z0-9_]*|u([a-oq-z0-9_][a-z0-9_]*|p[a-z0-9_]+)?)?)?)?)|
(h([b-z0-9_][a-z0-9_]*|a([a-uw-z0-9_][a-z0-9_]*|v([a-hj-z0-9_][a-z0-9_]*|i([a-mo-z0-9_][a-z0-9_]*|n([a-fh-z0-9_][a-z0-9_]*|g[a-z0-9_]+)?)?)?)?)?)|
(i([a-mo-z0-9_][a-z0-9_]*|n([a-mo-z0-9_][a-z0-9_]*|n([a-df-z0-9_][a-z0-9_]*|e([a-qs-z0-9_][a-z0-9_]*|r[a-z0-9_]+)?)?))?)|
(j([a-np-z0-9_][a-z0-9_]*|o([a-hj-z0-9_][a-z0-9_]*|i([a-mo-z0-9_][a-z0-9_]*|n[a-z0-9_]+)?)?)?)|
(l([a-df-hj-z0-9_][a-z0-9_]*|e([a-eg-z0-9_][a-z0-9_]*|f([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?|i([a-jl-z0-9_][a-z0-9_]*|k([a-df-z0-9_][a-z0-9_]*|e[a-z0-9_]+)?)?)?)|
(m([b-hj-z0-9_][a-z0-9_]*|a([a-wyz0-9_][a-z0-9_]*|x[a-z0-9_]+)?|i([a-mo-z][a-z0-9_]*|n[a-z0-9_]+)?)?)|
(n([a-np-z0-9_][a-z0-9_]*|o([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?)|
(o([a-mo-qstv-z0-9_][a-z0-9_]*|n[a-z0-9_]+|r([a-ce-z0-9_][a-z0-9_]*|d([a-df-z0-9_][a-z0-9_]*|e([a-qs-z0-9_][a-z0-9_]*|r[a-z0-9_]+)?)?)|u([a-su-z0-9_][a-z0-9_]*|t([a-df-z0-9_][a-z0-9_]*|e([a-qs-z0-9_][a-z0-9_]*|r[a-z0-9_]+)?)?)?)?)|
(r([a-hj-z0-9_][a-z0-9_]*|i([a-fh-z0-9_][a-z0-9_]*|g([a-gi-z0-9_][a-z0-9_]*|h([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?)?)?)|
(s([a-df-tv-z0-9_][a-z0-9_]*|e([a-km-z0-9_][a-z0-9_]*|l([a-df-z0-9_][a-z0-9_]*|e([abd-z0-9_][a-z0-9_]*|c([a-su-z0-9_][a-z0-9_]*|t[a-z0-9_]+)?)?)?)?|u([a-ln-z0-9_][a-z0-9_]*|m[a-z0-9_]+)?)?)|
(u([a-mo-z0-9_][a-z0-9_]*|n([a-hj-z0-9_][a-z0-9_]*|i([a-pr-z0-9_][a-z0-9_]*|q([a-tv-z0-9_][a-z0-9_]*|u([a-df-z0-9_][a-z0-9_]*|e[a-z0-9_]+)?)?)?)?)?)|
(w([a-gi-z0-9_][a-z0-9_]*|h([a-df-z0-9_][a-z0-9_]*|e([a-qs-z0-9_][a-z0-9_]*|r([a-df-z0-9_][a-z0-9_]*|e[a-z0-9_]+)?)?)?)?)|
[ekpqtvxyz0-9_]+)[ ]*\(>>

Steffen Gaede.

_______________________________________________
Grammatica-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: multiline in token definition

Oliver Gramberg-2
Hi Steffen,

> I'm writing the grammar for an Select-from-where string.
> The Main Problem is, that the regExp at defining token not known about
> exclude words (like command words):
>
> e.g.:
> FUNCTION_U =
> <<(^((from|where|distinct|unique|or|and|not|group|order|by|having|as|asc|desc|on|like|between|in|count|min|max|avg|sum)[
> ]*\())[a-z0-9]+[ ]*\(>>
> doesn't work: It accepts then only FUNCTION_U with suffix of excluded
> text like "minxxxx", "havingbbb" not "murx".

(Note: The reason it fails is that "^" works like "NOT" only inside of "[...]".)

> Ok, then I decided to build a combination, that has all literal
> combination.

I believe that either 1.5 or 1.6 has been modified such that the order in which tokens are given represents priority. (To be exact, the first of all longest matches will be chosen.) I.e., if you define

        KW_from = "from"
        KW_where = "where"
        ...
        FUNCTION_U = <<[a-z0-9]+>>

then each keyword would be matched by both its specific definition as well as the general one, but since the specific comes first, that will be chosen, and you are fine.

(Also, please look at documentation and examples how whitespace should be handled, e.g., "WHITESPACE = <<[ \t\n\r]+>> %ignore%".)

This is much shorter and clearer and easier to understand and maintain than trying to press everything into a big regex. I am afraid that there will be other locations in the grammar where you would have to repeat this, and then it would get really ugly...

Please try if the above works. If you have 1.5 and I am wrong and the new priority determination is only included in 1.6, then the old one should apply which says that literals ("...") are preferred over regexes (<<...>>), so it should work in your case, too.

Regards
Oliver
--
Diese Signatur besteht zu 100% aus wiederverwendeten Pixeln.


Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

_______________________________________________
Grammatica-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: multiline in token definition

Steffen Gaede

> Hi Steffen,
>
>> I'm writing the grammar for an Select-from-where string.
>> The Main Problem is, that the regExp at defining token not known about
>> exclude words (like command words):
>>
>> e.g.:
>> FUNCTION_U =
>> <<(^((from|where|distinct|unique|or|and|not|group|order|by|having|as|asc|desc|on|like|between|in|count|min|max|avg|sum)[
>> ]*\())[a-z0-9]+[ ]*\(>>
>> doesn't work: It accepts then only FUNCTION_U with suffix of excluded
>> text like "minxxxx", "havingbbb" not "murx".
>
> (Note: The reason it fails is that "^" works like "NOT" only inside of "[...]".)

ok, that professing something.

>
>> Ok, then I decided to build a combination, that has all literal
>> combination.
>
> I believe that either 1.5 or 1.6 has been modified such that the order in which tokens are given represents priority. (To be exact, the first of all longest matches will be chosen.) I.e., if you define
>
>         KW_from = "from"
>         KW_where = "where"
>         ...
>         FUNCTION_U = <<[a-z0-9]+>>
>
> then each keyword would be matched by both its specific definition as well as the general one, but since the specific comes first, that will be chosen, and you are fine.
>
> (Also, please look at documentation and examples how whitespace should be handled, e.g., "WHITESPACE = <<[ \t\n\r]+>> %ignore%".)

The Whitespace in the funtion is nessessary - because, if I've an
expression defined as:
        EXPRESSION = <<[a-z0-9_]+>>
and a function with
        MIN   = "min"
and a symbol with
        LEFT_PAREN = "("
        RIGHT_PAREN = ")"
        DOT = "."

and my SFW has "select min(Person.Salary) ..."
he searches for "min(" instead of "min" "leftparen" although I've defined:
        FUNCTION1 = MIN|MAX;
        S1 = FUNCTION1 "(" Expression ["." Expression] ")";

(the same with "min (Person.Salary) ..." - he search for "min ("
So it does work only with:
        MIN = <<min[ ]*\(>>
and
        S1 = FUNCTION1 Expression ["." Expression] ")";


> Please try if the above works. If you have 1.5 and I am wrong and the new priority determination is only included in 1.6, then the old one should apply which says that literals ("...") are preferred over regexes (<<...>>), so it should work in your case, too.
That was my first problem, before I've tried the full-word exclusion.

Steffen.

_______________________________________________
Grammatica-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: multiline in token definition

Per Cederberg
Understand that an LL parser like Grammatica is context-free, i.e. the tokenization is separated from the grammar rules. You can run Grammatica in tokenize-only mode to show just the stream of tokens read.

The order of the token definitions is important, and has always been such:

1. The longest matching token
2. The first defined token (if both have same length)

Hence, do like this:

- Keep tokens short (without whitespace).
- Place string tokens above regex tokens.

Cheers,

/Per

On Thursday, November 3, 2011, Steffen Gaede <[hidden email]> wrote:
>
>> Hi Steffen,
>>
>>> I'm writing the grammar for an Select-from-where string.
>>> The Main Problem is, that the regExp at defining token not known about
>>> exclude words (like command words):
>>>
>>> e.g.:
>>>      FUNCTION_U =
>>> <<(^((from|where|distinct|unique|or|and|not|group|order|by|having|as|asc|desc|on|like|between|in|count|min|max|avg|sum)[
>>> ]*\())[a-z0-9]+[ ]*\(>>
>>> doesn't work: It accepts then only FUNCTION_U with suffix of excluded
>>> text like "minxxxx", "havingbbb" not "murx".
>>
>> (Note: The reason it fails is that "^" works like "NOT" only inside of "[...]".)
>
> ok, that professing something.
>
>>
>>> Ok, then I decided to build a combination, that has all literal
>>> combination.
>>
>> I believe that either 1.5 or 1.6 has been modified such that the order in which tokens are given represents priority. (To be exact, the first of all longest matches will be chosen.) I.e., if you define
>>
>>         KW_from = "from"
>>         KW_where = "where"
>>         ...
>>         FUNCTION_U = <<[a-z0-9]+>>
>>
>> then each keyword would be matched by both its specific definition as well as the general one, but since the specific comes first, that will be chosen, and you are fine.
>>
>> (Also, please look at documentation and examples how whitespace should be handled, e.g., "WHITESPACE = <<[ \t\n\r]+>> %ignore%".)
>
> The Whitespace in the funtion is nessessary - because, if I've an
> expression defined as:
>        EXPRESSION = <<[a-z0-9_]+>>
> and a function with
>        MIN        = "min"
> and a symbol with
>        LEFT_PAREN = "("
>        RIGHT_PAREN = ")"
>        DOT     = "."
>
> and my SFW has "select min(Person.Salary) ..."
> he searches for "min(" instead of "min" "leftparen" although I've defined:
>        FUNCTION1 = MIN|MAX;
>        S1 = FUNCTION1 "(" Expression ["." Expression] ")";
>
> (the same with "min (Person.Salary) ..." - he search for "min ("
> So it does work only with:
>        MIN     = <<min[ ]*\(>>
> and
>        S1 = FUNCTION1 Expression ["." Expression] ")";
>
>
>> Please try if the above works. If you have 1.5 and I am wrong and the new priority determination is only included in 1.6, then the old one should apply which says that literals ("...") are preferred over regexes (<<...>>), so it should work in your case, too.
> That was my first problem, before I've tried the full-word exclusion.
>
> Steffen.
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> https://lists.nongnu.org/mailman/listinfo/grammatica-users
>
_______________________________________________
Grammatica-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: multiline in token definition

Steffen Gaede

> Understand that an LL parser like Grammatica is context-free, i.e. the
> tokenization is separated from the grammar rules. You can run Grammatica
> in tokenize-only mode to show just the stream of tokens read.
>
> The order of the token definitions is important, and has always been such:
>
> 1. The longest matching token
> 2. The first defined token (if both have same length)
>
> Hence, do like this:
>
> - Keep tokens short (without whitespace).

ok, that's the hint:
> - Place string tokens above regex tokens.

If I first place the regex, then I become the errors. After changing
order, it works now.

I've build like your Example from arithmetic language, so it was not
clearly visible about this rule.


Best regards,
Steffen.

_______________________________________________
Grammatica-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/grammatica-users