Trying to disambiguate a grammar

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying to disambiguate a grammar

Craig Ugoretz-2
Hello,
 
     I am new to grammatica (and parsers in general) and I have a grammar that I am trying to disambiguate.  Can anyone lend any advice?  Hopefully, this should get me on the right track with the rest of my work...  I apologize for the notation - it is EBNF, but nonstandard (and non-grammatica).
 
<int> ::= ['~'] <nzdigit> { <digit> }
         |  ['~'] O { <octdigit> }+
         |  ['~'] ('0x' | '0X') { <hexdigit> }+
         |  ['~'] ('0b' | '0B') { <bindigit> }+
<float> ::= ['~'] { <digit> }+ '.' { <digit> } { ('e' | 'E') ['~'] { <digit> }+
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<nzdigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<octdigit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<hexdigit> ::= <digit> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
<bindigit> ::= 0 | 1
 
Can proper tokenization alone with regular expressions lend itself to disambiguating the grammar?  This was a tactic that I tried, but was not familar enough with regular expressions to make progress.

_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying to disambiguate a grammar

Per Cederberg
Hi Craig,

Most of the grammar below lends itself well to
tokenization with regular expressions. Consider
the following tokens:

BIN_NUMBER           = <<0(B|b)[0-1]+>>
OCT_NUMBER           = <<0[0-7]*>>
DEC_NUMBER           = <<[1-9][0-9]*>>
HEX_NUMBER           = <<0(x|X)[0-9A-Fa-f]+>>
MINUS                = "~"
DOT                  = "."
E                    = <<(e|E)>>

With the help of these you can rewrite the rest of
the grammar:

Int         = ["~"] NumberToken ;
NumberToken = BIN_NUMBER
             | OCT_NUMBER
             | DEC_NUMBER
             | HEX_NUMBER ;
Float       = ["~"] DEC_NUMBER "." [DEC_NUMBER] [Exponent] ;
Exponent    = E ["~"] DEC_NUMBER

If you are working to expand this into a full
programming language grammar, you'll run into
issues with the E token. As the tokenizer is
not context sensitive, it will always return
the longest matching token.

Also, I many grammars the definition of float
and integer decimal number are both built into
the same DEC_NUMBER token. For a full language
that is probably the better solution, leaving
some validation controls to the analyzer stage.
Here I opted for something more similar to your
original grammar.

Cheers,

/Per

Craig Ugoretz wrote:

> Hello,
>  
>      I am new to grammatica (and parsers in general) and I have a
> grammar that I am trying to disambiguate.  Can anyone lend any advice?  
> Hopefully, this should get me on the right track with the rest of my
> work...  I apologize for the notation - it is EBNF, but nonstandard (and
> non-grammatica).
>  
> <int> ::= ['~'] <nzdigit> { <digit> }
>          |  ['~'] O { <octdigit> }+
>          |  ['~'] ('0x' | '0X') { <hexdigit> }+
>          |  ['~'] ('0b' | '0B') { <bindigit> }+
> <float> ::= ['~'] { <digit> }+ '.' { <digit> } { ('e' | 'E') ['~'] {
> <digit> }+
> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <nzdigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <octdigit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
> <hexdigit> ::= <digit> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'A' | 'B' |
> 'C' | 'D' | 'E' | 'F'
> <bindigit> ::= 0 | 1
>  
> Can proper tokenization alone with regular expressions lend itself to
> disambiguating the grammar?  This was a tactic that I tried, but was not
> familar enough with regular expressions to make progress.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying to disambiguate a grammar

Craig Ugoretz-2
Hi Per,
 
     Thank you for your quick and relevent response.  However, in all truthfulness, I tried a scheme about the same as you proposed, but ran into trouble.  I apologize for not being more explicit about the nature of my trouble.  For example, when I input 0b001101567, I get the following parse:

Parse tree from AbstractMachine3.txt:

Program(2001)

Int(2002)

NumberToken(2003)

BIN_NUMBER(1005): "0b001101", line: 1, col: 1

Int(2002)

NumberToken(2003)

DEC_NUMBER(1007): "567", line: 1, col: 9

     The problem is that the parse is breaking the inputted number into two parts, instead of leaving it as one and reporting an error, i.e. a binary number should not contain digits other than 0 or 1.  I speculate this problem is due to the way the regular expressions for the tokens are constructed, hence my question about disambiguating the grammar.  Again, I was not explicit in my orginal query.  Additionally, in OCT_NUMBER, the first "0" (zero) should be an "O" (oh).
 
                                                               Thanks,
                                                               Craig
 
On 2/18/07, Per Cederberg <[hidden email]> wrote:
Hi Craig,

Most of the grammar below lends itself well to
tokenization with regular expressions. Consider
the following tokens:

BIN_NUMBER           = <<0(B|b)[0-1]+>>
OCT_NUMBER           = <<0[0-7]*>>
DEC_NUMBER           = <<[1-9][0-9]*>>
HEX_NUMBER           = <<0(x|X)[0-9A-Fa-f]+>>
MINUS                = "~"
DOT                  = "."
E                    = <<(e|E)>>

With the help of these you can rewrite the rest of
the grammar:

Int         = ["~"] NumberToken ;
NumberToken = BIN_NUMBER
            | OCT_NUMBER
            | DEC_NUMBER
            | HEX_NUMBER ;
Float       = ["~"] DEC_NUMBER "." [DEC_NUMBER] [Exponent] ;
Exponent    = E ["~"] DEC_NUMBER

If you are working to expand this into a full
programming language grammar, you'll run into
issues with the E token. As the tokenizer is
not context sensitive, it will always return
the longest matching token.

Also, I many grammars the definition of float
and integer decimal number are both built into
the same DEC_NUMBER token. For a full language
that is probably the better solution, leaving
some validation controls to the analyzer stage.
Here I opted for something more similar to your
original grammar.

Cheers,

/Per

Craig Ugoretz wrote:

> Hello,
>
>      I am new to grammatica (and parsers in general) and I have a
> grammar that I am trying to disambiguate.  Can anyone lend any advice?
> Hopefully, this should get me on the right track with the rest of my
> work...  I apologize for the notation - it is EBNF, but nonstandard (and
> non-grammatica).
>
> <int> ::= ['~'] <nzdigit> { <digit> }
>          |  ['~'] O { <octdigit> }+
>          |  ['~'] ('0x' | '0X') { <hexdigit> }+
>          |  ['~'] ('0b' | '0B') { <bindigit> }+
> <float> ::= ['~'] { <digit> }+ '.' { <digit> } { ('e' | 'E') ['~'] {
> <digit> }+
> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <nzdigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <octdigit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
> <hexdigit> ::= <digit> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'A' | 'B' |
> 'C' | 'D' | 'E' | 'F'
> <bindigit> ::= 0 | 1
>
> Can proper tokenization alone with regular expressions lend itself to
> disambiguating the grammar?  This was a tactic that I tried, but was not
> familar enough with regular expressions to make progress.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying to disambiguate a grammar

Craig Ugoretz-2
Hi again Per,
 
     I see my problem - I had written NUMBER+ instead of NUMBER thinking that I could test multiple numbers from the same file.  In an actual program, there would be some token between the numbers, so what you presented to me was correct. 
 
                                                                     Craig

 
On 2/18/07, Craig Ugoretz <[hidden email]> wrote:
Hi Per,
 
     Thank you for your quick and relevent response.  However, in all truthfulness, I tried a scheme about the same as you proposed, but ran into trouble.  I apologize for not being more explicit about the nature of my trouble.  For example, when I input 0b001101567, I get the following parse:

Parse tree from AbstractMachine3.txt:

Program(2001)

Int(2002)

NumberToken(2003)

BIN_NUMBER(1005): "0b001101", line: 1, col: 1

Int(2002)

NumberToken(2003)

DEC_NUMBER(1007): "567", line: 1, col: 9

     The problem is that the parse is breaking the inputted number into two parts, instead of leaving it as one and reporting an error, i.e. a binary number should not contain digits other than 0 or 1.  I speculate this problem is due to the way the regular expressions for the tokens are constructed, hence my question about disambiguating the grammar.  Again, I was not explicit in my orginal query.  Additionally, in OCT_NUMBER, the first "0" (zero) should be an "O" (oh).
 
                                                               Thanks,
                                                               Craig
 
On 2/18/07, Per Cederberg <[hidden email]> wrote:
Hi Craig,

Most of the grammar below lends itself well to
tokenization with regular expressions. Consider
the following tokens:

BIN_NUMBER           = <<0(B|b)[0-1]+>>
OCT_NUMBER           = <<0[0-7]*>>
DEC_NUMBER           = <<[1-9][0-9]*>>
HEX_NUMBER           = <<0(x|X)[0-9A-Fa-f]+>>
MINUS                = "~"
DOT                  = "."
E                    = <<(e|E)>>

With the help of these you can rewrite the rest of
the grammar:

Int         = ["~"] NumberToken ;
NumberToken = BIN_NUMBER
            | OCT_NUMBER
            | DEC_NUMBER
            | HEX_NUMBER ;
Float       = ["~"] DEC_NUMBER "." [DEC_NUMBER] [Exponent] ;
Exponent    = E ["~"] DEC_NUMBER

If you are working to expand this into a full
programming language grammar, you'll run into
issues with the E token. As the tokenizer is
not context sensitive, it will always return
the longest matching token.

Also, I many grammars the definition of float
and integer decimal number are both built into
the same DEC_NUMBER token. For a full language
that is probably the better solution, leaving
some validation controls to the analyzer stage.
Here I opted for something more similar to your
original grammar.

Cheers,

/Per

Craig Ugoretz wrote:

> Hello,
>
>      I am new to grammatica (and parsers in general) and I have a
> grammar that I am trying to disambiguate.  Can anyone lend any advice?
> Hopefully, this should get me on the right track with the rest of my
> work...  I apologize for the notation - it is EBNF, but nonstandard (and
> non-grammatica).
>
> <int> ::= ['~'] <nzdigit> { <digit> }
>          |  ['~'] O { <octdigit> }+
>          |  ['~'] ('0x' | '0X') { <hexdigit> }+
>          |  ['~'] ('0b' | '0B') { <bindigit> }+
> <float> ::= ['~'] { <digit> }+ '.' { <digit> } { ('e' | 'E') ['~'] {
> <digit> }+
> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <nzdigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
> <octdigit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
> <hexdigit> ::= <digit> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'A' | 'B' |
> 'C' | 'D' | 'E' | 'F'
> <bindigit> ::= 0 | 1
>
> Can proper tokenization alone with regular expressions lend itself to
> disambiguating the grammar?  This was a tactic that I tried, but was not
> familar enough with regular expressions to make progress.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Grammatica-users mailing list
> [hidden email]
> <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://lists.nongnu.org/mailman/listinfo/grammatica-users" target="_blank"> http://lists.nongnu.org/mailman/listinfo/grammatica-users


_______________________________________________
Grammatica-users mailing list
[hidden email]
<a onclick="return top.js.OpenExtLink(window,event,this)" href="http://lists.nongnu.org/mailman/listinfo/grammatica-users" target="_blank">http://lists.nongnu.org/mailman/listinfo/grammatica-users



_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users