Re: HTML grammar??

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: HTML grammar??

Rodgers, Kevin
Re: [Grammatica-users] HTML grammar??

Per Cederberg writes:
> Well, I guess it would be possible to write an HTML
> grammar for Grammatica. But the question is more if
> it would really be a good fit. The thing with HTML
> is that *lots* of the real-world web pages are
> invalid (syntactically).
> So I think to write a good HTML-parser, one really
> needs to do it by hand. Adding special code
> everywhere to recover from common problems and
> issues.
> Also, HTML is a very unstrict syntax, allowing new
> unknown tags to be used, end tags to be omitted, etc,
> etc. So it is very hard to create a correct BNF
> grammar that covers all that still provides something
> more than a pure tokenizer.

HTML 4 and XHTML are very formally specified languages: SGML and XML
applications, respectively.  So it should be feasible to parse valid
HTML/XHTML documents with Grammatica.

Handling the vast amount of ill-formed and invalid HTML published on the
web is a separate problem.  I would try to solve it by piping each
document through Tidy (to generate valid XHTML) and Grammatica (to
process it).


Grammatica-users mailing list
[hidden email]