Per Cederberg writes:
> Well, I guess it would be possible to write an HTML
> grammar for Grammatica. But the question is more if
> it would really be a good fit. The thing with HTML
> is that *lots* of the real-world web pages are
> invalid (syntactically).
> So I think to write a good HTML-parser, one really
> needs to do it by hand. Adding special code
> everywhere to recover from common problems and
> Also, HTML is a very unstrict syntax, allowing new
> unknown tags to be used, end tags to be omitted, etc,
> etc. So it is very hard to create a correct BNF
> grammar that covers all that still provides something
> more than a pure tokenizer.
HTML 4 and XHTML are very formally specified languages: SGML and XML
applications, respectively. So it should be feasible to parse valid
HTML/XHTML documents with Grammatica.
Handling the vast amount of ill-formed and invalid HTML published on the
web is a separate problem. I would try to solve it by piping each
document through Tidy (to generate valid XHTML) and Grammatica (to