Parsing data out of an html file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing data out of an html file

Andrew Smellie-2

Hi

 

I have a long and complex html file that contains a small piece of well formatted data inside it. I need to write a grammar of the type

 

Skip over the garbage

Read what I need

Skip the rest of the garbage

 

Here is an example:

 

        </div>

          <!--

          <div class="endOfDay"> END OF PLAY REPORTS

            <div class="endOfDayLinks">

              <div class="endOfDayLeft"><a href="#">England</a></div>

              <div class="endOfDayRight"><a href="#">County</a></div>

            </div>

          </div>

          -->

        </div>

 

I want to parse the line <div class="endOfDayLeft"><a href="#">England</a></div> and ignore everything else

 

I have tried to define a “skip everything” token and then special casing wwhat I want

 

SKIP_EVERYTHING         = <<.*>> %ignore%

WHITESPACE              = <<[ \t\n\r\d]+>> %ignore%

 

But I keep getting a parsing exception and the end of the file

 

Thanks for any help in advance

 

Andrew


_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users
Reply | Threaded
Open this post in threaded view
|

Re: Parsing data out of an html file

Oliver Gramberg

Hi, Andrew,


your token definition,

SKIP_EVERYTHING         = <<.*>> %ignore%

does exactly what its name says: it skips everything. The reason is that the tokenizer works "greedily," i.e., it eats as many characters as it can, once a valid match is found. This is called the "longest match principle." The reason this principle is employed, in turn, is efficiency: The tokenizer doesn't have to backtrack, and therefore effectively reads each character only once.

Let's assume you are actually interested in the "England" bit of the line you show to be your target. Grammatica's %ignore% is all-or-nothing, therefore, it is not of much help here: The line is identified by the markup at the beginning of the line, so you cannot just throw away *all* markup; also, you want to throw away most of the content, but not *all*.

Fortunately, there's another way: To ignore something can also mean *not to do anything with it*, or, in Grammatica's terms: to do nothing in the method that is called when such a token is found.

So, the easiest solution to your problem might involve
(1) declaring a token that exactly matches HTML markup before the location where you want to extract data;
(2) declaring a token that matches all HTML markup, i.e., starts with "<";
(3) declaring a token that matches all HTML non-markup, i.e., starts with "[^<]";

- Token (1) must come first in your grammar, this way Grammatica choses it over (2) when your identifying markup appears in the input.
- When (1) is found, you set (in the appropriate callback method) a flag that indicates that the next non-markup is the data you want to extract.
- Only when the flag is set, (3) is used as output.
- Don't forget to reset the flag.


On the other hand, with such a small number of tokens, it might be even easier to handle this with a small script:

perl -n extract.pl output.html > extract.txt

with this line as the contents of extract.pl:

print $1 if m|<div class="endOfDayLeft"><a href="#">([^<]+)</a></div>|;


Regards
Oliver Gramberg

_______________________________________________
Grammatica-users mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/grammatica-users