[nmh-workers] INCing of email archives

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[nmh-workers] INCing of email archives

Bakul Shah
Once in a while I download email archives of some mailing list
and unpack them using "inc -file <archive-file>". But more
than once I have seen that inc gets confused and doesn't
unpack the whole thing. The cause seems to be a line starting
with From in some message body. Ideally inc should look that
a "From ..." line is immediately followed by header lines.
And if this is not the case, assume it is in the message body.

How do people deal with this?  I tried writing a quick hack
like the one below but surely there is a better way?

fix() {
        grep -n '^From .*[^0-9]$' $1 | sed 's/:.*/s|^|>|/' > ,$1
        if [ -s ,$1 ]; then echo wq >> ,$1; cat ,$1 | ed $1; fi
        rm ,$1
}

This prepends a > to any line beginning with "From "and not
ending with a digit.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Ralph Corderoy
Hi Bakul,

> Once in a while I download email archives of some mailing list
> and unpack them using "inc -file <archive-file>". But more
> than once I have seen that inc gets confused and doesn't
> unpack the whole thing. The cause seems to be a line starting
> with From in some message body.

Then it isn't any of the four mbox formats described at
https://en.wikipedia.org/wiki/Mbox#Family ?

> Ideally inc should look that a "From ..." line is immediately followed
> by header lines.  And if this is not the case, assume it is in the
> message body.

I agree that would be one heuristic to help, but it would also have
problems:

    From the outset, was clear we failed 42
    times: the first on attempting to read faulty input...

> fix() {
> grep -n '^From .*[^0-9]$' $1 | sed 's/:.*/s|^|>|/' > ,$1
> if [ -s ,$1 ]; then echo wq >> ,$1; cat ,$1 | ed $1; fi
> rm ,$1
> }
>
> This prepends a > to any line beginning with "From "and not
> ending with a digit.

    sed -i '/^From .*[^0-9]$/s/^/> /' "${1?}"

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Paul Fox-3
In reply to this post by Bakul Shah
bakul wrote:
 > Once in a while I download email archives of some mailing list
 > and unpack them using "inc -file <archive-file>". But more
 > than once I have seen that inc gets confused and doesn't
 > unpack the whole thing. The cause seems to be a line starting
 > with From in some message body. Ideally inc should look that
 > a "From ..." line is immediately followed by header lines.
 > And if this is not the case, assume it is in the message body.

I thought the trigger was "\n\nFrom: ", and that no more headers
were needed.

paul

=----------------------
paul fox, [hidden email] (arlington, ma, where it's 58.5 degrees)


--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Ken Hornstein-2
In reply to this post by Bakul Shah
>Once in a while I download email archives of some mailing list
>and unpack them using "inc -file <archive-file>". But more
>than once I have seen that inc gets confused and doesn't
>unpack the whole thing. The cause seems to be a line starting
>with From in some message body. Ideally inc should look that
>a "From ..." line is immediately followed by header lines.
>And if this is not the case, assume it is in the message body.

Ralph answered this, but let me expand a bit.

The job of inc(1) is to incorporate messages from a 'mail drop' into your
MH mailbox.  Traditionally it handles mbox-style files and POP (it also
does MMDF, but let us not speak of that).

As you can see from the Wikipedia entry Ralph linked to, all of the
various mbox formats use the same scheme: a line beginning with "From
" is the mailbox delimiter (mboxcl and mboxcl2 uses a Content-Length
header; I believe they are officially dead at this point).  The big
differences are in quoting rules.  Unfortunately since we're kind of
locked in to the mbox format in inc(1) at least, changing that would
have some nasty consequences (Ralph gave you an example of a message
that it would break on but I am sure there are others).  I think your
best bet is to preprocess these mailing list archives so they are valid
mbox files.

--Ken

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Bakul Shah
On Jul 25, 2019, at 4:25 PM, Ken Hornstein <[hidden email]> wrote:

>
>> Once in a while I download email archives of some mailing list
>> and unpack them using "inc -file <archive-file>". But more
>> than once I have seen that inc gets confused and doesn't
>> unpack the whole thing. The cause seems to be a line starting
>> with From in some message body. Ideally inc should look that
>> a "From ..." line is immediately followed by header lines.
>> And if this is not the case, assume it is in the message body.
>
> Ralph answered this, but let me expand a bit.
>
> The job of inc(1) is to incorporate messages from a 'mail drop' into your
> MH mailbox.  Traditionally it handles mbox-style files and POP (it also
> does MMDF, but let us not speak of that).
>
> As you can see from the Wikipedia entry Ralph linked to, all of the
> various mbox formats use the same scheme: a line beginning with "From
> " is the mailbox delimiter (mboxcl and mboxcl2 uses a Content-Length
> header; I believe they are officially dead at this point).  The big
> differences are in quoting rules.  Unfortunately since we're kind of
> locked in to the mbox format in inc(1) at least, changing that would
> have some nasty consequences (Ralph gave you an example of a message
> that it would break on but I am sure there are others).  I think your
> best bet is to preprocess these mailing list archives so they are valid
> mbox files.

Thanks, Ralph & Ken. The site from where I downloaded the latest
email archive uses mailman so I was a bit surprised. The method
I suggested would make inc able to handle a larger set of inputs.
While there can still be false positives, the number of messages
matching

From ... [0-9]$
<mail header>:

is likely to be much much smaller than a random line starting with
"From " and ending in a digit. Still, I can understand the reluctance
to add this logic to inc.



--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Ken Hornstein-2
In reply to this post by Paul Fox-3
> > Once in a while I download email archives of some mailing list
> > and unpack them using "inc -file <archive-file>". But more
> > than once I have seen that inc gets confused and doesn't
> > unpack the whole thing. The cause seems to be a line starting
> > with From in some message body. Ideally inc should look that
> > a "From ..." line is immediately followed by header lines.
> > And if this is not the case, assume it is in the message body.
>
>I thought the trigger was "\n\nFrom: ", and that no more headers
>were needed.

It seems that mbox format was officially standardized in RFC 4155
(although even that RFC acknowledges that there are lot of variations).
I guess the best you can count on is:

From <unspecified stuff here>

Note no colon (that's a header field).

To make it even more confusing, technically \n\n is NOT part of the
separator because at the beginning of the mbox file you don't have a
blank line.  Every other message in the mbox file IS supposed to have
a blank line followed by a "From ".

--Ken

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: INCing of email archives

Steffen Nurpmeso
In reply to this post by Paul Fox-3
Paul Fox wrote in <[hidden email]>:
 |bakul wrote:
 |> Once in a while I download email archives of some mailing list
 |> and unpack them using "inc -file <archive-file>". But more
 |> than once I have seen that inc gets confused and doesn't
 |> unpack the whole thing. The cause seems to be a line starting
 |> with From in some message body. Ideally inc should look that
 |> a "From ..." line is immediately followed by header lines.
 |> And if this is not the case, assume it is in the message body.
 |
 |I thought the trigger was "\n\nFrom: ", and that no more headers
 |were needed.

The trigger is "From ", but real POSIX requires at least one valid
header line to follow (then an empty line again), so with that
automatic MBOX detection can be heavily improved.  (Even my one
does this now like that, thanks for Dr. Fink of Suse.)  On top of
that RFC 4155 puts more constraints on the "From " / From_ line.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers