[nmh-workers] nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Paul Fox-3
ken wrote:
 > >The  �" � around `Blind-Carbon-Copy' should be \(lq and \(rq, or the
 > >equivalent strings for consistency with the style used at start of the
 > >paragraph.
 >
 > So, in a mostly unrelated note ... I couldn't help noticing that Ralph
 > used guillemets ( � �) in one of his messages on this thread (way to push
 > non-US-ASCII characters, Ralph!), and after a series of replies to his note
 > things devolved into classic mojibake.  And since hopefully most everyone
 > on this thread is an nmh user, I wanted to understand why, because really
 > that shouldn't have happened.


Mea Culpa.  I haven't fully worked through the bug or the fix, but
rest assured, the problem isn't with nmh.

My replies and forwarded message drafts are constructed by a script
that predates replyfilter.  It does things like add attribution ("ken
wrote:"), my .sig, and the bulk of the body with the " > " indents.
It includes the original headers if forwarding, but not when replying,
and also adjusts the current headers based on what folder I'm in, for
things like Reply-to: and Fcc:.

I haven't done full debugging yet, but looking quickly I see that the
body content is created by:
            mhshow -form mhl.null -type text/plain -file $original_text  |
                utf_clean |
                remove_part_markers_and_quote

where $original text is the path to the message being replied to.

The function remove_part_markers_and_quote() runs sed to get rid of
the "part markers" that mhshow emits:
    remove_part_markers_and_quote()
    {
        # delete part markers entirely if they're the whole line,
        # otherwise just remove that part of the line.
        # and because we're already running sed, add the leading ' > '
        sed -e '/^\[\*@\(\[ part .* \]\)@\*\]$/d' \
            -e 's/\[\*@\(\[ part .* \]\)@\*\]//' \
            -e 's/^/ > /'
    }

But utf_clean() is the culprit, I believe -- it's there to remove a
few really annoying binary characters that my fonts don't display
correctly.  But it does so with a fairly large and indiscriminate
hammer, completely ignoring the current encoding.
    utf_clean()
    {
        #eliminate utf hard non-printing space:  <U+200B> or \u200B
        #also eliminate A0, which is non-breaking space in iso-8859
        sed -e 's/\xe2\x80\x8e/ /g' \
            -e 's/\xe2\x80\x8b//g' \
            -e 's/\xa0/ /g' \
            -e 's/\xc2/ /g'
    }

I'll work on this, and also take a look at replyfilter to see if
I can't get it to do more of the heavy lifting.

paul



 >
 > I went back to the raw archives (ftp://lists.gnu.org/nmh-workers/2019-02)
 > because the mailing list software will sometimes translate stuff into
 > base64 encoding when it sees non-ASCII characters.  And, well, I hate to
 > assign blame, but I think it's a bit unavoidable ... please, don't anyone
 > take this as a personal attack, I am just trying to understand how we
 > could do better.
 >
 > Ralph's original note containing the guillemets (Message-Id
 > <[hidden email]>) was text/plain, a
 > character set of utf-8, and encoded using quoted-printable.  The
 > characters were encoded properly using quoted-printable, specifically
 > they were listed as =C2=AB and =C2=BB.
 >
 > Valdis was the first reply to that (Message-ID
 > <[hidden email]>), and HIS email was text/plain,
 > character set iso-8859-1, and encoded using quoted-printable.  He quoted
 > Ralph's message, and the guillemets were encoded as =AB and =BB.  Which seems
 > correct to me.
 >
 > Paul Fox replied to Valdis's note (Message-Id
 > <[hidden email]>), and THAT note
 > was text/plain, character set UTF-8, encoded using quoted-printable ...
 > but it seems like this was the start of where things went off the rails.
 > The original line in Valdis's email was (in raw form):
 >
 >    > The =AB=22=BB around ...
 >
 > But in Paul's note it ended up as (extra > added in the reply)
 >
 >    > > The  =AB" =BB around
 >
 > This is NOT correct.  First, there is an extra space in front of
 > the encoded bytes.  Secondly, they're not valid UTF-8; they're the
 > ISO-8859-1 bytes.  So I am guessing whatever Paul used to quote the reply
 > didn't translate the ISO-8859-1 characters properly into UTF-8.
 >
 > However, whatever Mark Bergman uses for email actually made an intelligent
 > decision.  When he replied to Paul's note, those invalid UTF-8 characters
 > got converted to the Unicode Replacement Character (U+FFFD), which was
 > sent out as =EF=BF=BD (utf-8, quoted-printable).
 >
 > Further muddying the waters ... when Ralph replied to Mark's email,
 > those Unicode Replacement Characters somehow got converted back to
 > the correct guillemets (=C2=AB and =C2=BB).  Which means Ralph has
 > perhaps the most intelligent reply quoting program ever and he should
 > immediately share it as it would revolutionize AI, or he went back and
 > manually corrected it when he replied to Mark's note.  I'm 50/50 on
 > which one of those scenarios is more likely.
 >
 > If anyone involved with this email thread wants to pipe up with some
 > more explanation on what exactly they used to compose their email
 > replies, I would love to hear it.  No judgements; I just want to know
 > how nmh could help everyone do better.  Like, do we need to include
 > better tools for composing reply messages?  Well, duh, the answer to
 > that is "yes", and I think replyfilter does ok here but obviously we
 > need to do better.  But if we're SENDING something that is not valid
 > UTF-8, should we be smarter and flag it?  People were upset when we
 > refused to send out 8-bit characters when your locale was US-ASCII (I
 > mean, REALLY?  I couldn't believe it), so I don't know what makes sense.
 > Sending out invalid UTF-8 just seems wrong to me.
 >
 > --Ken
 >
 > --
 > nmh-workers
 > https://lists.nongnu.org/mailman/listinfo/nmh-workers


=----------------------
paul fox, [hidden email] (arlington, ma, where it's 33.6 degrees)



--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Ralph Corderoy
In reply to this post by Ken Hornstein-2
Hi Ken,

> > The «"» around `Blind-Carbon-Copy' should be \(lq and \(rq
>
> So, in a mostly unrelated note ... I couldn't help noticing that Ralph
> used guillemets («») in one of his messages on this thread (way to
> push non-US-ASCII characters, Ralph!)

I find they're useful because they're visually distinct from anything
that might look like it should be part of the pipeline, etc.,
I'm quoting.  In Vim, the digraphs, entered after Ctrl-K, for them are
`<<' and `>>', so they're readily to hand.  And misappropriating them
from the French pleases my Englishness.
https://en.wikipedia.org/wiki/Guillemet

> Ralph's original note containing the guillemets (Message-Id
> <[hidden email]>) was text/plain, a
> character set of utf-8, and encoded using quoted-printable.

The QP is Mailman's meddling.  I gave it
`Content-Transfer-Encoding: 8bit'.

> Valdis was the first reply to that (Message-ID
> <[hidden email]>), and HIS email was
> text/plain, character set iso-8859-1, and encoded using
> quoted-printable.  He quoted Ralph's message, and the guillemets were
> encoded as =AB and =BB.  Which seems correct to me.

Definitely.

> Paul Fox replied to Valdis's note (Message-Id
> <[hidden email]>), and THAT
> note was text/plain, character set UTF-8, encoded using
> quoted-printable ...  but it seems like this was the start of where
> things went off the rails.

Yes.  I'm sure Paul won't mind being blamed.  :-)

> Further muddying the waters ... when Ralph replied to Mark's email,
> those Unicode Replacement Characters somehow got converted back to the
> correct guillemets (=C2=AB and =C2=BB).  Which means Ralph has perhaps
> the most intelligent reply quoting program ever and he should
> immediately share it as it would revolutionize AI, or he went back and
> manually corrected it when he replied to Mark's note.  I'm 50/50 on
> which one of those scenarios is more likely.

I fixed it manually when composing in vim.  Didn't see the point in
deciding to keep those lines as context and yet mislead by their
content.

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Ralph Corderoy
In reply to this post by Robert Elz
Hi kre,

Sending to you directly so you see a version that Mailman doesn't touch.

>   | >The «"» around `Blind-Carbon-Copy'
>
> I am leaving that there just so you can see what happens...   What I
> see when composing this is (ignoring my "|" quoting marker, ">The "
> (which I assume is fine for everyone),

Yes.

> capital A with a caret (I hope that is the right name, like the ^
> char, but smaller),

Yes, see `dict -d foldoc caret'.

> the opening guillemets (never heard that name before...) a normal
> double quote (ascii), another capital A-caret, and the closing
> guillemets (and then a space, and the rest of the text).

Yes.

> What I think is happening, is that everything I do is "un-localed",
> that is, I have no LC_* or LANG settings at all, which means that
> everything runs in the C (aka POSIX) locale (more or less US-ASCII).
>
> If I use nmh (ie: show) to look at your message, I see:
>
> >The ?"? around `Blind-Carbon-Copy'
>
> which is correct as I understand things.

I think that's because nmh knows the text has two bytes representing
each guillemet, and iconv(3) says it can't translate either of them,
Unicode U+00AB or U+00BB, to the C locale so nmh renders each two bytes
as a single `?' byte.

> Then, what I expect happens, is that when the reply is composed, and
> the 2 byte UTF-8 character is read, it is instead interpreted as 2
> characters, one of which is the A-Caret, and the other is, probably
> not entirely by fluke, the opening « (which I just pasted from your
> message, no idea in what form it will be sent out).

Correct.  The UTF-8 encoding of U+00AB is 0xc2 0xab.

    $ printf '\uab' | hd
    00000000  c2 ab                                             |..|

This is because the 11 bits, p-z, of the Unicode runes [0x80, 0x800) are
mapped to two bytes.

    110p qrst
    10uv wxyz

u-z stay in their original place within the byte.  Their byte is headed
by 10.  If s-t's value is 10 then the second byte retains its original
value and so runes [0x80, 0xc0) are the ones that are simply prefixed by
0xc2 when they are UTF-8 encoded.

Byte 0xc2 is a `Â' in ISO 8859-1 so if you see a pair of runes starting
with `Â' then the second one is what was intended under a common
mis-encoding.

> \xe0\xb9\x80\xe0\xb8\xa3\xe0\xb8\xb5\xe0\xb8\xa2\xe0\xb8\x99
> \xe0\xb8\x84\xe0\xb8\x93\xe0\xb8\xb2\xe0\xb8\x88\xe0\xb8\xb2
> \xe0\xb8\xa3\xe0\xb8\xa2\xe0\xb9\x8c

    $ show | grep \\\\x |
    > tr -dc 0-9a-f | tr a-f A-F |
    > sed 's/.*/16i&0AP/' | dc
    เรียนคณาจารย์
    $

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Robert Elz
    Date:        Fri, 15 Feb 2019 15:48:48 +0000
    From:        Ralph Corderoy <[hidden email]>
    Message-ID:  <[hidden email]>

  | Sending to you directly so you see a version that Mailman doesn't touch.

That's not actually guaranteed, though it worked ...   I do mail
filtering that drops duiplicates, so which I see would depend uppn
which arrived first.

  | Yes, see `dict -d foldoc caret'.

Good to see I remember some things!    It is kind of interesting
(and I'm sure there's a reason) that the things I saw as A-caret
(and you explained why) appeared in the message I got back from the
list as A-umlaut instead!

  | I think that's because nmh knows the text has two bytes representing
  | each guillemet, and iconv(3) says it can't translate either of them,
  | Unicode U+00AB or U+00BB, to the C locale so nmh renders each two bytes
  | as a single `?' byte.

Yes, that is what I expected to have happened.

  | The UTF-8 encoding of U+00AB is 0xc2 0xab.
  | This is because [...]

Thanks, though I understand how to do UTF-8 (and undo it).   How
to use things that do it properly, to achieve the desired outcome,
is an entirely different matter.

  |     $ show | grep \\\\x |
  |     > tr -dc 0-9a-f | tr a-f A-F |
  |     > sed 's/.*/16i&0AP/' | dc

Produced exactly the right string.   (I can pattern match, I'd have
to get someone else to tell me what it says though!)

Before I deleted the ouout from this reply, all there was was another
mess...

kre



--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Ralph Corderoy
In reply to this post by Valdis Klētnieks
Hi Valdis,

> Am I the only guy who's been bitten by documentation that has single
> and double quotes that look cut-n-paste-able but actually aren't?

This is a combination of faulty man page source, a confusing area of
troff implementation evolution over its many decades, and some faults
being papered by some systems to help folk like you.  Unfortunately, it
means upstream don't fix the problems because they don't see them.

For example, your /usr/share/groff/1.22.3/tmac/an-old.tmac or similar has

    .\" For UTF-8, map some characters conservatively for the sake
    .\" of easy cut and paste.
    .
    .if '\*[.T]'utf8' \{\
    .  rchar \- - ' `
    .
    .  char \- \N'45'
    .  char  - \N'45'
    .  char  ' \N'39'
    .  char  ` \N'96'
    .\}

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Paul Fox-3
In reply to this post by Valdis Klētnieks
[hidden email] wrote:
 > On Thu, 14 Feb 2019 22:51:43 -0500, Ken Hornstein said:
 > > >There's something else wonky about Paul's note. My preliminary guess is
 > > >that something got mangled when MailMan tacked the us-ascii footer onto
 > > >Paul's utf-8.  The end result is that exmh gets confoozled, even though
 > > >it usually gets this sort of thing right.
 > >
 > > FWIW, I saw the same exact thing.  I thought maybe it was because of the
 > > invalid encoding, but maybe that isn't right
 >
 > Found it.
 >
 > Content-ID: <6521.1550182708.1@grass>
 >
 > Removing it causes exmh to behave. Added to my bug list.
 >

Did you happen to see the same behavior with Mark Bergman's message,
in this same thread?  I ask because we both generated Content-ID headers
with no domain part (i.e., no dots after the '@').

paul
=----------------------
paul fox, [hidden email] (arlington, ma, where it's 42.3 degrees)


--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Ralph Corderoy
In reply to this post by Robert Elz
Hi kre,

> > Sending to you directly so you see a version that Mailman doesn't
> > touch.
>
> That's not actually guaranteed, though it worked ...   I do mail
> filtering that drops duiplicates, so which I see would depend uppn
> which arrived first.

Good point.  I thought the list removed addresses in the headers from
the subscribers and sent to the rest, but I see that's a per-subscriber,
not per-list setting.

> It is kind of interesting (and I'm sure there's a reason) that the
> things I saw as A-caret (and you explained why) appeared in the
> message I got back from the list as A-umlaut instead!

Yes, every round trip is interpreting a UTF-8 encoding as ISO 8859-1.

When I type the digraph A^ to get  you receive its UTF-8 encoding which
is 0xc3 0x82.  The first of those bytes in ISO 8859-1 is A" = Ä.
The second isn't mapped to a rune in ISO 8859-1, but Windows 1252,
AKA cp1252(7), thinks it's a low-9 quote, looking similar to a comma,
so you might see that instead.

    $ printf \\x82 | iconv -f cp1252 -t ucs-4le |
    > hexdump -ve '/4 "U+%04x\n"'
    U+201a

In Unicode, it's U+201A so I can enter that to get [‚], but you'll see
that as the three-byte UTF-8 encoding 0xe2 0x80 0x9a, perhaps a^ C= sv.

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Valdis Klētnieks
In reply to this post by Paul Fox-3
On Fri, 15 Feb 2019 18:11:12 -0500, Paul Fox said:

>  > Content-ID: <6521.1550182708.1@grass>
>  >
>  > Removing it causes exmh to behave. Added to my bug list.
>  >
>
> Did you happen to see the same behavior with Mark Bergman's message,
> in this same thread?  I ask because we both generated Content-ID headers
> with no domain part (i.e., no dots after the '@').

No, his doesn't trigger it, because although he had a Content-ID: tag, it was
in the main rfc822 headers of the mail (not sure what that is supposed to do).
The exmh bug has to do with displaying MIME body sub-parts, and Mark's was a
single part text/plain.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Ralph Corderoy
In reply to this post by Alexander Zangerl-4
Hi az,

> > I'd be tempted to make it an if-then with no else clause by hoisting
> > the "BCC:" prefix and "\n" suffix outside of the if-then.
>
> hmm. i see your point, but don't entirely agree. my aim here was to
> contain all the related logic within the smallest possible/sensible
> horizon.
>
> apart from the small ugliness of having the string "bcc:" hardcoded
> twice i prefer that 'if' block doing its thing (and all of its thing!)
> over strings conditionally accumulating across a pageful of code or
> more.

I don't quite get the comparison so think I may not have got my idea
over.  I don't know the context of the patch's chunk so considered it in
isolation.  The original,

    for (lp = localaddrs.m_next; lp; lp = lp->m_next)
      if (lp->m_bcc)
         allbcc = allbcc? add(concat(", ", lp->m_mbox, NULL), allbcc)
            : mh_xstrdup(lp->m_mbox);
    for (lp = netaddrs.m_next; lp; lp = lp->m_next)
      if (lp->m_bcc)
         allbcc = allbcc? add(
            concat(", ", lp->m_mbox, "@", lp->m_host, NULL),
            allbcc)
            : concat(lp->m_mbox, "@", lp->m_host, NULL);
    if (allbcc)
    {
      fprintf (out, "BCC: %s\n",allbcc);
      free(allbcc);
    }

seems simpler as

    for (lp = localaddrs.m_next; lp; lp = lp->m_next)
        if (lp->m_bcc)
            allbcc = add(concat(", ", lp->m_mbox, NULL), allbcc);

    for (lp = netaddrs.m_next; lp; lp = lp->m_next)
        if (lp->m_bcc)
            allbcc = add(concat(", ", lp->m_mbox, "@", lp->m_host, NULL), allbcc);

    if (allbcc) {
        fprintf(out, "BCC: %s\n", allbcc + 1);
        free(allbcc);
    }
 
though I haven't run it, let alone tested it, so it could be simpler but
wrong.  :-)
 
> anyway, these are my personal preferences; i hope we can agree to
> disagree and that somebody with commit rights will apply the patch or
> equivalent to the code.

Yep, absolutely.  I see it's already getting attention from David and
I expect Ken will look too.

--
Cheers, Ralph.

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Robert Elz
In reply to this post by Paul Fox-3
    Date:        Fri, 15 Feb 2019 18:11:12 -0500
    From:        Paul Fox <[hidden email]>
    Message-ID:  <[hidden email]>

  | Did you happen to see the same behavior with Mark Bergman's message,
  | in this same thread?  I ask because we both generated Content-ID headers
  | with no domain part (i.e., no dots after the '@').

I think it is more the Content-TYpe header in the multiopart/mixed body
that is the issue, your message was multipart/mixed, his was just a
simple test/plain

Isee the mangling in your message, but his is fine.   I doubt the content
of the content-id is related.

kre


--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

Valdis Klētnieks
On Sat, 16 Feb 2019 07:35:22 +0700, Robert Elz said:
> Isee the mangling in your message, but his is fine.   I doubt the content
> of the content-id is related.

I haven't dug into it deeply, but I suspect there's a hidden assumption that if
a body part has a content-ID: header, it's the target of some html that does an
<img src="cid://..." or a tarball or a pdf rather than being a text/plain to be
displayed inline.


--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe

David Levine-3
In reply to this post by Alexander Zangerl-4
az wrote:

> that somebody with commit rights will apply the patch or equivalent
> to the code.

Done, sorry for the delay.  Thank you for investigating this bug and
providing your patches.

David

--
nmh-workers
https://lists.nongnu.org/mailman/listinfo/nmh-workers
12