scan - Funky output with utf-8

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

scan - Funky output with utf-8

Valdis Klētnieks
%body in a scan format goes pear-shaped if -width splits a UTF character. Seems
that -width isn't clear on whether it's counting bytes or glyphs - 136 and 138
produce 1 UTF-8 character different.  And -width 137 splitting the UTF-8
char?  Yee-hah.  And I have no idea why -width 137 *stopped* where it did -
that one output 633 characters.

The tail end of the scan format is:

%(void{content-type})\
%<(match multipart)\
%?(match text/html)\
%|%<{body} <<%{body}%>\
%>

(Attaching a .png, cut-n-paste didn't quite get things right)

[~] scan -width 136 867285
867285      *  Fri 17Mar    9183 ??               Re: [PATCH v1 6/7] arm64: dts: rockchip: add dts file for RK3328 evaluation board <<

[~] scan -width 138 867285
867285      *  Fri 17Mar    9183 ??               Re: [PATCH v1 6/7] arm64: dts: rockchip: add dts file for RK3328 evaluation board <<?

[~] scan -width 137 867285
867285      *  Fri 17Mar    9183 ??               Re: [PATCH v1 6/7] arm64: dts: rockchip: add dts file for RK3328 evaluation board <<? 2017?03?17? 00:18, Andre Przywara ??: > Hi Chen, > > On 16/03/17 01:45, [hidden email] wrote: >> From: Chen Liang <[hidden email]> >> >> This patch add rk3328-evb.dts for RK3328 evaluation board. >> Tested on RK3328 evb. >> >> Signed-off-by: Chen Liang <[hidden email]> >> --- >> arch/arm64/boot/dts/rockchip/Makefile | 1 + >> arch/arm64/boot/dts/rockchip/rk3328-evb.dts | 57 +++++++++++++++++++++++++++++ >> 2 files changed, 58 insertions(+) >> create mode 100644 arch/a


_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers

nmh-scan.png (62K) Download Attachment
attachment1 (495 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Ralph Corderoy
Hi Valdis,

> %body in a scan format goes pear-shaped if -width splits a UTF character.

May we please have the output of

    scan -version
    locale

> (Attaching a .png, cut-n-paste didn't quite get things right)

A simple test here doesn't show the fault.

    $ scan -version
    scan -- nmh-1.6 [compiled on orac at 2016-11-15 11:46:57 +0000 Tue]
    $
    $ for w in 7 8 9; do
    >     printf '%s\n' '' Beau£iful. |
    >     scan -forma '<<%{body}>>' -file /dev/stdin -width $w
    > done
    <<Beau
    <<Beau£
    <<Beau£i
    $

It would be handy to have just the start of the undecoded body so we can
reproduce the same UTF-8 runes, e.g. the quoted-printable or base64 of
867285's body.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Valdis Klētnieks
On Tue, 21 Mar 2017 12:02:51 -0000, Ralph Corderoy said:
(I appear to have only replied to Ralph.  Trying again)

> May we please have the output of
>
>     scan -version

% scan -version
scan -- nmh-1.6+dev [compiled on turing-police.cc.vt.edu at Tue Mar 21 01:01:40 EDT 2017]

Did a 'git pull' before I built it..

>     locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


> > (Attaching a .png, cut-n-paste didn't quite get things right)
>
> A simple test here doesn't show the fault.
>
>     $ scan -version
>     scan -- nmh-1.6 [compiled on orac at 2016-11-15 11:46:57 +0000 Tue]
>     $
>     $ for w in 7 8 9; do
>     >     printf '%s\n' '' Beau£iful. |
>     >     scan -forma '<<%{body}>>' -file /dev/stdin -width $w
>     > done
>     <<Beau
>     <<Beau£
>     <<Beau£i
>     $
It appears to be in the same class of faults as another bug I reported a while
back with handling of null bytes in mhfixmsg - the failure symptoms appear to
be dependent on the amount of input.  You don't see a problem, when I try with
the entire mail it overruns by 400 bytes - I trim it down to just 40-50
chars past where the problem hits, it only overruns by 60 chars or so...

> It would be handy to have just the start of the undecoded body so we can
> reproduce the same UTF-8 runes, e.g. the quoted-printable or base64 of
> 867285's body.

Am attaching the form file and the problematic mail...

[~] scan -width 137 -form /tmp/timely -file /tmp/badmail
     1      *  Fri 17Mar       0 ??               Re: [PATCH v1 6/7] arm64: dts: rockchip: add dts file for RK3328 evaluation board <<? 2017?03?17? 00:18, Andre Przywara ??: > Hi Chen, > > On 16/03/17 01:45, [hidden email] wrote: >> From: Chen Liang <[hidden email]> >> >> This patch add rk3328-evb.dts for RK3328 evaluation board. >> Tested on RK3328 evb. >> >> Signed-off-by: Chen Liang <[hidden email]> >> --- >> arch/arm64/boot/dts/rockchip/Makefile | 1 + >> arch/arm64/boot/dts/rockchip/rk3328-evb.dts | 57 +++++++++++++++++++++++++++++ >> 2 files changed, 58 insertions(+) >> create mode 100644 arch/


_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers

badmail (12K) Download Attachment
timely (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

David Levine-3
Valdis wrote:

> It appears to be in the same class of faults as another bug I
> reported a while back with handling of null bytes in mhfixmsg -
> the failure symptoms appear to be dependent on the amount of
> input.

Just to note:  that problem was due to use of strlen(3) in the MIME
parser, which of course didn't do what we want with null bytes.  At
least I think it was, it looks like we never confirmed that it fixed
your particular problem.  I had recently run into what looked like
the same behavior and it fixed it for me.

scan doesn't use the MIME parser.

David

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Ken Hornstein-2
>scan doesn't use the MIME parser.

Right; I expect things are getting confused in fmt_scan().  Specifically,
I bet cpstripped() is messing up; it's supposed to stop when it hits the
final character, but maybe when it's hitting that last multibyte character
it is getting confused.

--Ken

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Valdis Klētnieks
In reply to this post by David Levine-3
On Tue, 21 Mar 2017 20:26:15 -0400, David Levine said:

> Valdis wrote:
>
> > It appears to be in the same class of faults as another bug I
> > reported a while back with handling of null bytes in mhfixmsg -
> > the failure symptoms appear to be dependent on the amount of
> > input.
>
> Just to note:  that problem was due to use of strlen(3) in the MIME
> parser, which of course didn't do what we want with null bytes.  At
> least I think it was, it looks like we never confirmed that it fixed
> your particular problem.  I had recently run into what looked like
> the same behavior and it fixed it for me.
I meant that the exact failure mode depends on how much data follows
the null byte or split multi-byte character, because after that it's
off to the races looking for the *next* plausible stopping point (and
where things finally end is often *not* at the next null char or whatever).

Ralph wasn't able to replicate it with a 7-byte string, when I feed it
the entire problematic mail it spews about 400 extra bytes, if I give it
the headers and just the first line of the body it doesn't go anywhere near
as far, etc...

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers

attachment0 (495 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Ralph Corderoy
Hi Valdis,

> Ralph wasn't able to replicate it with a 7-byte string

With your provided email and format file I can re-create the `-width
137' spew with git, but not with 1.6.  That suggests a bisect might
help;  I may have time tomorrow.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

Ralph Corderoy
Hi,

> I can re-create the `-width 137' spew with git, but not with 1.6.
> That suggests a bisect might help

So git-bisect(1) blames

    commit e537b780f7aea8df01ddca1976f8c128d9c1fb55
    Date:   Fri Aug 29 08:50:51 2014 -0500

        fmt_scan() no longer subtracts 1 from the width.  This has the effect
        of no longer counting the trailing newline in the output of scan(1),
        inc(1), and the other programs that rely on it.

But I suspect it's wrong and that just pushes the 137 to trigger the
error off by one.  :-)  I'll have another go testing a run of widths
each time and saying it's bad if any fail.

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers
Reply | Threaded
Open this post in threaded view
|

Re: scan - Funky output with utf-8

David Levine-3
Ralph wrote:

> > [Valdis:]
> > I can re-create the `-width 137' spew with git, but not with 1.6.
> > That suggests a bisect might help
>
> So git-bisect(1) blames
>
>     commit e537b780f7aea8df01ddca1976f8c128d9c1fb55

> But I suspect it's wrong and that just pushes the 137 to trigger the
> error off by one.

And as Ken suspected, the problem was in cpstripped().  I just committed
a fix.

David

_______________________________________________
Nmh-workers mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/nmh-workers