unicode support strategy

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

unicode support strategy

duplicity-talk mailing list
Hi,
going through the code, I see you've decided not to perform byte/unicode
conversion on I/O boundaries, but rather work with byte filenames and
convert to unicode when needed. Is absence of surrogateescape codec
error handler in py2 sole reason for this?

I believe there is way to implement fsencode/fsdecode with
surrogateescape behavior even in py2, see attachment.

I believe, it is now possible to convert from bytes as soon as possible
and work internally with unicode. In my opinion, this leads to tight and
elegant patch, that will work well both in py2 and 3. I even have
stitched something along those lines already. I'd love to complete it,
provided you find this approach acceptable.

Have a great day,
Radim.

_______________________________________________
Duplicity-talk mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/duplicity-talk

sedemo.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: unicode support strategy

duplicity-talk mailing list
Hello Radim,

Thank you for your help with Duplicity and your comments. This is a good question.

Please see the comments in duplicity/util.py:
for a discussion of unicode <--> bytes conversion in duplicity.

On Sep 30 2018, at 7:11 pm, Radim Tobolka via Duplicity-talk <[hidden email]> wrote:

Hi,
going through the code, I see you've decided not to perform byte/unicode
conversion on I/O boundaries, but rather work with byte filenames and
convert to unicode when needed. Is absence of surrogateescape codec
error handler in py2 sole reason for this?


While duplicity does not solely use unicode paths internally, the intention is definitely to decode/encode all bytes to/from unicode at I/O boundaries. This is a relatively recent effort, though, and has not been completed. Nearly all of the code used to assume bytes, so you may come across some internal conversions to keep everything working while we "unicodeify" one part at a time -- please feel free to work on these!

As set out in the comments linked above (and as you have alluded to yourself), Python2 does not offer a way to losslessly translate *nix bytes paths (e.g. Linux filenames) to unicode and back again (cf os.fsencode/os.fsdecode in Python3). There are backports and workarounds, but they are not bulletproof. (Note that I am talking about paths only; everything else (files etc) should absolutely be pulled in as unicode at the I/O boundaries.)

The approach for filenames in duplicity is therefore to do the decoding at I/O boundaries for internal use, but to keep a copy of the bytes version to use when interacting with that file. When a path is read/created, the bytes version of the filename is therefore stored in path.name (which also means all the old code that assumes bytes keeps using that and keeps working). This is then decoded to unicode (using util.fsdecode, which itself uses os.fsdecode on Python 3.2+) for all internal uses/filename matching (path.uc_name).

I have taken this approach for selection.py and globmatch.py, and am fixing up other files as I get to them as part of the Python 3 prep, but it would be great to have it consistent across the codebase. Wherever possible:
  • things should be using .uc_name instead of .name;
  • strings should be unicode wherever possible; and
  • if, for some reason, things need to be converted, please use util.fsencode/fsdecode, as then we can transparently upgrade to using built-in os.fsencode/fsdecode as we move to Python 3.
Any questions, please ask.

Kind regards,

Aaron

_______________________________________________
Duplicity-talk mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/duplicity-talk
Reply | Threaded
Open this post in threaded view
|

Re: unicode support strategy

duplicity-talk mailing list

Hi Aaron,
sorry for late reply, I needed time to research it.
I'd like to submit two tests for unicode filenames, that fail with current codebase (see attached test_uni.patch). As for the fix, that will depend on how discussion on fsencode/fsencode turns out.

When did you last check the mentioned backport? Which input made it fail? I've been poking into it for some time now and so far it seems to produce correct results. I've summarized my attempts in a testcase (see attached test_os_backport.py). I had to include and modify the backport slightly to allow testing with different encodings, actual tests are at the end of the file after "import pytest" statement.

More importantly, the backport performs successfully full conversion cycle on Markus Kuhn's UTF-8 decoder capability and stress test from https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
This file contains numerous invalid UTF-8 sequences - if the backport did fine with this exhaustive list, I'd say it's pretty stable.

Even if there is some input, that makes it fail, I think, that can be helped. Maybe the entire encoding error handler logic will have to be bypassed and handled purely in Python. Still, I think it's worth the effort as opposed to hunting all the adorned strings, that force implicit ascii decoding of byte filenames, which is bound to fail on 8+bit codepoints.

There is another concern. What will happen in Python 3, when you get b"" adorned string combined with - this time - unicode filename? How do you intend to deal with that?

I will stop my train of thought here - looking forward to your comments.

Best regards,
Radim



_______________________________________________
Duplicity-talk mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/duplicity-talk

test_uni.patch (3K) Download Attachment
test_os_backport.py (10K) Download Attachment