Python 3 migration: considering non-UTF-8 conform filenames

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Python 3 migration: considering non-UTF-8 conform filenames

EricZolf
Hi,

as I worked on migrating to Python 3, one of the "fanciest" aspects was
the change from str/unicode to bytes/str "character chains" types.

Without going into the technical details (python savvy persons will know
what I mean), it means among other things that the codeset of file names
becomes relevant and must be UTF-8. Files with a name which isn't
compliant with UTF-8 aren't backed up.

The warnings look something like:

Sat Aug  3 10:51:51 2019  Warning: unable to read ACL from 'very
complicated filename': 'utf-8' codec can't encode character '\udcb1' in
position 54: surrogates not allowed
Sat Aug  3 10:51:51 2019  Warning: ignoring file 'very complicated
filename' with wrong encoding: 'utf-8' codec can't encode character
'\udcb1' in position 54: surrogates not allowed

I don't see much options because only str (i.e. codeset-aware) can be
matched against regex, bytes can't (filenames could still be read as bytes).

Few consequences:

1. such files can't get backed-up anymore.
2. old backup repos which contain such files are seen as broken - as
long as the last version doesn't contain such files, only in increments,
it'll be usable though.

This said, non-UTF-8-compatible file systems are uncommon since many
years, so that the impact should be very limited (in my case, old
Windows files lying around since 2010).

I'm mostly concerned about the Asian room, because I've heard (but have
no experience whatsoever) that they might use other rich encodings than
Unicode. The original code was IMHO already not very clean in this
regard, the migration to UTF-8 hasn't improved things, strings are
encoded/decoded sometimes explicitly with UTF-8 sometimes without
explicit UTF-8 encoding.

If the users on this list could comment on their experience and
expectations it would be great. Doing tests with old backup repos on my
PR [1] would be even greater.

Don't expect miracles though, currently I don't see any viable
alternative to the decision I've taken. I mostly wanted to make sure
it's taken transparently.

Thanks, Eric

[1] https://github.com/sol1/rdiff-backup/pull/40

_______________________________________________
rdiff-backup-users mailing list at [hidden email]
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 migration: considering non-UTF-8 conform filenames

Patrik Dufresne
Hello Éric, im very concerned about this. I did not review all your
changes, and did not notice this fact. I'm backup allot of various system
and the encoding are not all utf8. And invalid utf8 happen quite often.

The way to work around this in rdiffweb at least it's to manage path as
bytes. That is how rdiffweb 1.2.8 is working. Path are bytes. That is also
how most filesystem are working too. Paths are bytes and those are decoded
to be displayed to the user.

Not supporting non-utf8 is a deal breaker for me. What would I say to my
non technical user. Hum sorry, you must rename your file to get it backup...

Not sure

On Sat, Aug 3, 2019, 5:50 AM Eric L., <[hidden email]> wrote:

> Hi,
>
> as I worked on migrating to Python 3, one of the "fanciest" aspects was
> the change from str/unicode to bytes/str "character chains" types.
>
> Without going into the technical details (python savvy persons will know
> what I mean), it means among other things that the codeset of file names
> becomes relevant and must be UTF-8. Files with a name which isn't
> compliant with UTF-8 aren't backed up.
>
> The warnings look something like:
>
> Sat Aug  3 10:51:51 2019  Warning: unable to read ACL from 'very
> complicated filename': 'utf-8' codec can't encode character '\udcb1' in
> position 54: surrogates not allowed
> Sat Aug  3 10:51:51 2019  Warning: ignoring file 'very complicated
> filename' with wrong encoding: 'utf-8' codec can't encode character
> '\udcb1' in position 54: surrogates not allowed
>
> I don't see much options because only str (i.e. codeset-aware) can be
> matched against regex, bytes can't (filenames could still be read as
> bytes).
>
> Few consequences:
>
> 1. such files can't get backed-up anymore.
> 2. old backup repos which contain such files are seen as broken - as
> long as the last version doesn't contain such files, only in increments,
> it'll be usable though.
>
> This said, non-UTF-8-compatible file systems are uncommon since many
> years, so that the impact should be very limited (in my case, old
> Windows files lying around since 2010).
>
> I'm mostly concerned about the Asian room, because I've heard (but have
> no experience whatsoever) that they might use other rich encodings than
> Unicode. The original code was IMHO already not very clean in this
> regard, the migration to UTF-8 hasn't improved things, strings are
> encoded/decoded sometimes explicitly with UTF-8 sometimes without
> explicit UTF-8 encoding.
>
> If the users on this list could comment on their experience and
> expectations it would be great. Doing tests with old backup repos on my
> PR [1] would be even greater.
>
> Don't expect miracles though, currently I don't see any viable
> alternative to the decision I've taken. I mostly wanted to make sure
> it's taken transparently.
>
> Thanks, Eric
>
> [1] https://github.com/sol1/rdiff-backup/pull/40
>
> _______________________________________________
> rdiff-backup-users mailing list at [hidden email]
> https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
> Wiki URL:
> http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
_______________________________________________
rdiff-backup-users mailing list at [hidden email]
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 migration: considering non-UTF-8 conform filenames

Robert Nichols-2
On 8/3/19 6:49 AM, Patrik Dufresne wrote:
> Hello Éric, im very concerned about this. I did not review all your
> changes, and did not notice this fact. I'm backup allot of various system
> and the encoding are not all utf8. And invalid utf8 happen quite often.
>
> The way to work around this in rdiffweb at least it's to manage path as
> bytes. That is how rdiffweb 1.2.8 is working. Path are bytes. That is also
> how most filesystem are working too. Paths are bytes and those are decoded
> to be displayed to the user.

Indeed. Paths are bytes, and there is no requirement that they be printable
strings. The forward slash is the separator, and the individual path
components can contain any byte except 0x00 (ASCII NUL). rdiff-backup should
accept that.

>
> Not supporting non-utf8 is a deal breaker for me. What would I say to my
> non technical user. Hum sorry, you must rename your file to get it backup...
>
> Not sure
>
> On Sat, Aug 3, 2019, 5:50 AM Eric L., <[hidden email]> wrote:
>
>> Hi,
>>
>> as I worked on migrating to Python 3, one of the "fanciest" aspects was
>> the change from str/unicode to bytes/str "character chains" types.
>>
>> Without going into the technical details (python savvy persons will know
>> what I mean), it means among other things that the codeset of file names
>> becomes relevant and must be UTF-8. Files with a name which isn't
>> compliant with UTF-8 aren't backed up.
>>
>> The warnings look something like:
>>
>> Sat Aug  3 10:51:51 2019  Warning: unable to read ACL from 'very
>> complicated filename': 'utf-8' codec can't encode character '\udcb1' in
>> position 54: surrogates not allowed
>> Sat Aug  3 10:51:51 2019  Warning: ignoring file 'very complicated
>> filename' with wrong encoding: 'utf-8' codec can't encode character
>> '\udcb1' in position 54: surrogates not allowed
>>
>> I don't see much options because only str (i.e. codeset-aware) can be
>> matched against regex, bytes can't (filenames could still be read as
>> bytes).
>>
>> Few consequences:
>>
>> 1. such files can't get backed-up anymore.
>> 2. old backup repos which contain such files are seen as broken - as
>> long as the last version doesn't contain such files, only in increments,
>> it'll be usable though.
>>
>> This said, non-UTF-8-compatible file systems are uncommon since many
>> years, so that the impact should be very limited (in my case, old
>> Windows files lying around since 2010).
>>
>> I'm mostly concerned about the Asian room, because I've heard (but have
>> no experience whatsoever) that they might use other rich encodings than
>> Unicode. The original code was IMHO already not very clean in this
>> regard, the migration to UTF-8 hasn't improved things, strings are
>> encoded/decoded sometimes explicitly with UTF-8 sometimes without
>> explicit UTF-8 encoding.
>>
>> If the users on this list could comment on their experience and
>> expectations it would be great. Doing tests with old backup repos on my
>> PR [1] would be even greater.
>>
>> Don't expect miracles though, currently I don't see any viable
>> alternative to the decision I've taken. I mostly wanted to make sure
>> it's taken transparently.
>>
>> Thanks, Eric
>>
>> [1] https://github.com/sol1/rdiff-backup/pull/40
>>
>> _______________________________________________
>> rdiff-backup-users mailing list at [hidden email]
>> https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
>> Wiki URL:
>> http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
> _______________________________________________
> rdiff-backup-users mailing list at [hidden email]
> https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
> Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
>


--
Bob Nichols     "NOSPAM" is really part of my email address.
                 Do NOT delete it.


_______________________________________________
rdiff-backup-users mailing list at [hidden email]
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 migration: considering non-UTF-8 conform filenames

EricZolf
Hi,

On 03/08/2019 14:50, Robert Nichols wrote:

>> The way to work around this in rdiffweb at least it's to manage path as
>> bytes. That is how rdiffweb 1.2.8 is working. Path are bytes. That is
>> also
>> how most filesystem are working too. Paths are bytes and those are
>> decoded
>> to be displayed to the user.
>
> Indeed. Paths are bytes, and there is no requirement that they be printable
> strings. The forward slash is the separator, and the individual path
> components can contain any byte except 0x00 (ASCII NUL). rdiff-backup
> should accept that.

I never challenged that, I was just highlighting the difficulties.

OK, I've created a sub-branch ericzolf-py2to3-bytes and will work on
this aspect, moving away from str for paths to bytes.

KR, Eric

_______________________________________________
rdiff-backup-users mailing list at [hidden email]
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki
Reply | Threaded
Open this post in threaded view
|

Re: Python 3 migration: considering non-UTF-8 conform filenames

EricZolf
Hi,

On 04/08/2019 08:59, Eric L. wrote:
> OK, I've created a sub-branch ericzolf-py2to3-bytes and will work on
> this aspect, moving away from str for paths to bytes.

If someone is interested, there is a working version at

https://github.com/ericzolf/rdiff-backup/tree/ericzolf-py2to3-bytes

You can backup, also incrementally, not sure about recovering, and none
of the tests is working. The repository looks good, rdiff-backup 1.2.8
accepts it without moaning.

So, all in all, we're back on track and unicode isn't a pre-requisite
anymore. Time to have dinner and go to bed...

You might want to decide if you prefer to review the sub-branch before I
merge it into the main branch ericzolf-py2to3, or if it's preferable to
merge it into it before general review. Both is possible, I don't really
care.

KR, Eric

_______________________________________________
rdiff-backup-users mailing list at [hidden email]
https://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki