PDF text copy and AGLFN

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

PDF text copy and AGLFN

Steve White-12
Hi,

To investigate the question of whether it is worthwhile to implement
the AGLFN, I did several test cases, with different means of producing
PDF files, and different fonts.

In the zip attachment are some PDF files, generated by three different
methods.  The exercise is to open these, say with Adobe Reader, copy
the text and paste it into a text processor.

The result should look like the input, which is in each case:

सभी मनुष्यों को गौरव और अधिकारों के मामले में जन्मजात स्वतन्त्रता और
समानता प्राप्त है ।
उन्हें बुद्धि और अ तरात्मा की देन प्राप्त है और परस्पर उन्हें भाईचारे
के भाव से बर्ताव करना चाहिए ।

But never does.  Each fails in its own way.
It's very impressive how different the three technologies I found in Linux are.

The test text is Hindi.  You may think that's weird.
I don't.  Sanskrit might be stretching it.

My conclusion is: the existing technology for making text  from PDF is
limited -- it might be possible for the PDF generator to wrangle it in
ways it wasn't designed for to produce correct output, but nothing I
have seen succeeds with this simple example.

The AGLFN only imposes further restrictions on this already-limited
technology.  In particular, trying to cram the mapping into a
31-character OpenType glyph name would be painful. The best of the
technologies construct the ToUnicode structure directly from the
OpenType tables.  The worst relies on AGLFN.

A summary of the results from the tests follows.

Firefox/CUPS
=============
Use the generic Print to File printer.

This ignores all substitution information in all font, both AGLFN and
OpenType features.  Looking into the PDF file, there is a ToUnicode
table, but it's always the same table: it's boilerplate.

Result is unreadable.

LOWriter 4.0
============
These people are on the right track.  Still a long way to go.

All glyphs converted to Unicode, glyph names mostly ignored.
Amazingly, it seems to try to re-order the characters, and often gets it right!

A couple of clear bugs in the conversion, font-independent.
Weird: some glyph pairs get reduplicated, and at least one character
is inserted, apparently out of thin air.  These are simply bugs.
Dealing with marks (reph and anusvara), I have evidence that it is
falling back to AGLFN.

When these are fixed (just deleting letters), the text is nearly readable.
I wrote them a bug report.

XeLaTeX 3.1415926-2.4-0.9998
============================
Clearly does use glyph names, but only part of the AGLFN is supported:

Names of ligatures must be the format
        uXXXX_uXXXX
as used in Lohit.  But
        glyphname1_glyphname2
as used in Gargi, fails.

Each of the fonts that attempts the AGLFN has mistakes that show up badly.
Besides conversion of ligatures, the conversion is catastrophic.
Such a mess I can't make out what's going on.

This is the worst of the three methods.

text-copy-tests.zip (391K) Download Attachment