SetCurrentEncoder for OpenDocument
Posted: Sun Sep 01, 2019 8:18 pm
I would like to select manually decoding type for an opened document, but SetCurrentEncoder works only for created documents and can't find such like SetCurrentDecoder.
This is the reason why I need that:
PageStream on Amiga knows nothing about UTF-8, so we used to change some Latin-1 characters to get local characters in the font files.
But these glyphs are not the same character codes with Latin-2 or ISO-8859-16, but works well on our non unicode system (with some limitation of course as õ can be read as ő).
As PageStream is the de facto software to create printable or pdf documents this is the way how characters end in my pdf creations. To print or save as pdf is not a problem because PageStream embed the glyphs from the hacked font files.
Now I wrote a script that extracts text from PDFs, and here comes the problem.
pdf.GetText() gives back question marks both for ő and ű, so I have no chance to handle it manually.
If I change the system locale and input to latin-1, it works and I got different characters, but cannot make the change inside my script.
If I could force the decoder of GetText() that would help to read those bastards.
This is the reason why I need that:
PageStream on Amiga knows nothing about UTF-8, so we used to change some Latin-1 characters to get local characters in the font files.
But these glyphs are not the same character codes with Latin-2 or ISO-8859-16, but works well on our non unicode system (with some limitation of course as õ can be read as ő).
As PageStream is the de facto software to create printable or pdf documents this is the way how characters end in my pdf creations. To print or save as pdf is not a problem because PageStream embed the glyphs from the hacked font files.
Now I wrote a script that extracts text from PDFs, and here comes the problem.
pdf.GetText() gives back question marks both for ő and ű, so I have no chance to handle it manually.
If I change the system locale and input to latin-1, it works and I got different characters, but cannot make the change inside my script.
If I could force the decoder of GetText() that would help to read those bastards.
