SetCurrentEncoder for OpenDocument

Discuss PDF file handling with the Polybios plugin here
Post Reply
User avatar
lazi
Posts: 625
Joined: Thu Feb 24, 2011 11:08 pm

SetCurrentEncoder for OpenDocument

Post by lazi »

I would like to select manually decoding type for an opened document, but SetCurrentEncoder works only for created documents and can't find such like SetCurrentDecoder.

This is the reason why I need that:
PageStream on Amiga knows nothing about UTF-8, so we used to change some Latin-1 characters to get local characters in the font files.
But these glyphs are not the same character codes with Latin-2 or ISO-8859-16, but works well on our non unicode system (with some limitation of course as õ can be read as ő).
As PageStream is the de facto software to create printable or pdf documents this is the way how characters end in my pdf creations. To print or save as pdf is not a problem because PageStream embed the glyphs from the hacked font files.
Now I wrote a script that extracts text from PDFs, and here comes the problem.
pdf.GetText() gives back question marks both for ő and ű, so I have no chance to handle it manually.
If I change the system locale and input to latin-1, it works and I got different characters, but cannot make the change inside my script.
If I could force the decoder of GetText() that would help to read those bastards.
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: SetCurrentEncoder for OpenDocument

Post by airsoftsoftwair »

So it renders correctly and only GetText() returns wrong characters? Can you provide a test case (PDF & script)?
User avatar
lazi
Posts: 625
Joined: Thu Feb 24, 2011 11:08 pm

Re: SetCurrentEncoder for OpenDocument

Post by lazi »

Ok, forget about the question marks. That was the MUI Royale display output.
So, yes it renders correctly and even GetText() returns the correct characters, but because the PageStream fonts are tweaked some of the characters are rendered differently as they get by GetText().
If I could change the decoder of GetText() from 8859-1 to 8859-2 then it could result the same as rendered (with the hacked fonts).
It was a quite frequent use of unlocalised fonts in this country and it was a popular solution to change the two problematic character glyphs in the fonts.
Anyway, how Polybios decide the character map of the PDF? The PDF which is generated by PageStream seems to me does not contain any codepage information.

I made a simple test script and PDF file. You can download it from here: https://kezdobetu.hu/temp/Polytest.lha.
Here is how it looks like:
Image

And the script:

Code: Select all

@REQUIRE "polybios.hwp"
@OPTIONS {Encoding = #ENCODING_UTF8}

pdf.opendocument(1,"ps.pdf")
pdf.loadpage(1,1,True)
pdf.getbrushfrompage(1,1,1)

DisplayBrush(1,10,270,{width=4*90,height=4*50})

SetFont("DejaVu Sans",24)
NPrint("The desired unicode text:")
NPrint("Árvíztűrő tükörfúrógép\n")

NPrint("The text by codes are really in the pdf:")
NPrint("Árvíztûrõ tükörfúrógép\n")

NPrint("GetText() result:")
NPrint(pdf.gettext(1,1,0,-1))
NPrint("\nAnd the rendered pdf:")

waitleftmouse
User avatar
lazi
Posts: 625
Joined: Thu Feb 24, 2011 11:08 pm

Re: SetCurrentEncoder for OpenDocument

Post by lazi »

And just a bit of background info:
The two words "árvíztűrő tükörfúrógép" contains all of the hungarian accented letters and it translates to "floodproof mirrordriller". :)
User avatar
Clyde
Posts: 348
Joined: Sun Feb 14, 2010 12:38 pm
Location: Dresden / Germany

Re: SetCurrentEncoder for OpenDocument

Post by Clyde »

lazi wrote: Tue Sep 03, 2019 10:14 pm And just a bit of background info:
The two words "árvíztűrő tükörfúrógép" contains all of the hungarian accented letters and it translates to "floodproof mirrordriller". :)
:lol: :lol: :lol:
Currently using: Hollywood 9 with Windows IDE and Hollywood 9 with Visual Studio Code and hw4vsc
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: SetCurrentEncoder for OpenDocument

Post by airsoftsoftwair »

lazi wrote: Tue Sep 03, 2019 10:11 pm Anyway, how Polybios decide the character map of the PDF? The PDF which is generated by PageStream seems to me does not contain any codepage information.
This is probably unfixable because the PDF doesn't contain any codepage information. The reason why it's rendered correctly is probably just because the font is embedded in the PDF.

I've just opened your PDF with Adobe Acrobat on Windows, marked all text, and pasted it into Microsoft Word. Guess what I got:

Code: Select all

árvíztûrõ tükörfúrógép
ÁRVÍZTÛRÕ TÜKÖRFÚRÓGÉP
So it's the same problem as Polybios and I think this is unfixable because Polybios cannot know that those characters are ISO-8859-2...
Post Reply