pdf.GetText fails to get text with some PDFs

Discuss PDF file handling with the Polybios plugin here
Post Reply
User avatar
jPV
Posts: 600
Joined: Sat Mar 26, 2016 10:44 am
Location: RNO
Contact:

pdf.GetText fails to get text with some PDFs

Post by jPV »

I found some PDF which has text on pages, but pdf.GetText doesn't give anything if I try to extract all the text from a page. But if I skip the last two characters and try to get the rest of the page, then it works.

Here is a test file, and here an example code:

Code: Select all

@REQUIRE "polybios"
f$="c-kasetti.pdf"
page=1
pdf.OpenDocument(1, f$)
pdf.LoadPage(1, page, True)
len = pdf.GetPageLen(1, page)
DebugPrint("Page length:", len)
t$ = pdf.GetText(1, page, 0, -1)
DebugPrint("Extracted text length (all):", StrLen(t$))
t$ = pdf.GetText(1, page, 0, len - 2)
DebugPrint("Extracted text length (len-2):", StrLen(t$))
This will result:

Code: Select all

Page length: 3339
Extracted text length (all): 0
Extracted text length (len-2): 3337
I'd like to know if these last characters would be something that can be seen commonly on PDFs? Maybe something UTF-8 related? So if I make a workaround to skip the last two characters, would that only work for this one PDF or could it be a more common solution?
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: pdf.GetText fails to get text with some PDFs

Post by airsoftsoftwair »

Hmm, what are those last two characters? Are they non-ASCII ones by chance? It could also be that the length calculation is wrong. Have you checked what exactly is missing?
User avatar
jPV
Posts: 600
Joined: Sat Mar 26, 2016 10:44 am
Location: RNO
Contact:

Re: pdf.GetText fails to get text with some PDFs

Post by jPV »

I have no idea what those two characters are, because I can't get them with Hollywood and I don't know PDF format that well that could I dig them out with a hex editor or are the texts compressed etc.

For me it looks that when I skip the two characters, I still get all visible text from the page. At least for the first page, but OTOH the second page shows that there are some image captions as last thing to get even when the image is at top, there seems to be some variation in order depending the elements on the page, but quickly looking I can't see any words missing characters or so...

I just found this work-around, because I tried to extract only part of the text from the beginning to see if that wouldn't fail like the full page fails, and it seemed to work, then I started to go backwards that how many characters I have to skip to get it work.
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: pdf.GetText fails to get text with some PDFs

Post by airsoftsoftwair »

The problem seems to be in PDFium. I've taken a note and will debug this when I have some time to spare.
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: pdf.GetText fails to get text with some PDFs

Post by airsoftsoftwair »

This is the issue on the Chromium bugtracker: https://bugs.chromium.org/p/pdfium/issu ... il?id=1552
User avatar
airsoftsoftwair
Posts: 5425
Joined: Fri Feb 12, 2010 2:33 pm
Location: Germany
Contact:

Re: pdf.GetText fails to get text with some PDFs

Post by airsoftsoftwair »

Well, who knows when somebody at Google will have time to fix this so I've just fixed it on my own in PDFium now.

Code: Select all

- Fix: pdf.GetPageLen() and pdf.GetText() didn't skip potential control characters in the text correctly
Post Reply