Page 1 of 1

pdf.GetText fails to get text with some PDFs

Posted: Tue Aug 13, 2019 5:21 pm
by jPV
I found some PDF which has text on pages, but pdf.GetText doesn't give anything if I try to extract all the text from a page. But if I skip the last two characters and try to get the rest of the page, then it works.

Here is a test file, and here an example code:

Code: Select all

@REQUIRE "polybios"
f$="c-kasetti.pdf"
page=1
pdf.OpenDocument(1, f$)
pdf.LoadPage(1, page, True)
len = pdf.GetPageLen(1, page)
DebugPrint("Page length:", len)
t$ = pdf.GetText(1, page, 0, -1)
DebugPrint("Extracted text length (all):", StrLen(t$))
t$ = pdf.GetText(1, page, 0, len - 2)
DebugPrint("Extracted text length (len-2):", StrLen(t$))
This will result:

Code: Select all

Page length: 3339
Extracted text length (all): 0
Extracted text length (len-2): 3337
I'd like to know if these last characters would be something that can be seen commonly on PDFs? Maybe something UTF-8 related? So if I make a workaround to skip the last two characters, would that only work for this one PDF or could it be a more common solution?

Re: pdf.GetText fails to get text with some PDFs

Posted: Fri Aug 16, 2019 11:37 pm
by airsoftsoftwair
Hmm, what are those last two characters? Are they non-ASCII ones by chance? It could also be that the length calculation is wrong. Have you checked what exactly is missing?

Re: pdf.GetText fails to get text with some PDFs

Posted: Sat Aug 17, 2019 8:49 am
by jPV
I have no idea what those two characters are, because I can't get them with Hollywood and I don't know PDF format that well that could I dig them out with a hex editor or are the texts compressed etc.

For me it looks that when I skip the two characters, I still get all visible text from the page. At least for the first page, but OTOH the second page shows that there are some image captions as last thing to get even when the image is at top, there seems to be some variation in order depending the elements on the page, but quickly looking I can't see any words missing characters or so...

I just found this work-around, because I tried to extract only part of the text from the beginning to see if that wouldn't fail like the full page fails, and it seemed to work, then I started to go backwards that how many characters I have to skip to get it work.

Re: pdf.GetText fails to get text with some PDFs

Posted: Mon Sep 09, 2019 7:29 pm
by airsoftsoftwair
The problem seems to be in PDFium. I've taken a note and will debug this when I have some time to spare.

Re: pdf.GetText fails to get text with some PDFs

Posted: Sat Nov 28, 2020 9:29 pm
by airsoftsoftwair
This is the issue on the Chromium bugtracker: https://bugs.chromium.org/p/pdfium/issu ... il?id=1552

Re: pdf.GetText fails to get text with some PDFs

Posted: Sun Dec 06, 2020 11:48 am
by airsoftsoftwair
Well, who knows when somebody at Google will have time to fix this so I've just fixed it on my own in PDFium now.

Code: Select all

- Fix: pdf.GetPageLen() and pdf.GetText() didn't skip potential control characters in the text correctly