pdf.GetText fails to get text with some PDFs
Posted: Tue Aug 13, 2019 5:21 pm
I found some PDF which has text on pages, but pdf.GetText doesn't give anything if I try to extract all the text from a page. But if I skip the last two characters and try to get the rest of the page, then it works.
Here is a test file, and here an example code:
This will result:
I'd like to know if these last characters would be something that can be seen commonly on PDFs? Maybe something UTF-8 related? So if I make a workaround to skip the last two characters, would that only work for this one PDF or could it be a more common solution?
Here is a test file, and here an example code:
Code: Select all
@REQUIRE "polybios"
f$="c-kasetti.pdf"
page=1
pdf.OpenDocument(1, f$)
pdf.LoadPage(1, page, True)
len = pdf.GetPageLen(1, page)
DebugPrint("Page length:", len)
t$ = pdf.GetText(1, page, 0, -1)
DebugPrint("Extracted text length (all):", StrLen(t$))
t$ = pdf.GetText(1, page, 0, len - 2)
DebugPrint("Extracted text length (len-2):", StrLen(t$))
Code: Select all
Page length: 3339
Extracted text length (all): 0
Extracted text length (len-2): 3337