Extracting text from PDF with Poppler (C++) -
i'm trying way through poppler , (lack of) documentation.
what want simple thing: open pdf file , read text in it. i'm going process text, doesn't matter here.
so... saw poppler_page_get_text
function, , kind of works, have specify selection rectangle, not handy. isn't there simple function output pdf text in order (maybe line line?).
you should able set selection rectangle pagesize/mediabox
of page , text.
i should because before start wondering why surprised output of poppler_page_get_text
, should aware of how text gets laid out on page. graphics laid out on page using program expressed in post-fix notation. render page, program executed on blank page.
operations in program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves , on. text laid out series of text operators bracketed bt (begin text) , et (end text). how or text placed on page @ sole discretion of software generates pdf. example, print drivers, code responds gdi calls drawstring
, translates text drawing operations.
if lucky, text on page laid out in sane order sane font usage, many programs generate pdf aren't kind. psroff
, example liked place plain text first, italic text, bold text. words may or may not placed in reading order. fonts may re-encoded 'a'
maps '{'
or whatever. might have ligatures multiple characters replaced single glyphs - common ones ae
, oe
, fi
, fl
, , ffl
.
with of in place, process of extracting text decidedly non-trivial, don't surprised if see poor quality results text extraction.
i used work on text extraction tools in acrobat 1.0 , 2.0 - it's real challenge right.
Comments
Post a Comment