Extracting text from PDF with Poppler (C++) -


i'm trying way through poppler , (lack of) documentation.

what want simple thing: open pdf file , read text in it. i'm going process text, doesn't matter here.

so... saw poppler_page_get_text function, , kind of works, have specify selection rectangle, not handy. isn't there simple function output pdf text in order (maybe line line?).

you should able set selection rectangle pagesize/mediabox of page , text.

i should because before start wondering why surprised output of poppler_page_get_text, should aware of how text gets laid out on page. graphics laid out on page using program expressed in post-fix notation. render page, program executed on blank page.

operations in program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves , on. text laid out series of text operators bracketed bt (begin text) , et (end text). how or text placed on page @ sole discretion of software generates pdf. example, print drivers, code responds gdi calls drawstring , translates text drawing operations.

if lucky, text on page laid out in sane order sane font usage, many programs generate pdf aren't kind. psroff, example liked place plain text first, italic text, bold text. words may or may not placed in reading order. fonts may re-encoded 'a' maps '{' or whatever. might have ligatures multiple characters replaced single glyphs - common ones ae, oe, fi, fl, , ffl.

with of in place, process of extracting text decidedly non-trivial, don't surprised if see poor quality results text extraction.

i used work on text extraction tools in acrobat 1.0 , 2.0 - it's real challenge right.


Comments

Popular posts from this blog

javascript - Enclosure Memory Copies -

php - Replacing tags in braces, even nested tags, with regex -