Dochive, liberates and explores PDF files by utilizing Tesseract and ImageMagick to convert PDF files into CSV format files. ImageMagick crops highlighted regions into sections to convert the regions to words using Tesseract.


The new program aims to pull the data from the documents and put it into a spreadsheet.

It’s called DocHive, and here’s how it works: the program uses XML, a computer programming language used mainly for websites, to break a page up into smaller sections.

For example, in the campaign finance documents, it will make separate sections for donor name, occupation, donation amount and all the other fields. Then, it will take each of those sections and turn it into a separate image file. The software takes that small image and uses optical character recognition technology, known by the acronym OCR, to read the couple words or numbers and insert it into a text file.

