Sunday, January 13, 2013

Scraping PDF files


While puttering around on the Internet, I recently stumbled upon the website of the National Printing House ("Εθνικό Τυπογραφείο" in Greek) which is the public service responsible for the dissemination of Greek law. It publishes and distributes the government gazette and its website provides free access to all series of the Official Journal of the Hellenic Republic (ΦΕΚ).


    So, at its search page I noticed a section with the most popular issues. The most-viewed one, as of 13 Jan 2013, with 351.595 views was this: ΦΕΚ A 226 - 27.10.2011. Out of curiosity I decided to download it in order to take a quick look and see what it is all about. It was available in a PDF format and it turned out to be an issue about the Economic Adjustment Programme for Greece aiming to reduce its macroeconomic and fiscal imbalances. However, I was quite surprised to find that the text was contained in images and you could not perform any keyword search in it nor could you copy-paste its textual content! I guess because the document's pages were scanned and converted to digital images.
    This instantly brought to my mind once again the difficulties that PDF scraping involves. From our web scraping experience there are many cases where the data is "locked" in PDF files e.g. in a .pdf brochure. Getting the data of interest out is not an easy task but quite a few tools (pdftotext is one of them) have popped up over the years to ease the pain. One of the best tools I have encountered so far is Tesseract, a pretty accurate open source OCR engine currently maintained by Google.


    So, I thought it would be nice to put Tesseract into action and check its efficiency against the PDF document (that so dramatically affects the lives of all Greeks..). It worked quite well, although not perfect (probably because of the Greek language), and a few minutes later (and after converting the pdf to a tiff image through Ghostscript) I had the full text, or at least most of it, in my hands. The output text file generated can be found here. The truth is that I could not do much with it (and the austerity measures were harsh..) but at least I was happy that I was able to extract the largest part of the text.
    Of course this is just an example, there are numerous PDF files out there containing rich, inaccessible data that could potentially be processed and further utilised e.g. in order to create a full text search index. DEiXTo, our beloved web data extraction tool, can scrape only HTML pages. It cannot deal with PDF files residing on the Web. However, we do have the tools and the knowledge to parse those as well, find bits of interest and unleash their value!

No comments:

Post a Comment