Friday, May 3, 2013

Scraping the members of the Greek Parliament

The Hellenic Parliament is the supreme democratic institution that represents Greek citizens through an elected body of Members of Parliament (MPs). It is a legislature of 300 members, elected for a four-year term, that submits bills and amendments. Its website, www.hellenicparliament.gr, has a lot of interesting data on it that could potentially be useful for mere citizens, certain types of professionals like journalists and lawyers, the media as well as businesses.


    Inspired by existing scrapers for many Parliaments of the world like these on ScraperWiki, an amazing web-based scraping platform, we decided to write a simple, though efficient, DEiXToBot-based script that gathers information (such as the full name, constituency and contact details) from the CVs pages of Greek MPs and exports it (after some post-processing, e.g. deducing the party name to which the MP belongs from the logo in the party column) to a tab delimited text file that can then be easily imported in an ODF spreadsheet or into a database. The script uses a tree pattern previously built with the GUI DEiXTo tool to identify the data under interest and visits all 30 target pages (each containing ten records) by utilizing the pageNo URL parameter. It should also be noted that we used Selenium for our purposes, our favorite browser automation tool. Eventually, the results of the execution of the script can be found in this .ods fileIn case you would like to take a look at the Perl code that got the job done you can download it here


    Open data — data that is free for use, reuse, and redistribution — is a goldmine that can stimulate innovative ways to discover knowledge and analyze rich data sets available on the World Wide Web. Scraping is an invaluable tool that can help towards this direction and serve transparency and openness. Currently there is a wide variety of remarkable web data extraction tools (among which quite a few free). Perhaps you would like to give DEiXTo a try and start building your own web robots to get the data you need and transform it into a suitable format for further use.
    In conclusion, scraping has numerous uses and applications and there is a high chance you could come up with an interesting and creative use case scenario tailored to your requirements. So, if you need any help with DEiXTo or have any inquiries, please do not hesitate to contact us!

No comments:

Post a Comment