Saturday, January 28, 2012

Federated searching & dbWiz

Nowadays, most university and college students, professors as well as researchers are increasingly seeking information and finding answers on the open Web. Google has become the dominant search tool for almost everyone. Its popularity is enormous, no need to wonder or analyze why. It has a simple and effective interface and it returns fast, accurate results.


    However, libraries, in their effort to win some patrons back, have tried to offer a decent searching alternative by developing a new model: federated search engines. Federated searching (also known as metasearch or cross searching) allows users to search simultaneously multiple web resources and subscription-based bibliographic databases from a single interface. To achieve that, parallel processes are executed in real time and retrieve results from each separate source. Τhen, the results returned get grouped together and presented to the user in a unified way.
    The mechanisms used for pulling the data from the target sources are broadly two: either through an Application Programming Interface (API) or via scraping the native web interface/ site of each database. The first method is undoubtedly better but very often a search API is not available. In such cases, web robots (or agents) come into play and capture information of interest, typically by simulating a human browsing through the target webpages. Especially in the academia, there are numerous online bibliographic databases. Some of them offer Z39.50 or API access. However, a large number still does not provide protocol-based search functionality. Thus, scraping techniques should be deployed for those (unless the vendor disallows bots).
   When starting my programming adventure with Perl back in 2006, in the context of my former full-time job at the Library of University of Macedonia (Thessaloniki, Greece), I had the chance (and luck) to run across dbWiz, a remarkable open source, federated search tool developed by the Simon Fraser University (SFU) Library in Canada. I was fascinated with Perl as well as dbWiz's internal design and implementation. So, this is how I met and fell in love with Perl.
    dbWiz offered a friendly and usable admin interface that allowed you to create search categories and select from a global list of resources which databases would be active and searchable. If you had to add a new resource though, you would have to write your own plugin (Perl knowledge and programming skills were required). Some of the dbWiz search plugins were based upon Z39.50 whereas others (the majority) relied on regular expressions and WWW::Mechanize (a handy web browser Perl object).
    The federated search engine developed while working at the University of Macedonia (2006-2008) was named "Pantou" and became a valuable everyday tool for students and professors of the University. The results of this work were presented at the 16th Panhellenic Academic Libraries Conference (Piraeus, 1-3 October 2007). Unfortunately, its maintenance stopped at the end of 2010 due to the economic crisis and severe cuts in funding. Consequently, a few months later some of its plugins started falling apart.
    Generally, delving into dbWiz taught me a lot of lessons such as web development, Perl programming and GNU/Linux administration. I loved it! Meanwhile, in my effort to improve the relatively hard and tedious procedure of creating new dbWiz plugins, I put into practice an early version of GUI DEiXTo (which was my MSc thesis being fulfilled in the same period at the Aristotle University of Thessaloniki). The result was a new Perl module that allowed the execution of W3C DOM-based, XML patterns (built with the GUI DEiXTo) inside dbWiz and eliminated, at least to a large extent, the need for heavy use of regular expressions. That module, which was the first predecessor of today's DEiXToBot package, got included in the official dbWiz distribution after contacting the dbWiz development team in 2007. Unfortunately, SFU Library ended the support and development of dbWiz in 2010.
    Looking back, I can now say with quite a bit of certainty, that DEiXTo (more than ever before) can power federated search tools and help them extend their reach to previously inaccessible resources. As far as the search engines war is concerned, Google seems to triumph but nobody can say for sure what is going to happen in the next few years to come. Time will tell..

No comments:

Post a Comment