Sunday 2 October 2011

Tools for Web Crawling

PART 1: INTRODUCTION TO A WEB CRAWLER

A Web Crawler is a software that allows us to create an index of web pages. What is does is to basically start from a particular URL and validate the links associated to it. The links that are validated are indexed and the links present in them are stored in something known as a “Crawl Frontier”, aiding the crawler in determining which URLs to validate next.

Once the “dynamism” in the nature of the World Wide Web was realized, a need to index web pages arose, yielding way for web crawlers. The following presentation describes how a web crawler works.

PART 2: TRYING OUT A FREELY AVAILABLE WEB CRAWLER

We tried out the WebSPHINX Web Crawler, primarily for two reasons

  1. It is renowned for being interactive. Also, it's Graph and Outline Visualization modes helped us to further understand the Web Crawling process

  2. Other commonly used Web Crawlers were already taken by the other groups. One of them took “Win” web crawler and the other one took “Harvestman” for Linux and “Spider” for Windows

WebSPHINX is actually an abbreviation of Website-Specific Processors for HTML INformation eXtraction. We have used WebSPHINX's Crawler Workbench, which allows a user to crawl the web with a customizable web crawler.

The following presentation covers in detail everything we did on this crawler.

PART 3: CREATING OUR OWN WEB CRAWLER

Coding and understanding it:

Please see the following presentation

Terminal Outputs:

Please see the following presentation

References:

  1. Web Crawler On Client Machine by Rajashree Shettar, Dr. Shobha G (first two pages only) http://www.slideshare.net/crazyprave12490/web-crawler-7156834

  1. Chapter 20 of Introduction to Information Retrieval, a draft © April 1, 2009 Cambridge University Press (Overview and Crawling sub-topics) nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

  1. WebSPHINX: A Personal, Customizable web Crawler http://www.cs.cmu.edu/~rcm/websphinx/

  1. Manual entry of wget

Note:
The presentations used have been made by us on our own. They have just been embedded using Sanchit's Slideshare profile. (http://www.slideshare.net/sanchitsaini/presentations)

Thank you for reading.

Rohit Atri, 2011091

Sanchit Saini, 2011097


No comments:

Post a Comment