PART 1: INTRODUCTION TO A WEB CRAWLER
A Web Crawler is a software that allows us to create an index of web pages. What is does is to basically start from a particular URL and validate the links associated to it. The links that are validated are indexed and the links present in them are stored in something known as a “Crawl Frontier”, aiding the crawler in determining which URLs to validate next.
Once the “dynamism” in the nature of the World Wide Web was realized, a need to index web pages arose, yielding way for web crawlers. The following presentation describes how a web crawler works.
PART 2: TRYING OUT A FREELY AVAILABLE WEB CRAWLER
We tried out the WebSPHINX Web Crawler, primarily for two reasons
It is renowned for being interactive. Also, it's Graph and Outline Visualization modes helped us to further understand the Web Crawling process
Other commonly used Web Crawlers were already taken by the other groups. One of them took “Win” web crawler and the other one took “Harvestman” for Linux and “Spider” for Windows
WebSPHINX is actually an abbreviation of Website-Specific Processors for HTML INformation eXtraction. We have used WebSPHINX's Crawler Workbench, which allows a user to crawl the web with a customizable web crawler.
The following presentation covers in detail everything we did on this crawler.
PART 3: CREATING OUR OWN WEB CRAWLER
Coding and understanding it:
Please see the following presentation
Terminal Outputs:
Please see the following presentation
References:
Web Crawler On Client Machine by Rajashree Shettar, Dr. Shobha G (first two pages only) http://www.slideshare.net/crazyprave12490/web-crawler-7156834
Chapter 20 of Introduction to Information Retrieval, a draft © April 1, 2009 Cambridge University Press (Overview and Crawling sub-topics) nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
WebSPHINX: A Personal, Customizable web Crawler http://www.cs.cmu.edu/~rcm/websphinx/
Manual entry of wget
Note:
The presentations used have been made by us on our own. They have just been embedded using Sanchit's Slideshare profile. (http://www.slideshare.net/sanchitsaini/presentations)
Thank you for reading.
Rohit Atri, 2011091
Sanchit Saini, 2011097
No comments:
Post a Comment