IIITD: System Management 2011

TOOLS FOR WEB CRAWLING
1.How does a web crawler works??

A web crawler is simply an automated program, or particularly a script that methodically scans or searches or in particular "crawls" the world wide web web in an automated or in a methodological manner by creating an index of the data that it is looking for.

Web crawlers start by parsing a specified web page , noting any hypertext links on that page that point to other pages.They then parse those pages for new links , and so on ,recursively.A web crawler needs a starting point which would be web address, URL.In order to browse the Internet we use the HTTP network protocol which allows us to talk to web servers an download or upload data from and to it.The crawler browses browses this URL and then seeks for hyper links.

A web crawlers doesn't actually move around the web or different computers retrieving the information instead of it resides on a single computer and sends HTTP requests for documents to other machines on the Internet , just as any other web browser does when user clicks on on the links . ALL the crawler really does to automate the process of following links.Web crawlers are bots that saves a copy of all the visited pages in a database(after processing them) so that information in the websites can be indexed to provide fast searches on the web.

SO there are basically three steps that are involved in the web crawling procedure.First, the search bot starts by crawling the pages of your site.Then it continues indexing the words and content of your site , and finally it visit the links(web page addresses or URL s) that are found in your site.

Web crawlers , some times called scrapers , automatically scan the Internet attempting to find the meaning of the content they find. The web wouldn't function without them. Crawlers are the backbone of the search engines(for eg in goggle the crawler is goggle bot) which combined with other algorithms work out the relevance of your page to a given key word context.

uses of web crawlers:

the web crawlers are not simply used by the search engines itself but also by the layman or linguistics uses them for performing a textual analysis; that is they comb the net to determine what words which are commonly used today ;market researches uses them to asses the trends in the given market etc..

(note : this not wholly a work of our, but this is a result of our frequent research on the net)

There are many free and easily available utilities for windows for crawling the web we have covered only sphinx(a very good browsers especially for advanced browsing)

SPHINX (A WINDOWS WEB CRAWLERS):

(it is one of the free and easily available utility on the net)

web sphinx or web site processors for html information extraction is a Java class library and interactive development environment for web crawlers .

It is basically designed for advanced web browsers who want to crawl a small part of the web (such as a single web site) automatically it is not made for crawling the entire it is essentially made for crawling only a small part of the web.

Web sphinx uses the built in Java- class URL and URL connection to fetch web pages.

1.USING THE WEB SPHINX FOR CRAWLING AND GIVING THE OUTPUT IN THE FORM OF A GRAPH SHOWING THE VARIOUS WEB PAGES CRAWLED.

crawling for google

2.USING THE WEB SPHINX FOR CRAWLING goole.com and saving all the web pages crawled in a directory on the desktop (specified by the user).

3. The crawler also concatenates the pages visited.

the concatenated pages are concatenated into a single web page for printing..

4. EXTRACTING THE IMAGES FROM A SET OF PAGES..(the output if opened in windows explorer actually opens up an html code)..

UBUNTU:

In Ubuntu wget is a very powerful tool which can be used to crawl web pages. Apart from crawling it can also be used to make a mirror of any website which can be viewed in the offline mode which is extremely useful for unstable network connections. It can also be used to download a file which is another powerful feature. Information about wget is easily available using the comments info wget and man wget. wget is also available for almost all platforms including windows, and is usually pre-installed on all linux/unix based distros.

The most useful option in wget is -r which recursively crawls the websites. Apart from wget harvestman is also a freely available software which was tried out by us. Harvestman is based on the principle of queues and it can also be used from the python prompt. ( refer to http://code.google.com/p/harvestman-crawler/wiki/WorldsSimplestCrawler)

In this project we have focussed on wget and we have made two shell scripts which perform the following tasks:

1) The first script asks the user for a website and crawls it , it then displays the link extensions which user wants

2) The second user asks the user for a song name and then downloads it in mp3 format

WGET1.SH:

#! /bin/bash

echo "Enter the website you want to crawl for links"

read site

echo "Enter the link you want match (eg. png)"

read link
wget $site -O xyz.txt
links=`cat xyz.txt | egrep -o "href=\"?[^<>\"\\;}]{3,}" | sed -e 's/href="//g'| egrep $link$`
for link in $links
do
echo $link
done
wget -r $site -O xyz.txt recursively crawls the website and stores the result it in xyz.txt
links=`cat xyz.txt | egrep -o "href=\"?[^<>\"\\;}]{3,}" | sed -e 's/href=//g' | sed -e 's/^"//g' | egrep $link$` command reads the file and extracts the links using egrep as all the links are stored after href (basic html) :) and the characters in the square brackets were obtained using trial and error, the sed commands are used to substitute href=" with a blank and finally egrep $links$ searches for links specified occuring at the end of the link.
Finally the for statement prints all the relevant links.

WGET2.sh:

#! /bin/bash

echo "Please enter the song you want to download"

read song

song=`echo $song | sed -e 's/\s/%20/g' | awk '{print tolower($0)}'`

query="Index Of:mp3%20$song"

wget "http://www.google.co.in/search?q="$query -O results.txt --ser-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Fire fox/3.6.3 (FM Scene 4.6.1)"

link=`cat results.txt | egrep -o 'h3 class="r">

link=${link:22}

wget $link -O results.txt

download=`cat results.txt | egrep -o '"[^<>="]+\.mp3' | sed -e 's/\s/%20/g' | sed -e 's/^"//g'`

for mp3 in $download

mp3dl="$link$mp3"

stripped=`echo $mp3dl | sed -e 's/%20//g'`

original_s=`echo $song | sed -e 's/%20//g'`

want=`echo $stripped | awk '{ if (tolower($0) ~ /'$original_s'/) print $0 }' | grep -o ".*mp3" | head -n 1`

song=`echo $song | sed -e 's/\s/%20/g' | awk '{print tolower($0)}'` command replaces all the spaces by %20 and awk prints the whole word in lower case

query="Index Of:mp3%20$song" variables store the query and hence searches for pages that contain mp3 files (Credit to Apoorv Singh)

Actually there are a number of ways in which google searches for the text we enter, this is one of the ways using which we can directly reach the page we normally reach after we type in our search on www.goggle.co.in (Exception in this case we directly reach the page in which sites contain mp3 files)

wget "http://www.google.co.in/search?q="$query -O results.txt --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Fire fox/3.6.3 (FM Scene 4.6.1)"

In this command wget crawls the site and stores the information in results.txt. User agent is perhaps the most important option because this tells google that we are actually using mozilla firefox to search for the song (which we are not) and this is not an automated bot (which it is :)).

link=`cat results.txt | egrep -o 'h3 class="r">

This command again extracts all the links and ignores the sites which contain words mp3,media fire, angel fire, 4shared (which thanks to iiit are blocked or do not work). For personal use grep -v commands can be omitted, head -n 1 stores the first link obtained in variable link.

link=${link:22} This command deletes the first 22 characters from the link and store it again in the link variable

wget $link -O results.txt crawls the link so that we reach the website which has the mp3 files (so much for it when it is so easy on mozilla)

download=`cat results.txt | egrep -o '"[^<"]+\.mp3' | sed -e 's/\s/%20/g' | sed -e 's/^"//g'` This command now extracts the download links available using egrep and then sed is used to substitute %20 for blankspace.

mp3dl="$link$mp3" concatenates the link(website) and the mp3(i.e. the first link in download) to get the full address of the file.

stripped=`echo $mp3dl | sed -e 's/%20//g'`

original_s=`echo $song | sed -e 's/%20//g'

want=`echo $stripped | awk '{ if (tolower($0) ~ /'$original_s'/) print $0 }' | grep -o ".*mp3" | head -n 1`

In these commands the name of the song in the link is compared with the original song name using awk '{ if (tolower($0) ~ /'$original_s'/) print $0 }'.

Hereafter the mp3 extension is searched using grep.

Now if $want exists or is not null then the song is downloaded using wget :)

NOTE:

Wget2.sh won't always work because of some sites using some trick to fake indexed pages should be enough.holding configurations like : allowing indexes, managing redirects etc, and is used by the apache web server. What happens in the background on those nasty sites is - it contains a redirect routine for mp3 files and sends you to another page.

PS: Please do not use this mp3 script excessively in IIIT. A similar scripts for downloading movies was made by Apoorv , Inshu and me(Tiru) and Apoorv was banned from using the net as we had exceeded the bandwidth limit (we all used his account to download movies, although he does not have any of them and all of those films reside with me and Inshu).

EXPERIENCE:

it was a great felling working togther for the blog.

we learnt a lot from the project .

some of the links prefered :

google.com(our best friend) , yahoo answers,wikipedia

Countibuted by-

Tiru Sharma (2011116)

Kushagar lall(2011061)

IIITD: System Management 2011

Sunday, 2 October 2011

1 comment:

Contributors