A new approach to Web Crawling — DHEKTS Crawler in comparison with various Crawlers

Objectives: To propose a crawler to visit websites for collecting information and create a search engine index for reference; To compare various crawler License, language used for creation, effectiveness with proposed DHEKTS crawler; To compare various characteristics, tasks and functions with proposed DHEKTS crawler; To identify the merits of the DHEKTS Crawler. Methods: A new Crawler called DHEKTS is developed to filter and synchronize documents like Images, Link, and HTML code from a given website. This Crawler is unique in nature since it returns all the details of a particular website having Images, Links, html code and contents. It can crawl through links in a specified website and crawl further to other links on thewebsite. TheDHEKTS Crawler is designed for Depth and Relevance crawling. The entire DHEKTS crawler has a few crawling mechanism supporting variety of information. The requirements are Operating System: Win 7 and higher, Front End: PHP, BackEnd: MySQL, RAM: Minimum 4GB and SERVER: High Speed Server with good storage Capacity. Findings: The DHEKTS Crawler has brought web related Links, Images, HTML Code, Information about to fifth level of crawling and Relevance Search giving relevant information. Multiple crawlers fulfill the major functions of crawling but DHEKTS CRAWLER is built to execute all functions in one crawler. Applications: This is applied in Crawling of various Websites and to retrieve valuable data.


Introduction
A web crawler systematically browses WWW for the purpose of indexing. Using crawler, the web search engines updates web content, index other sites. Usually, crawler begins with a popular site and index words of pages following links within sites. Since WWW provides a great amount of useful information electronically in the form of hypertext, dynamically changing unstructured information, makes it difficult for requisite information. It is studied a web Crawler automatically traverses web by downloading documents page by page (1) . Further, crawling is made difficult since WWW has large volume dynamic pages (2) . https://www.indjst.org/ 1580 According to crawler Junghoo Cho et al., (3) an Internet bot systematically browses WWW (4) for web indexing. Web search engines use crawling to index sites and modernize web contents. It is noted that crawler copies pages for process by search engine. Roughly speaking, a crawler (5) starts off by placing an initial set of URLs in a queue and all URLs to be retrieved are kept and prioritized. From this queue, crawler gets an URL (in some order), downloads the page, extracts URLs from the downloaded page and puts them in URLs queue. Collected pages are later used for other applications such as a Web search engine or a Web cache.
In this study various Crawlers (6) like JSpider, Google Bot (Google), Httrack, Methabot, WebSphinix, Gnu Wget, WIRE, Pavuk, Scrapy, Bing Bot (Microsoft), Heritrix, Slurp 3.0 (Yahoo), WebHTTrack, MSN Bot (Microsoft), Web2disk are compared with DHEKTS Crawler. The functionality (7) , effectiveness (8) , tasks (9) performed by various crawlers are studied in detail. It is understood that the features of one crawler is not in other crawler and implementing all features in one crawler is not done. This problem is identified for this study to build a unique crawler to systematically browse WWW for indexing information, supporting multiple features of crawling like bringing links, images, HTML Source, Depth Crawling (10) . Thus, the desired work is to develop a crawler with all features of diversified crawlers (11) , reducing time of referring multiple crawlers (12) to fulfill the task. Further the proposed crawler can be useful to consolidate the outcomes of crawling very easily.

DHEKTS Crawler
A new Crawler called DHEKTS is developed to filter and synchronize documents like Images, Link, and HTML code from a given website. This Crawler is unique in nature since it returns all the details of a particular website having Images, Links, Files and details of any website. It can crawl through links in a specified website and crawl further to other links in the website. The DHEKTS Crawler is designed for Depth and Relevance crawling. The entire DHEKTS crawler has a few crawling mechanism supporting variety of information.

Image Crawler
The DHEKTS Image Crawler is used to browse all images (13) (jpg, gif, png etc.) of a website recursively and collects multitude of images from the website. The images are viewed as thumbnail with respective URL links. These crawlers crawls all images of a website and display them with URL. Without storing resultant images in database, the DHEKTS Image Crawler directly display the results on the screen.

Link Crawler
The function of DHEKTS Link Crawler crawls all links of a website. The crawler crawl websites and gathers all internal and external links and produces Page heading, URL, hyperlink of the website. The Crawler acts like a site map provider for any website. It is also displaying the results without taking them to storage.

HTML Crawler
This Crawler crawl a website and lists all html links, html coding (14) of the entire website. It is useful to analyze coding techniques, structure of website. Though download or right click option is restricted, this crawler gets html code (15) .

Depth Crawler
The DHEKTS Depth Crawler (16) crawl the entire website and continue crawling other websites based on the links of the website. Crawl depth is the degree to which a web search engine goes interior to a website. Majority of the sites contain multiple pages, subpages. The pages and subpages grow deeper in a manner similar to the way folders and subfolders (or directories and subdirectories) grow deeper in computer storage. By default a home page has a crawl depth 0. Pages linked within home page have a crawl depth value 1; pages linked directly within crawl-depth-1 page have a crawl depth value 2 and so on. The DHEKTS Depth Crawler is developed to have crawl depth value 5.

Relevance Crawler
Finally, the DHEKTS Crawler bringing relevant information from WWW is called Relevance Crawler. This Crawler works based on search keywords, no. of keywords present in a particular website, user relevance rating is given to the website. Today, extracting images, links and html source are difficult on web. The existing crawlers are not sufficient for responding certain queries. Every crawler has its own specialization functions for crawling entire website and displaying the result, supporting multithreads, supporting HTTP proxies and cookies, partial local file system support etc. This paper is about a new approach in web crawling using DHEKTS Crawler which is quite different from prominent crawlers.

Architecture of Dhekts Crawler
The DHEKTS, a proposed crawler, is designed to overcome the difficulties of referring multiple crawlers (17) for searching. It is a unique system that can cater all information of a website. The whole system is divided into components like Image crawlering, Link crawlering, HTML crawlering, Depth crawlering and DHEKTS Search engine. The system has a component for initializing URL, loading DOM component for existence of URL, database for storage of information. The DHEKTS system plays a major role to search data on WWW for filtering intended objects (18) . It has multiple crawling functions applicable on various objects. This Crawler finds details of any website. It crawls links of a particular website. It is developed for deep crawling and implementing relevance (19) in results. The components of the proposed crawler are named after subjective crawling. The user crawl WWW for multiple objectives and the outcome can be stored in the database. It is required to choose appropriate component (20) in the system to go with purpose of crawling. The depth crawler is designed to crawl up to level 5 for interior crawling (21) . Since the work is crawling (22) WWW, the results obtained will be the existed information of website. The multiple features of the intended DHEKTS crawler is crawling images, links, HTML, depth and relevance crawling in single software.

Working of Dhekts Web Crawler
1. Initiate crawling https://www.indjst.org/ 2. Input seed URL and determine an IP address for the target server using DomainNameServer. 3. Extract Robot.txt file from the server and verify permission. 4. Verify protocol of underlying host like http, ftp, gopher etc. 5. Based on protocol host, download document. 6. Identify the document format like doc, xls, ppt, html, or pdf etc. 7. Extract links or references of the websites 8. Store the document and URLs in search engine buffer 9. Repeat steps 1 to 9 till the queue is empty.
Starting with the seed URL, the DHEKTS crawler crawls all links found in HTML page till the URL in a designated queue is empty.

Functions of Various Crawlers
The tasks of various crawlers are organized below.

Crawlers Platform
This The first feature of DHEKTS is crawling images of websites. It has crawled images of 5games.com, clearly the mining process of DHEKTS crawler mined WWW and has displayed all images linked with concerned websites. The results are tabulated with images and appropriate URLs. https://www.indjst.org/ The second feature of DHEKTS crawler is crawling links of websites. It has crawled links of icc-criket.com. Clearly, DHEKTS crawler has displayed all associated links of crawling websites by mining process. The results are tabulated with page heading, URL and hyperlinks.
The performance of DHEKTS Crawler crawling different websites are compared with time The Precision of DHEKTS Crawler is always one since the number of retrieved the relevant pages are same. Multiple crawlers fulfill the major functions of crawling but DHEKTS CRAWLER is built to execute all functions in one crawler.

Merits of Dhekts Crawler
requirement of the hour which can better understand the user requests. The work can be extended to build an effective content mining crawler to satisfy future trends.