Web Crawling (Date Crawling using c#)

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner.. Web crawling is a very advance field for researchers in the field of data mining. Almost all modern languages provide easy to use methods for implementing web crawling. Search Engine actually is an information retrieval system that helps users to find information stored on computer system or systems.

Fig: 01 Web crawling picture courtesy:www.sayonetech.com

 The search results commonly known as “hits” are presented in the form of a list to the users. The current search engines like Google, Yahoo and MSN hits millions of records against a single query. Among these millions of records it’s very difficult and time consuming for the users to find the relevant information. These search engines search information based on key words mentioned in the query. Date sensitive search engines have the capability to give priority to the dates mentioned in the query. It will consider only dates mentioned inside the text (page contents) and not the date on which page is updated, created or published.

Here we implemented a web crawler that crawl dates, days and events from a web page. The web page link is given below:
http://vu.edu.pk/StudentServices/AcademicCalendar.aspx
The events that are mentioned against the dates will be crawled and saved into a list. We followed following steps to complete the tasks.

  1. Create three lists for each crawl item i.e. date, day and event..
  2. Than we used regular expression to extract our desired terms in the web page.
  3. create web client object named web
  4. Create object of MatchCollection object for each crawl item
  5. for each item create for loop to store the items in each separate list.
Code for crawler:



List<String> dates = new List<String>();
            List<String> days = new List<String>();
            List<String> eventList = new List<String>();

WebClient web = new WebClient();String html = web.DownloadString("http://vu.edu.pk/StudentServices/AcademicCalendar.aspx");
            MatchCollection m1 = Regex.Matches(html, @"<span class='txtsmall''>\s*(.+?)\s*</span>", RegexOptions.Singleline);
            MatchCollection m2 = Regex.Matches(html, @"<span style='text-align:left;' >\s*(.+?)\s*</span>", RegexOptions.Singleline);
            MatchCollection m3 = Regex.Matches(html, @"<class=txtsmall >\s*(.+?)\s*</span>", RegexOptions.Singleline);



            foreach (Match m in m1)
            {

                string events= m.Groups[1].Value;
                eventList.Add(events);

            }
            foreach (Match m in m2)
            {

                string day = m.Groups[1].Value;
                days.Add(day);

            }
            foreach (Match m in m3)
            {

                string date = m.Groups[1].Value;
                dates.Add(date);

            }

Comments

Popular posts from this blog

Guidelines for Effective Academic Writing

Unstructued Notes on TCP IP Networking

Protecting IT Infrastructure: Key Takeaways from the CrowdStrike Update Incident