Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
Source Wikipedia
In very simple words, scraping is extracting data from web page by processing HTML and it is extremely powerful. It is used for various purposes such as analyzing web pages, data aggregation from multiple sources, researching trends and many more.
Web scraping using php
Before starting scraping with php one should have basic knowledge of domdocument and curl.
Let’s being with an example, for website wiredskill.com, it’s a tech news aggregator, we have to do as follows.
- get title and title length
- get meta description
- get all h1 tags
- get all links from the page
Steps to follow
- get page content using curl
- load content in dom document
- if title exists, get title and it’s length
- if meta description exits, get meta description
- get h1 tags and links
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
<?php // url for sending http request $url = "http://wiredskill.com"; // this will initialize a curl session $ch = curl_init(); // set url for http request curl_setopt($ch, CURLOPT_URL,$url); // used to get request output as string curl_setopt($ch,CURLOPT_RETURNTRANSFER, true); // execute the request $output = curl_exec($ch); //end of curl request curl_close($ch); // initialize dom document $dom = new DOMDocument(); // load curl output in dom @$dom->loadHtml($output); // return a node list // searches for all specified tag name in the dom $title_node_list = $dom->getElementsByTagName('title'); // always a good practice to initialize variables $title = $meta_tag = ""; $anchor_tags = $h1_headings = []; // check if node list has length if( $title_node_list->length ) $title = trim($title_node_list->item(0)->nodeValue); // get title length $title_length = strlen($title); var_dump($title); //string(84) "Home | Tech - Web Design - UI&UX - Startups - Gadgets News at one place | WiredSkill" var_dump($title_length); // int(84) $meta_tags = $dom->getElementsByTagName('meta'); foreach( $meta_tags as $meta_tag ) { // has attribute to check if the dom element has any attribute with 'name' // if dom element has attribute with name then get that and // check if it is "description" or not if( $meta_tag->hasAttribute('name') && ($attribute = $meta_tag->getAttribute('name') ) && ($attribute == 'description') ) { // check if meta tag has any attribute "content" if( $meta_tag->hasAttribute('content') ) $meta_description = $meta_tag->getAttribute('content'); } } var_dump($meta_description); //string(81) "WiredSkill provides tech, web design, ui-ux, startups, gadgets news at one place." $headings_h1 = $dom->getElementsByTagName('h1'); foreach( $headings_h1 as $heading ) { // get text between <h1>some text</h1> $h1_headings[] = trim($heading->nodeValue); } print_r($h1_headings); /* Array ( [0] => WiredSkill beta ) */ $all_anchors_tags = $dom->getElementsByTagName('a'); foreach ($all_anchors_tags as $anchor ) { // check if anchor tag has attribute href if( $anchor->hasAttribute('href') ) { // get "href" attribute of anchor tah $anchor_tags[] = $anchor->getAttribute('href'); } } print_r($anchor_tags); /* Array ( [0] => http://wiredskill.com [1] => http://wiredskill.com/category/tech [2] => http://wiredskill.com/category/webdesign [3] => http://wiredskill.com/category/uiux .. many more ) */ // always a good practice to unset variable which are no more in use $output = $dom = null; unset($output,$dom); ?> |
All in all, web scraping with php is very easy, so do scraping just for fun because it breaches most of the website policies. Therefore, i really recommend to use it wisely after reading all terms and conditions of the website you are targeting to collect data.
This article is just to give knowledge about scraping, we are not responsible for any harm caused by using above script to any website by anyone.