scraping Archives - Knowledge Nation

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
Source Wikipedia

In very simple words, scraping is extracting data from web page by processing HTML and it is extremely powerful. It is used for various purposes such as analyzing web pages, data aggregation from multiple sources, researching trends and many more.

Web scraping using php

Before starting scraping with php one should have basic knowledge of domdocument and curl.
Let’s being with an example, for website wiredskill.com, it’s a tech news aggregator, we have to do as follows.

get title and title length
get meta description
get all h1 tags
get all links from the page

Steps to follow

get page content using curl
load content in dom document
if title exists, get title and it’s length
if meta description exits, get meta description
get h1 tags and links

<?php
// url for sending http request
$url = "http://wiredskill.com";
// this will initialize a curl session
$ch = curl_init(); 
// set url for http request
curl_setopt($ch, CURLOPT_URL,$url); 
// used to get request output as string
curl_setopt($ch,CURLOPT_RETURNTRANSFER, true); 
// execute the request
$output = curl_exec($ch); 
//end of curl request
curl_close($ch);

// initialize dom document
$dom = new DOMDocument();
// load curl output in dom
@$dom->loadHtml($output);
// return a node list
// searches for all specified tag name in the dom
$title_node_list = $dom->getElementsByTagName('title');

// always a good practice to initialize variables
$title = $meta_tag = "";
$anchor_tags = $h1_headings = [];
// check if node list has length
if( $title_node_list->length )
	$title = trim($title_node_list->item(0)->nodeValue);
// get title length
$title_length = strlen($title);
var_dump($title); 
//string(84) "Home | Tech - Web Design - UI&UX - Startups - Gadgets News at one place | WiredSkill"

var_dump($title_length); // int(84)

$meta_tags = $dom->getElementsByTagName('meta');
foreach( $meta_tags as $meta_tag )
{
	// has attribute to check if the dom element has any attribute with 'name' 
	// if dom element has attribute with name then get that and
        // check if it is "description" or not
	if( $meta_tag->hasAttribute('name') && ($attribute = $meta_tag->getAttribute('name') ) && ($attribute == 'description') )
	{
		// check if meta tag has any attribute "content"
		if( $meta_tag->hasAttribute('content') )
			$meta_description = $meta_tag->getAttribute('content');
	}
}
var_dump($meta_description); 
//string(81) "WiredSkill provides tech, web design, ui-ux, startups, gadgets news at one place."

$headings_h1 = $dom->getElementsByTagName('h1');
foreach( $headings_h1 as $heading )
{
	// get text between <h1>some text</h1>
	$h1_headings[] = trim($heading->nodeValue);
	
}
print_r($h1_headings);
/*
Array
(
    [0] => WiredSkill beta
)
*/
$all_anchors_tags = $dom->getElementsByTagName('a');
foreach ($all_anchors_tags as $anchor )
{
	// check if anchor tag has attribute href
	if( $anchor->hasAttribute('href') )
	{
		// get "href" attribute of anchor tah
		$anchor_tags[] = $anchor->getAttribute('href');
	}
}

print_r($anchor_tags);
/*
Array
(
    [0] => http://wiredskill.com
    [1] => http://wiredskill.com/category/tech
    [2] => http://wiredskill.com/category/webdesign
    [3] => http://wiredskill.com/category/uiux
    .. many more 
)
*/
// always a good practice to unset variable which are no more in use 
$output = $dom = null;
unset($output,$dom);
?>

<?php

// url for sending http request

$url = "http://wiredskill.com";

// this will initialize a curl session

$ch = curl_init();

// set url for http request

curl_setopt($ch, CURLOPT_URL,$url);

// used to get request output as string

curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);

// execute the request

$output = curl_exec($ch);

//end of curl request

curl_close($ch);

// initialize dom document

$dom = new DOMDocument();

// load curl output in dom

@$dom->loadHtml($output);

// return a node list

// searches for all specified tag name in the dom

$title_node_list = $dom->getElementsByTagName('title');

// always a good practice to initialize variables

$title = $meta_tag = "";

$anchor_tags = $h1_headings = [];

// check if node list has length

if( $title_node_list->length )

$title = trim($title_node_list->item(0)->nodeValue);

// get title length

$title_length = strlen($title);

var_dump($title);

//string(84) "Home | Tech - Web Design - UI&UX - Startups - Gadgets News at one place | WiredSkill"

var_dump($title_length); // int(84)

$meta_tags = $dom->getElementsByTagName('meta');

foreach( $meta_tags as $meta_tag )

{

// has attribute to check if the dom element has any attribute with 'name'

// if dom element has attribute with name then get that and

// check if it is "description" or not

if( $meta_tag->hasAttribute('name') && ($attribute = $meta_tag->getAttribute('name') ) && ($attribute == 'description') )

{

// check if meta tag has any attribute "content"

if( $meta_tag->hasAttribute('content') )

$meta_description = $meta_tag->getAttribute('content');

}

var_dump($meta_description);

//string(81) "WiredSkill provides tech, web design, ui-ux, startups, gadgets news at one place."

$headings_h1 = $dom->getElementsByTagName('h1');

foreach( $headings_h1 as $heading )

{

// get text between <h1>some text</h1>

$h1_headings[] = trim($heading->nodeValue);

}

print_r($h1_headings);

Array

(

[0] => WiredSkill beta

)

$all_anchors_tags = $dom->getElementsByTagName('a');

foreach ($all_anchors_tags as $anchor )

{

// check if anchor tag has attribute href

if( $anchor->hasAttribute('href') )

{

// get "href" attribute of anchor tah

$anchor_tags[] = $anchor->getAttribute('href');

}

print_r($anchor_tags);

Array

(

[0] => http://wiredskill.com

[1] => http://wiredskill.com/category/tech

[2] => http://wiredskill.com/category/webdesign

[3] => http://wiredskill.com/category/uiux

.. many more

)

// always a good practice to unset variable which are no more in use

$output = $dom = null;

unset($output,$dom);

All in all, web scraping with php is very easy, so do scraping just for fun because it breaches most of the website policies. Therefore, i really recommend to use it wisely after reading all terms and conditions of the website you are targeting to collect data.

This article is just to give knowledge about scraping, we are not responsible for any harm caused by using above script to any website by anyone.

Category: scraping

Web scraping using PHP

Web scraping using php