Scraping library for CodeIgniter Framework

This tutorial is about how to build a scraping library [based on cURL] for your CodeIgniter [CI] MVC Framework. I am using CI version 2.1.3 and SimpleHtmlDom version 1.5 for this tutorial.

In this tutorial, I am not giving any detail for CI framework. CI has very right user guide, if you need help.

Lets start the journey. First make sure that the CI framework is ready to run. Place the SimpleHtmlDom [ http://sourceforge.net/projects/simplehtmldom/files/ ] file [simple_html_dom.php] to the application/libraries folder.

Create a new library file name scraping.php. Set the proper permission of the file so that it can execute commands [chmod 755].

Lets start to build our scraping library with CI by editing the scraping.php file.
1. First ensure that the script is not accessible remotely right after the php start tag:

<?php
if ( ! defined('BASEPATH')) exit('No direct script access allowed');

2. Lets include the SimpleHtmlDom Parser:

include_once( dirname(__FILE__) . '/simple_html_dom.php') ;

3. define the Class:

class Scraping {}

4. Set the temporary path variable:

private $tmp_path;
public function __construct($tmp = false) {
  if($tmp){
    $this->tmp_path =  $tmp . '/';
  } else {
    $this->tmp_path =  dirname(dirname(dirname(__FILE__))) . '/tmp/';
    $this->tmp_dir_check();
  }
  $this->clearTmp();
}

private function tmp_dir_check()
{
  if(!is_dir($this->tmp_path)){
    mkdir($this->tmp_path);
  }
}

private function clearTmp()
{
  foreach( glob($this->tmp_path . '*.txt') as $file)
  {
    unlink($file);
  }
}

Here the __construct method will check if any temporary directory path is given while initializing the class. If yes, then it will set to the temp_path variable. Otherwise, it will set to the tmp folder of the CI root directory. Also a private method tmp_dir_check will invoke.

The private method tmp_dir_check will check if the temporary directory exists, otherwise, the method will try to create the tmp folder in CI root directory.

The last method will invoke from the __construct is clearTmp. This method will check existing temporary files, and if found, will delete all.

5. Now, we will create a method that will scrape page using the SimpleHtmlDom parser class.

public function page($url)
{
  return file_get_html($url);
}

This method use the default method of SimpleHtmlDom [file_get_html()]. This method will use generally for pages that do not require post/get/cookie.

6. Now we will create a method that require cookies. This method can be extended to accept Proxy, Get and lots more. Here I am not making this complex, as this is just give you the hints how to use your own scraping library from CI.

private function scrapeIt($url, $redirect = 0, $cookieFile = '')
{
  if ($cookieFile == '') {
    $cookieFile = $this->tmp_path . time() . '_tmp.txt';
  }
  $ch = curl_init();
  $headers = array("Expect:");
  curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
  curl_setopt($ch, CURLOPT_TIMEOUT, 30);
  if ($redirect) {
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  }
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
  curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6');
  $return = curl_exec($ch);
  curl_close($ch);

  return $return;
}

I hope you have basic understanding of the cURL library. So I am not describing this cURL options in detail. If you need help with cURL, please visit http://php.net/manual/en/ref.curl.php.

7. Now create a method, that will use the cURL to parse the page, and return the DOM to the simpleHtmlDom parser library to use its fantastic selector.

public function shDom($url)
{
  $page = $this->scrapeIt($url);
  $shdom = str_get_html($page);
  unset($page);

  return $shdom;
}

8. Now we create a CI controller to handle the scraping tasks:

<?php if ( ! defined('BASEPATH')) exit('No direct script access allowed');

class Scraper extends CI_Controller {

  public function __construct()
  {
    parent::__construct();
  }

  public function index()
  {
    $data = array();
    if(isset($_GET['url']))
    {
      $data['url'] = trim($_GET['url']);
      if(!empty( $data['url']))
      {
        $this->load->library('scraping');
//       $data['page'] = $this->scraping->page($url); // to scrape simple webpages
        $data['page'] = $this->scraping->shDom($data['url']); // to scrape complex webpages
      }else{
        $data['notice'] = 'URL is empty!!!';
      }
    }
    $this->load->view('dashboard', $data);
  }
}

As I assume you already have expertise on CI, therefore, I am not going to more detail about the controller. 9. Here is the View file: dashboard.php

<!DOCTYPE html>
<html>
    <head>
        <title>Scraping library with SimpleHtmlDom for CodeIgniter 2.1</title>
    </head>
    <body>
        <div>
            <form action="" method="get">
                <span>URL:</span> <input type="text" name="url" id="url"/> <input type="submit" value="GO!">
            </form>
            <br />
            <?php
            if (isset($page)){
                echo 'Page url: ' . $url . '<hr />';
                echo $page;
            }
            if (isset($notice)){
                echo "<h3>$notice</h3>";
            }
            ?>
        </div>
    </body>
</html>

Download the source code of this project

Comments

comments

Scraping library for CodeIgniter Framework

Download the source code of this project

Comments

No Comments Yet

Leave a Reply Cancel reply