my gift to the php multi-curl data scrapers

in developer •  6 years ago 

Hi ,
my name Joe (a.k.a @rxhector on twitter/steemit )

and this is the result of 15 years data-scraping using php w/multi-curl.

I have just recently solved one of the biggest problems I've had in scraping using php w/multi-curl,
the problem where you have to post to first page and use that data to get to yet another page of results (without having the never-ending unweildy cascading if/then/else url check bullshit)

it took 15 years because I started as a carpenter by trade (15 years) and just
kind of accidentally discovered php/mysql,
but we will save that story for another time.

I have a working example to check out so you wont be left flying blind like i was while i learned this shit.

https://github.com/rxhector/ultimate-multicurl

And I have tried to comment the code (still needs a ton of better / prettier format)

The first caveat from the code is this little xml beauty
load any web page into xml without it breaking the shit out of php simplexml_import_dom

if (!function_exists('load_simplexml_page')) {
    /*

        this will 'force' xml to load a web page (pure html)
        sometimes simplexml_import_dom breaks when trying to import html with bad mark up (i.e - old crappy coding / scripts)
        DOMDocument will auto-magically fix shitty html

        then we can simplexml-ize it !!!

        NOTE : 
            when using php file_get_contents or dom document or simplexml
            those functions use the php.ini user_agent

            most web sites will not return a web request to empty user_agent
            or user_agent "PHP"
            i ALWAYS set php.ini user_agent to a valid browser string

            php.ini
            user_agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"

    */
    //load from url(default) or string
    function load_simplexml_page($page , $type='url'){
        /*
            static DOMDocument::loadHTMLFile() errors on malformed documents
            $dom = new DOMDocument('1.0', 'UTF-8'); //force doctype/utf - get rid of warnings
            prepend @$dom = tell php to ignore any warnings/errors
        */
        $dom = new DOMDocument('1.0', 'UTF-8');
        $type === 'url'
        ? @$dom->loadHTMLFile(trim($page))
        : @$dom->loadHTML($page);
        return simplexml_import_dom($dom);
    }//end function
}//end function check

here's a quick and dirty script that lets you test your php.ini user_agent from curl request


<?php   
    // save this on your  web server so you can hit it as web page ex. /www/test_user_agent.php
    // then run from cmd line ex. php /www/test_user_agent.php

    //require( somefile with load_simplexml_page )
    
    if(isset($_SERVER['REQUEST_METHOD'])){
        //this is a webpage hit 
        $return = print_r($_SERVER , true) ;
        echo $return;   //return something to browser/curl request
    }
    else
    {
        //no server request - this is running as cmd line script
        
        //returns default php user_agent "PHP" (if set)
        //$text = file_get_contents("http://localhost/test_user_agent.php");
        //echo $text;

        //returns default php user_agent "PHP" (if set)
        $xml = load_simplexml_page("http://localhost/test_user_agent.php");
        echo $xml->asXML();

    }  

i tried regex and substr and other normal php string processing functions
but i learned early on that this can become unweildy and complex
thats when i discovered simplexml - what an awesome tool for html !!!
it is super easy to use xpath expressions to get to any element on a web page to get data !!!

my first few scraping jobs i was a NOOB and was using good old file_get_contents and getting one slow page at a time,
but that has it limitations...
when you get a little more advanced you have to learn to make a query string and post data..(pagination is a bitch)

that's when i discovered curl - wow it took me about a week to learn how to set cookies so i could log in to sites and get backend data.

so my load_simplexml_page and curl tools came in handy - but man was it slow doing 'synchronous' page loading - one slow page at a time.
when you get a client that wants 10,000 pages at a time (instead of 100) - you better figure out how to do it AS FAST AS POSSIBLE !!!

then i stumbled into multi-curl - HOLY SHIT , talk about a learning curve..
i know there are a few 'wizards' out there that probably picked it up right away
but you gotta remember - I was a 15 year carpenter/guitar player/stoner - it took me a bit
to grab the 'asynchronous' concept - let's load 1000 pages all at once - and process them.

so now we can move on to the main multi-curl object itself
i wont post the code in its entirety here - you can find that posted at PUT GITHUB PROJECT LINK HERE

some of my favorite code tricks i've discovered along the way


    //notice the &(reference) here
    if ($x = &$this->result_callback($this->rs[$i]['options'])) {
    
    //$x is now a string and can be called as a function
        
        $x($this->rs[$i] , $i); //old school would be something like mixed call_user_func_array ( callable $callback , array $param_arr )
        //call_user_func_array($x , [$this->rs[$i] , $i]);
        

another pretty cool use for references


    public function start_callback(&$options , $set = false){ 
    //$options can now be changed from with the callback function (trust me - it comes in handy for passing variables in multi-curl)
    
        $options['result_callback'] = $set;     //now $options is changed in the main(calling) flow 
        

the next really cool caveat you get is a proxy scraper - you know - get a list of 300 proxies
then do a quick check on the target domain to make sure you have a good proxy ;)

so - the whole point for me was to get a better understanding of how multi-curl works.
if you want to know how something works - you gotta break it and try to rebuild it.

looking forward
this code really needs some more clean up and better comments/formatting.

i would really like to add some of the functionality from zebra-curl (i didnt need all the bells right away - so i just built this for quick get/post json requests)

I hope you guys like it

https://github.com/rxhector/ultimate-multicurl

this is running in production for mid-market cap company - we are scraping about 30k records @ 1000/hr (25pages/second) - not bad !!!

tipping is allowed
the old slow way - paypal [email protected]
the super fast ~3 second way twitter @xrptipbot https://twitter.com/@rxhector
twitter trx bot @goseedit

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!