Hi ,
my name Joe (a.k.a @rxhector on twitter/steemit )
and this is the result of 15 years data-scraping using php w/multi-curl.
I have just recently solved one of the biggest problems I've had in scraping using php w/multi-curl,
the problem where you have to post to first page and use that data to get to yet another page of results (without having the never-ending unweildy cascading if/then/else url check bullshit)
it took 15 years because I started as a carpenter by trade (15 years) and just
kind of accidentally discovered php/mysql,
but we will save that story for another time.
I have a working example to check out so you wont be left flying blind like i was while i learned this shit.
https://github.com/rxhector/ultimate-multicurl
And I have tried to comment the code (still needs a ton of better / prettier format)
The first caveat from the code is this little xml beauty
load any web page into xml without it breaking the shit out of php simplexml_import_dom
if (!function_exists('load_simplexml_page')) {
/*
this will 'force' xml to load a web page (pure html)
sometimes simplexml_import_dom breaks when trying to import html with bad mark up (i.e - old crappy coding / scripts)
DOMDocument will auto-magically fix shitty html
then we can simplexml-ize it !!!
NOTE :
when using php file_get_contents or dom document or simplexml
those functions use the php.ini user_agent
most web sites will not return a web request to empty user_agent
or user_agent "PHP"
i ALWAYS set php.ini user_agent to a valid browser string
php.ini
user_agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
*/
//load from url(default) or string
function load_simplexml_page($page , $type='url'){
/*
static DOMDocument::loadHTMLFile() errors on malformed documents
$dom = new DOMDocument('1.0', 'UTF-8'); //force doctype/utf - get rid of warnings
prepend @$dom = tell php to ignore any warnings/errors
*/
$dom = new DOMDocument('1.0', 'UTF-8');
$type === 'url'
? @$dom->loadHTMLFile(trim($page))
: @$dom->loadHTML($page);
return simplexml_import_dom($dom);
}//end function
}//end function check
here's a quick and dirty script that lets you test your php.ini user_agent from curl request
<?php
// save this on your web server so you can hit it as web page ex. /www/test_user_agent.php
// then run from cmd line ex. php /www/test_user_agent.php
//require( somefile with load_simplexml_page )
if(isset($_SERVER['REQUEST_METHOD'])){
//this is a webpage hit
$return = print_r($_SERVER , true) ;
echo $return; //return something to browser/curl request
}
else
{
//no server request - this is running as cmd line script
//returns default php user_agent "PHP" (if set)
//$text = file_get_contents("http://localhost/test_user_agent.php");
//echo $text;
//returns default php user_agent "PHP" (if set)
$xml = load_simplexml_page("http://localhost/test_user_agent.php");
echo $xml->asXML();
}
i tried regex and substr and other normal php string processing functions
but i learned early on that this can become unweildy and complex
thats when i discovered simplexml - what an awesome tool for html !!!
it is super easy to use xpath expressions to get to any element on a web page to get data !!!
my first few scraping jobs i was a NOOB and was using good old file_get_contents and getting one slow page at a time,
but that has it limitations...
when you get a little more advanced you have to learn to make a query string and post data..(pagination is a bitch)
that's when i discovered curl - wow it took me about a week to learn how to set cookies so i could log in to sites and get backend data.
so my load_simplexml_page and curl tools came in handy - but man was it slow doing 'synchronous' page loading - one slow page at a time.
when you get a client that wants 10,000 pages at a time (instead of 100) - you better figure out how to do it AS FAST AS POSSIBLE !!!
then i stumbled into multi-curl - HOLY SHIT , talk about a learning curve..
i know there are a few 'wizards' out there that probably picked it up right away
but you gotta remember - I was a 15 year carpenter/guitar player/stoner - it took me a bit
to grab the 'asynchronous' concept - let's load 1000 pages all at once - and process them.
so now we can move on to the main multi-curl object itself
i wont post the code in its entirety here - you can find that posted at PUT GITHUB PROJECT LINK HERE
some of my favorite code tricks i've discovered along the way
//notice the &(reference) here
if ($x = &$this->result_callback($this->rs[$i]['options'])) {
//$x is now a string and can be called as a function
$x($this->rs[$i] , $i); //old school would be something like mixed call_user_func_array ( callable $callback , array $param_arr )
//call_user_func_array($x , [$this->rs[$i] , $i]);
another pretty cool use for references
public function start_callback(&$options , $set = false){
//$options can now be changed from with the callback function (trust me - it comes in handy for passing variables in multi-curl)
$options['result_callback'] = $set; //now $options is changed in the main(calling) flow
the next really cool caveat you get is a proxy scraper - you know - get a list of 300 proxies
then do a quick check on the target domain to make sure you have a good proxy ;)
so - the whole point for me was to get a better understanding of how multi-curl works.
if you want to know how something works - you gotta break it and try to rebuild it.
looking forward
this code really needs some more clean up and better comments/formatting.
i would really like to add some of the functionality from zebra-curl (i didnt need all the bells right away - so i just built this for quick get/post json requests)
I hope you guys like it
https://github.com/rxhector/ultimate-multicurl
this is running in production for mid-market cap company - we are scraping about 30k records @ 1000/hr (25pages/second) - not bad !!!
tipping is allowed
the old slow way - paypal [email protected]
the super fast ~3 second way twitter @xrptipbot https://twitter.com/@rxhector
twitter trx bot @goseedit