Hello!
Our today's story will be about scraping dynamically generated frontend. Current Web technologies runs a part of the code on a client (browser) side. This technologies made the websites more flexible, reducing the server-side load, allowing to download content dynamically.
If your heard about ReactJS, Angular, Ember, Backbone this is all about dynamically generated frontend. For example steemit is written on ReactJS. But advantages for users are disadvantages for scrappers. For example in steemit case only few articles of user blog are initially loaded, to see more articles you must scroll down a page. Simple action for user is not so simple for spider.
To deal with this problem scraping frameworks interacts with different browser automation tools. The most famous tool of this kind is Selenium which primary task is automated tests of web pages. But this method needs an active browser. Another alternative is Splash. Splash is a browser but without GUI that wrapped in docker container. This browser controlled through HTTP API. And one more important thing is that Scrapy has a plugin for Splash. This is example from Scrapy-Splah documentation that allows to understand how to integrate Splash
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# ...
Looks great.
In the next article we will try to parse our steemit using Scrapy-Splash.
Congratulations @ertinfagor! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on any badge to view your own Board of Honnor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
By upvoting this notification, you can help all Steemit users. Learn how here!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @ertinfagor! You have received a personal award!
Happy Birthday - 1 Year on Steemit Happy Birthday - 1 Year on Steemit
Click on the badge to view your own Board of Honor on SteemitBoard.
For more information about this award, click here
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @ertinfagor! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @ertinfagor! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :
Click here to view your Board of Honor
If you no longer want to receive notifications, reply to this comment with the word
STOP
Do not miss the last post from @steemitboard:
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Congratulations @ertinfagor! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit