Hi folks,
How's the weekend going?
Today let us learn about getting the title and url for stack overflow questions.
So, for that we will be using Scrapy framework available in Python. And, also we will be creating a json file after the extraction.
Interested! Lets jump into this....
Install Scrapy (pip install scrapy)
Create a Scrapy project (scrapy startproject scrapper)
In items.py
from scrapy.item import Item, Field
class StackItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
url = Field()
Create a file stack_spider.py inside spiders/
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
Now it's time to run the spider and get the questions in a json file.
Run this command below from spider file directory to get the output:
scrapy crawl stack -o questions.json
This was all about the stack overflow scraping. We can do more on this. Like, storing data in database. And, also extracting page wise data. Explore all these, and, let me know if you run into any problems through comments.
Also, if you liked this article, please upvote it!
Thanks, see you all in next post!