Hi folks,

How's the weekend going?

Today let us learn about getting the title and url for stack overflow questions.

So, for that we will be using Scrapy framework available in Python. And, also we will be creating a json file after the extraction.

Interested! Lets jump into this....

Install Scrapy (pip install scrapy)

Create a Scrapy project (scrapy startproject scrapper)

Screenshot from 2018-07-22 13-43-50.png

In items.py

from scrapy.item import Item, Field


class StackItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    url = Field()

Create a file stack_spider.py inside spiders/

from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            item = StackItem()
            item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
            yield item

Now it's time to run the spider and get the questions in a json file.

Run this command below from spider file directory to get the output:

scrapy crawl stack -o questions.json

This was all about the stack overflow scraping. We can do more on this. Like, storing data in database. And, also extracting page wise data. Explore all these, and, let me know if you run into any problems through comments.

Also, if you liked this article, please upvote it!

Thanks, see you all in next post!

References:

https://doc.scrapy.org/en/latest/intro/tutorial.html

Extract StackOverflow QuestionssteemCreated with Sketch.