Web Scraping in Python

in python •  7 years ago 

I'm writing the post as a recap of what I uncovered while learning to scraping web pages for content on the Internet. I did a lot of research and it all started here. Priceconomics sells data crawling as a service (DCAS). Not sure if DCAS is a thing yet, but I'm pretty sure some people will start calling this service that. I looked at what Priceconomics was doing and thought it shouldn't be hard to gain a basic understanding of web scraping.

My options

There are many open source libraries and tools available. You can be easily overwhelmed just examining the landscape. My programming language of choice was python and the library I choose to use was lxml.

Why did I choose Python?

I believe in learning from others. I figured a good place to start was Hacker News. Hacker News is a place where many innovative ideas and solutions are shared. A quick search for web scraping yielded some great results.

web-scraping-search-hacker-news.png

Here's the result from my search on Hacker News. The highlights in yellow match for the term 'web scraping' and the red underlines match for 'Python'. The search results are really good and many of them are based on Python. This was my reasoning.

Why did I choose to use LXML?

I came across a posted titled Scraping with Urllib2 & LXML. A search on Google turned this up. This post was very similar to what I wanted to accomplish. It felt like an easy win and I decided to give it a try. LXML is used by many other libraries and software packages. You can check some of the uses in the LXML FAQ page.

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

The problem I wanted to solve

I wanted to scrape Amazon search results. If I wanted to compare prices, I would have to visual do this or copy and paste results into some other place to do analysis. That's not very appealing to me.

amazon-search-lego-21115.png

Looking at the search results for 'lego 21115', you would see the following web page. I needed to find div for each item; luckily Chrome developers tools made that easy. I just inspected the first item. I had to walk up the tree, but I found the node. For me, it was a div tag with a class of a-fixed-left-grid-col a-col-right.

chrome-inspect-amazon-search.png

My code

from lxml import html
import requests

url = "http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=lego+21115"

page = requests.get(url)
tree = html.fromstring(page.content)

items = tree.xpath('//*[@class="a-fixed-left-grid-col a-col-right"]')

titles = []

for item in items:
  for title in item.getiterator("a"):
    if title.get("title") is not None:
      titles.append(title.get("title"))

prices = []

for item in items:
  found = False
  for price in item.getiterator("span"):
     b = price.get('class')
     if(b == 'a-size-base a-color-price a-text-bold'):
       prices.append(price.text)
       found = True
  if found == False:
    prices.append("no price")

print titles

print prices

Results - Titles

['LEGO Minecraft 21115 The First Night', 'LEGO Minecraft 21114 The Farm', 'LEGO Minecraft 21116 Crafting Box', 'LEGO Minecraft 21121 the Desert Outpost Building Kit', 'LEGO Minecraft 21120 the Snow Hideout Building Kit', 'LEGO Minecraft 21117 The Ender Dragon', 'LEGO Minecraft 21118 The Mine', 'LEGO Minecraft Creative Adventures 21115 The First Night', '21115 Lego 408Pcs Minecraft The First Night Kids Building Playset', 'Lego Minecraft Toys Premium Educational Sets Creationary Game With Minifigures For 8 Year olds Childrens Farm Box', 'Minecraft The Farm Includes a Steve Minifigure with an Accessory, plus a Skeleton, Cow and a Sheep', 'LEGO Minecraft - The Creeper Minifigure from set 21115', 'Lego Minecraft Ultimate Collection (Cave 21113 ,Farm 21114, First Night 21115, Crafing Box 21116, Dragon 21117, Mine 21118 )', 'Bundle: LEGO Minecraft 21116 Crafting Box & LEGO Minecraft 21115 The First Night & LEGO Minecraft The Cave 21113 Playset', 'LEGO Minecraft The Cave 21113 Playset & LEGO Minecraft 21115 The First Night & LEGO Minecraft 21114 The Farm', u'Lego\xae Minecraft Terrain Ore Bundle "(1) Diamond" "(1) Emerald" "(1) Silver" "(1) Amethyst" "(1) Gold" "(1) Lapis" "(1) Redstone"']

Results - Prices

['$34.99', '$25.99', '$21.31', '$40.99', '$35.66', '$49.00', '$34.76', '$73.36', '$118.50', '$46.91', '$61.38', '$47.55', 'no price', '$5.40', '$568.99', 'no price', '$185.99', 'no price']

Results - Overall

Within a few hours of reading and coding, I was able to accomplish my original goal. I now have a base to grow from to do more complex scraping.

Disclaimer

Be careful not to abuse scraping. Most companies like Amazon have API for you to use. This will allow to use to bypass design and style changes made to the site. Yes, my scrape will break when Amazon changes their page. It's not a question of if, but when they will do this. Using an API puts you right next to the data. Always use an API.

This example is simple and pulls one page from Amazon's site. Amazon would probably not block my IP. If my script started to crawl Amazon's site, that's another story.

Learning the craft

Here are some additional resources that gave me insight into web scraping.

XPath and XSLT with lxml - This contained a great example of using XPath

Elements and Element Trees - While traversing the tree, you will receive elements. Here's a good overview and examples.

Requests python library - Retrieving web pages shouldn't be a task. The Requests library makes it super easy. There's a great example if you need to send a multi-part post request in the post.

HTML Scraping — The Hitchhiker's Guide to Python - A good place to start if you want to get coding immediately and skip the stuff above.

Note: This is a repost from blog. I had to do minor reformatting because Jekyll markdown support is slightly different than steem.it.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Congratulations! This post has been upvoted from the communal account, @minnowsupport, by el from the Minnow Support Project. It's a witness project run by aggroed, ausbitbank, teamsteem, theprophet0, someguy123, neoxian, followbtcnews, and netuoso. The goal is to help Steemit grow by supporting Minnows. Please find us at the Peace, Abundance, and Liberty Network (PALnet) Discord Channel. It's a completely public and open space to all members of the Steemit community who voluntarily choose to be there.

If you would like to delegate to the Minnow Support Project you can do so by clicking on the following links: 50SP, 100SP, 250SP, 500SP, 1000SP, 5000SP.
Be sure to leave at least 50SP undelegated on your account.