Self-intro: I am a graduate student at an unnamed institution in China :) The main focus is Computer vision using deep learning. I will update some of notes about deep learning at steemit. Hope that like-minded friends follow me, and we discuss and support with each other. Thanks for reading! @whytin
Python has ranked the first programming language?(who know, I don't know)
So, let's explore it!
Finding the most popular category of books in O'Reilly
Overview
Refer to "Data Science From Scratch", I want to explore which category of books is the largest quantity, and I will recommend you to start studying which kind of books when you are still confused with your future.
You can fork the project by Github:
Github: http://github.com/whytin/book_scratch
Preparation
Tool
- BeautifulSoup4(a python library designed for dissecting a doucument into a parse tree, we can extract what we need esaily);
Refer to : http://www.crummy.com/software/BeautifulSoup/ - htmll5lib(a popular Python parser to handle the HTML format);
- requests(make a HTTP request)
Environment
- Linux Mint 18.1 (Unlimited)
- Python 2.7
- Sqlite3
Foundation
- Python
- HTML
- Matplotlib
- SQL
Start
Scratch admited
Before you start the project, make sure your target is open to scratch.
Like O'Reilly: http://oreilly.com/terms/
Glance over the page I have not found some issues with banning the scratch.
Then we look over the robots.txt file. http://shop.oreilly.com/robots.txt
We found that :
Crawl-delay: 30
Request-rate: 1/30
It means that we should delay 30s between two requests.
Parsing the page
If you know well with HTML, it is easy for you to find out the tags.
First, you can select category of data through Browse Subjects
Second, use the developer tools.
It is wise to use the button of Select an element in the page to inspect it ,and then find out the tag
We can extract the title, authors, date, isbn, price of the book.
Do yourself , you will fall in curiousity.
Coding
from bs4 import BeautifulSoup
import requests
# Making a request of url and send to BeautifulSoup parsing with html5lib.
url = "http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page=1"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
tds = soup('td', 'thumbtext')
We found book's title involved the a tag of
, and extract it.titles = [td.find("div", "thumbheader").a.text for td in tds]
And we can build the function of book_info()
# In order to extract the book information like title, authors, isbn, date, price. Return a dict.
def book_info(td):
title = td.find("div", "thumbheader").a.text
authors = td.find('div', 'AuthorName').text
isbn_link = td.find("div", "thumbheader").a.get("href")
isbn = re.match("/product/(.*)\.do", isbn_link).group(1)
date = td.find("span", "directorydate").text.strip()
price = td('span', 'pricelabel')[0].find('span', 'price')
return {
"tilte": title,
"authors": authors,
"isbn": isbn,
"date":date,
"price":price }
Scratching:
from bs4 import BeautifulSoup
import requests
import re
from time import sleep
base_url = "http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page="
books=[]
NUM_PAGES = 44
for page_num in range(1, NUM_PAGES + 1):
url = baseurl + str(page_num)
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
for td in soup('td', 'thumbtext'):
books.append(book_info(td))
sleep(30)
Visualization:
import matplotlib as plt
def get_year(book):
return int(book["date"].split()[1])
# Counter(): dict subclass for counting hashable objects
years_counts = Counter(get_year(book) for book in books if get_year(book) <= 2016)
years = sorted(years_counts)
book_counts = [year_counts[year] for year in years]
plt.plot(years, book_counts)
plt.show()
Summary
It is the brief induction of usage of python scratching, using BeautifulSoup and Matplotlib. You can also scratching Amazon website or whatever you want to obtain. Remember you are risk in data scratching, square up your behavior in Internet.
Thanks for reading!
Congratulations @whytin! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
You published your First Post
You made your First Vote
You made your First Comment
You got a First Vote
Award for the number of upvotes received
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit