What will I Learn?
This tutorial covers these topics:
- You will learn about basics of Web Scrapping and python package used for it,
- You will learn to setup environment for this web scrapping project.
- You will learn different functions of BeautifulSoup and create a simple web scrapper.
- You will learn to write extracted data into CSV file.
Requirements:
- A PC/laptop with Windows / OSX/ Linux
- Python 2.7 or Python 3 installed
- Any Python IDE installed
- pip preinstalled
Note: This tutorial is performed in my Laptop with Ubuntu OS 17.1, 64 bit and I used Pycharm IDE.
Setting Up Environment
We will install Beautiful Soup and Requests library using pip. Go to Command line/ Terminal in your laptop/PC
pip install beautifulsoup4
pip install requests
Difficulty
Basic: Anyone with basic knowlwdge of Python can catch up this tutorial.
What is Web Scrapping?
Web Scrapping is the technique to extract the data from websites.It is also known as web harvesting or web data extraction. With web scrapping we can extract data which makes analysis easier. The two steps that are involved in web scrapping are fetching web page and extracting data from it.
What is BeautifulSoup?
BeautifulSoup is a python package which is used to parse HTML and XML documents. BeautifulSoup can't fetch the data, it can only parse the fetched data. To fetch the web page we use Requests package.
Starting the web scrapping tutorial:
To scrape a website first we need to fetch it and then parse it. To do so, we need to import libraries first.
from bs4 import BeautifulSoup
import requests
Now, the libraries are imported, we need to fetch the website using requests.get method. The fetched website is then passed to BeautifulSoup instance, created as soup and can be printed using print function. prettify() formats the fetched HTML nicely.
source = requests.get('http://techprotricks.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
Note: This domain techprotricks.com is owned by me and you can use to scrape for learning purpose only.
You can see the fetched html here
If we comment the print function in above code and add these
t = soup.title.text
print(t)
two lines of code then, it will print title of the website.
Now, we start with extracting data from a single article of the website and then we will extract from all articles in the home page.
We will comment the code which extracts title from the website and add other code to print out information of article from html. To find out the class of article in website, we inspect the element in browser and hover over it and we will select the class which covers the area of a single article from home page.
article = soup.find('article')
print(article.prettify())
To view printed article information click here It's a piece of HTML code ectracted from fetched.html which consists information related to first article in the website.
We will extract now headline of article, Category of article, Published date and Summary of the article.
find()
function takes two parameters touniquely identify the needed tag, tag_name as the first argument and the class_ argument which takes the class attribute of the tag.
.text
gives output as plain text
Commenting the print function of article we will add these codes.
headline = article.h2.a.text
cat = article.find('div', class_ = 'above-entry-meta').a.text
pubtime = article.find('div', class_ = 'below-entry-meta').time.text
summary = article.find('div', class_ = 'entry-content clearfix').p.text
print(headline)
print(cat)
print(pubtime)
print(summary)
We have extracted these data from a single post. Now we will use a loop and change few line of codes to do this for all post in a home page. Changes and result can be seen on the screenshot below.
Writing extracted data to CSV files.
CSV stands for Comma Separated Values. These file stores tabular data in plain text separated by commas.
We will create a csv file in the same directory as of our project and open it by adding this line of code:
csv_file = open('techprotricks.csv','w')
Running this line will automatically create techprotricks.csv file and opens it.
The name of the file is "techprotricks.csv" and 'w' in above code represents 'write' which allows our program to write on the CSV file we made. Now, we have to add data inside CSV so we use the writer function.
writer()
function will create an writing object.
writerow()
function will iterate data in rows.
Our final code looks like:
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('http://techprotricks.com').text
soup = BeautifulSoup(source, 'html.parser')
csv_file = open('techprotricks.csv','w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline','Category','Published Date','Summary'])
for article in soup.find_all('article'):
headline = article.h2.a.text
print(headline)
cat = article.find('div', class_='above-entry-meta').a.text
print(cat)
pubtime = article.find('div', class_='below-entry-meta').time.text
print(pubtime)
summary = article.find('div', class_='entry-content clearfix').p.text
print(summary)
csv_writer.writerow([headline, cat, pubtime, summary ])
csv_file.close()
Check the CSV file here
We can view this CSV file in any spreadsheet applications like MS Excel, LibreOffice Calc, etc. in tabular form.
You can download the source code of this project from my github repo. Click here to download.
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
@flauwy Thanks.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
python lover
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Are you?
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
yeah :)
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Great to hear
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
While I only understood bits and pieces of this (and only because I started using computers back in the days that MBasic was a thing - LOL), I have to say that this tutorial seems very easy to follow, and I imagine the screenshots would be very helpful while figuring out how to set up a scrapping service for the first time. Nicely done, @nilesh.katuwal!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
@traciyork Well all the things are explained as much as I cann. Feedsback are appreciated. I thought that setting up environment is much easy as apart from screenshot everything is explained.I will cover those all things in coming tutorials. Thank You.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Oh, you did a wonderful job explaining things, but web scraping isn't something I know anything about. Kind of like a football fan reading about how to throw a pitch in baseball, not ever having played - I know enough about
the gameprogramming to appreciate how helpful your guide would be toother baseball playerspeople learning how to scrape the web. 😊Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
@traciyork come to the field of Baseball than, you'll know how beautiful it is to play. Hope I'll be of your help.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Thanks!
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Welcome
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Good try bro ! Seems like you are good in programming !
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
@welcometonepal Thanks for appreciation.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
You may want to participate in our next contest ! Only 1 day remaining !
https://steemit.com/nepal/@teamnepal/another-great-opportunity-for-our-followers-and-result-of-last-competition
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Lol
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Hey @nilesh.katuwal I am @utopian-io. I have just upvoted you!
Achievements
Suggestions
Get Noticed!
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit