Web Scraping with BeautifulSoup

in utopian-io •  7 years ago  (edited)

What will I Learn?
This tutorial covers these topics:

  • You will learn about basics of Web Scrapping and python package used for it,
  • You will learn to setup environment for this web scrapping project.
  • You will learn different functions of BeautifulSoup and create a simple web scrapper.
  • You will learn to write extracted data into CSV file.

Requirements:

  • A PC/laptop with Windows / OSX/ Linux
  • Python 2.7 or Python 3 installed
  • Any Python IDE installed
  • pip preinstalled

Note: This tutorial is performed in my Laptop with Ubuntu OS 17.1, 64 bit and I used Pycharm IDE.

Setting Up Environment

We will install Beautiful Soup and Requests library using pip. Go to Command line/ Terminal in your laptop/PC

pip install beautifulsoup4
pip install requests

Difficulty
Basic: Anyone with basic knowlwdge of Python can catch up this tutorial.

What is Web Scrapping?
Web Scrapping is the technique to extract the data from websites.It is also known as web harvesting or web data extraction. With web scrapping we can extract data which makes analysis easier. The two steps that are involved in web scrapping are fetching web page and extracting data from it.

What is BeautifulSoup?
BeautifulSoup is a python package which is used to parse HTML and XML documents. BeautifulSoup can't fetch the data, it can only parse the fetched data. To fetch the web page we use Requests package.

Starting the web scrapping tutorial:
To scrape a website first we need to fetch it and then parse it. To do so, we need to import libraries first.

from bs4 import BeautifulSoup               
import requests

Now, the libraries are imported, we need to fetch the website using requests.get method. The fetched website is then passed to BeautifulSoup instance, created as soup and can be printed using print function. prettify() formats the fetched HTML nicely.

source = requests.get('http://techprotricks.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())

Note: This domain techprotricks.com is owned by me and you can use to scrape for learning purpose only.
You can see the fetched html here
If we comment the print function in above code and add these

t = soup.title.text
print(t)

two lines of code then, it will print title of the website.
Screenshot

Now, we start with extracting data from a single article of the website and then we will extract from all articles in the home page.
We will comment the code which extracts title from the website and add other code to print out information of article from html. To find out the class of article in website, we inspect the element in browser and hover over it and we will select the class which covers the area of a single article from home page.
Screenshot

article = soup.find('article')
print(article.prettify())

To view printed article information click here It's a piece of HTML code ectracted from fetched.html which consists information related to first article in the website.
We will extract now headline of article, Category of article, Published date and Summary of the article.

find()

function takes two parameters touniquely identify the needed tag, tag_name as the first argument and the class_ argument which takes the class attribute of the tag.

.text

gives output as plain text
Commenting the print function of article we will add these codes.

headline = article.h2.a.text
cat = article.find('div', class_ = 'above-entry-meta').a.text
pubtime = article.find('div', class_ = 'below-entry-meta').time.text
summary = article.find('div', class_ = 'entry-content clearfix').p.text
print(headline)
print(cat)
print(pubtime)
print(summary)

Screenshot

We have extracted these data from a single post. Now we will use a loop and change few line of codes to do this for all post in a home page. Changes and result can be seen on the screenshot below.
Screenshot

Writing extracted data to CSV files.
CSV stands for Comma Separated Values. These file stores tabular data in plain text separated by commas.
We will create a csv file in the same directory as of our project and open it by adding this line of code:

csv_file = open('techprotricks.csv','w')

Running this line will automatically create techprotricks.csv file and opens it.
The name of the file is "techprotricks.csv" and 'w' in above code represents 'write' which allows our program to write on the CSV file we made. Now, we have to add data inside CSV so we use the writer function.

writer()

function will create an writing object.

writerow()

function will iterate data in rows.

Our final code looks like:

from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('http://techprotricks.com').text
soup = BeautifulSoup(source, 'html.parser')
csv_file = open('techprotricks.csv','w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline','Category','Published Date','Summary'])
for article in soup.find_all('article'):
   headline = article.h2.a.text
   print(headline)
   cat = article.find('div', class_='above-entry-meta').a.text
   print(cat)
   pubtime = article.find('div', class_='below-entry-meta').time.text
   print(pubtime)
   summary = article.find('div', class_='entry-content clearfix').p.text
   print(summary)
   csv_writer.writerow([headline, cat, pubtime, summary ])
csv_file.close()

Check the CSV file here
We can view this CSV file in any spreadsheet applications like MS Excel, LibreOffice Calc, etc. in tabular form.
You can download the source code of this project from my github repo. Click here to download.



Posted on Utopian.io - Rewarding Open Source Contributors

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

@flauwy Thanks.

python lover

Are you?

yeah :)

Great to hear

While I only understood bits and pieces of this (and only because I started using computers back in the days that MBasic was a thing - LOL), I have to say that this tutorial seems very easy to follow, and I imagine the screenshots would be very helpful while figuring out how to set up a scrapping service for the first time. Nicely done, @nilesh.katuwal!

@traciyork Well all the things are explained as much as I cann. Feedsback are appreciated. I thought that setting up environment is much easy as apart from screenshot everything is explained.I will cover those all things in coming tutorials. Thank You.

Oh, you did a wonderful job explaining things, but web scraping isn't something I know anything about. Kind of like a football fan reading about how to throw a pitch in baseball, not ever having played - I know enough about the game programming to appreciate how helpful your guide would be to other baseball players people learning how to scrape the web. 😊

@traciyork come to the field of Baseball than, you'll know how beautiful it is to play. Hope I'll be of your help.

Thanks!

Welcome

Good try bro ! Seems like you are good in programming !

@welcometonepal Thanks for appreciation.

You may want to participate in our next contest ! Only 1 day remaining !

https://steemit.com/nepal/@teamnepal/another-great-opportunity-for-our-followers-and-result-of-last-competition

Lol

Hey @nilesh.katuwal I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • This is your first accepted contribution here in Utopian. Welcome!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x