What Will I Learn?
Greetings, the goal of this tutorial is to use Jsoup to get data from a website and later proceed it according to your needs. By learning this you will be able to reach any websites data and implement your own design to it. Manage to get information from your favorite page without entering it, merging data?s from different sources or even comparing them. To do that you should first download Jsoup?s library for java developers and put it in the same folder with your java package. Then you will be able to call it by using import function in java.
Requirements
- IDE is required to test the code (preferably Eclipse IDE for java developers)
- Basic knowledge on Java.
- Basic knowledge about Jsoup library.
Difficulty
This tutorial is prepared for indivuduals who have a prior knowledge about Java classes, libraries and programming languages,
- Intermediate
Tutorial Contents
In this tutorial we will pull our data's from imdb and process it according to our needs. There are quite a lot of methods and ways to index a webpage in java but the fastest and accurate one is to use api of the desired page if its possible. Firstly we should go to the page that we want to get datas. Then we should find the div class that we want to pull and after processing the data we will be able to get the below output,
We shall begin by importing the libraries that we want to use in this project.
The first librarty that we need to locate is the java.io.IOException which is capable of showing/displaying detailed errors when user enters an unexpected input. Briefly it is used to optimize input/output (i/o) relationship,
import java.io.IOException;
We can then procceed on adding our Jsoup library which is capable of generating,tracking tracking the html codes of the desired sites
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Now we should add one last library that will help us to get the user entered values,
import java.util.Scanner
Then we can declare our class
public class
The name of the class can be picked by you. Its just need to be same with name located on workbench.And we need to define our method by saying public static void, we mean that the code is visible,no return value and a class type.
public static void main(String[] args) throws IOException
Then we can continue calling the connect or parse function from Jsoup library. To do that Jsoup.connect'Yourwebsite') command will be enough
Jsoup.connect(yourwebsite).get();
In this tutorail we will pick data from imdb and it's lists. Although user made lists have export function default search and site lists dont have this function. So by using this method we can gather any movie or tv series information from imdb. To do that we first need the url of the desired data source. Since we want to obtain data from imdb, all german based movies selected by using Most Popular Feature Films With Primary Language German
link There are around 14 thousand movies recorded so we need to gather datas from page one to the last page. In your design you can use this method to gather another data from imdb or other movie information site. Below code was written to trace all the sites,
int page = 1;
while (page>0)
Now this loop will always terminate until we break it. This is pretty functional when you want to gather data from multiple sites. Here we need to gather all the pages shown in the list. Then we can define the link/url of the desired site.
String link = "http://www.imdb.com/search/title?title_type=feature&primary_language=de&sort=moviemeter,asc&page=" +page+ "&ref_=adv_nxt" ;
Note that here the page variable will change increase to get the complete list on imdb. Now we need to define a connection method for the jsoup to connect the site,
Document doc = Jsoup.connect(link).get();
And then we need to declare the tag that we want to get data. Here we may use several tags, header to get the movie name only, muted for the duration of the movie, ratings-bar for ratings and num-votes for number of votes but since we want to get all of these values we can use item-content tag. You can use different tag to get diferent values in your design. Below is the representation of the tag we used.
Now we can move on printing and displaying the datas that we obtained from imdb. In order to do that me must convert elements into element and then text to remove all tags and html statements from the site's source code.
for (Element d : initialtable) {
dr = d.text();
This will convert the elements initialtable into dr text string. After having this string we can use several methods to have a better user friendly output. You may use indexof to get the specific element, substirng to divide the string or replace to change an element in the data. To change the rating panel below code is added. In your task you can add more complex or user demanded fields,
dr = dr.replaceAll("Rate this 1 2 3 4 5 6 7 8 9 10", "")
And then finally we can print the obtained output
System.out.println(dr);
To all pages in the list we also need to increase the page int in while loop,
page++;
Now our code is ready to test. Below is the overall code and the output for the movie list gathered in imdb. In next tutorials we will focus on merging this data with an Offfice software and improve output format.
Overall code
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Scanner;
public class imdb {
public static void main(String[] args) throws IOException {
int page = 1;
while (page>0)
{
String link = "http://www.imdb.com/search/title?title_type=feature&primary_language=de&sort=moviemeter,asc&page=" +page+ "&ref_=adv_nxt" ;
Document doc = Jsoup.connect(link).get();
Elements initialtable = doc.select("div.lister-item.mode-advanced");
//initialtable.remove(0);
String dr = "";
//System.out.println(initialtable);
// int i3 = 1;
for (Element d : initialtable) {
dr = d.text();
//int i = dr.indexOf(" ");
//String country = dr.substring(0,i);
dr = dr.replaceAll("Rate this 1 2 3 4 5 6 7 8 9 10", "");
System.out.println(dr);
}
page++;
}
}
}
Sample outputs,
Curriculum
- Making and designing a translator with Jsoup
- Making your own currency tracker with Jsoup!
- Extracting data by using Jsoup
- Improving translators performance by using Jsoup
- Making country fact list by using Jsoup
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit
Hey @wodsuz I am @utopian-io. I have just upvoted you!
Achievements
Suggestions
Get Noticed!
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit