Data Science's Brain : Better Way to Perform EDA (Exploratory Data Analysis) in Python

in datascience •  last year  (edited)

Pandas-Profiling : Better Way to Perform EDA (Exploratory Data Analysis) in Python

EDA for Data Analysis or Data Visualization is very important. It gives a brief summary and the main characteristics of the data. According to a survey, Data Scientists use their most of time performing EDA tasks.

EDA involves a lot of steps including some statistical tests, visualization of data using different kinds of plots, and many more. Some of the steps of EDA are discussed below:

Data Quality Check: It can be done using some Pandas library functions i.e.

df.describe() , df.shape , df.info(), df.dtypes()
These functions are generally used to find missing values, duplicate values, features, data-types, summary of data, etc.

Statistical Test: Some statistical test i.e. Pearson correlation,Spearman correlation, Kendall test etc is done to get the correlation between features. What I mean by correlation is that one feature is dependent on another feature. It can be done in Python using stats library.

Quantitative Test: Some quantitative test is used to find the spread of numerical features, the count of categorical features. It can be implemented in Python using the functions of the pandas library.

Visualization: Feature visualization is essential to get an understanding of the data. Graphical techniques like bar plots, and pie charts are used to get an understanding of categorical features, whereas scatter plots and histograms are used for numerical features.

To perform the above-mentioned tasks we need to type several lines of code. Here the pandas-profiling open-source library comes into play, which can perform all these tasks using just 1 line of code.

Wow! Just one line of code!🤔

Yes, you read it correctly only one line of code. It’s possible in Python using it’s pandas-profiling open-source library. Also, the result of EDA using pandas-profiling can be displayed in a Jupyter notebook or can be converted to an HTML page.

Now, without wasting any time let’s see how to do this😲

Installation:

There are many ways to install the Pandas-profiling library but we’ll use the simplest one using pip:

pip install pandas-profiling
Import libraries:

To use the pandas-profiling library for EDA, we need to import some required libraries:

import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
Now EDA using one line code:

profile = ProfileReport(pd.read_csv(’titanic.csv’),title='Pandas Profiling Report',html={'style': {'full_width': True}}, sort="None"))
Yes, that’s it, we’ve completed with exploratory data analysis. Results can be observed in Jupyter Notebook or Google Colab itself or the file can be saved in HTML format and used in a web browser.

#to view result in jupyter notebook or google colab
profile.to_widgets()

to save results of pandas-profiling to a HTML file

profile.to_file("EDA.html")
EDA for the Titanic Dataset:

The dataset used for exploratory data analysis using the pandas-profiling library is downloaded from Kaggle.

Here is a work sample of EDA for the Titanic Dataset

Exploratory Data Analysis(EDA)
Exploratory Data Analysis(EDA). GitHub Gist: instantly share code, notes, and snippets.
https://gist.github.com/TheSkyFox3006/4181ce62b1d41fb4cfc8d011945cea0e?ref=hackernoon.com
Output:

The output of EDA for the Titanic Dataset will look like this :

*Just click on below GIF

Note:

If you are a beginner in Data Science I won’t suggest you to perform EDA using pandas-profiling. I prefer to do my EDA with self-defined functions using several Python libraries.

For beginners, it is good to start doing EDA using the pandas library and writing Python code before trying this library, as it is more important to be equipped with fundamental knowledge and programming practices.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!