Sitemap

Exploratory Data Analysis (EDA) Using Python

Basic Examples about exploratory data analysis and data visualization in Python

12 min readJan 22, 2024
Photo by John Schnobrich on Unsplash

Introduction

Exploratory data analysis (EDA) is an important step in the data analyzing process to understand the dataset better. By doing EDA, we can understand the main features of the data, the relationships between variables, and the variables that are relevant to our problem. EDA can also help us identify and handle missing or duplicate values, outliers, and errors in the data.

For this post, we’re using Python to do data analysis because it has many libraries that can help us perform EDA, such as pandas, numpy, matplotlib, and seaborn. Pandas is a library for data manipulation and analysis. Numpy is a library for numerical computing. Matplotlib and seaborn are libraries for data visualization.

For this project, we will analyze Goodreads Choice Awards Best Books of 2023.

You can find this dataset on Kaggle to perform the analysis and start practicing.

Understanding the data

Import necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Load the dataset into pandas dataframe

#The data is written in a CSV file, so we're going to use Pandas function to read CSV file.
df = pd.read_csv('Good_Reads_Book_Awards_Crawl_2023_12_27_11_14.csv')

df.sample(5) #show the samples of the data

Then we’re going to remove some unnecessary column from the dataset, this step is optional, but since we’ll not be using those columns, it’s best to remove it to reduce the size of our DataFrame

#The Unused columns are source_URL, Book Description, and About the Author
df.drop(['source_URL','Book Description','About the Author'],axis=1, inplace=True)

Checking the DataFrame

Now we’ll check the data types for each column and check the summary of the numerical columns so we can determine our next action.

df.info()
The DataFrame has 12 columns in total, with 299 rows

From the .info(), the dataset looks good without any missing values. It also gives us some information such as the shape of our dataset (12 number of columns and 299 number of rows) and the data types of each column.

df.describe()
Numerical statistic of our DataFrame

The .describe() method gives us summary statistics for numerical columns in our DataFrame. It shows us the average value, median values standard deviation, minimum, and maximum values of each numerical columns.

Downsizing the Int/Float and assign the data types

Once we have identified the appearance and numerical composition of our data, we can proceed to determine the subsequent steps in our data analysis. From the .info(), we know the size of our data, which is 28.2 KB and the data types for each column. The.describe() method shows us the statistics of the numerical columns, such as the minimum and maximum value of each column, as well as the average value.

By this finding, we can see there are still some columns missing, like Number of Ratings and Number of Reviews, which are supposed to be numerical columns. Turns out those columns use comma “,” as the thousand separator. Other columns like Readers Choice Votes don’t have thousand separators as it stored in General Formatting. The reason to don’t write numbers with thousand separator and rather use plain number instead is because it’s the safest option as the number might be identifiers, or some multi digit number for which it wouldn’t be appropriate to separate the digits. So, in order to do that, we need to remove the comma and replace it with blank.

numeric_columns = ['Number of Ratings','Number of Reviews']

#Remove the character comma from those columns and convert to Int32
for column in numeric_columns:
df[column] = df[column].replace(',', '', regex=True).astype('int32')

This method will remove all the commas from those columns. You can leave this part .astype('int32') because pandas will automatically assign the data type into int64. The numbers 64 and 32 stand for how much memory or how many bits per row is it going to store these values.

You can keep it as it is, but in order to make our DataFrame more efficient we’re downcasting those values into lower values, which is int32. If we take a look at the range from each of the numerical columns below:

Numerical Statistic of our latest DataFrame

It showed the smallest and highest value of each numerical column. Take an example from Readers Choice Votes column, the smallest value is 935 and the highest is 397565.

By knowing the range of values, we can now determine the bits to store these values. For references:

  1. Int8 variables can hold values ranging from -128 to 127.
  2. Int16 variables can hold values ranging from -32,768 to 32,767.
  3. Int32 variables can hold values ranging from -2,147,483,648 to 2,147,483,647.
  4. Int64 variables can store values ranging from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

The Int32 is the perfect option because it fits the value range. We can still use Int64, but it’s not proper because it uses more memory and makes our DataFrame less efficient.

For floats, it’s a bit different because it really affects how many decimal places our data can store. Float16 stores 4 decimal digits, Float32 stores 8 decimal digits, and Float64 stores 16 decimal digits. The best option is to use Float16 because we don’t really need to use many decimal digits in our DataFrame, but we still want to keep the values same as the original.

Now that we know the value range of each column, we will assign each of those columns into the proper data types.

There are also some columns that store text values. We can assign those columns' data types into string or category. From the pandas' documentation, the categorical data type is useful if a string variable consisting of only a few different values, for example gender, social class, blood type, country affiliation, etc. By that definition, the Column Readers Choice Category is the best fit to use the categorical data type.

#Convert the rest of the columns to correct data types
convert_dict = {'Readers Choice Votes': 'int32',
'Readers Choice Category': 'category',
'Title': 'string',
'Author': 'string',
'Total Avg Rating': 'float16',
'Number of Pages': 'int16',
'Edition': 'category',
'First Published date': 'datetime64[ns]',
'Kindle Price': 'float16'}
df = df.astype(convert_dict)

For the column ‘Kindle Version and Price’, we will remove the price as we already have another column ‘Kindle Price’ that store price data

#Separate the currency from the text and put it in the new column
df['Kindle Version'] = df['Kindle Version and Price'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()

#Change the column into correct data type
df['Kindle Version'] = df['Kindle Version'].astype('category')

#Remove the previous column
df = df.drop('Kindle Version and Price', axis=1)

Now let’s take a look once again at our data frame to see if the data type has been changed:

df.info()
The DataFrame after changing data types and remove some columns
df.describe()
Numerical Statistic after changing data types
df.sample(10)
A sample of the data after data types changes

As we see in the latest results, we can greatly reduce the memory size of our DataFrame by changing the data types and downcasting integers and floats. Please note that these steps are optional, but can be very useful if we’re working with a large dataset.

Analyzing and Visualizing the Data

Category Distribution

The first analysis is to find out the distribution of books across different categories in the dataset. Then we will visualize using seaborn module.

cat_counts = df['Readers Choice Category'].value_counts()
print(cat_counts)

plt.figure(figsize=(12, 6))
sns.barplot(x=cat_counts.index, y=cat_counts.values, palette='Blues_d')
plt.title('Distribution of Books Across Categories')
plt.xlabel('Category')
plt.ylabel('Number of Books')
plt.xticks(rotation=30, ha='right')
plt.show()

Our data is evenly distributed across all categories, with an exception of Debut Novel category that only has 19 books.

Next we will analyze the distribution of votes, ratings, reviews, pages, and price for each category. We are going to use boxplot for plotting distributions.

fig, axes = plt.subplots(3, 2, figsize=(16, 18), sharey=False, sharex=True)

# First plot Distributions of Readers Choice Votes
sns.boxplot(data=df, x='Readers Choice Category', y='Readers Choice Votes', palette='Set3', ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Readers Choice Votes for Each Category')
axes[0, 0].set_ylabel('Votes')

# Second plot Distribution of Average Ratings
sns.boxplot(data=df, x='Readers Choice Category', y='Total Avg Rating', palette='Set3', ax=axes[0, 1])
axes[0, 1].set_title('Distribution of Average Ratings for Each Category')
axes[0, 1].set_ylabel('Avg Ratings')

# Third plot Distribution of Number of Ratings
sns.boxplot(data=df, x='Readers Choice Category', y='Number of Ratings', palette='Set3', ax=axes[1, 0])
axes[1, 0].set_title('Distribution of Number of Ratings for Each Category')
axes[1, 0].set_ylabel('Ratings')

# Fourth plot Distribution of Number of Reviews
sns.boxplot(data=df, x='Readers Choice Category', y='Number of Reviews', palette='Set3', ax=axes[1, 1])
axes[1, 1].set_title('Distribution of Number of Reviews for Each Category')
axes[1, 1].set_ylabel('Reviews')

# Fifth plot Distribution of Number of Pages
sns.boxplot(data=df, x='Readers Choice Category', y='Number of Pages', palette='Set3', ax=axes[2, 0])
axes[2, 0].set_title('Distribution of Number of Pages for Each Category')
axes[2, 0].set_ylabel('Pages')

# Sixth plot Distribution of Kindle Price
sns.boxplot(data=df, x='Readers Choice Category', y='Kindle Price', palette='Set3', ax=axes[2, 1])
axes[2, 1].set_title('Distribution of Kindle Price ($) for Each Category')
axes[2, 1].set_ylabel('Kindle Price ($)')

for ax in axes[2, :]:
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')

fig.tight_layout()
plt.show()
Category Distribution

As you can see, most of the distribution is skewed, with some extreme outliers in some categories, except in average ratings' distribution where the data is normally distributed. The best way to measure central tendency in skewed data is using median because it is less sensitive to extreme values/outliers.

Analyzing by Category

We’ll now examine by category from votes, ratings, reviews, pages, and price, and find out the Most Popular Category in 2023.

#Determine which column we want to aggregate
aggregations = {'Readers Choice Votes': 'sum',
'Total Avg Rating': 'mean',
'Number of Ratings': 'sum',
'Number of Reviews': 'sum',
'Number of Pages': 'median',
'Kindle Price': 'median',
}

#Group by book category
category_vote = df.groupby('Readers Choice Category').agg(aggregations).sort_values('Readers Choice Votes', ascending=False)

# Calculate the percentage of total votes, total ratings, and total reviews for each category
total_votes = category_vote['Readers Choice Votes'].sum()
total_ratings = category_vote['Number of Ratings'].sum()
total_reviews = category_vote['Number of Reviews'].sum()
percent_of_total_votes = (category_vote['Readers Choice Votes'] / total_votes) * 100
percent_of_total_ratings = (category_vote['Number of Ratings'] / total_ratings) * 100
percent_of_total_reviews = (category_vote['Number of Reviews'] / total_reviews) * 100

# Create new DataFrame of Votes, Ratings, and Reviews
result_df = pd.DataFrame({
'Votes (sum)': category_vote['Readers Choice Votes'],
'% Votes': percent_of_total_votes,
'Avg Ratings': category_vote['Total Avg Rating'].round(2),
'Number of Ratings': category_vote['Number of Ratings'],
'% of Total Ratings': percent_of_total_ratings.round(2),
'Number of Reviews': category_vote['Number of Reviews'],
'% of Total Reviews': percent_of_total_reviews.round(2),
'Median Pages': category_vote['Number of Pages'],
'Median Kindle Price': category_vote['Kindle Price'].round(2)
})

#Find the most voted category
max_voted_cat = result_df['Votes (sum)'].idxmax()
max_votes = result_df['Votes (sum)'].max()
avg_rat = result_df.loc[max_voted_cat, 'Avg Ratings']

#Find the most rated category
max_rated_cat = result_df['Number of Ratings'].idxmax()
max_rates = result_df['Number of Ratings'].max()
pct_max_rates = result_df['% of Total Ratings'].max()

#Find the most reviewed category
max_reviewed_cat = result_df['Number of Reviews'].idxmax()
max_reviews = result_df['Number of Reviews'].max()
pct_max_reviews = result_df['% of Total Reviews'].max()

#Print the result
print(f"The category '{max_voted_cat}' is The Most Voted Category of 2023, with {max_votes:,} votes")
print(f"The category '{max_rated_cat}' is The Most Rated Category of 2023, having an average rating of {format(avg_rat, '.2f')}, and number of ratings: {max_rates:,}, or {format(pct_max_rates, '.2f')}% of total ratings")
print(f"The category '{max_reviewed_cat}' is The Most Reviewed Category of 2023, with {max_reviews:,} number of reviews, or {format(pct_max_reviews, '.2f')}% of total reviews")

result_df
The Most Voted, Most Rated, and Most Reviewed Category in 2023

Next, we are going to plot that data to have a better understanding and visualization.

fig, axes = plt.subplots(3, 2, figsize=(16, 18), sharey=False)

# First plot
sns.barplot(x=result_df.index, y=result_df['Votes (sum)'], palette='Blues_d', order=result_df.index, ax=axes[0, 0])
axes[0, 0].set_title('Readers Choice Votes for Each Category')
axes[0, 0].set_ylabel('Votes')
axes[0, 0].set_xticklabels(labels=result_df.index, rotation=30, ha='right')

# Second plot
result_df_sorted = result_df.sort_values(by='Avg Ratings', ascending=False)
sns.barplot(x=result_df_sorted.index, y=result_df_sorted['Avg Ratings'], palette='Blues_d', order=result_df_sorted.index, ax=axes[0, 1])
axes[0, 1].set_title('Average Ratings for Each Category')
axes[0, 1].set_ylabel('Avg Ratings')
axes[0, 1].set_xticklabels(labels=result_df_sorted.index, rotation=30, ha='right')

# Third plot
result_df_sorted = result_df.sort_values(by='Number of Ratings', ascending=False)
sns.barplot(x=result_df_sorted.index, y=result_df_sorted['Number of Ratings'], palette='Blues_d', order=result_df_sorted.index, ax=axes[1, 0])
axes[1, 0].set_title('Number of Ratings for Each Category')
axes[1, 0].set_ylabel('Ratings')
axes[1, 0].set_xticklabels(labels=result_df_sorted.index, rotation=30, ha='right')

# Fourth plot
result_df_sorted = result_df.sort_values(by='Number of Reviews', ascending=False)
sns.barplot(x=result_df_sorted.index, y=result_df_sorted['Number of Reviews'], palette='Blues_d', order=result_df_sorted.index, ax=axes[1, 1])
axes[1, 1].set_title('Number of Reviews for Each Category')
axes[1, 1].set_ylabel('Reviews')
axes[1, 1].set_xticklabels(labels=result_df_sorted.index, rotation=30, ha='right')

# Fifth plot
result_df_sorted = result_df.sort_values(by='Median Pages', ascending=False)
sns.barplot(x=result_df_sorted.index, y=result_df_sorted['Median Pages'], palette='Blues_d', order=result_df_sorted.index, ax=axes[2, 0])
axes[2, 0].set_title('Average Pages for Each Category')
axes[2, 0].set_ylabel('Pages')
axes[2, 0].set_xticklabels(labels=result_df_sorted.index, rotation=30, ha='right')

# Sixth plot
result_df_sorted = result_df.sort_values(by='Median Kindle Price', ascending=False)
sns.barplot(x=result_df_sorted.index, y=result_df_sorted['Median Kindle Price'], palette='Blues_d', order=result_df_sorted.index, ax=axes[2, 1])
axes[2, 1].set_title('Average Kindle Price for Each Category')
axes[2, 1].set_ylabel('Kindle Price ($)')
axes[2, 1].set_xticklabels(labels=result_df_sorted.index, rotation=30, ha='right')

plt.tight_layout()
plt.show()

So, there you go. Despite not having the highest average ratings, Romance takes the title as the Most Popular Book Category in 2023, surpassing other categories in votes, ratings, and reviews. It beats second place in Ratings and Reviews by twice the number. Meanwhile, Humor and History & Biography ranked as the two least popular book categories in 2023.

The price is very identical in every category, with exceptions of Romance and Romantasy which have the lowest median price of all categories, despite having high numbers of votes, ratings, and reviews.

Finding Correlations

Now comes the question. Are there any correlations between votes, reviews, rating, or even the number of pages and price? Does having more pages mean the rating is higher? Or does the category with low price mean more reviews and higher ratings? Let’s find out.

# Assign the columns
columns_of_interest = ['Number of Reviews', 'Number of Ratings', 'Number of Pages', 'Total Avg Rating', 'Readers Choice Votes', 'Kindle Price']

# Calculate the correlation matrix
correlation_matrix = df[columns_of_interest].corr()

# Display the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.xticks(rotation=30, ha='right')
plt.show()
Correlation Matrix

From this matrix, The Readers Choice Votes have high correlations between Number of Reviews and Number of Ratings. Higher reviews and ratings correlate with higher votes. Pages and price don’t have a strong connection between votes, ratings and reviews. That means the price and thickness of the book doesn’t really affect the votes, ratings, and reviews.

Analyzing by Books

It’s time to find out what book takes the title as the Most Voted Books in 2023.

most_voted_books = df[['Title', 'Readers Choice Category', 'Readers Choice Votes', 'Total Avg Rating', 'Number of Ratings', 'Number of Reviews', 'Number of Pages']].sort_values(by=['Readers Choice Votes', 'Number of Ratings', 'Number of Reviews'], ascending=False).head(20)

plt.figure(figsize=(14, 6))
sns.barplot(x=most_voted_books['Title'], y=most_voted_books['Readers Choice Votes'], data=most_voted_books, palette='Blues_d')
plt.title('Most Voted Books in 2023')
plt.xlabel('Book Title')
plt.ylabel('Votes')
plt.xticks(rotation=30, ha='right')
plt.show()

most_voted_books

So, the winner has been determined. Fourth Wing dominates the 2023 Readers Choice Vote as the Most Popular Book, receiving almost twice the votes as second place Yellowface and close to a million ratings. It takes over half of the votes just for the Romantasy category.

Fourth Wing by Rebecca Yarros

Now let’s take a look at the winners from every category.

max_votes_index = df.groupby('Readers Choice Category')['Readers Choice Votes'].idxmax()
titles_with_max_votes = df.loc[max_votes_index, ['Readers Choice Category', 'Title', 'Readers Choice Votes', 'Total Avg Rating', 'Number of Ratings', 'Number of Reviews', 'Number of Pages']].sort_values('Readers Choice Votes', ascending=False)
titles_with_max_votes

Next, we are going to analyze by month published by using barplot to find out the number of books released every month in 2023.

import calendar
df['First Published date'] = pd.to_datetime(df['First Published date'])

#Get only the books from year 2023.
books_2023 = df[df['First Published date'].dt.year == 2023]

#Count how many books released every month
books_per_month = books_2023.groupby(books_2023['First Published date'].dt.month)['Title'].count().reset_index()
books_per_month_2['Month'] = books_per_month['First Published date'].apply(lambda x: calendar.month_abbr[x])

plt.figure(figsize=(14, 8))
sns.barplot(data=books_per_month_2, x='Month', y='Title', palette='Blues_d')
plt.title('Distribution of Books Published by Date in 2023')
plt.xlabel('Month')
plt.ylabel('Number of Books Published')
plt.show()

books_per_month[['Month','Title']]
Number of books released every month in 2023

From this plot, we found out that November has the least books released, while September and January are the months that have the most books released.

Conclusion

From our analysis, we have determined which category is the most and the least popular in 2023. We also did a distribution analysis and correlation analysis between pages, votes, ratings, and review to find out is there any connection between those values.

What I just showed you is just the tip of how amazingly we can use Python as a tool to do data analysis and visualization. In this article, we have covered the basic and pivotal steps, that provide deeper understanding of the dataset, just by using libraries such as pandas for analysis and matplotlib/seaborn for visualization.

Thank you for reading my article. I hope you enjoyed reading it and I hope I’ve helped you to understand Exploratory Data Analysis in Python.

You can find my full code here on my Github.

--

--

Joseph M. Tandiallo
Joseph M. Tandiallo

Written by Joseph M. Tandiallo

Data Analytic & Web Scraping Enthusiast

Responses (1)