DEV Community

Msaghu
Msaghu

Posted on • Edited on

Python for Data Analysis Cheatsheet

Pandas is a panda package for data manipulation and visualization. Pandas is built ontop of Numpy and Matlplotlib(data visualization). In pandas, data analysis is done on rectangular data which is represented as dataframes, in SQL they are referred to as tables.
We will be using the data from a csv of 'russian alcohol consumption'.

Exploring a Dataframe using the table

Returns first few rows of the dataframe

print(alcohol_df.head())
Enter fullscreen mode Exit fullscreen mode

To get the names of columns, and their data types and if they contain missing values

print(alcohol_df.info())
Enter fullscreen mode Exit fullscreen mode

To get the number of rows and columns of the datframe displayed as a tuple. This is an attribute, not a method.

print(alcohol_df.shape)
Enter fullscreen mode Exit fullscreen mode

To get the summary statistics of the dataframe;

print(alcohol_df.describe())
Enter fullscreen mode Exit fullscreen mode

To get the values in a 2D NumPy array

print(alcohol_df.values)
Enter fullscreen mode Exit fullscreen mode

To get the column names

print(alcohol_df.columns)
Enter fullscreen mode Exit fullscreen mode

To get row numbers/row names

print(alcohol_df.index)
Enter fullscreen mode Exit fullscreen mode

1. Dataframes

Change order of the rows by sorting them using a particular column. This automatically sorts be ascending order, from smallest to the largest.

sorted_drinks = alcohol_df.sort_values("region")
print(sorted_drinks)
Enter fullscreen mode Exit fullscreen mode

To sort by descending order

sorted_drinks = alcohol_df.sort_values("region", ascending=False)
print(sorted_drinks)
Enter fullscreen mode Exit fullscreen mode

Sort my multiple variables by passing a list to the method

sorted_drinks = alcohol_df.sort_values(["region", "year"], ascending=[True, False])
print(sorted_drinks)
Enter fullscreen mode Exit fullscreen mode

Subset columns to only see the data in a particular column

print(alcohol_df["region"])
Enter fullscreen mode Exit fullscreen mode

To subset multiple columns to only see the data in the particular columns

print(alcohol_df[["region", "year"]])
or 
column_info = ["region", "year"]
print(alcohol_df[column_info])
Enter fullscreen mode Exit fullscreen mode

Subset rows to only see the data in a particular column. It displays the answer in boolean True or False.;

print(alcohol_df["year"]>1999)
Enter fullscreen mode Exit fullscreen mode

To get the whole dataframe displayed with only the year above 2000;

print(alcohol_df[alcohol_df["year"]>"1999"])
Enter fullscreen mode Exit fullscreen mode

or
only where the region is "Chuvash Republic":

print(alcohol_df[alcohol_df["region"]=="Chuvash Republic"])
Enter fullscreen mode Exit fullscreen mode

Combining these conditions to get multiple insights at once;

c_republic = alcohol_df["region"] == "Chuvash Republic"
wine_percent = alcohol_df["wine"] > 2.0
print(alcohol_df[c_republic & wine_percent])
Enter fullscreen mode Exit fullscreen mode

Subset using .isin()

is_chuvash_or_yaroslavl = alcohol_df["region"].isin(["Chuvash Republic","Yaroslavl Oblast"])
print(alcohol_df[is_chuvash_or_yaroslavl])
Enter fullscreen mode Exit fullscreen mode

Adding new column

alcohol_df["wine_m"] = alcohol_df["wine"] / 100
print(alcohol_df.head())
Enter fullscreen mode Exit fullscreen mode

2. Aggregating Data

Calculate column mean

print(alcohol_df["wine"].mean())
Enter fullscreen mode Exit fullscreen mode

We use the same method to calculate .median(), .mode(), .min(), .max(), .var(), .std(), .sum()

To get the earliest year

print(alcohol_df["year"].min())
Enter fullscreen mode Exit fullscreen mode

To get the latest year

print(alcohol_df["year"].max())
Enter fullscreen mode Exit fullscreen mode

Use the .agg() method to compute custom summary statistics. To get the 30th percentile of the wines column

def pct30(column):
    return column.quantile(0.3)
print(alcohol_df["wine"].agg(pct30))
Enter fullscreen mode Exit fullscreen mode

To get the 30th percentile of multiple columns

def pct30(column):
    return column.quantile(0.3)
print(alcohol_df[["wine", "beer", "vodka"]].agg(pct30))
Enter fullscreen mode Exit fullscreen mode
def pct40(column):
    return column.quantile(0.4)
print(alcohol_df["wine"].agg([pct30, pct40]))
Enter fullscreen mode Exit fullscreen mode

Cumulative statistics

Calculating the cumulative sum of a column and return an entire column of the DataFrame
.cummax(), .cummin(), .cumprod()

Remove repeat values

print(alcohol_df.drop_duplicates(subset="region"))

unique = alcohol_df.drop_duplicates(subset=["year","region"])
print(unique)

print(unique["region"].value_counts())
print(unique["region"].value_counts(sort=True))
Enter fullscreen mode Exit fullscreen mode

To turn the counts into proportions of the total

print(unique["region"].value_counts(normalize=True))
Enter fullscreen mode Exit fullscreen mode

Summary statistics

print(alcohol_df.groupby(["year", "region"])["wine"].mean())
print(alcohol_df.groupby("year")["wine"].mean())

print(alcohol_df.groupby("year")["wine"].agg([min, max, sum]))
Enter fullscreen mode Exit fullscreen mode

Pivot Tables

Groups elements together where, values
arguments is the column we want to summarize and the index is the column we want to group by. The pivot will take the mean value of each group

print(alcohol_df.pivot_table(values="wine", index="year"))
Enter fullscreen mode Exit fullscreen mode

Also done as:

import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=median)
print(year_percent)

import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=[np.median, np.mean])
print(year_percent)
Enter fullscreen mode Exit fullscreen mode

To group by 2 variables

year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week")
print(year_percent)
Enter fullscreen mode Exit fullscreen mode

To fill empty values

year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week", fill_value=0, margins=True)
print(year_percent)
Enter fullscreen mode Exit fullscreen mode

3. Slicing and Indexing Data

Setting a column as the index

alcohol_ind = alcohol_df.set_index("year")
print(alcohol_ind)
Enter fullscreen mode Exit fullscreen mode

Removing an index

print(alcohol_ind.reset_index())
Enter fullscreen mode Exit fullscreen mode

We can drop the index, therefore fully removing it from our dataframe

print(alcohol_ind.reset_index(drop=True))
Enter fullscreen mode Exit fullscreen mode

Subset on index values using loc

print(alcohol_ind.loc[[2000,2003]])
Enter fullscreen mode Exit fullscreen mode

Multi-level indexes

alcohol_ind3 = alcohol_df.set_index(["region", "year"]).sort_index()
print(alcohol_ind3.loc[["Amur Oblast"]])

alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[["Amur Oblast", "Sakha (Yakutia) Republic"]])
Enter fullscreen mode Exit fullscreen mode

Subsetting with a list of tuples

alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]])
Enter fullscreen mode Exit fullscreen mode

Controlling the sort index

print(alcohol_ind4 = alcohol_df.sort_index(level=["region", "year"], ascending=[True, False]))
Enter fullscreen mode Exit fullscreen mode

Slicing and subsetting with .loc and .iloc
This works on only outer index levels, if we want to index inner levels, we need to pass tuples
Slicing the outer index level, but first we sort the index

alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])
Enter fullscreen mode Exit fullscreen mode

Slicing columns

alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])
print(alcohol_srt.loc[:, "region":"wine"])
Enter fullscreen mode Exit fullscreen mode

Slicing on rows and columns at the same time

#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"

alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]]
print(years_ind.loc[:, "wine":"beer"])
Enter fullscreen mode Exit fullscreen mode
#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000):("Sakha (Yakutia) Republic", 2000), "wine":"beer"]]
print(years_ind)
Enter fullscreen mode Exit fullscreen mode
# Wine and beer consumption for the year 2000-2010 from the regions "Amur Oblast"only
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[("Amur Oblast", 2000):("Amur Oblast", 2010), "wine":"beer"]
print(years_ind)
Enter fullscreen mode Exit fullscreen mode

Slicing row and column number using iloc method

years_ind = alcohol_df.iloc[2:5, 1:4]
print(years_ind)
Enter fullscreen mode Exit fullscreen mode

4. Creating and Visualizing data

Top comments (0)