Msaghu

Posted on May 10 • Edited on May 15

Python for Data Analysis Cheatsheet

#programming #python #pandas

Pandas is a panda package for data manipulation and visualization. Pandas is built ontop of Numpy and Matlplotlib(data visualization). In pandas, data analysis is done on rectangular data which is represented as dataframes, in SQL they are referred to as tables.
We will be using the data from a csv of 'russian alcohol consumption'.

Exploring a Dataframe using the table

Returns first few rows of the dataframe

print(alcohol_df.head())

To get the names of columns, and their data types and if they contain missing values

print(alcohol_df.info())

To get the number of rows and columns of the datframe displayed as a tuple. This is an attribute, not a method.

print(alcohol_df.shape)

To get the summary statistics of the dataframe;

print(alcohol_df.describe())

To get the values in a 2D NumPy array

print(alcohol_df.values)

To get the column names

print(alcohol_df.columns)

To get row numbers/row names

print(alcohol_df.index)

1. Dataframes

Change order of the rows by sorting them using a particular column. This automatically sorts be ascending order, from smallest to the largest.

sorted_drinks = alcohol_df.sort_values("region")
print(sorted_drinks)

To sort by descending order

sorted_drinks = alcohol_df.sort_values("region", ascending=False)
print(sorted_drinks)

Sort my multiple variables by passing a list to the method

sorted_drinks = alcohol_df.sort_values(["region", "year"], ascending=[True, False])
print(sorted_drinks)

Subset columns to only see the data in a particular column

print(alcohol_df["region"])

To subset multiple columns to only see the data in the particular columns

print(alcohol_df[["region", "year"]])
or 
column_info = ["region", "year"]
print(alcohol_df[column_info])

Subset rows to only see the data in a particular column. It displays the answer in boolean True or False.;

print(alcohol_df["year"]>1999)

To get the whole dataframe displayed with only the year above 2000;

print(alcohol_df[alcohol_df["year"]>"1999"])

or
only where the region is "Chuvash Republic":

print(alcohol_df[alcohol_df["region"]=="Chuvash Republic"])

Combining these conditions to get multiple insights at once;

c_republic = alcohol_df["region"] == "Chuvash Republic"
wine_percent = alcohol_df["wine"] > 2.0
print(alcohol_df[c_republic & wine_percent])

Subset using .isin()

is_chuvash_or_yaroslavl = alcohol_df["region"].isin(["Chuvash Republic","Yaroslavl Oblast"])
print(alcohol_df[is_chuvash_or_yaroslavl])

Adding new column

alcohol_df["wine_m"] = alcohol_df["wine"] / 100
print(alcohol_df.head())

2. Aggregating Data

Calculate column mean

print(alcohol_df["wine"].mean())

We use the same method to calculate .median(), .mode(), .min(), .max(), .var(), .std(), .sum()

To get the earliest year

print(alcohol_df["year"].min())

To get the latest year

print(alcohol_df["year"].max())

Use the .agg() method to compute custom summary statistics. To get the 30th percentile of the wines column

def pct30(column):
    return column.quantile(0.3)
print(alcohol_df["wine"].agg(pct30))

To get the 30th percentile of multiple columns

def pct30(column):
    return column.quantile(0.3)
print(alcohol_df[["wine", "beer", "vodka"]].agg(pct30))

def pct40(column):
    return column.quantile(0.4)
print(alcohol_df["wine"].agg([pct30, pct40]))

Cumulative statistics

Calculating the cumulative sum of a column and return an entire column of the DataFrame
.cummax(), .cummin(), .cumprod()

Remove repeat values

print(alcohol_df.drop_duplicates(subset="region"))

unique = alcohol_df.drop_duplicates(subset=["year","region"])
print(unique)

print(unique["region"].value_counts())
print(unique["region"].value_counts(sort=True))

To turn the counts into proportions of the total

print(unique["region"].value_counts(normalize=True))

Summary statistics

print(alcohol_df.groupby(["year", "region"])["wine"].mean())
print(alcohol_df.groupby("year")["wine"].mean())

print(alcohol_df.groupby("year")["wine"].agg([min, max, sum]))

Pivot Tables

Groups elements together where, values
arguments is the column we want to summarize and the index is the column we want to group by. The pivot will take the mean value of each group

print(alcohol_df.pivot_table(values="wine", index="year"))

Also done as:

import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=median)
print(year_percent)

import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=[np.median, np.mean])
print(year_percent)

To group by 2 variables

year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week")
print(year_percent)

To fill empty values

year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week", fill_value=0, margins=True)
print(year_percent)

3. Slicing and Indexing Data

Setting a column as the index

alcohol_ind = alcohol_df.set_index("year")
print(alcohol_ind)

Removing an index

print(alcohol_ind.reset_index())

We can drop the index, therefore fully removing it from our dataframe

print(alcohol_ind.reset_index(drop=True))

Subset on index values using loc

print(alcohol_ind.loc[[2000,2003]])

Multi-level indexes

alcohol_ind3 = alcohol_df.set_index(["region", "year"]).sort_index()
print(alcohol_ind3.loc[["Amur Oblast"]])

alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[["Amur Oblast", "Sakha (Yakutia) Republic"]])

Subsetting with a list of tuples

alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]])

Controlling the sort index

print(alcohol_ind4 = alcohol_df.sort_index(level=["region", "year"], ascending=[True, False]))

Slicing and subsetting with .loc and .iloc
This works on only outer index levels, if we want to index inner levels, we need to pass tuples
Slicing the outer index level, but first we sort the index

alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])

Slicing columns

alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])
print(alcohol_srt.loc[:, "region":"wine"])

Slicing on rows and columns at the same time

#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"

alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]]
print(years_ind.loc[:, "wine":"beer"])

#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000):("Sakha (Yakutia) Republic", 2000), "wine":"beer"]]
print(years_ind)

# Wine and beer consumption for the year 2000-2010 from the regions "Amur Oblast"only
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[("Amur Oblast", 2000):("Amur Oblast", 2010), "wine":"beer"]
print(years_ind)

Slicing row and column number using iloc method

years_ind = alcohol_df.iloc[2:5, 1:4]
print(years_ind)

4. Creating and Visualizing data

DEV Community