Pandas is a panda package for data manipulation and visualization. Pandas is built ontop of Numpy and Matlplotlib(data visualization). In pandas, data analysis is done on rectangular data which is represented as dataframes, in SQL they are referred to as tables.
We will be using the data from a csv of 'russian alcohol consumption'.
Exploring a Dataframe using the table
Returns first few rows of the dataframe
print(alcohol_df.head())
To get the names of columns, and their data types and if they contain missing values
print(alcohol_df.info())
To get the number of rows and columns of the datframe displayed as a tuple. This is an attribute, not a method.
print(alcohol_df.shape)
To get the summary statistics of the dataframe;
print(alcohol_df.describe())
To get the values in a 2D NumPy array
print(alcohol_df.values)
To get the column names
print(alcohol_df.columns)
To get row numbers/row names
print(alcohol_df.index)
1. Dataframes
Change order of the rows by sorting them using a particular column. This automatically sorts be ascending order, from smallest to the largest.
sorted_drinks = alcohol_df.sort_values("region")
print(sorted_drinks)
To sort by descending order
sorted_drinks = alcohol_df.sort_values("region", ascending=False)
print(sorted_drinks)
Sort my multiple variables by passing a list to the method
sorted_drinks = alcohol_df.sort_values(["region", "year"], ascending=[True, False])
print(sorted_drinks)
Subset columns to only see the data in a particular column
print(alcohol_df["region"])
To subset multiple columns to only see the data in the particular columns
print(alcohol_df[["region", "year"]])
or
column_info = ["region", "year"]
print(alcohol_df[column_info])
Subset rows to only see the data in a particular column. It displays the answer in boolean True or False.;
print(alcohol_df["year"]>1999)
To get the whole dataframe displayed with only the year above 2000;
print(alcohol_df[alcohol_df["year"]>"1999"])
or
only where the region is "Chuvash Republic":
print(alcohol_df[alcohol_df["region"]=="Chuvash Republic"])
Combining these conditions to get multiple insights at once;
c_republic = alcohol_df["region"] == "Chuvash Republic"
wine_percent = alcohol_df["wine"] > 2.0
print(alcohol_df[c_republic & wine_percent])
Subset using .isin()
is_chuvash_or_yaroslavl = alcohol_df["region"].isin(["Chuvash Republic","Yaroslavl Oblast"])
print(alcohol_df[is_chuvash_or_yaroslavl])
Adding new column
alcohol_df["wine_m"] = alcohol_df["wine"] / 100
print(alcohol_df.head())
2. Aggregating Data
Calculate column mean
print(alcohol_df["wine"].mean())
We use the same method to calculate .median(), .mode(), .min(), .max(), .var(), .std(), .sum()
To get the earliest year
print(alcohol_df["year"].min())
To get the latest year
print(alcohol_df["year"].max())
Use the .agg() method to compute custom summary statistics. To get the 30th percentile of the wines column
def pct30(column):
return column.quantile(0.3)
print(alcohol_df["wine"].agg(pct30))
To get the 30th percentile of multiple columns
def pct30(column):
return column.quantile(0.3)
print(alcohol_df[["wine", "beer", "vodka"]].agg(pct30))
def pct40(column):
return column.quantile(0.4)
print(alcohol_df["wine"].agg([pct30, pct40]))
Cumulative statistics
Calculating the cumulative sum of a column and return an entire column of the DataFrame
.cummax(), .cummin(), .cumprod()
Remove repeat values
print(alcohol_df.drop_duplicates(subset="region"))
unique = alcohol_df.drop_duplicates(subset=["year","region"])
print(unique)
print(unique["region"].value_counts())
print(unique["region"].value_counts(sort=True))
To turn the counts into proportions of the total
print(unique["region"].value_counts(normalize=True))
Summary statistics
print(alcohol_df.groupby(["year", "region"])["wine"].mean())
print(alcohol_df.groupby("year")["wine"].mean())
print(alcohol_df.groupby("year")["wine"].agg([min, max, sum]))
Pivot Tables
Groups elements together where, values
arguments is the column we want to summarize and the index is the column we want to group by. The pivot will take the mean value of each group
print(alcohol_df.pivot_table(values="wine", index="year"))
Also done as:
import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=median)
print(year_percent)
import numpy as np
year_percent = alcohol_df.pivot_table(values="wine", index="year", aggfunc=[np.median, np.mean])
print(year_percent)
To group by 2 variables
year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week")
print(year_percent)
To fill empty values
year_percent = alcohol_df.pivot_table(values="wine", index="year", columns="week", fill_value=0, margins=True)
print(year_percent)
3. Slicing and Indexing Data
Setting a column as the index
alcohol_ind = alcohol_df.set_index("year")
print(alcohol_ind)
Removing an index
print(alcohol_ind.reset_index())
We can drop the index, therefore fully removing it from our dataframe
print(alcohol_ind.reset_index(drop=True))
Subset on index values using loc
print(alcohol_ind.loc[[2000,2003]])
Multi-level indexes
alcohol_ind3 = alcohol_df.set_index(["region", "year"]).sort_index()
print(alcohol_ind3.loc[["Amur Oblast"]])
alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[["Amur Oblast", "Sakha (Yakutia) Republic"]])
Subsetting with a list of tuples
alcohol_ind4 = alcohol_df.set_index(["region", "year"])
print(alcohol_ind4.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]])
Controlling the sort index
print(alcohol_ind4 = alcohol_df.sort_index(level=["region", "year"], ascending=[True, False]))
Slicing and subsetting with .loc and .iloc
This works on only outer index levels, if we want to index inner levels, we need to pass tuples
Slicing the outer index level, but first we sort the index
alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])
Slicing columns
alcohol_srt = alcohol_df.set_index(["year"]).sort_index()
print(alcohol_srt.loc[2000:2005])
print(alcohol_srt.loc[:, "region":"wine"])
Slicing on rows and columns at the same time
#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000), ("Sakha (Yakutia) Republic", 2000)]]
print(years_ind.loc[:, "wine":"beer"])
#Wine consumption for the year 2000 for "Amur Oblast", "Astrakhan Oblast"
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[[("Amur Oblast", 2000):("Sakha (Yakutia) Republic", 2000), "wine":"beer"]]
print(years_ind)
# Wine and beer consumption for the year 2000-2010 from the regions "Amur Oblast"only
alcohol_srt = alcohol_df.set_index(["region", "year"]).sort_index()
years_ind = alcohol_srt.loc[("Amur Oblast", 2000):("Amur Oblast", 2010), "wine":"beer"]
print(years_ind)
Slicing row and column number using iloc method
years_ind = alcohol_df.iloc[2:5, 1:4]
print(years_ind)
Top comments (0)