#!/usr/bin/env python # coding: utf-8 # # # *This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).* # # *The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!* # # *No changes were made to the contents of this notebook from the original.* # # < [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) | [Contents](Index.ipynb) | [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) > # # Aggregations: Min, Max, and Everything In Between # Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. # Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.). # # NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here. # ## Summing the Values in an Array # # As a quick example, consider computing the sum of all values in an array. # Python itself can do this using the built-in ``sum`` function: # In[1]: import numpy as np # In[2]: L = np.random.random(100) sum(L) # The syntax is quite similar to that of NumPy's ``sum`` function, and the result is the same in the simplest case: # In[3]: np.sum(L) # However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly: # In[4]: big_array = np.random.rand(1000000) get_ipython().run_line_magic('timeit', 'sum(big_array)') get_ipython().run_line_magic('timeit', 'np.sum(big_array)') # Be careful, though: the ``sum`` function and the ``np.sum`` function are not identical, which can sometimes lead to confusion! # In particular, their optional arguments have different meanings, and ``np.sum`` is aware of multiple array dimensions, as we will see in the following section. # ## Minimum and Maximum # # Similarly, Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array: # In[5]: min(big_array), max(big_array) # NumPy's corresponding functions have similar syntax, and again operate much more quickly: # In[6]: np.min(big_array), np.max(big_array) # In[7]: get_ipython().run_line_magic('timeit', 'min(big_array)') get_ipython().run_line_magic('timeit', 'np.min(big_array)') # For ``min``, ``max``, ``sum``, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself: # In[8]: print(big_array.min(), big_array.max(), big_array.sum()) # Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays! # ### Multi dimensional aggregates # # One common type of aggregation operation is an aggregate along a row or column. # Say you have some data stored in a two-dimensional array: # In[9]: M = np.random.random((3, 4)) print(M) # By default, each NumPy aggregation function will return the aggregate over the entire array: # In[10]: M.sum() # Aggregation functions take an additional argument specifying the *axis* along which the aggregate is computed. For example, we can find the minimum value within each column by specifying ``axis=0``: # In[11]: M.min(axis=0) # The function returns four values, corresponding to the four columns of numbers. # # Similarly, we can find the maximum value within each row: # In[12]: M.max(axis=1) # The way the axis is specified here can be confusing to users coming from other languages. # The ``axis`` keyword specifies the *dimension of the array that will be collapsed*, rather than the dimension that will be returned. # So specifying ``axis=0`` means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated. # ### Other aggregation functions # # NumPy provides many other aggregation functions, but we won't discuss them in detail here. # Additionally, most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value (for a fuller discussion of missing data, see [Handling Missing Data](03.04-Missing-Values.ipynb)). # Some of these ``NaN``-safe functions were not added until NumPy 1.8, so they will not be available in older NumPy versions. # # The following table provides a list of useful aggregation functions available in NumPy: # # |Function Name | NaN-safe Version | Description | # |-------------------|---------------------|-----------------------------------------------| # | ``np.sum`` | ``np.nansum`` | Compute sum of elements | # | ``np.prod`` | ``np.nanprod`` | Compute product of elements | # | ``np.mean`` | ``np.nanmean`` | Compute median of elements | # | ``np.std`` | ``np.nanstd`` | Compute standard deviation | # | ``np.var`` | ``np.nanvar`` | Compute variance | # | ``np.min`` | ``np.nanmin`` | Find minimum value | # | ``np.max`` | ``np.nanmax`` | Find maximum value | # | ``np.argmin`` | ``np.nanargmin`` | Find index of minimum value | # | ``np.argmax`` | ``np.nanargmax`` | Find index of maximum value | # | ``np.median`` | ``np.nanmedian`` | Compute median of elements | # | ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements | # | ``np.any`` | N/A | Evaluate whether any elements are true | # | ``np.all`` | N/A | Evaluate whether all elements are true | # # We will see these aggregates often throughout the rest of the book. # ## Example: What is the Average Height of US Presidents? # Aggregates available in NumPy can be extremely useful for summarizing a set of values. # As a simple example, let's consider the heights of all US presidents. # This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values: # In[13]: get_ipython().system('head -4 data/president_heights.csv') # We'll use the Pandas package, which we'll explore more fully in [Chapter 3](03.00-Introduction-to-Pandas.ipynb), to read the file and extract this information (note that the heights are measured in centimeters). # In[14]: import pandas as pd data = pd.read_csv('data/president_heights.csv') heights = np.array(data['height(cm)']) print(heights) # Now that we have this data array, we can compute a variety of summary statistics: # In[15]: print("Mean height: ", heights.mean()) print("Standard deviation:", heights.std()) print("Minimum height: ", heights.min()) print("Maximum height: ", heights.max()) # Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values. # We may also wish to compute quantiles: # In[16]: print("25th percentile: ", np.percentile(heights, 25)) print("Median: ", np.median(heights)) print("75th percentile: ", np.percentile(heights, 75)) # We see that the median height of US presidents is 182 cm, or just shy of six feet. # # Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib (we'll discuss Matplotlib more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)). For example, this code generates the following chart: # In[17]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt import seaborn; seaborn.set() # set plot style # In[18]: plt.hist(heights) plt.title('Height Distribution of US Presidents') plt.xlabel('height (cm)') plt.ylabel('number'); # These aggregates are some of the fundamental pieces of exploratory data analysis that we'll explore in more depth in later chapters of the book. # # < [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) | [Contents](Index.ipynb) | [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) >