Kodeclik Blog

Python Pandas crosstab()

Recall that a Pandas dataframe is just like a table or spreadsheet, composed of columns and rows which contain data about some subject matter. Each column denotes typically one attribute (or variable) and each row contains one instance or observation of all variables.

Sometimes we are given “non-dataframe” data, i.e., data that is not in dataframe format and we wish to capture summary statistics from this data in the form of a dataframe. This is what the Pandas crosstab() function is for!

Consider for instance:

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

Here we are given three lists: products, months, and quantities. Different products are sold in different months in different quantities and this information is recorded in the above lists. Thus, we can see that 5 apples were sold in January, 2 (units of) Shampoo were sold in March, and so on. Note that there are two items sold in March: Shampoo and Chips. We wish to create summaries of sales across months, across products, and across both months and products. That is what pandas crosstab() is for!

Here is the most basic use of crosstab():

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

print(pd.crosstab(products, months))

The output is:

col_0    Apr  Feb  Jan  Mar
row_0                      
Apples     0    0    1    0
Bananas    0    1    0    0
Chips      0    0    0    1
Salsa      1    0    0    0
Shampoo    0    0    0    1

Note that the crosstab() function above is given only two inputs, namely the products and months (it doesn’t take quantities as input). The output is simply counting the frequencies of each (product, month) pair and outputting these frequencies. As you can see every row has only one “1”. Every column has only one “1” except for March which has two products sold in that month. Also note that the products and months are sorted.

If you flip the order in which you gave these inputs:

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

print(pd.crosstab(months, products))

The output is:

col_0  Apples  Bananas  Chips  Salsa  Shampoo
row_0                                        
Apr         0        0      0      1        0
Feb         0        1      0      0        0
Jan         1        0      0      0        0
Mar         0        0      1      0        1

Thus the rows have become columns and the columns have become rows.

In the above two examples, notice that pandas does not have a good idea of what to name the columns and rows so it has simply called them “col_0” and “row_0”. We can specify names in the following manner:

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

print(pd.crosstab(products, months, 
      rownames=['Groceries'],colnames=['Months']))

The output is:

Months     Apr  Feb  Jan  Mar
Groceries                    
Apples       0    0    1    0
Bananas      0    1    0    0
Chips        0    0    0    1
Salsa        1    0    0    0
Shampoo      0    0    0    1

Finally, we can specify that the table must contain not just frequencies but also quantities but you must also specify how these quantities should be aggregated (by default we usually intend for these quantities to be summed up, so the aggregation function is just “sum”):

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

print(pd.crosstab(products, months, quantities, 
      rownames=['Groceries'],colnames=['Months'], 
      aggfunc=sum))

The output is:

Months     Apr   Feb  Jan  Mar
Groceries                     
Apples     NaN   NaN  5.0  NaN
Bananas    NaN  10.0  NaN  NaN
Chips      NaN   NaN  NaN  3.0
Salsa      2.0   NaN  NaN  NaN
Shampoo    NaN   NaN  NaN  2.0

Note that most cells have “NaN” (not a number) because we had no data for those (product, month) combinations. For others we have added up the values to product the final table.

A last feature of pandas crosstab() is that we can extract row marginals, column marginals, and marginals across the whole table. We simply add “margins=True” to our crosstab invocation:

import pandas as pd

products = ["Apples","Bananas","Shampoo","Chips","Salsa"]
months = ["Jan", "Feb", "Mar", "Mar", "Apr"]
quantities = [5,10,2,3,2]

print(pd.crosstab(products, months, quantities, 
      rownames=['Groceries'],colnames=['Months'], 
      aggfunc=sum,margins=True))

The output is:

Months     Apr   Feb  Jan  Mar  All
Groceries                          
Apples     NaN   NaN  5.0  NaN    5
Bananas    NaN  10.0  NaN  NaN   10
Chips      NaN   NaN  NaN  3.0    3
Salsa      2.0   NaN  NaN  NaN    2
Shampoo    NaN   NaN  NaN  2.0    2
All        2.0  10.0  5.0  5.0   22

Note that there is now an “All” column (denoting all months), an “All” row (denoting all products). Each value here is simply an aggregation along a particular row or column. Further, note that there is an aggregation across both rows and columns, which yields 22.

If you liked this blogpost, checkout our blogpost on Pandas daterange. Also learn how to reorder columns in a Pandas dataframe.

Interested in more things Python? Checkout our post on Python queues. Also see our blogpost on Python's enumerate() capability. Also if you like Python+math content, see our blogpost on Magic Squares. Finally, master the Python print function!

Want to learn Python with us? Sign up for 1:1 or small group classes.

Python Pandas crosstab()

Join our mailing list