Foundation Data Sciences
Week 02: Introduction to Jupyter Notebooks and Pandas
Learning outcomes:
In this lab you will learn the very basics of the python library pandas, which is used for data management. By the end of the lab you should be able to:
- use jupyter notebook,
- load different data file types,
- display data,
- filter your data for specific values, and
- apply basic statistical computations on the data.
Prerequisites
- Basic knowledge of
python
is assumed for this course. If you haven’t used Python before or need a refresher, we can recommend the following python tutorial as a starting point. - Basic knowledge of
numpy
is assumed for this course. If you haven’t used numpy before or need a refresher, we can recommend the following numpy tutorial.
We will try to cover a different research question every week. This week we will take the position of a historian and try to answer the following question.
Research question: Which passenger group had the worst survival rate on the Titanic?
Data information: We will use a well-known dataset in the machine learning community, often referred to as the titanic dataset. It contains a list of passengers, information about whether they survived or not, and some extra information, such as age, fare, gender and the class they travelled in. On the website, you will find a testing and training dataset. What training data and testing data means will be covered later in this course. In this lab we will not be doing machine learning; therefore the data is combined into a single dataset.
1.A IPython / Jupyter environment
Jupyter Notebook is a web-based interactive computational environment, which enables code to be shared and documented. It supports Julia, Python and R (Ju-pyte-r). A notebook is a collection of code and Markdown (text) cells. We will only give a high-level introduction to Jupyter Notebooks, which will be enough to solve the labs for this course. If you are interested in creating your own notebooks, or you just generally want to get a better understanding, we recommend the following tutorial.
Each code cell can be run separately, and the output is given below the cell. A number appears at the side of the code cell to indicate the order in which the cells were run.
Remarks
- Code in one cell can run, even if there are errors in other cells.
- The order in which these cells are run is important, e.g. if you are calling a function in cell A, which is defined in cell B, cell B needs to be executed before cell A.
- This means that the state of the variables in the notebook can be difficult to see.
All objects created by running cells are stored in the kernel running in the background. You can restart the kernel by using the Kernel menu at the top of the notebook. Because of the issues with the state of the system noted in the remarks, we recommend that when developing notebooks, you periodically select Kernel->Restart & Run All to check that your code really is reproducible. If you run into problems, it can also be a good idea to select Kernel->Restart & Run All; the problem might be due to the state of the variables.
1.A.1 Basic operation and shortcuts
There are two modes of selection when inside a Jupyter Notebook:
- Command Mode - When you hit up/down arrows you select different cells. Hit
<enter>
to enter edit mode. - Edit Mode - You can edit the cell. Hit
<esc>
to enter Command Mode again.
In Command Mode (cell highlighted blue):
<h> - bring up help window (contains full list of shortcuts!)
<enter> - Enter Edit Mode
<a> - create new cell above selected
<b> - create cell below selected
<d> <d> - delete selected cell (pressing 'd' twice)
In Edit Mode (cell highlighted green):
<esc> - Enter Command Mode
<shift> + <enter> - Run cell and move to cell below in Command Mode
<ctrl> + <enter> - Run cell in place
Try running the following code cell:
a = 1
b = 2
a + b
You’ll notice that the notebook will try to display the last thing in the cell, even if you don’t use a print
statement. However, if you want to print multiple things from one cell, you need to use multiple print
statements (or multiple cells).
first_name = ‘Jane’
last_name = ‘Doe’
print(first_name)
print(last_name)
Good Practice
- It is good practice to separate code into different cells. One cell should correspond to one task, similarly to functions. For example, use a cell for the
import
statements, one for loading data, one for preprocessing data, and one for each different operation you carry out on the data. - It’s generally good practice to import all your packages at the top of a file. We will do so in future tutorials.
Before we start, we need to import the packages that we will be using later:
import os
import pandas as pd
import numpy as np
os
stands for the standard Python operating system module, which we will use to access files. pd
is an alias for the pandas
module, to save typing later. Likewise np
is an alias for numpy
module. pd
and np
are the standard aliases for pandas
and numpy
. Here is a more in-depth tutorial on installing, importing and using modules.
1.B Pandas
Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the Series and DataFrame structures which are built on top of NumPy arrays. (Again, if you need a refresher, you can check out this numpy tutorial.)
Pandas is well documented and you will find good information about all methods and structures in the API reference.
1.B.1 Series
A Series is a one-dimensional object (similar to a list). Each element has a corresponding index. By default the indices range from 0
to N-1
, where N
is the length of the Series.
passenger = pd.Series([‘Mr. Owen Harris Braund’, 22.0, False])
passenger
If we want to specify meaningful labels for the index, we can do so with the index
parameter.
passenger = pd.Series([‘Mr. Owen Harris Braund’, 22.0, False], index=[‘Name’, ‘Age’, ‘Survived’])
passenger
You can access a Series entry the same way as you access list entries, either using the assigned index labels, such as 'Name'
, or by using the numeric index, i.e. 0:(N-1)
, where N
is the length of the Series.
print(passenger[1]) # Careful: indexing starts at 0.
print(passenger[‘Age’]) # Remember to use quotes
1.B.2 DataFrame
A DataFrame is a tabular data structure comprised of rows and columns. You can also think of the DataFrame as a collection of Series objects that share an index.
Creating DataFrame structures, adding rows, deleting rows and modifying entries
We can create an empty DataFrame by specifying the column names. Then we can insert data row by row.
passengers = pd.DataFrame(columns=[‘Gender’, ‘Age’, ‘Survived’])
passengers # Careful, the dataframe is called passengers, use meaningful variable names when coding
Now, let’s start filling the dataframe. To specify the row of a data frame, we use the .loc
attribute.
passengers.loc[0] = [‘Male’, 22.0, False] # Note how we used df.loc() to specify the index
passengers
Remember, we said that a DataFrame is a collection of Series. Let’s double check that.
type(passengers.loc[0])
Pandas DataFrames are quite flexible. Just as with Series, we can also use strings as index labels.
passengers.loc[‘Mrs. John Bradley Cumings’] = [‘Female’, 38.0, ‘Yes’]
passengers
Remark It is generally bad practice to mix different kinds of indices. So let’s remove the first entry.
cleaned_passengers = passengers.drop(0)
print(cleaned_passengers)
print(’\n’) # Empty line between DataFrames
print(passengers)
Remark: By default, df.drop(index)
creates a copy of the DataFrame without modifying the original DataFrame, which is why if you want to drop a row without creating a new DataFrame, you need to write df = df.drop(0)
, or set the optional inplace
argument to True
. You can see the difference between cleaned_passengers
and passengers
above.
passengers.drop(0, inplace=True) # Remove the 0th passenger without creating a copy
passengers
You can also populate a DataFrame using a dictionary which allows you to do things in a nonstandard order. Let’s get our first entry back.
passengers.loc[‘Mr. Owen Harris Braund’] = dict(Survived=False, Age=22.0, Gender=‘Male’) # Remark that the attributes are assigned in a different order
passengers
We just made a mess. In the first row, we used a string to denote whether the passenger survived, and in the second a Boolean value. Let’s clean this up.
passengers.loc[‘Mrs. John Bradley Cumings’, ‘Age’] = 39.0
passengers
Creating DataFrame from other structures
You can also create a DataFrame from:
- A dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame
Let’s recreate the same DataFrame:
Create a DataFrame from a list
passengers_list = [[‘Male’, 22.0, False], [‘Female’, 38.0, True]]
passengers = pd.DataFrame(passengers_list, index=[‘Mr. Owen Harris Braund’, ‘Mrs. John Bradley Cumings’],
columns=[‘Gender’, ‘Age’, ‘Survived’])
passengers
Create a DataFrame from a dictionary where keys are column values
column_key_dict = {
‘Gender’: [‘Male’, ‘Female’],
‘Age’: [22.0, 38.0],
‘Survived’: [False, True]
}
passengers = pd.DataFrame.from_dict(column_key_dict, orient=‘columns’)
passengers.index = [‘Mr. Owen Harris Braund’, ‘Mrs. John Bradley Cumings’]
passengers
Create a DataFrame from a dictionary where keys are index values
index_key_dict = {‘Mr. Owen Harris Braund’:[‘Male’, 22.0, False],
‘Mrs. John Bradley Cumings’:[‘Female’, 38.0, True]}
passengers = pd.DataFrame.from_dict(index_key_dict, orient=‘index’)
passengers.columns = [‘Gender’, ‘Age’, ‘Survived’]
passengers
Using the DataFrame call, keys are assumed to be column headers
passengers = pd.DataFrame({‘Mr. Owen Harris Braund’:[‘Male’, 22.0, False],
‘Mrs. John Bradley Cumings’:[‘Female’, 38.0, True]},
index=[‘Gender’, ‘Age’, ‘Survived’])
passengers
However, now the rows have become columns and vice versa. We could rewrite the code above, assigning the passenger names to the index
argument, and using the Gender, Age and Survived attributes as the dict keys. However, there is a more elegant solution: we can use the transpose method df.T
passengers = passengers.T
passengers
Remark Again, the transpose method creates a copy. Thus, if you want to actually apply the changes to the DataFrame, you need to save it to the variable, as shown above.
Let’s combine a few things we have learned so far.
Exercise 01 a:
- Delete the markdown cell below, which says ‘We don’t need this cell.’.
- Insert a new code cell below this cell.
- Create a ‘passengers’ DataFrame with one of the options we have presented above, which contains the following entries:
- Miss. Laina Heikkinen, Female, aged 26, survived;
- Mrs. Jacques Heath Futrelle, Female, 35 years old, survived;
- Mr. William Henry Allen, male, 35.0, did not survive;
- Make sure that you use a consistent notation in your DataFrame.
We don’t need this cell. Delete it.
Exercise 01 b:
- Append the two passengers mentioned previously (Mr. Own Harris Braund and Mrs. John Bradley Cumings) to the DataFrame.
passengers_list = [[‘Female’, 26.0,True],[‘Female’,35.0,True],[‘Male’,35.0,False]]
passengers = pd.DataFrame(passengers_list,index = [‘Miss. Laina Heikkinen’,‘Mrs. Jacques Heath Futrelle’,‘Mr. William Henry Allen’],columns=[‘Gender’,‘Age’,‘Survived’])
passengers
1.B.3 Dataset operations in Pandas
Most commonly we create DataFrame structures by reading csv files. We store the datasets you need in the first lab in datasets
. The next labs will have the datasets stored accordingly.
passengers_filepath = os.path.join(os.getcwd(), ‘datasets’, ‘titanic.csv’)
passengers_filepath
passengers = pd.read_csv(passengers_filepath)
passengers.head() # Head shows the first five elements (unless specified otherwise) of the DataFrame
These are the five passengers you should have added to your dataframe in Exercise 01. Now, we have some extra information. Using df.head()
we can get an impression of how the data looks like in our dataset. Note that the information in the Survived column is stored as 0/1 instead of False/True
. Pandas doesn’t care how Boolean values are stored. If a column only contains 0s and 1s, Pandas treats the column as Boolean, and works with it in the same way as a column of False
and True
.
However, df.head()
only shows the first five entries. How many entries are there in total? You can use the python native len()
function.
len(passengers)
Tab completion
Tab completion is a powerful method for viewing object attributes and available methods.
We have just seen the df.head()
method above. Let’s see what other functions we can call on a DataFrame. You can see what methods are available by typing the DataFrame name followed by .
and then hitting the <tab>
key. Then you can access any method’s help documentation by hitting the method’s name followed by ?
; this opens a ‘pager’ at the bottom of the screen, you can hit <esc>
to exit it.
For example, to find the last few entries of the DataFrame, first type passengers.t
, then hit <tab>
and then choose df.tail()
.
passengers.tail
Let’s get the documentation
passengers.tail?
If you want to get more than five entries you can do so by specifying N
entries with df.head(N)
or df.tail(N)
. Give it a try above.
Row selection
As already mentioned, you can think of a DataFrame as a group of Series that share an index (either the column headers or the row id). This makes it easy to select specific observations (i.e. rows).
type(passengers.loc[0])
We have already talked about df.loc[label]
, which selects a row based on the label of the index, e.g. ‘Mr. Owen Harris Braund’. If we want the N-th row, we can use df.iloc[N]
. In the loaded Titanic dataset, it so happens that the indices also run from 0 to 886, and thus df.loc[N]
and df.iloc[N]
return the same observation.
Technically, there are three options to select rows:
df.loc[label]
: works on labels in the indexdf.iloc[N]
: works on the position in the index (so it only takes integers)df[N]
: works the same way asdf.iloc[N]
It is often safest to use the first two methods (rather than just using square brackets) to index into pandas DataFrames.
passengers.iloc[0]
We can also select several rows; the resulting structure is a DataFrame. This operation is called slicing.
Remark: Python slicing might seem a bit confusing. When you specify a range to slice with i:j
the slice returned runs from the i
-th entry to the j-1
-th entry. This can be very helpful at times, when you want the last few entries, and you don’t know how long the object is (e.g. a DataFrame or list). In that case, you can slice with k:len(object)
.
type(passengers.iloc[0:3])
passengers.iloc[0:3]
This is equivalent to using .iloc
passengers[0:3]
Column Selection
As already mentioned, you can think of a DataFrame as a group of Series that share an index (either the column headers or the row id). This makes it as easy to select specific columns as it is to select rows.
type(passengers[‘Name’])
passengers[‘Name’].head()
To select multiple columns we simply need to pass a list of column names. We said above that we were interested in who survived, so let’s check that:
#Remark the double brackets, because we passed a list of names
passengers[[‘Name’, ‘Survived’]].head(7)
Exercise 02: What do you expect what the type of passengers[['Name', 'Survived']].head(7)
is? Check it!
type(passengers[[‘Name’,‘Survived’]].head(7))
You can combine row and column selection, as we already did above, when selecting a specific entry.
passengers.iloc[0][‘Survived’]
Exercise 03: Look at the dataframe you obtained in Exercise 01. Reconstruct the same dataframe using the newly loaded dataframe, i.e. select the correct rows and columns, such that the output is the same as in Exercise 01.
passengers[[‘Name’,‘Age’,‘Sex’,‘Survived’]].head(5)
Filtering
Now suppose that you want to select all the observations of minors (i.e. people under the age of 18 in the UK). It is easy to do that:
passengers[passengers[‘Age’] <= 17]
Or equivalently:
passengers[passengers.Age <= 17]
This concept is called filtering, and passengers.Age <= 17
is called a mask, which hides/masks all entries that don’t fit the criteria; i.e. it only returns the observations for which the mask returns True
. You can also filter the data by using multiple attributes:
young_passengers = passengers[(passengers.Age <= 17) & (passengers.Survived)] #Remark that we were able to drop ‘Survived == 1’
young_passengers
Remark The first row has index 9. As mentioned earlier: the label of the index does not have to coincide with the position of the data entry. Here, df.iloc[9]
and df.loc[9]
return different values!.
Exercise 04: What will be the return values of young_passengers.iloc[8:10]
and young_passengers.loc[8:10]
. Check the answer.
young_passengers.iloc[8:10]
young_passengers.loc[8:10]
Note that we can also index into columns using loc
. We just have to specify the second dimension (much as we would do with numpy arrays). Let’s give it a try. We want to get the list of minors (children) aboard the Titanic:
young_passengers.loc[:, ‘Name’]
young_passengers.iloc[:, 2] # And now using column indexing Remark this would not work with .loc[:, 3]
If we try the following we will get an empty DataFrame because there are no rows with labels 0 and 1.
young_passengers.loc[0:2]
The result is still a DataFrame
type(young_passengers.loc[0:2])
For more, check out Advanced Indexing
Basic operations on datasets
We now know how to
- load a dataset from a csv file
pd.read_csv()
, - get the first few entries
df.head()
, - select certain rows or columns based on their position
df.iloc[]
or index labeldf.loc[]
, - and finally how to filter the dataset for entries that satisfy certain conditions
df[condition]
.
However, at the beginning of the lab, we wanted to figure out the passenger group worst hit. Using filtering, we can already split the dataset into different subsets. Now let’s apply some operations.
First let’s get the age spread, and figure out the oldest and youngest passenger. The passengers in the dataset are not sorted by age. Looking at each entry individually is not an option. Thankfully pandas dataframes have a method: df.sort_values(by='column', ascending={True,False})
.
passengers.sort_values(by=‘Age’).head()
passengers.sort_values(by=‘Age’, ascending=False).head() # Returns the data in descending order of the sorted value
The two results above already show that young passengers had a better survival rate than senior passengers. Let’s try, tentatively, to answer the research question.
Discussion: We will consider the different travel classes, genders, and age groups. Discuss with your lab partner who you would expect had the best and who had the worst survival rate.
Exercise 05:
- Compute the survival rate of: minors (0-17 years old), adults (18-65), and seniors (66+). (Hint: All you need is
len()
.) - Compute the survival rate of: women and men.
- Compute the survival rate of the travel classes (1, 2, and 3).
- Does the gender have an influence on the survival rates of minors? (What is the survival rate of girls vs boys?)
- Compute the survival rate of all combinations of class, age group, and gender, and print them out as “Age group, gender, class: percentage”, e.g. “Minors, Male, First Class: 0.3”. (Hint: These are quite a few computations, think about how to automate it, e.g.
conditions = condition1 & condition2
might help.).
survival_minors = passengers[(passengers.Age<=17.0) &(passengers.Survived) ]
all_minors = passengers[(passengers.Age<=17.0)& (passengers.Age>0)]
minors_rate = (len(survival_minors))/(len(all_minors))
minors_rate
survival_adults= passengers[(passengers.Age>=17.0)&(passengers.Age<=65)&(passengers.Survived)]
all_adults= passengers[(passengers.Age>=17.0)&(passengers.Age<=65)]
adults_rate = (len(survival_adults))/(len(all_adults))
adults_rate
survival_seniors = passengers[(66<passengers.Age) &(passengers.Survived) ]
all_seniors = passengers[(66<passengers.Age)& (passengers.Age>0)]
seniors_rate = (len(survival_seniors))/(len(all_seniors))
seniors_rate
survival_man = passengers[(passengers.Sex == ‘male’) &(passengers.Survived) ]
all_man = passengers[(passengers.Sex == ‘male’)]
man_rate = (len(survival_man))/(len(all_man ))
man_rate
survival_woman = passengers[(passengers.Sex == ‘female’) &(passengers.Survived) ]
all_woman = passengers[(passengers.Sex == ‘female’)]
woman_rate = (len(survival_woman))/(len(all_woman ))
woman_rate
survival_one = passengers[(passengers.Pclass == 1) &(passengers.Survived) ]
all_one = passengers[(passengers.Pclass == 1)]
one_rate = (len(survival_one))/(len(all_one ))
one_rate
survival_two = passengers[(passengers.Pclass == 2) &(passengers.Survived) ]
all_two = passengers[(passengers.Pclass == 2)]
two_rate = (len(survival_two))/(len(all_two ))
two_rate
survival_three = passengers[(passengers.Pclass == 3) &(passengers.Survived) ]
all_three = passengers[(passengers.Pclass == 3)]
two_three = (len(survival_three))/(len(all_three ))
two_three
#Exercise a
Exercise b
#Exercise c
#Exercise d
We need your help: This is a new course. In order for us to improve the labs for the next iterations, and to make sure that the next labs are better, we need your feedback. Please fill out the following form.