week02-Introduction

Foundation Data Sciences

Week 02: Introduction to Jupyter Notebooks and Pandas

Learning outcomes:
In this lab you will learn the very basics of the python library pandas, which is used for data management. By the end of the lab you should be able to:

  • use jupyter notebook,
  • load different data file types,
  • display data,
  • filter your data for specific values, and
  • apply basic statistical computations on the data.

Prerequisites

  • Basic knowledge of python is assumed for this course. If you haven’t used Python before or need a refresher, we can recommend the following python tutorial as a starting point.
  • Basic knowledge of numpy is assumed for this course. If you haven’t used numpy before or need a refresher, we can recommend the following numpy tutorial.

We will try to cover a different research question every week. This week we will take the position of a historian and try to answer the following question.

Research question: Which passenger group had the worst survival rate on the Titanic?

Data information: We will use a well-known dataset in the machine learning community, often referred to as the titanic dataset. It contains a list of passengers, information about whether they survived or not, and some extra information, such as age, fare, gender and the class they travelled in. On the website, you will find a testing and training dataset. What training data and testing data means will be covered later in this course. In this lab we will not be doing machine learning; therefore the data is combined into a single dataset.

1.A IPython / Jupyter environment

Jupyter Notebook is a web-based interactive computational environment, which enables code to be shared and documented. It supports Julia, Python and R (Ju-pyte-r). A notebook is a collection of code and Markdown (text) cells. We will only give a high-level introduction to Jupyter Notebooks, which will be enough to solve the labs for this course. If you are interested in creating your own notebooks, or you just generally want to get a better understanding, we recommend the following tutorial.

Each code cell can be run separately, and the output is given below the cell. A number appears at the side of the code cell to indicate the order in which the cells were run.

Remarks

  1. Code in one cell can run, even if there are errors in other cells.
  2. The order in which these cells are run is important, e.g. if you are calling a function in cell A, which is defined in cell B, cell B needs to be executed before cell A.
  3. This means that the state of the variables in the notebook can be difficult to see.

All objects created by running cells are stored in the kernel running in the background. You can restart the kernel by using the Kernel menu at the top of the notebook. Because of the issues with the state of the system noted in the remarks, we recommend that when developing notebooks, you periodically select Kernel->Restart & Run All to check that your code really is reproducible. If you run into problems, it can also be a good idea to select Kernel->Restart & Run All; the problem might be due to the state of the variables.

1.A.1 Basic operation and shortcuts

There are two modes of selection when inside a Jupyter Notebook:

  1. Command Mode - When you hit up/down arrows you select different cells. Hit <enter> to enter edit mode.
  2. Edit Mode - You can edit the cell. Hit <esc> to enter Command Mode again.

In Command Mode (cell highlighted blue):

              <h> - bring up help window (contains full list of shortcuts!)
          <enter> - Enter Edit Mode
              <a> - create new cell above selected
              <b> - create cell below selected
          <d> <d> - delete selected cell (pressing 'd' twice)

In Edit Mode (cell highlighted green):

            <esc> - Enter Command Mode
<shift> + <enter> - Run cell and move to cell below in Command Mode
 <ctrl> + <enter> - Run cell in place

Try running the following code cell:

a = 1
b = 2
a + b

You’ll notice that the notebook will try to display the last thing in the cell, even if you don’t use a print statement. However, if you want to print multiple things from one cell, you need to use multiple print statements (or multiple cells).

first_name = ‘Jane’
last_name = ‘Doe’
print(first_name)
print(last_name)

Good Practice

  • It is good practice to separate code into different cells. One cell should correspond to one task, similarly to functions. For example, use a cell for the import statements, one for loading data, one for preprocessing data, and one for each different operation you carry out on the data.
  • It’s generally good practice to import all your packages at the top of a file. We will do so in future tutorials.

Before we start, we need to import the packages that we will be using later:

import os
import pandas as pd
import numpy as np

os stands for the standard Python operating system module, which we will use to access files. pd is an alias for the pandas module, to save typing later. Likewise np is an alias for numpy module. pd and np are the standard aliases for pandas and numpy. Here is a more in-depth tutorial on installing, importing and using modules.

1.B Pandas

Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the Series and DataFrame structures which are built on top of NumPy arrays. (Again, if you need a refresher, you can check out this numpy tutorial.)

Pandas is well documented and you will find good information about all methods and structures in the API reference.

1.B.1 Series

A Series is a one-dimensional object (similar to a list). Each element has a corresponding index. By default the indices range from 0 to N-1, where N is the length of the Series.

passenger = pd.Series([‘Mr. Owen Harris Braund’, 22.0, False])
passenger

If we want to specify meaningful labels for the index, we can do so with the index parameter.

passenger = pd.Series([‘Mr. Owen Harris Braund’, 22.0, False], index=[‘Name’, ‘Age’, ‘Survived’])
passenger

You can access a Series entry the same way as you access list entries, either using the assigned index labels, such as 'Name', or by using the numeric index, i.e. 0:(N-1), where N is the length of the Series.

print(passenger[1]) # Careful: indexing starts at 0.
print(passenger[‘Age’]) # Remember to use quotes

1.B.2 DataFrame

A DataFrame is a tabular data structure comprised of rows and columns. You can also think of the DataFrame as a collection of Series objects that share an index.

Creating DataFrame structures, adding rows, deleting rows and modifying entries

We can create an empty DataFrame by specifying the column names. Then we can insert data row by row.

passengers = pd.DataFrame(columns=[‘Gender’, ‘Age’, ‘Survived’])
passengers # Careful, the dataframe is called passengers, use meaningful variable names when coding

Now, let’s start filling the dataframe. To specify the row of a data frame, we use the .loc attribute.

passengers.loc[0] = [‘Male’, 22.0, False] # Note how we used df.loc() to specify the index
passengers

Remember, we said that a DataFrame is a collection of Series. Let’s double check that.

type(passengers.loc[0])

Pandas DataFrames are quite flexible. Just as with Series, we can also use strings as index labels.

passengers.loc[‘Mrs. John Bradley Cumings’] = [‘Female’, 38.0, ‘Yes’]
passengers

Remark It is generally bad practice to mix different kinds of indices. So let’s remove the first entry.

cleaned_passengers = passengers.drop(0)
print(cleaned_passengers)
print(’\n’) # Empty line between DataFrames
print(passengers)

Remark: By default, df.drop(index) creates a copy of the DataFrame without modifying the original DataFrame, which is why if you want to drop a row without creating a new DataFrame, you need to write df = df.drop(0), or set the optional inplace argument to True. You can see the difference between cleaned_passengers and passengers above.

passengers.drop(0, inplace=True) # Remove the 0th passenger without creating a copy
passengers

You can also populate a DataFrame using a dictionary which allows you to do things in a nonstandard order. Let’s get our first entry back.

passengers.loc[‘Mr. Owen Harris Braund’] = dict(Survived=False, Age=22.0, Gender=‘Male’) # Remark that the attributes are assigned in a different order
passengers

We just made a mess. In the first row, we used a string to denote whether the passenger survived, and in the second a Boolean value. Let’s clean this up.

passengers.loc[‘Mrs. John Bradley Cumings’, ‘Age’] = 39.0
passengers

Creating DataFrame from other structures

You can also create a DataFrame from:

  • A dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

Let’s recreate the same DataFrame:

Create a DataFrame from a list

passengers_list = [[‘Male’, 22.0, False], [‘Female’, 38.0, True]]
passengers = pd.DataFrame(passengers_list, index=[‘Mr. Owen Harris Braund’, ‘Mrs. John Bradley Cumings’],
columns=[‘Gender’, ‘Age’, ‘Survived’])
passengers

Create a DataFrame from a dictionary where keys are column values

column_key_dict = {
‘Gender’: [‘Male’, ‘Female’],
‘Age’: [22.0, 38.0],
‘Survived’: [False, True]
}
passengers = pd.DataFrame.from_dict(column_key_dict, orient=‘columns’)
passengers.index = [‘Mr. Owen Harris Braund’, ‘Mrs. John Bradley Cumings’]
passengers

Create a DataFrame from a dictionary where keys are index values

index_key_dict = {‘Mr. Owen Harris Braund’:[‘Male’, 22.0, False],
‘Mrs. John Bradley Cumings’:[‘Female’, 38.0, True]}
passengers = pd.DataFrame.from_dict(index_key_dict, orient=‘index’)
passengers.columns = [‘Gender’, ‘Age’, ‘Survived’]
passengers

Using the DataFrame call, keys are assumed to be column headers

passengers = pd.DataFrame({‘Mr. Owen Harris Braund’:[‘Male’, 22.0, False],
‘Mrs. John Bradley Cumings’:[‘Female’, 38.0, True]},
index=[‘Gender’, ‘Age’, ‘Survived’])
passengers

However, now the rows have become columns and vice versa. We could rewrite the code above, assigning the passenger names to the index argument, and using the Gender, Age and Survived attributes as the dict keys. However, there is a more elegant solution: we can use the transpose method df.T

passengers = passengers.T
passengers

Remark Again, the transpose method creates a copy. Thus, if you want to actually apply the changes to the DataFrame, you need to save it to the variable, as shown above.

Let’s combine a few things we have learned so far.

Exercise 01 a:

  • Delete the markdown cell below, which says ‘We don’t need this cell.’.
  • Insert a new code cell below this cell.
  • Create a ‘passengers’ DataFrame with one of the options we have presented above, which contains the following entries:
    • Miss. Laina Heikkinen, Female, aged 26, survived;
    • Mrs. Jacques Heath Futrelle, Female, 35 years old, survived;
    • Mr. William Henry Allen, male, 35.0, did not survive;
  • Make sure that you use a consistent notation in your DataFrame.

We don’t need this cell. Delete it.

Exercise 01 b:

  • Append the two passengers mentioned previously (Mr. Own Harris Braund and Mrs. John Bradley Cumings) to the DataFrame.

passengers_list = [[‘Female’, 26.0,True],[‘Female’,35.0,True],[‘Male’,35.0,False]]
passengers = pd.DataFrame(passengers_list,index = [‘Miss. Laina Heikkinen’,‘Mrs. Jacques Heath Futrelle’,‘Mr. William Henry Allen’],columns=[‘Gender’,‘Age’,‘Survived’])
passengers

1.B.3 Dataset operations in Pandas

Most commonly we create DataFrame structures by reading csv files. We store the datasets you need in the first lab in datasets. The next labs will have the datasets stored accordingly.

passengers_filepath = os.path.join(os.getcwd(), ‘datasets’, ‘titanic.csv’)
passengers_filepath

passengers = pd.read_csv(passengers_filepath)
passengers.head() # Head shows the first five elements (unless specified otherwise) of the DataFrame

These are the five passengers you should have added to your dataframe in Exercise 01. Now, we have some extra information. Using df.head() we can get an impression of how the data looks like in our dataset. Note that the information in the Survived column is stored as 0/1 instead of False/True. Pandas doesn’t care how Boolean values are stored. If a column only contains 0s and 1s, Pandas treats the column as Boolean, and works with it in the same way as a column of False and True.

However, df.head() only shows the first five entries. How many entries are there in total? You can use the python native len() function.

len(passengers)

Tab completion

Tab completion is a powerful method for viewing object attributes and available methods.

We have just seen the df.head() method above. Let’s see what other functions we can call on a DataFrame. You can see what methods are available by typing the DataFrame name followed by . and then hitting the <tab> key. Then you can access any method’s help documentation by hitting the method’s name followed by ?; this opens a ‘pager’ at the bottom of the screen, you can hit <esc> to exit it.

For example, to find the last few entries of the DataFrame, first type passengers.t, then hit <tab> and then choose df.tail().

passengers.tail

Let’s get the documentation

passengers.tail?

If you want to get more than five entries you can do so by specifying N entries with df.head(N) or df.tail(N). Give it a try above.

Row selection

As already mentioned, you can think of a DataFrame as a group of Series that share an index (either the column headers or the row id). This makes it easy to select specific observations (i.e. rows).

type(passengers.loc[0])

We have already talked about df.loc[label], which selects a row based on the label of the index, e.g. ‘Mr. Owen Harris Braund’. If we want the N-th row, we can use df.iloc[N]. In the loaded Titanic dataset, it so happens that the indices also run from 0 to 886, and thus df.loc[N] and df.iloc[N] return the same observation.

Technically, there are three options to select rows:

  • df.loc[label]: works on labels in the index
  • df.iloc[N]: works on the position in the index (so it only takes integers)
  • df[N]: works the same way as df.iloc[N]

It is often safest to use the first two methods (rather than just using square brackets) to index into pandas DataFrames.

passengers.iloc[0]

We can also select several rows; the resulting structure is a DataFrame. This operation is called slicing.

Remark: Python slicing might seem a bit confusing. When you specify a range to slice with i:j the slice returned runs from the i-th entry to the j-1-th entry. This can be very helpful at times, when you want the last few entries, and you don’t know how long the object is (e.g. a DataFrame or list). In that case, you can slice with k:len(object).

type(passengers.iloc[0:3])

passengers.iloc[0:3]

This is equivalent to using .iloc

passengers[0:3]

Column Selection

As already mentioned, you can think of a DataFrame as a group of Series that share an index (either the column headers or the row id). This makes it as easy to select specific columns as it is to select rows.

type(passengers[‘Name’])

passengers[‘Name’].head()

To select multiple columns we simply need to pass a list of column names. We said above that we were interested in who survived, so let’s check that:

#Remark the double brackets, because we passed a list of names
passengers[[‘Name’, ‘Survived’]].head(7)

Exercise 02: What do you expect what the type of passengers[['Name', 'Survived']].head(7) is? Check it!

type(passengers[[‘Name’,‘Survived’]].head(7))

You can combine row and column selection, as we already did above, when selecting a specific entry.

passengers.iloc[0][‘Survived’]

Exercise 03: Look at the dataframe you obtained in Exercise 01. Reconstruct the same dataframe using the newly loaded dataframe, i.e. select the correct rows and columns, such that the output is the same as in Exercise 01.

passengers[[‘Name’,‘Age’,‘Sex’,‘Survived’]].head(5)

Filtering

Now suppose that you want to select all the observations of minors (i.e. people under the age of 18 in the UK). It is easy to do that:

passengers[passengers[‘Age’] <= 17]

Or equivalently:

passengers[passengers.Age <= 17]

This concept is called filtering, and passengers.Age <= 17 is called a mask, which hides/masks all entries that don’t fit the criteria; i.e. it only returns the observations for which the mask returns True. You can also filter the data by using multiple attributes:

young_passengers = passengers[(passengers.Age <= 17) & (passengers.Survived)] #Remark that we were able to drop ‘Survived == 1’
young_passengers

Remark The first row has index 9. As mentioned earlier: the label of the index does not have to coincide with the position of the data entry. Here, df.iloc[9] and df.loc[9] return different values!.

Exercise 04: What will be the return values of young_passengers.iloc[8:10] and young_passengers.loc[8:10]. Check the answer.

young_passengers.iloc[8:10]

young_passengers.loc[8:10]

Note that we can also index into columns using loc. We just have to specify the second dimension (much as we would do with numpy arrays). Let’s give it a try. We want to get the list of minors (children) aboard the Titanic:

young_passengers.loc[:, ‘Name’]

young_passengers.iloc[:, 2] # And now using column indexing Remark this would not work with .loc[:, 3]

If we try the following we will get an empty DataFrame because there are no rows with labels 0 and 1.

young_passengers.loc[0:2]

The result is still a DataFrame

type(young_passengers.loc[0:2])

For more, check out Advanced Indexing

Basic operations on datasets

We now know how to

  • load a dataset from a csv file pd.read_csv(),
  • get the first few entries df.head(),
  • select certain rows or columns based on their position df.iloc[] or index label df.loc[],
  • and finally how to filter the dataset for entries that satisfy certain conditions df[condition].

However, at the beginning of the lab, we wanted to figure out the passenger group worst hit. Using filtering, we can already split the dataset into different subsets. Now let’s apply some operations.

First let’s get the age spread, and figure out the oldest and youngest passenger. The passengers in the dataset are not sorted by age. Looking at each entry individually is not an option. Thankfully pandas dataframes have a method: df.sort_values(by='column', ascending={True,False}).

passengers.sort_values(by=‘Age’).head()

passengers.sort_values(by=‘Age’, ascending=False).head() # Returns the data in descending order of the sorted value

The two results above already show that young passengers had a better survival rate than senior passengers. Let’s try, tentatively, to answer the research question.

Discussion: We will consider the different travel classes, genders, and age groups. Discuss with your lab partner who you would expect had the best and who had the worst survival rate.

Exercise 05:

  • Compute the survival rate of: minors (0-17 years old), adults (18-65), and seniors (66+). (Hint: All you need is len().)
  • Compute the survival rate of: women and men.
  • Compute the survival rate of the travel classes (1, 2, and 3).
  • Does the gender have an influence on the survival rates of minors? (What is the survival rate of girls vs boys?)
  • Compute the survival rate of all combinations of class, age group, and gender, and print them out as “Age group, gender, class: percentage”, e.g. “Minors, Male, First Class: 0.3”. (Hint: These are quite a few computations, think about how to automate it, e.g. conditions = condition1 & condition2 might help.).

survival_minors = passengers[(passengers.Age<=17.0) &(passengers.Survived) ]
all_minors = passengers[(passengers.Age<=17.0)& (passengers.Age>0)]
minors_rate = (len(survival_minors))/(len(all_minors))
minors_rate

survival_adults= passengers[(passengers.Age>=17.0)&(passengers.Age<=65)&(passengers.Survived)]
all_adults= passengers[(passengers.Age>=17.0)&(passengers.Age<=65)]
adults_rate = (len(survival_adults))/(len(all_adults))
adults_rate

survival_seniors = passengers[(66<passengers.Age) &(passengers.Survived) ]
all_seniors = passengers[(66<passengers.Age)& (passengers.Age>0)]
seniors_rate = (len(survival_seniors))/(len(all_seniors))
seniors_rate

survival_man = passengers[(passengers.Sex == ‘male’) &(passengers.Survived) ]
all_man = passengers[(passengers.Sex == ‘male’)]
man_rate = (len(survival_man))/(len(all_man ))
man_rate

survival_woman = passengers[(passengers.Sex == ‘female’) &(passengers.Survived) ]
all_woman = passengers[(passengers.Sex == ‘female’)]
woman_rate = (len(survival_woman))/(len(all_woman ))
woman_rate

survival_one = passengers[(passengers.Pclass == 1) &(passengers.Survived) ]
all_one = passengers[(passengers.Pclass == 1)]
one_rate = (len(survival_one))/(len(all_one ))
one_rate

survival_two = passengers[(passengers.Pclass == 2) &(passengers.Survived) ]
all_two = passengers[(passengers.Pclass == 2)]
two_rate = (len(survival_two))/(len(all_two ))
two_rate

survival_three = passengers[(passengers.Pclass == 3) &(passengers.Survived) ]
all_three = passengers[(passengers.Pclass == 3)]
two_three = (len(survival_three))/(len(all_three ))
two_three

#Exercise a

Exercise b

#Exercise c

#Exercise d

We need your help: This is a new course. In order for us to improve the labs for the next iterations, and to make sure that the next labs are better, we need your feedback. Please fill out the following form.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值