week03-Data-wrangling-solution

Foundation Data Sciences

Week 03: Pandas - Data Wrangling

Learning outcomes:
In this lab you will learn to prepare your data for future use. By the end of the lab you should be able to:

  • get an overview of your data,
  • clean data, and
  • combine data from multiple datasets.

After analysing the survival rate of different passenger groups on the titanic last week, now imagine you are a political advisor for the UN this week. You will analyse data from the World Health Organization (WHO), in order to answer the following question:

Research question: Which medical, social and economic factors have the biggest impact on life expectancy, and which improvement would have the biggest impact on the three countries with the shortest life expectancy?

Data description: The data from the first dataset for this lab originates from the Global Health Observatory (GHO) data repository under the WHO. It contains health factors, as well as social and economic data of 193 countries between 2000 and 2015. The dataset was slightly preprocessed for this lab. The second dataset is based on data from an article in ‘Fivethirtyeight’, a website that focuses on political and economic articles and polls. It contains information about the alcohol consumption in 193 countries.

Remarks

  • Starting from this lab, we will not provide all the code, but expect you to be able to draw from previous labs to fill in the code.
  • Try not to use copy+paste when coding these labs, as typing will help you memorize the code better.
  • Try to understand each detail in the code we provide, and read the comments!
  • In the text df refers to a generic pandas DataFrame, which allows us to indicate which methods are applied to DataFrames. E.g. in df.head(), the head is a member function of a pandas DataFrame.
  • Finally, try to keep a clean structure when programming: As discussed in the previous lab, we will start by loading all libraries we need at the top of the lab.

import os
import pandas as pd
import numpy as np
import seaborn as sns

We can now start by loading the data we will use in this lab.

Exercise 01:

a) Load the life_expectancy.csv file which is in datasets, and store the variable as life_expectancy.

b) Print the first few entries of the dataset.

Your code

life_expectancy_loc = os.path.join(os.getcwd(), ‘datasets’, ‘life_expectancy.csv’)
life_expectancy = pd.read_csv(life_expectancy_loc)
life_expectancy.head()

In the last lab you learned to load data, and get an overview of the data using df.head(). Now, we will look at higher level information of the data. Whenever you work with data, it is important to get a good idea of how your data looks like:

  • Nature of data: Is the data from a population (e.g. medical data from patients) or rather high-level statistics (e.g. economic data comparison of different countries)?
  • Data format: Is the dataset a relational database, if so what columns does it have?
  • Missing data: Is the data complete (e.g. does every row contain an entry in every column, or are some entries missing)?
  • Consistent data: Is the notation used consistent throughout the database (e.g. Nan vs None vs Null vs __ vs 0 vs -1 …)?

All those points and many more should be answered before you start running any algorithms on your data. Clean data is key!

2.A Data exploration: High level information of DataFrames

In Exercise 01 you printed out the first few entries and also the column names. These entries might give you an idea of what type of entries you should expect in a specific columns. For example, in the column ‘Total expenditure’, which reflects the total general government expenditure on health, you should only expect numbers. However, this might not always be the case and the data might be stored in a different format from what you would expect. Columns with numbers could be floats, integers or even strings, e.g. 0 vs 0.0 vs ‘Zero’ vs ‘NIL’ vs ‘0’.

Exercise 02: Use filtering to get all the entries for ‘Zimbabwe’.

Your code

life_expectancy[life_expectancy[‘Country’] == ‘Zimbabwe’]

As you can see it contains a string ‘No data’ in the ‘Total expenditure’ column. If we print out the entry below, which looks like a float (6.44), we realize something unexpected:

life_expectancy.iloc[2923][‘Total expenditure’]

It’s printed out as ‘6.44’ instead of 6.44. This is because pandas expects all entries in a specific column to be of the same type. Thus, when pandas loads the table from a csv file and finds an entry that clearly isn’t a numeric value (‘No data’), it casts all other entries in the same column to a string.

To be certain of what types are used we can use df.info().

life_expectancy.info()

Remark The ‘Total expenditure’ column has dtype:object, but the entries we printed out above looked like strings. To understand why pandas stores strings as objects, read this answer on stackoverflow. Stackoverflow is an extremely helpful website for most questions you will encounter when coding!

To get an overview of the numerical data you have in your dataset .describe() is a good option. It gives you the most important statistical analysis of your columns with numerical data.

life_expectancy.describe()

Have a look at the output of df.describe().

Discussion 01: Discuss with your lab partner what the count row represents. Check the correct answer on stackoverflow or the official pandas documentation. How is this information helpful when analyzing your dataset? Use stackoverflow or the official documentation to work out how you could compute just one of the statistics for one specific column, e.g. how can we compute the mean of the ‘BMI’ column? Write down your answers:

Your answer:

  • count returns the number of non-NaN values in a column
  • Having the number of non-NaN values tells you how meaningful your results are and how helpful a specific column is. For example, imagine you are given a dataset, where only 10% of all entries in a column are non-NaN values. Your data may not be statistically significant.
  • You can select the results with df['column'].count(), .mean(), .std(), .min(), .max(), .quantile(0.25).
‘Total expenditure’ is not listed in the statistical analysis, which is because, as mentioned above, all values were cast to a string when loaded. This is very annoying, as most values are actually numeric, and the statistical measures of the numeric values, such as the median, are still meaningful, even though not all values are numeric. To compute these values, we will have to clean up the data first. As mentioned above: Clean data is key!

2.B Cleaning data

2.B.1 Detecting non-numerical values

A common problem in datasets is that data are not consistently of the same type. For example, a date is written in a non-standard way or a number is given as a string instead of a numeric value. Many data analysis algorithms are applied to numeric data. So let’s tackle the problem we encountered above in the ‘Total expenditure’ column.

To help us ensure that every entry in a column is numeric, pandas offers the function pd.to_numeric(series, errors={‘ignore’, ‘raise’, ‘coerce’}, downcast = {‘integer’, ‘signed’, ‘unsigned’, ‘float’}).

This function does three things:

  1. it checks whether every entry is a numeric value
  2. if a non-numeric value is encountered (error), either it is ignored, raised (see below) or replaced by a NaN value (error = 'coerce')
  3. if downcast is chosen (default is None), it casts all numeric values down to the specified parameter.

Remark: If the chosen dtype occupies more bits than the dtype of the value (e.g. chosen dtype is float, but the found value is int) pd.to_numeric doesn’t recast the value!

life_expectancy[‘Total expenditure’] = pd.to_numeric(life_expectancy[‘Total expenditure’],
errors = ‘coerce’)
life_expectancy[‘Total expenditure’]

Remark pd.numeric creates a copy and does not apply the changes directly to the series, hence the need to save the changes as shown above. If you just want to check whether or not all values are numeric, you can call pd.to_numeric(life_expectancy['Total expenditure'], errors='raise'), which will raise errors but not save any changes.

Why is it interesting to have NaN values instead of strings, such as ‘No data’? Let’s check the type.

type(np.NaN)

It’s a float, which means that you can apply all numeric operations to entire columns; NaN values are usually just ignored. Now we can compute the missing values from our statistical analysis above.

Exercise 03:

a) Start by saving the values from the ‘Total expenditure’ column in a new variable total_expenditure. What type does total_expenditure have? (Try to think of the answer first, then check with type().)

b) Compute the eight values (count, mean, …) using the methods you should have learned about in Discussion 01.

Your code

total_expenditure = life_expectancy[‘Total expenditure’]
print(type(total_expenditure))
print("Mean: "+ str(total_expenditure.mean()))
print("std: " + str(total_expenditure.std()))
print("Min: " + str(total_expenditure.min()))
print("Max: " + str(total_expenditure.max()))
print("25 quantile: " + str(total_expenditure.quantile(0.25)))
print("Median: " + str(total_expenditure.median()))
print("75 quantile: "+ str(total_expenditure.quantile(0.75)))

Remark: By default, when loading a csv file in pandas, empty values are usually filled with NaN values automatically.

However, even NaN values can be annoying. Where do you plot a value with x-coordinate 1 and y-coordinate NaN? There are different options for how to deal with NaN values.

First, let’s start by checking which values are NaN, using .isna(), which is a mask (we have discussed masks in lab01). It returns the same data structure as the one it is applied to, with True and False values instead of the values from the original data structure.

print(‘Series:\n’)
print(life_expectancy[‘Life expectancy’].isna())
print(’\n\nDataFrames:\n’)
print(life_expectancy.isna())

It looks like there are no NaN in the ‘Life expectancy’ column. Let’s double check, by summing all the values of each column. (Remember that False == 0, and True == 1).

life_expectancy[‘Life expectancy’].isna().sum()

It turns out that there are a few missing values, but none of them are at the beginning or the end, so we didn’t see them. This is an important lesson: using .head() gives you a good initial idea of how the data looks like, but it is by no means comprehensive!

2.B.2 Working with NaN values

Dropping NaN values

Now we know how to find NaN values, let’s fix those values. We have a couple of options. First, if you just want to get rid of data entries with NaN values, you can drop them using .dropna(axis={0 or'index', 1 or'columns}, how={‘any’, ‘all’}, thresh=N, subset=[col1, col2, ...]).

life_expectancy.dropna(axis = 0, how = ‘any’) # drops rows if any value is a NaN

Because we specified how='any', pandas drops a row if any column has a NaN value, so the DataFrame is reduced to 1649 rows from 2938 rows initially. However, we wanted to discuss more ways of handling NaN values, and now we have dropped them. The good news is that dropna() creates a copy, as usual with pandas DataFrame methods, and the original DataFrame is preserved. If you do want to apply those changes, you can use either df = df.dropna(axis = 0, how = 'any') or df.dropna(axis = 0, how = 'any', inplace = True). inplace=True can be used in many DataFrame methods to apply the operations to the DataFrame.

Now that we know that we didn’t apply the changes to the DataFrame, let’s discuss some more options for dealing with NaNs.

Instead of how = 'any', you can choose 'all', which drops rows only if all values are NaN. You can specify axis=1, to look along columns for NaN values and drop columns instead of rows. You can further set subset=[col1, col2] to search specific columns or rows for NaN values, and finally you can set a thresh=N to specify that at least N values need to be non-NaN to be kept. Give it a try!

Exercise 03: Drop columns with more than 20% NaN values. Save the result to a new DataFrame and print that DataFrame.

Your code

threshold = int(0.8* len(life_expectancy))
cleaned_life_expectancy = life_expectancy.dropna(axis=1, thresh=threshold)
cleaned_life_expectancy

You should have got the original DataFrame without the ‘Population’ column.

Overwriting NaN values

If you don’t want to drop your data entries you can replace the values. You have many options, including:

  • The mean, max or min value of the column
  • strings
  • 0

The function needed is df.fillna(value={scalar, dict}, method={‘bfill’, ‘ffill’}, axis={0, 1}, limit=N).

life_expectancy.fillna(0).loc[2922] # Get row 2922 of the DataFrame where NaN values are replaced with 0.

Above we have applied two operations at once to the DataFrame. It shows the row we initially inspected where ‘Total expenditure’ = ‘No data’. Now it is 0.

Remark: In some cases, it makes more sense to replace different NaN values according to the column they are in. You can do this by passing a dictionary instead of a scalar. df.fillna(value={'col1': value1, 'col2': value2}). For more information on fillna() you are invited to check the pandas documentation.

We will only be interested by entries with no NaN values for the rest of this lab, so we will drop all entries with NaN values:

life_expectancy.dropna(subset=[‘Life expectancy’], inplace=True)

2.C Unique values

So far, we have loaded the data, we have extracted some high-level information about the data set, and dealt with missing data entries. These steps are important initial steps that need to be taken whenever you deal with data, especially data you haven’t generated yourself.

Another important step is to check whether your data is free of duplicate observations. Let’s check how many unique values we have in each column of the DataFrame.

life_expectancy.nunique()

We see that, for example, in ‘Status’, we only have two unique values. Let’s check which ones they are.

pd.unique(life_expectancy[‘Status’])

Remark: df.nunique() is applied to the data structure (Series or DataFrames), whereas pd.unique() takes the data structure as a parameter.

The above query showed us that there are two categories in the ‘Status’ column: Developing and Developed. Let’s find out the impact of whether a country is considered to be developing or developed.

Exercise 04:

a) Split the dataset into two subsets with the names ‘developed’ and ‘developing’ according to the status of the countries.

b) Compute the mean life expectancy of each subset.

Your code

developed = life_expectancy[life_expectancy[‘Status’]‘Developed’]
developing = life_expectancy[life_expectancy[‘Status’]
‘Developing’]
print(developing[‘Life expectancy’].mean(), developed[‘Life expectancy’].mean())

Discussion: Is the difference of life expectancy between the two groups about what you expected? If not what statistical metric could help you understand why the life expectancy difference is not what you expected? If you can think of a specific metric, compute that metric and see whether this confirms your expectations. Write down your answers:

Your answer:

  • You might have expected that developing countries would have a shorter life expectancy, than developed countries. However, you might also have anticipated a greater difference.
  • Maybe the spread of life expectancy in the developing countries is much greater than developed ones and would explain, why even though we would expect much shorter life expectancy, the mean does not show it.
  • We can compute the variance, as:

Your code

print(developed[‘Life expectancy’].std(), developing[‘Life expectancy’].std())

We have found the unique values, but pd.unique searches unique values in each column. To search for entire rows of data that duplicate other rows, we need to use df.duplicated(subset=[col1, col2, ...], keep={'first', 'last', False}). This function is a mask; by now you should know how to use masks. It returns a Series with True and False entries, where True marks the rows that are duplicates of another row. The subset parameter allows you to specify which columns to consider when checking for duplicates. This can be very helpful when you have a column of unique identifiers, and you want to know if there are rows that are identical apart from the unique identifier. For example, different patients (with different IDs = unique identifier) might have the same symptoms. The keep parameter lets you specify which occurrence of a duplicate to mark as False (i.e. not as a duplicate):

  • 'first', which is the default value, marks the first occurrence as not a duplicate, but all other occurrences as duplicates of the first occurrence,
  • 'last' marks the last occurrence as not a duplicate, but all other occurrences as duplicates of the last occurrence, and
  • False marks all occurrences of duplicates as duplicates.

Let’s give it a try:

life_expectancy.duplicated()

As promised above, .duplicated() gives you a mask. It looks like no entry is a duplicate. Let’s check.

life_expectancy.duplicated().sum()

(Remember, False==0, True==1; thus summing the series tells you how many True entries there were in the Series). The above tells you that there are no duplicates. Thankfully the WHO, as we would hope, created a clean dataset.

Let’s check whether there are countries in which the life expectancy remained the same over the years.

life_expectancy.duplicated(subset=[‘Country’, ‘Life expectancy’]).sum()

There are 136 instances in which the life expectancy did not change. Let’s print them.

Exercise 05:

  • Print all occurrences of duplicates. Hint: think about the keep parameter.
  • Print all the countries that appear in the list. Each country should only appear once.
  • Create a new DataFrame from the life_expectancy DataFrame, with the name cleaned_life_expectancy, in which each country only appears once. Since the entries are sorted newest to oldest, only keep the newest observation of each country. Print out the first few lines to check your DataFrame looks as expected.

Hint: to negate a condition in Pandas, use the tilde ~ operator instead of not, e.g.: if people is a DataFrame with a column adult of boolean values: child = people[~people['adult']].

Your code

You should use keep=False to mark all duplicates as True

print(life_expectancy[life_expectancy.duplicated(subset=[‘Country’, ‘Life expectancy’], keep=False)])
print(pd.unique(life_expectancy[life_expectancy.duplicated(subset=[‘Country’, ‘Life expectancy’], keep=False)][‘Country’]))
cleaned_life_expectancy = life_expectancy[~life_expectancy.duplicated(subset=[‘Country’])]
cleaned_life_expectancy.head()

We have finished cleaning our dataset. Let’s plot some of the results. You can ignore the code below; we will learn about plotting in the next two labs and this is just to give you an idea of your data.

cleaned_life_expectancy.head()

sns.pairplot(cleaned_life_expectancy, x_vars = [‘Life expectancy’], y_vars = [‘percentage expenditure’, ‘Hepatitis B’, ‘Measles’,
‘BMI’, ‘under-five deaths’, ‘Polio’, ‘Total expenditure’,
‘Diphtheria’, ‘HIV/AIDS’, ‘GDP’, ‘Population’,
‘thinness 1-19 years’, ‘thinness 5-9 years’,
‘Income composition of resources’, ‘Schooling’], hue=‘Status’, palette=‘Dark2’)

Discussion: Discuss with your lab partner which factors seem to have a clear correlation with life expectancy.

Your answer:

  • Schooling, Income composition of resources, and adult mortality have the clearest correlation. However, BMI, health expenditure, and HIV also have a correlation with outliers.

Up to now, we have used single datasets and analyzed them. The WHO dataset contains medical and economic factors, and the plot above showed whether or not there was a correlation between those factors and life expectancy.

Now suppose that a politician approaches you who is interested in whether limiting alcohol consumption might have a positive effect on life expectancy. To analyze this you need to use a second dataset.

drinks_by_country_loc = os.path.join(os.getcwd(), ‘datasets’, ‘drinks_by_country.csv’)
drinks_by_country = pd.read_csv(drinks_by_country_loc)
drinks_by_country.head()

2.D Combining data from two datasets

You will rarely find all the information you need in one dataset. Being able to combine data from different datasets into one is very important. Practise it until you feel very comfortable; having just tried it once is not enough.

There a various ways of combining data from datasets: concatenation, merging and joining.

Concatenation

Let’s start with the arguably simplest pandas function: concat

When you concatenate two data structures, pandas DataFrames or pandas series, you simply append one to the other along one axis (along rows or along columns), i.e. you expect that the new data is “independent” of the existing data.

Let’s illustrate that on a simple toy example.

df1 = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]},
index=[0, 1, 2, 3])

df2 = pd.DataFrame({‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’],
‘C’: [‘C4’, ‘C5’, ‘C6’, ‘C7’],
‘D’: [‘D4’, ‘D5’, ‘D6’, ‘D7’]},
index=[0, 1, 2, 3])

print(df1)
print(’\n\n’)
print(df2)

result = pd.concat([df1, df2])
result

Remarks:

  • Both DataFrames had entries with indices 0-3. However, using concat means that there is no relational connection between the rows, and thus df2 is just appended to df1.
  • pd.concat() takes as input a list of data structures and can concatenate several DataFrames at once.

As mentioned above, we can specify along which axis the data structures should be concatenated. The default is axis=0, as above. We hinted above that concat and join have two different meanings. In fact, join={'inner', 'outer'} is a parameter allowing you to specify the set logic of how data entries that only exist in one set but not the other should be treated. Using inner join means that you only want the intersection, while outer join means you want to keep the union. Sounds a bit abstract? Let’s try it out.

df3 = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]},
index=[0, 1, 2, 3])

df4 = pd.DataFrame({‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’],
‘C’: [‘C4’, ‘C5’, ‘C6’, ‘C7’]},
index=[0, 1, 2, 3])

print(df3)
print(’\n\n’)
print(df4)

result = pd.concat([df3, df4], axis=0, join=‘outer’)
result

In the example above along axis=0, one dataset has the columns ‘A’, ‘B’, and ‘D’, while the other has the columns ‘A’, ‘B’, and ‘C’. Because we specified join='outer' it took the union of the two column sets, i.e. ‘A’, ‘B’, ‘C’, and ‘D’ and just filled out the missing information with NaN values. What would happen in the case of join='inner'? Give it a try. Is it what you would expect?

Your code

pd.concat([df3, df4], axis=0, join=“inner”)

Above we have seen, that indices 0-3 were repeated, because we just appended one dataset to the other and didn’t assume any relation. However, in that case the indices most probably don’t even have a meaning, and could be ignored. This can be achieved with, who could have guessed, ignore_index=True.

result = pd.concat([df1, df2], ignore_index=True)
result

Merging

What can we do, when we have a relation between two datasets? For example, in both the life expectancy dataset and the alcohol consumption dataset, we have the column ‘Country’. Combining both datasets by comparing those columns and combining the corresponding data entries would be very helpful.

Let’s consider a toy example again.

left = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’],
‘A’: [‘A0’, ‘A1’, ‘A2’],
‘B’: [‘B0’, ‘B1’, ‘B2’]})

right = pd.DataFrame({‘key’: [‘K1’, ‘K0’, ‘K3’],
‘C’: [‘C0’, ‘C1’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D3’]})
print(left)
print(’\n\n’)
print(right)

result = pd.merge(left, right, on=‘key’)
result

With on='column' we specify the column the two datasets will be compared on.

Remarks

  • In contrast to pd.concat(), pd.merge() only takes two DataFrames as input (left and right)
  • pd.merge() compares the rows across all data entries. For example, in the ‘right’ dataset the data entry ‘K0’ is in the second row though in the left data set it is in the first row.
  • As you can see above, by default, pd.merge() uses an inner join.
  • The column specified by the keyword ‘on’ needs to be written the same way in both datasets.

With pd.concat() the ‘join’ parameter specifies an inner join and an outer join. With pd.merge() we have the same option, only the keyword is how={'inner', 'outer', 'left', 'right'}, with the two extra options left and right. Technically left stands for ‘left outer join’, and means that:

  • we keep all rows that appear in the left dataset
  • merge the matching rows from the right dataset into the left dataset
  • drop all rows that only appear in the right dataset.

Let’s try it out.

result = pd.merge(left, right, on=‘key’, how=‘left’)
result

‘K0’, ‘K1’, and ‘K2’ are kept, as they all appear in the left dataset. However, ‘K3’, which only appears in the right dataset is dropped.

Note on performance: By default, merge will sort the output based on the values in the key column in lexicographic order, which can affect the performance substantially, if you do not care about the order, you can set sort=False.

There are a couple of more parameters you can set, in both concat and merge. However, the above are the most important. If you are interested, check out the pandas website.

Exercise 06: Use either pd.merge or pd.concat to combine the two datasets(cleaned_life_expectancy, drinks_by_country) so we can relate drinks consumption to life expectancy. We don’t want to have unnecessary NaN values, so only keep the entries that are found in both datasets. Save the result to final_life_expectancy.

Your code

final_life_expectancy = pd.merge(cleaned_life_expectancy, drinks_by_country, on=‘Country’, how=‘inner’)
final_life_expectancy

Again, ignore the code below. Just run it and compare the impact of different drinks on life expectancy.

sns.pairplot(final_life_expectancy, x_vars = [‘Life expectancy’], y_vars = [‘Beer Servings’, ‘Spirit Servings’, ‘Wine Servings’, ‘Total Litres of Pure Alcohol’], hue=‘Status’, palette=‘Dark2’)

Discussion:

  • Does the data above suggest that alcohol consumption as an average for a population has an impact on life expectancy?
  • Find the three countries with the lowest life expectancy from final_life_expectancy.
  • Have a look at the plots above, and discuss with your lab partner: Can we pinpoint the short life expectancy to specific factors? What changes would help those countries probably most to improve their life expectancy.

Your code

pd.set_option(‘display.max_columns’, None) # Helps you to see all columns
print(final_life_expectancy.sort_values(by=‘Life expectancy’, ascending=False).iloc[-3:])

Your answer:

  • The three countries are Central African Republic, Angola, and Sierra Leone
  • It seems like the immunization against Hepatitis B is quite low (47%) in the Central African Republic, the Mean BMI of each country is within the norm (compared to most rich countries, which are generally overweight), which suggests that many people in those three countries are underweight. This is also shown in the thinness column showing that 8% of children are underweight. In the Central African Republic and in Angola, the immunization against Polio is very low, as well as against Diphtheria.
©️2020 CSDN 皮肤主题: 1024 设计师:上身试试 返回首页