更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python
Datacamp track: Data Scientist with Python - Course 22 (1)
Exercise
Loading the data
Now it’s time to check out the dataset! You’ll use pandas (which has been pre-imported as pd
) to load your data into a DataFrame and then do some Exploratory Data Analysis (EDA) of it.
The training data is available as TrainingData.csv
. Your first task is to load it into a DataFrame in the IPython Shell using pd.read_csv()
along with the keyword argument index_col=0
.
Use methods such as .info()
, .head()
, and .tail()
to explore the budget data and the properties of the features and labels.
Some of the column names correspond to features - descriptions of the budget items - such as the Job_Title_Description
column. The values in this column tell us if a budget item is for a teacher, custodian, or other employee.
Some columns correspond to the budget item labels you will be trying to predict with your model. For example, the Object_Type
column describes whether the budget item is related classroom supplies, salary, travel expenses, etc.
Use df.info()
in the IPython Shell to answer the following questions:
- How many rows are there in the training data?
- How many columns are there in the training data?
- How many non-null entries are in the
Job_Title_Description
column?
fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv'
from urllib.request import urlretrieve
urlretrieve(fn, 'TrainingData.csv')
import pandas as pd
df = pd.read_csv('TrainingData.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 26 columns):
Unnamed: 0 1560 non-null int64
Function 1560 non-null object
Use 1560 non-null object
Sharing 1560 non-null object
Reporting 1560 non-null object
Student_Type 1560 non-null object
Position_Type 1560 non-null object
Object_Type 1560 non-null object
Pre_K 1560 non-null object
Operating_Status 1560 non-null object
Object_Description 1461 non-null object
Text_2 382 non-null object
SubFund_Description 1183 non-null object
Job_Title_Description 1131 non-null object
Text_3 296 non-null object
Text_4 193 non-null object
Sub_Object_Description 364 non-null object
Location_Description 874 non-null object
FTE 449 non-null float64
Function_Description 1340 non-null object
Facility_or_Department 252 non-null object
Position_Extra 1026 non-null object
Total 1542 non-null float64
Program_Description 1192 non-null object
Fund_Description 819 non-null object
Text_1 1132 non-null object
dtypes: float64(2), int64(1), object(23)
memory usage: 317.0+ KB
Exercise
Summarizing the data
You’ll continue your EDA in this exercise by computing summary statistics for the numeric data in the dataset. The data has been pre-loaded into a DataFrame called df
.
You can use df.info()
in the IPython Shell to determine which columns of the data are numeric, specifically type float64
. You’ll notice that there are two numeric columns, called FTE
and Total
.
FTE
: Stands for “full-time equivalent”. If the budget item is associated to an employee, this number tells us the percentage of full-time that the employee works. A value of 1 means the associated employee works for the school full-time. A value close to 0 means the item is associated to a part-time or contracted employee.Total
: Stands for the total cost of the expenditure. This number tells us how much the budget item cost.
After printing summary statistics for the numeric data, your job is to plot a histogram of the non-null FTE
column to see the distribution of part-time and full-time employees in the dataset.
This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the Scikit-Learn Cheat Sheet and keep it handy!
Instruction
- Print summary statistics of the numeric columns in the DataFrame
df
using the.describe()
method. - Import
matplotlib.pyplot
asplt
. - Create a histogram of the non-null
'FTE'
column. You can do this by passingdf['FTE'].dropna()
toplt.hist()
. - The title has been specified and axes have been labeled, so hit ‘Submit Answer’ to see how often school employees work full-time!
# Import pandas
import pandas as pd
# Load file into DataFrame: df
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Print the summary statistics
print(df.describe())
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Create the histogram
plt.hist(df['FTE'].dropna())
# Add title and labels
plt.title('Distribution of %full-time \n employee works')
plt.xlabel('% of full-time')
plt.ylabel('num employees')
# Display the histogram
plt.show()
FTE Total
count 449.000000 1.542000e+03
mean 0.493532 1.446867e+04
std 0.452844 7.916752e+04
min -0.002369 -1.044084e+06
25% 0.004310 1.108111e+02
50% 0.440000 7.060299e+02
75% 1.000000 5.347760e+03
max 1.047222 1.367500e+06
<Figure size 640x480 with 1 Axes>
Exercise
Exploring datatypes in pandas
It’s always good to know what datatypes you’re working with, especially when the inefficient pandas type object
may be involved. Towards that end, let’s explore what we have.
The data has been loaded into the workspace as df
. Your job is to look at the DataFrame attribute .dtypes
in the IPython Shell, and call its .value_counts()
method in order to answer the question below.
Make sure to call df.dtypes.value_counts()
, and not df.value_counts()
! Check out the difference in the Shell. df.value_counts()
will return an error, because it is a Series method, not a DataFrame method.
How many columns with dtype object
are in the data?
df.dtypes.value_counts()
object 23
float64 2
dtype: int64
Exercise
Encode the labels as categorical variables
Remember, your ultimate goal is to predict the probability that a certain label is attached to a budget line item. You just saw that many columns in your data are the inefficient object
type. Does this include the labels you’re trying to predict? Let’s find out!
There are 9 columns of labels in the dataset. Each of these columns is a category that has many possible values it can take. The 9 labels have been loaded into a list called LABELS
. In the Shell, check out the type for these labels using df[LABELS].dtypes
.
You will notice that every label is encoded as an object datatype. Because category
datatypes are much more efficient your task is to convert the labels to category types using the .astype()
method.
Note: .astype()
only works on a pandas Series. Since you are working with a pandas DataFrame, you’ll need to use the .apply()
method and provide a lambda
function called categorize_label
that applies .astype()
to each column, x
.
Instruction
- Define the lambda function
categorize_label
to convert columnx
intox.astype('category')
. - Use the
LABELS
list provided to convert the subset of datadf[LABELS]
to categorical types using the.apply()
method andcategorize_label
. Don’t forgetaxis=0
. - Print the converted
.dtypes
attribute ofdf[LABELS]
.
# Import pandas
import pandas as pd
# Load data
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Load labels corresponding to object types
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)
# Print the converted dtypes
print(df[LABELS].dtypes)
Function category
Use category
Sharing category
Reporting category
Student_Type category
Position_Type category
Object_Type category
Pre_K category
Operating_Status category
dtype: object
Exercise
Counting unique labels
As Peter mentioned in the video, there are over 100 unique labels. In this exercise, you will explore this fact by counting and plotting the number of unique values for each category of label.
The dataframe df
and the LABELS
list have been loaded into the workspace; the LABELS
columns of df
have been converted to category types.
pandas, which has been pre-imported as pd
, provides a pd.Series.nunique
method for counting the number of unique values in a Series.
Instruction
- Create the DataFrame
num_unique_labels
by using the.apply()
method ondf[LABELS]
withpd.Series.nunique
as the argument. - Create a bar plot of
num_unique_labels
using pandas’.plot(kind='bar')
method. - The axes have been labeled for you, so hit ‘Submit Answer’ to see the number of unique values for each label.
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_2533/datasets/TrainingSetSample.csv', index_col=0)
# Load labels corresponding to object types
LABELS = ['Function',
'Use',
'Sharing',
'Reporting',
'Student_Type',
'Position_Type',
'Object_Type',
'Pre_K',
'Operating_Status']
# Use the LABELS list to convert df[LABELS] to a categorical type.
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'), axis=0)# Import matplotlib.pyplot
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique, axis=0)
# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')
# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')
# Display the plot
plt.show()
Exercise
Computing log loss with NumPy
To see how the log loss metric handles the trade-off between accuracy and confidence, we will use some sample data generated with NumPy and compute the log loss using the provided function compute_log_loss()
, which Peter showed you in the video.
5 one-dimensional numeric arrays simulating different types of predictions have been pre-loaded: actual_labels
, correct_confident
, correct_not_confident
, wrong_not_confident
, and wrong_confident
.
Your job is to compute the log loss for each sample set provided using the compute_log_loss(predicted_values, actual_values)
. It takes the predicted values as the first argument and the actual values as the second argument.
Instruction
- Using the
compute_log_loss()``actual_labels
correct_confident
.correct_not_confident
.wrong_not_confident
.wrong_confident
.actual_labels
.
# Import numpy
import numpy as np
# Define LogLoss function, account for clipping
def compute_log_loss(predicted, actual, eps=1e-14):
predicted = np.clip(predicted, eps, 1 - eps)
return -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))
# Create sample data
actual_labels = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0], dtype=np.float64)
correct_confident = np.array([0.95, 0.95, 0.95, 0.95, 0.95, 0.05, 0.05, 0.05, 0.05, 0.05])
correct_not_confident = np.array([0.65, 0.65, 0.65, 0.65, 0.65, 0.35, 0.35, 0.35, 0.35, 0.35])
wrong_not_confident = np.array([0.35, 0.35, 0.35, 0.35, 0.35, 0.65, 0.65, 0.65, 0.65, 0.65])
wrong_confident = np.array([0.05, 0.05, 0.05, 0.05, 0.05, 0.95, 0.95, 0.95, 0.95, 0.95])
# Compute and print log loss for 1st case
correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_loss))
# Compute log loss for 2nd case
correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_loss))
# Compute and print log loss for 3rd case
wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss))
# Compute and print log loss for 4th case
wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_loss))
# Compute and print log loss for actual labels
actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels_loss))
Log loss, correct and confident: 0.05129329438755058
Log loss, correct and not confident: 0.4307829160924542
Log loss, wrong and not confident: 1.049822124498678
Log loss, wrong and confident: 2.9957322735539904
Log loss, actual labels: 9.99200722162646e-15