In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.
In this lesson, you'll learn how to modify an existing DataFrame.Some of the skills you'll learn include:
- Adding columns to a DataFrame
- Using lambda functions to calculate complex quantities
- Renaming columns
Adding a Column I
Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.
One way that we can add a new column is by giving a list of the same length as the existing DataFrame.
Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:
| Product ID | Product Description | Cost to Manufacture | Price |
|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 |
| 2 | 2 inch nail | 0.10 | 0.25 |
| 3 | hammer | 3.00 | 5.50 |
| 4 | screwdriver | 2.50 | 3.00 |
It looks like the actual quantity of each product in our warehouse is missing!
Let’s use the following code to add that information to our DataFrame.
df['Quantity'] = [100, 150, 50, 35]
Our new DataFrame looks like this:
| Product ID | Product Description | Cost to Manufacture | Price | Quantity |
|---|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 | 100 |
| 2 | 2 inch nail | 0.10 | 0.25 | 150 |
| 3 | hammer | 3.00 | 5.50 | 50 |
| 4 | screwdriver | 2.50 | 3.00 | 35 |
1.The DataFrame df contains information on products sold at a hardware store. Add a column to df called 'Sold in Bulk?', which indicates if the product is sold in bulk or individually. The final table should look like this:
| Product ID | Product Description | Cost to Manufacture | Price | Sold in Bulk? |
|---|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 | Yes |
| 2 | 2 inch nail | 0.10 | 0.25 | Yes |
| 3 | hammer | 3.00 | 5.50 | No |
| 4 | screwdriver | 2.50 | 3.00 | No |
import codecademylib3
import pandas as pd
df = pd.DataFrame([
[1, '3 inch screw', 0.5, 0.75],
[2, '2 inch nail', 0.10, 0.25],
[3, 'hammer', 3.00, 5.50],
[4, 'screwdriver', 2.50, 3.00]
],
columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)
# Add columns here
df['Sold in Bulk?'] = ['Yes','Yes','No','No']
print(df)
Adding a Column II
We can also add a new column that is the same for all rows in the DataFrame. Let’s return to our inventory example:
| Product ID | Product Description | Cost to Manufacture | Price |
|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 |
| 2 | 2 inch nail | 0.10 | 0.25 |
| 3 | hammer | 3.00 | 5.50 |
| 4 | screwdriver | 2.50 | 3.00 |
Suppose we know that all of our products are currently in-stock. We can add a column that says this:
df['In Stock?'] = True
Now all of the rows have a column called In Stock? with value True.
| Product ID | Product Description | Cost to Manufacture | Price | In Stock? |
|---|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 | True |
| 2 | 2 inch nail | 0.10 | 0.25 | True |
| 3 | hammer | 3.00 | 5.50 | True |
| 4 | screwdriver | 2.50 | 3.00 | True |
1.Add a column to df called Is taxed?, which indicates whether or not to collect sales tax on the product. It should be 'Yes' for all rows.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
[1, '3 inch screw', 0.5, 0.75],
[2, '2 inch nail', 0.10, 0.25],
[3, 'hammer', 3.00, 5.50],
[4, 'screwdriver', 2.50, 3.00]
],
columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)
# Add columns here
df['Is taxed?'] = ['Yes','Yes','Yes','Yes']
print(df)
Adding a Column III
Finally, you can add a new column by performing a function on the existing columns.
Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each Price by 0.075, the sales tax for our state:
df['Sales Tax'] = df.Price * 0.075
Now our table has a column called Sales Tax:
| Product ID | Product Description | Cost to Manufacture | Price | Sales Tax |
|---|---|---|---|---|
| 1 | 3 inch screw | 0.50 | 0.75 | 0.06 |
| 2 | 2 inch nail | 0.10 | 0.25 | 0.02 |
| 3 | hammer | 3.00 | 5.50 | 0.41 |
| 4 | screwdriver | 2.50 | 3.00 | 0.22 |
1.Add a column to df called 'Margin', which is equal to the difference between the Price and the Cost to Manufacture.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
[1, '3 inch screw', 0.5, 0.75],
[2, '2 inch nail', 0.10, 0.25],
[3, 'hammer', 3.00, 5.50],
[4, 'screwdriver', 2.50, 3.00]
],
columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)
# Add column here
df['Margin'] = df['Price'] - df['Cost to Manufacture']
print(df)
Performing Column Operations
In the previous exercise, we learned how to add columns to a DataFrame.
Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.
For example, imagine that we have the following table of customers.
| Name | |
|---|---|
| JOHN SMITH | john.smith@gmail.com |
| Jane Doe | jdoe@yahoo.com |
| joe schmo | joeschmo@hotmail.com |
It’s a little annoying that the capitalization is different for each row. Perhaps we’d like to make it more consistent by making all of the letters uppercase.
We can use the apply function to apply a function to every value in a particular column. For example, this code overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.
df['Name'] = df.Name.apply(str.upper)
The result:
| Name | |
|---|---|
| JOHN SMITH | john.smith@gmail.com |
| JANE DOE | jdoe@yahoo.com |
| JOE SCHMO | joeschmo@hotmail.com |
1.Apply the function lower to all names in column 'Name' in df. Assign these new names to a new column of df called 'Lowercase Name'. The final DataFrame should look like this:
| Name | Lowercase Name | |
|---|---|---|
| JOHN SMITH | john.smith@gmail.com | john smith |
| Jane Doe | jdoe@yahoo.com | jane doe |
| joe schmo | joeschmo@hotmail.com | joe schmo |
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['JOHN SMITH', 'john.smith@gmail.com'],
['Jane Doe', 'jdoe@yahoo.com'],
['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])
# Add columns here
df['Lowercase Name'] = df.Name.apply(str.lower)
print(df)
Reviewing Lambda Function
A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.
For example, the following lambda function multiplies a number by 2 and then adds 3:
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))
The output:
> 13
Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable x, and then converts it into lowercase:
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))
The output:
> "oh hi mark!"
1.Create a lambda function mylambda that returns the first and last letters of a string, assuming the string is at least 2 characters long. For example,
print(mylambda('This is a string'))
should produce:
'Tg'
mylambda = lambda x: x[0] + x[-1]
print(mylambda('This is a string'))
Reviewing Lambda Function: If Statements
We can make our lambdas more complex by using a modified form of an if statement.
Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). The following function will convert the number of hours into time-and-a-half hours using an if statement:
def myfunction(x):
if x > 40:
return 40 + (x - 40) * 1.50
else:
return x
Below is a lambda function that does the same thing:
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x
In general, the syntax for an if function in a lambda function is:
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
1.You are managing the webpage of a somewhat violent video game and you want to check that each user’s age is 13 or greater when they visit the site.
Write a lambda function that takes an inputted age and either returns Welcome to BattleCity! if the user is 13 or older or You must be over 13 if they are younger than 13. Your lambda function should be called mylambda.
import codecademylib3
mylambda = lambda age: 'Welcome to BattleCity!' if age >= 13 else "You must be over 13"
Applying a Lambda to a Column
In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address in the following table:
| Name | |
|---|---|
| JOHN SMITH | john.smith@gmail.com |
| Jane Doe | jdoe@yahoo.com |
| joe schmo | joeschmo@hotmail.com |
We could use the following code with a lambda function and the string method .split():
df['Email Provider'] = df.Email.apply(
lambda x: x.split('@')[-1]
)
The result would be:
| Name | Email Provider | |
|---|---|---|
| JOHN SMITH | john.smith@gmail.com | gmail.com |
| Jane Doe | jdoe@yahoo.com | yahoo.com |
| joe schmo | joeschmo@hotmail.com | hotmail.com |
1.Create a lambda function get_last_name which takes a string with someone’s first and last name (i.e., John Smith), and returns just the last name (i.e., Smith).
2.The DataFrame df represents the hours worked by different employees over the course of the week. It contains the following columns:
'name': The employee’s name'hourly_wage': The employee’s hourly wage'hours_worked': The number of hours worked this week
Use the lambda function get_last_name to create a new column last_name with only the employees’ last name.
import codecademylib3
import pandas as pd
df = pd.read_csv('employees.csv')
# Add columns here
get_last_name = lambda x: x.split()[-1]
df['last_name'] = df.name.apply(get_last_name)
print(df)
Applying a Lambda to a Row
We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax row.column_name or row[‘column_name’].
Suppose we have a table representing a grocery list:
| Item | Price | Is taxed? |
|---|---|---|
| Apple | 1.00 | No |
| Milk | 4.20 | No |
| Paper Towels | 5.00 | Yes |
| Light Bulbs | 3.75 | Yes |
If we want to add in the price with tax for each line, we’ll need to look at two columns: Price and Is taxed?.
If Is taxed? is Yes, then we’ll want to multiply Price by 1.075 (for 7.5% sales tax).
If Is taxed? is No, we’ll just have Price without multiplying it.
We can create this column using a lambda function and the keyword axis=1:
df['Price with Tax'] = df.apply(lambda row:
row['Price'] * 1.075
if row['Is taxed?'] == 'Yes'
else row['Price'],
axis=1
)
1.If an employee worked for more than 40 hours, she needs to be paid overtime (1.5 times the normal hourly wage).
For instance, if an employee worked for 43 hours and made $10/hour, she would receive $400 for the first 40 hours that she worked, and an additional $45 for the 3 hours of overtime, for a total for $445.
Create a lambda function total_earned that accepts an input row with keys hours_worked and hourly_wage and uses an if statement to calculate the hourly wage.
import codecademylib3
import pandas as pd
df = pd.read_csv('employees.csv')
def total_earned(row):
if row['hours_worked'] <= 40:
return row['hours_worked'] * row['hourly_wage']
else:
return (40 * row['hourly_wage']) + (row['hours_worked'] - 40) * (row['hourly_wage'] * 1.50)
2.Use the lambda function total_earned and apply to add a column total_earned to df with the total amount earned by each employee.
import codecademylib3
import pandas as pd
df = pd.read_csv('employees.csv')
def total_earned(row):
if row['hours_worked'] <= 40:
return row['hours_worked'] * row['hourly_wage']
else:
return (40 * row['hourly_wage']) + (row['hours_worked'] - 40) * (row['hourly_wage'] * 1.50)
df['total_earned'] = df.apply(total_earned,axis=1)
Renaming Columns
When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use df.column_name (which tab-completes) rather than df['column_name'] (which takes up extra space).
You can change all of the column names at once by setting the .columns property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:
df = pd.DataFrame({
'name': ['John', 'Jane', 'Sue', 'Fred'],
'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']
1.The DataFrame df contains data about movies from IMDb.
We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. Let’s modify df using the .columns attribute to make the following changes to the columns:
| Old | New |
|---|---|
| id | ID |
| name | Title |
| genre | Category |
| year | Year Released |
| imdb_rating | Rating |
1.The DataFrame df contains data about movies from IMDb.
We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. Let’s modify df using the .columns attribute to make the following changes to the columns:
| Old | New |
|---|---|
| id | ID |
| name | Title |
| genre | Category |
| year | Year Released |
| imdb_rating | Rating |
import codecademylib3
import pandas as pd
df = pd.read_csv('imdb.csv')
# Rename columns here
df.columns = ['ID','Title','Category','Year Released','Rating']
print(df)
Renaming Columns II
You also can rename individual columns by using the .rename method. Pass a dictionary like the one below to the columns keyword argument:
{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}
Here’s an example:
df = pd.DataFrame({
'name': ['John', 'Jane', 'Sue', 'Fred'],
'age': [23, 29, 21, 18]
})
df.rename(columns={
'name': 'First Name',
'age': 'Age'},
inplace=True)
The code above will rename name to First Name and age to Age.
Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument inplace=True. Using inplace=True lets us edit the original DataFrame.
There are several reasons why .rename is preferable to .columns:
- You can rename just one column
- You can be specific about which column names are getting changed (with
.columnyou can accidentally switch column names if you’re not careful)
Note: If you misspell one of the original column names, this command won’t fail. It just won’t change anything.
1.If we didn’t know that df was a table of movie ratings, the column name might be confusing.
To clarify, let’s rename name to movie_title.
Use the keyword inplace=True so that you modify df rather than creating a new DataFrame!
import codecademylib3
import pandas as pd
df = pd.read_csv('imdb.csv')
# Rename columns here
df.rename(columns={
'name':'movie_title'
},inplace=True)
print(df)
1.Once more, you’ll be the data analyst for ShoeFly.com, a fictional online shoe store.
More messy order data has been loaded into the variable orders. Examine the first 5 rows of the data using print and .head().
2.Many of our customers want to buy vegan shoes (shoes made from materials that do not come from animals). Add a new column called shoe_source, which is vegan if the materials is not leather and animal otherwise.
3.Our marketing department wants to send out an email to each customer. Using the columns last_name and gender create a column called salutation which contains Dear Mr. <last_name> for men and Dear Ms. <last_name> for women.
Here are some examples:
| last_name | gender | salutation |
|---|---|---|
| Smith | Male | Dear Mr. Smith |
| Jones | Female | Dear Ms. Jones |
import codecademylib3
import pandas as pd
orders = pd.read_csv('shoefly.csv')
print(orders.head(5))
orders['shoe_source'] = orders.shoe_material.apply(lambda x: \
'animal' if x == 'leather'else 'vegan')
orders['salutation'] = orders.apply(lambda row: \
'Dear Mr. ' + row['last_name']
if row['gender'] == 'male'
else 'Dear Ms. ' + row['last_name'],
axis=1)
128

被折叠的 条评论
为什么被折叠?



