Modifying DataFrames in Pandas

In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.

In this lesson, you'll learn how to modify an existing DataFrame.Some of the skills you'll learn include:

  • Adding columns to a DataFrame
  • Using lambda functions to calculate complex quantities
  • Renaming columns

Adding a Column I

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

Product IDProduct DescriptionCost to ManufacturePrice
13 inch screw0.500.75
22 inch nail0.100.25
3hammer3.005.50
4screwdriver2.503.00

It looks like the actual quantity of each product in our warehouse is missing!

Let’s use the following code to add that information to our DataFrame.

df['Quantity'] = [100, 150, 50, 35]

Our new DataFrame looks like this:

Product IDProduct DescriptionCost to ManufacturePriceQuantity
13 inch screw0.500.75100
22 inch nail0.100.25150
3hammer3.005.5050
4screwdriver2.503.0035

1.The DataFrame df contains information on products sold at a hardware store. Add a column to df called 'Sold in Bulk?', which indicates if the product is sold in bulk or individually. The final table should look like this:

Product IDProduct DescriptionCost to ManufacturePriceSold in Bulk?
13 inch screw0.500.75Yes
22 inch nail0.100.25Yes
3hammer3.005.50No
4screwdriver2.503.00No
import codecademylib3
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df['Sold in Bulk?'] = ['Yes','Yes','No','No']
print(df)

Adding a Column II

We can also add a new column that is the same for all rows in the DataFrame. Let’s return to our inventory example:

Product IDProduct DescriptionCost to ManufacturePrice
13 inch screw0.500.75
22 inch nail0.100.25
3hammer3.005.50
4screwdriver2.503.00

Suppose we know that all of our products are currently in-stock. We can add a column that says this:

df['In Stock?'] = True

Now all of the rows have a column called In Stock? with value True.

Product IDProduct DescriptionCost to ManufacturePriceIn Stock?
13 inch screw0.500.75True
22 inch nail0.100.25True
3hammer3.005.50True
4screwdriver2.503.00True

1.Add a column to df called Is taxed?, which indicates whether or not to collect sales tax on the product. It should be 'Yes' for all rows.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here
df['Is taxed?'] = ['Yes','Yes','Yes','Yes']
print(df)

Adding a Column III

Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each Price by 0.075, the sales tax for our state:

df['Sales Tax'] = df.Price * 0.075

Now our table has a column called Sales Tax:

Product IDProduct DescriptionCost to ManufacturePriceSales Tax
13 inch screw0.500.750.06
22 inch nail0.100.250.02
3hammer3.005.500.41
4screwdriver2.503.000.22

1.Add a column to df called 'Margin', which is equal to the difference between the Price and the Cost to Manufacture.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add column here
df['Margin'] = df['Price'] - df['Cost to Manufacture']
print(df)

Performing Column Operations

In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

For example, imagine that we have the following table of customers.

NameEmail
JOHN SMITHjohn.smith@gmail.com
Jane Doejdoe@yahoo.com
joe schmojoeschmo@hotmail.com

It’s a little annoying that the capitalization is different for each row. Perhaps we’d like to make it more consistent by making all of the letters uppercase.

We can use the apply function to apply a function to every value in a particular column. For example, this code overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.

df['Name'] = df.Name.apply(str.upper)

The result:

NameEmail
JOHN SMITHjohn.smith@gmail.com
JANE DOEjdoe@yahoo.com
JOE SCHMOjoeschmo@hotmail.com

1.Apply the function lower to all names in column 'Name' in df. Assign these new names to a new column of df called 'Lowercase Name'. The final DataFrame should look like this:

NameEmailLowercase Name
JOHN SMITHjohn.smith@gmail.comjohn smith
Jane Doejdoe@yahoo.comjane doe
joe schmojoeschmo@hotmail.comjoe schmo
import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])

# Add columns here
df['Lowercase Name'] = df.Name.apply(str.lower)
print(df)

Reviewing Lambda Function

lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:

mylambda = lambda x: (x * 2) + 3
print(mylambda(5))

The output:

> 13

Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable x, and then converts it into lowercase:

stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))

The output:

> "oh hi mark!"

1.Create a lambda function mylambda that returns the first and last letters of a string, assuming the string is at least 2 characters long. For example,

print(mylambda('This is a string'))

should produce:

'Tg'
mylambda = lambda x: x[0] + x[-1]
print(mylambda('This is a string'))

Reviewing Lambda Function: If Statements

We can make our lambdas more complex by using a modified form of an if statement.

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). The following function will convert the number of hours into time-and-a-half hours using an if statement:

def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

Below is a lambda function that does the same thing:

myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x

In general, the syntax for an if function in a lambda function is:

lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]

1.You are managing the webpage of a somewhat violent video game and you want to check that each user’s age is 13 or greater when they visit the site.

Write a lambda function that takes an inputted age and either returns Welcome to BattleCity! if the user is 13 or older or You must be over 13 if they are younger than 13. Your lambda function should be called mylambda.

import codecademylib3
mylambda = lambda age: 'Welcome to BattleCity!' if age >= 13 else "You must be over 13"

Applying a Lambda to a Column

In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address in the following table:

NameEmail
JOHN SMITHjohn.smith@gmail.com
Jane Doejdoe@yahoo.com
joe schmojoeschmo@hotmail.com

We could use the following code with a lambda function and the string method .split():

df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )

The result would be:

NameEmailEmail Provider
JOHN SMITHjohn.smith@gmail.comgmail.com
Jane Doejdoe@yahoo.comyahoo.com
joe schmojoeschmo@hotmail.comhotmail.com

1.Create a lambda function get_last_name which takes a string with someone’s first and last name (i.e., John Smith), and returns just the last name (i.e., Smith).

2.The DataFrame df represents the hours worked by different employees over the course of the week. It contains the following columns:

  • 'name': The employee’s name
  • 'hourly_wage': The employee’s hourly wage
  • 'hours_worked': The number of hours worked this week

Use the lambda function get_last_name to create a new column last_name with only the employees’ last name.

import codecademylib3
import pandas as pd

df = pd.read_csv('employees.csv')

# Add columns here
get_last_name = lambda x: x.split()[-1]
df['last_name'] = df.name.apply(get_last_name)
print(df)

Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax row.column_name or row[‘column_name’].

Suppose we have a table representing a grocery list:

ItemPriceIs taxed?
Apple1.00No
Milk4.20No
Paper Towels5.00Yes
Light Bulbs3.75Yes

If we want to add in the price with tax for each line, we’ll need to look at two columns: Price and Is taxed?.

If Is taxed? is Yes, then we’ll want to multiply Price by 1.075 (for 7.5% sales tax).

If Is taxed? is No, we’ll just have Price without multiplying it.

We can create this column using a lambda function and the keyword axis=1:

df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)

1.If an employee worked for more than 40 hours, she needs to be paid overtime (1.5 times the normal hourly wage).

For instance, if an employee worked for 43 hours and made $10/hour, she would receive $400 for the first 40 hours that she worked, and an additional $45 for the 3 hours of overtime, for a total for $445.

Create a lambda function total_earned that accepts an input row with keys hours_worked and hourly_wage and uses an if statement to calculate the hourly wage.

import codecademylib3
import pandas as pd

df = pd.read_csv('employees.csv')
def total_earned(row):
  if row['hours_worked'] <= 40:
    return row['hours_worked'] * row['hourly_wage']
  else:
    return (40 * row['hourly_wage']) + (row['hours_worked'] - 40) * (row['hourly_wage'] * 1.50)

2.Use the lambda function total_earned and apply to add a column total_earned to df with the total amount earned by each employee.

import codecademylib3
import pandas as pd

df = pd.read_csv('employees.csv')
def total_earned(row):
  if row['hours_worked'] <= 40:
    return row['hours_worked'] * row['hourly_wage']
  else:
    return (40 * row['hourly_wage']) + (row['hours_worked'] - 40) * (row['hourly_wage'] * 1.50)

df['total_earned'] = df.apply(total_earned,axis=1)

Renaming Columns

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use df.column_name (which tab-completes) rather than df['column_name'] (which takes up extra space).

You can change all of the column names at once by setting the .columns property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:

df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']

1.The DataFrame df contains data about movies from IMDb.

We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. Let’s modify df using the .columns attribute to make the following changes to the columns:

OldNew
idID
nameTitle
genreCategory
yearYear Released
imdb_ratingRating

1.The DataFrame df contains data about movies from IMDb.

We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. Let’s modify df using the .columns attribute to make the following changes to the columns:

OldNew
idID
nameTitle
genreCategory
yearYear Released
imdb_ratingRating

 

import codecademylib3
import pandas as pd

df = pd.read_csv('imdb.csv')

# Rename columns here
df.columns = ['ID','Title','Category','Year Released','Rating']
print(df)

Renaming Columns II

You also can rename individual columns by using the .rename method. Pass a dictionary like the one below to the columns keyword argument:

{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}

Here’s an example:

df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)

The code above will rename name to First Name and age to Age.

Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument inplace=True. Using inplace=True lets us edit the original DataFrame.

There are several reasons why .rename is preferable to .columns:

  • You can rename just one column
  • You can be specific about which column names are getting changed (with .column you can accidentally switch column names if you’re not careful)

Note: If you misspell one of the original column names, this command won’t fail. It just won’t change anything.

1.If we didn’t know that df was a table of movie ratings, the column name might be confusing.

To clarify, let’s rename name to movie_title.

Use the keyword inplace=True so that you modify df rather than creating a new DataFrame!

import codecademylib3
import pandas as pd

df = pd.read_csv('imdb.csv')

# Rename columns here
df.rename(columns={
  'name':'movie_title'
},inplace=True)
print(df)

1.Once more, you’ll be the data analyst for ShoeFly.com, a fictional online shoe store.

More messy order data has been loaded into the variable orders. Examine the first 5 rows of the data using print and .head().

2.Many of our customers want to buy vegan shoes (shoes made from materials that do not come from animals). Add a new column called shoe_source, which is vegan if the materials is not leather and animal otherwise.

3.Our marketing department wants to send out an email to each customer. Using the columns last_name and gender create a column called salutation which contains Dear Mr. <last_name> for men and Dear Ms. <last_name> for women.

Here are some examples:

last_namegendersalutation
SmithMaleDear Mr. Smith
JonesFemaleDear Ms. Jones

import codecademylib3
import pandas as pd

orders = pd.read_csv('shoefly.csv')

print(orders.head(5))

orders['shoe_source'] = orders.shoe_material.apply(lambda x: \
                        	'animal' if x == 'leather'else 'vegan')

orders['salutation'] = orders.apply(lambda row: \
                                    'Dear Mr. ' + row['last_name']
                                    if row['gender'] == 'male'
                                    else 'Dear Ms. ' + row['last_name'],
                                    axis=1)

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值