Introduction to Pandas and NumPy
Pandas
Pandas is a very popular library for working with data (its goal is to be the most powerful and flexible open-source tool, and in our opinion,it has reached that goal).DataFrames are at the center of pandas.A DataFrame is structured like a table or spreadsheet. The rows and the columns both have indexes, and you can perform operations on rows or columns separately.
A pandas DataFrame can be easily changed and manipulated.Pandas has helpful functions for handling missing data,performing operations on columns and rows, and transforming data.if that wasn't enough, a lot of SQL functions have counterparts in pandas, such as join,merge,filter by,and group by.with all of these powerful tools, it should come as no surprise that pandas is very popular among data scientists.
NumPy
NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. There are a few functions that exist in NumPy that we use on pandas DataFrames. For us, the most important part about NumPy is that pandas is built on top of it. So, NumPy is a dependency of Pandas.
Installation
If you have Anaconda installed,NumPy and Pandas may have been auto-installed as well! if they haven't been, or if you want to update to the lastet versions, you can open a terminal window and run the following commands:
conda install numpy
conda install pandas
If you don’t have Anaconda installed, you can alternatively install the libraries using pip
by running the following commands from your terminal:
pip install numpy
pip install pandas
Once you’ve installed these libraries, you’re ready to open any Python coding environment (we recommend Jupyter Notebook). Before you can use these libraries, you’ll need to import them using the following lines of code. We’ll use the abbreviations np
and pd
, respectively, to simplify our function calls in the future.
import numpy as np
import pandas as pd
NumPy Arrays
NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.
The code below initializes a Python list named list1
:
list1 = [1,2,3,4]
To convert this to a one-dimensional ndarray with one row and four columns, we can use the np.array()
function:
array1 = np.array(list1)
print(array1)
[1 2 3 4]
To get a two-dimensional ndarray from a list, we must start with a Python list of lists:
list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)
[[1 2 3]
[4 5 6]]
In the above output, you may notice that the NumPy array print-out is displayed in a way that clearly demonstrates its multi-dimensional structure: two rows and three columns.
Many operations can be performed on NumPy arrays which makes them very helpful for manipulating data:
-
Selecting array elements
-
Slicing arrays
-
Reshaping arrays
-
Splitting arrays
-
Combining arrays
-
Numerical operations (min, max, mean, etc)
Mathematical operations can be performed on all values in a ndarray at one time rather than having to loop through values, as is necessary with a Python list. This is very helpful in many scenarios. Say you own a toy store and decide to decrease the price of all toys by €2 for a weekend sale. With the toy prices stored in an ndarray, you can easily facilitate this operation.
toyPrices = np.array([5,8,3,6])
print(toyPrices - 2)
[3 6 1 4]
If, however, you had stored your toy prices in a Python list, you would have to manually loop through the entire list to decrease each toy price.
toyPrices = [5,8,3,6]
# print(toyPrices - 2) -- Not possible. Causes an error
for i in range(len(toyPrices)):
toyPrices[i] -= 2
print(toyPrices)
[3,6,1,4]
Pandas Series and Dataframes
Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. Say you want to store the ages of students in an online course to eventually figure out the average student age. If stored in a NumPy array, you could only access these ages with the internal ndarray indices 0,1,2...
. With a Series object, the indices of values are set to 0,1,2...
by default, but you can customize the indices to be other values such as student names so an age can be accessed using a name. Customized indices of a Series are established by sending values into the Series constructor, as you will see below.
A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.
# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(series1)
0 | 13
1 | 25
2 | 19
dtype: int64
When printing a Series, the data type of its elements is also printed. To customize the indices of a Series object, use the index
argument of the Series
constructor.
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)
Emma | 13
Swetha | 25
Serajh | 19
dtype: int64
Series objects provide more information than NumPy arrays do. Printing a NumPy array of ages does not print the indices or allow us to customize them.
ages = np.array([13,25,19])
print(ages)
[13 25 19]
Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).
There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. Here we have created a DataFrame using a Python list of lists. Each nested list represents the data in one row of the DataFrame. We use the keyword columns
to pass in the list of our custom column names.
dataf = pd.DataFrame([
['John Smith','123 Main St',34],
['Jane Doe', '456 Maple Ave',28],
['Joe Schmo', '789 Broadway',51]
],
columns=['name','address','age'])
This is how the DataFrame is displayed:
name | address | age
0 | John Smith | 123 Main St | 34
1 | Jane Doe | 456 Maple Ave | 28
2 | Joe Schmo | 789 Broadway | 51
The default row indices are 0,1,2...
, but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame. To use the names
column as indices instead of the default numerical values, we can run the following command on our DataFrame:
dataf.set_index('name')
name | address | age
John Smith | 123 Main St | 34
Jane Doe | 456 Maple Ave | 28
Joe Schmo | 789 Broadway | 51
DataFrames are useful because they make it much easier to select, manipulate, and summarize data. Their tabular format (a table with rows and columns) also makes it easier to label, simpler to read, and easier to export data to and from a spreadsheet. Understanding the power of these new data structures is the key to unlocking many new avenues for data manipulation, exploration, and analysis!
10 minutes to pandas
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Customarily, we import as follows:
Importing the Pandas Module
Pandas is a Python module for working with tabular data(data in a table with rows and columns).Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of python.
The pandas
module is usually imported at the top of a Python file under the alias pd
.
import pandas as pd
If we need to access the pandas
module, we can do so by operating on pd
.
In this lesson, you’ll learn the basics of working with a single table in Pandas, such as:
- Create a table from scratch
- Loading data from another file
- Selecting certain rows or columns of a table
Note: In order for Codecademy to properly display data from Pandas, we need to import another special library:
import codecademylib3
Create a DataFrame I
A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.
DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.
You can pass in a dictionary to pd.DataFrame()
. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:
df1 = pd.DataFrame({
'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
'age': [34, 28, 51]
})
This command creates a DataFrame called df1
that looks like this:
address | age | name |
---|---|---|
123 Main St. | 34 | John Smith |
456 Maple Ave. | 28 | Jane Doe |
789 Broadway | 51 | Joe Schmo |
1.You run an online clothing store called Panda’s Wardrobe. You need a DataFrame containing information about your products.
Create a DataFrame with the following data that your inventory manager sent you:
Product ID | Product Name | Color |
---|---|---|
1 | t-shirt | blue |
2 | t-shirt | green |
3 | skirt | red |
4 | skirt | black |
We have already filled in the information for Product ID
in df1
.
Add the code to create the columns Product Name
and Color
and their associated data.
import codecademylib3
import pandas as pd
df1 = pd.DataFrame({
'Product ID': [1, 2, 3, 4],
# add Product Name and Color here
'Product Name': ['t-shirt','t-shirt','skirt','skirt'],
'Color': ['blue','green','red','black']
})
print(df1)
Create a DataFrame II
You can also add data using lists.
For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns
to pass a list of column names.
df2 = pd.DataFrame([
['John Smith', '123 Main St.', 34],
['Jane Doe', '456 Maple Ave.', 28],
['Joe Schmo', '789 Broadway', 51]
],
columns=['name', 'address', 'age'])
This command produces a DataFrame df2
that looks like this:
name | address | age |
---|---|---|
John Smith | 123 Main St. | 34 |
Jane Doe | 456 Maple Ave. | 28 |
Joe Schmo | 789 Broadway | 51 |
In this example, we were able to control the ordering of the columns because we used lists.
1.
You’re running a chain of pita shops called Pita Power. You want to create a DataFrame with information on your different store locations.
Use a list of lists to create a DataFrame with the following data:
Store ID | Location | Number of Employees |
---|---|---|
1 | San Diego | 100 |
2 | Los Angeles | 120 |
3 | San Francisco | 90 |
4 | Sacramento | 115 |
We have filled in the information for the first two rows in df2
.
Add the code to create the 3rd and 4th rows, and the column names.
import codecademylib3
import pandas as pd
df2 = pd.DataFrame([
[1, 'San Diego', 100],
[2, 'Los Angeles', 120],
# Fill in rows 3 and 4
[3, 'San Francisco', 90],
[4, 'Sacramento', 115]
],
columns=[
#add column names here
'Store ID','Location','Number of Employees'
])
print(df2)
Comma Separated Variables (CSV)
We now know how to create our own DataFrame. However, most of the time, we’ll be working with datasets that already exist. One of the most common formats for big datasets is the CSV.
CSV (comma separated values) is a text-only spreadsheet format. You can find CSVs in lots of places:
- Online datasets (here’s an example from data.gov)
- Export from Excel or Google Sheets
- Export from SQL
column1,column2,column3
value1,value2,value3
That example CSV represents the following table:
column1 | column2 | column3 |
---|---|---|
value1 | value2 | value3 |
1.You run a cupcake store and want to create a record of all of the cupcakes that you offer.
Write the following data as a CSV in cupcakes.csv
.
name | cake_flavor | frosting_flavor | topping |
---|---|---|---|
Chocolate Cake | chocolate | chocolate | chocolate shavings |
Birthday Cake | vanilla | vanilla | rainbow sprinkles |
Carrot Cake | carrot | cream cheese | almonds |
cupcakes.csv
name,cake_flavor,frosting_flavor,topping
Chocolate Cake,chocolate,chocolate,chocolate shavings
Birthday Cake,vanilla,vanilla,rainbow sprinkles
Carrot Cake,carrot,cream,cheese almonds
Loading and Saving CSVs
When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv()
:
pd.read_csv('my-csv-file.csv')
In the example above, the .read_csv()
method is called. The CSV file called my-csv-file
is passed in as an argument.
We can also save data to a CSV, using .to_csv()
.
df.to_csv('new-csv-file.csv')
In the example above, the .to_csv()
method is called on df
(which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv
). By default, this method will save the CSV file in your current directory.
1.You’re working for the County of Whoville and you just received a CSV of data about the different cities in your county. Read the CSV 'sample.csv'
into a variable called df
, so that you can learn more about the cities.
2.Let’s inspect the CSV.
Type print(df)
on the next line and then run your code. What sort of data were you sent?
import codecademylib3
import pandas as pd
df = pd.read_csv('sample.csv')
print(df)
Inspect a DataFrame
When we load a new DataFrame from a CSV, we want to know what it looks like.
If it’s a small DataFrame, you can display it by typing print(df)
.
If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.
The method .head()
gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument n
. For example, df.head(10)
would show the first 10 rows.
The method df.info()
gives some statistics for each column.
1.You’re working for a Hollywood studio, trying to use data to predict the next big hit. Load the CSV imdb.csv
into a variable called df
, so that you can learn about popular movies from the past 90 years.
2.Let’s learn about these movies.
Paste the following code into script.py
:
print(df.head())
3.What exactly is in this dataset?
Paste the following code into script.py
to learn more about this data:
print(df.info())
Select Columns
Now we know how to create and load data. Let’s select parts of those datasets that are interesting or important to our analyses.
Suppose you have the DataFrame called customers
, which contains the ages of your customers:
name | age |
---|---|
Rebecca Erikson | 35 |
Thomas Roberson | 28 |
Diane Ochoa | 42 |
… | … |
Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.
There are two possible syntaxes for selecting all values from a column:
- Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type
customers['age']
to select the ages. - If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation:
df.MySecondColumn
. In our example, we would typecustomers.age
.
When we select a single column, the result is called a Series.
1.The DataFrame df
represents data collected by four health clinics run by the same organization. Each row represents a month from January through June and shows the number of appointments made at four different clinics.
You want to analyze what’s been happening at the North location. Create a variable called clinic_north
that contains ONLY the data from the column clinic_north
.
2.What exactly have you selected?
After you create the variable, enter the command:
print(type(clinic_north))
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west']
)
clinic_north = df['clinic_north']
print(type(clinic_north))
Selecting Multiple Columns
When you have a larger DataFrame, you might want to select just a few columns.
For instance, let’s return to a DataFrame of orders
from ShoeFly.com:
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
54791 | Rebecca | Lindsay | RebeccaLindsay57@hotmail.com | clogs | faux-leather | black |
53450 | Emily | Joyce | EmilyJoyce25@gmail.com | ballet flats | faux-leather | navy |
91987 | Joyce | Waller | Joyce.Waller@gmail.com | sandals | fabric | black |
14437 | Justin | Erickson | Justin.Erickson@outlook.com | clogs | faux-leather | red |
We might just be interested in the customer’s last_name
and email
. We want a DataFrame like this:
last_name | |
---|---|
Lindsay | RebeccaLindsay57@hotmail.com |
Joyce | EmilyJoyce25@gmail.com |
Waller | Joyce.Waller@gmail.com |
Erickson | Justin.Erickson@outlook.com |
To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:
new_df = orders[['last_name', 'email']]
Note: Make sure that you have a double set of brackets ([[]]
), or this command won’t work!
1.Now, you want to compare visits to the Northern and Southern clinics.
Create a variable called clinic_north_south
that contains ONLY the data from the columns clinic_north
and clinic_south
.
2.When we select multiple columns, do we get a Series or a DataFrame?
After you’ve created the variable, enter the command:
print(type(clinic_north_south))
to see what data type you’ve created.
How is this different from what happened in the previous exercise?
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west']
)
clinic_north_south = df[['clinic_north','clinic_south']]
print(type(clinic_north_south))
Select Rows
Let’s revisit our orders
from ShoeFly.com:
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
54791 | Rebecca | Lindsay | RebeccaLindsay57@hotmail.com | clogs | faux-leather | black |
53450 | Emily | James | EmilyJames25@gmail.com | ballet flats | faux-leather | navy |
91987 | Joyce | Waller | Joyce.Waller@gmail.com | sandals | fabric | black |
14437 | Justin | Erickson | Justin.Erickson@outlook.com | clogs | faux-leather | red |
… |
Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.
DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller’s order is the 2nd row.
We select it using the following command:
orders.iloc[2]
When we select a single row, the result is a Series (just like when we select a single column).
1.You’re getting ready to staff the clinic for March this year. You want to know how many visits took place in March last year, to help you prepare.
Write a command that will produce a Series made up of the March data from df
from all four clinic sites and save it to the variable march
.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west'])
march=df.iloc[2]
Selecting Multiple Rows
You can also select multiple rows from a DataFrame.
Here are a few more rows from ShoeFly.com’s orders
DataFrame:
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
54791 | Rebecca | Lindsay | RebeccaLindsay57@hotmail.com | clogs | faux-leather | black |
53450 | Emily | Joyce | EmilyJoyce25@gmail.com | ballet flats | faux-leather | navy |
91987 | Joyce | Waller | Joyce.Waller@gmail.com | sandals | fabric | black |
14437 | Justin | Erickson | Justin.Erickson@outlook.com | clogs | faux-leather | red |
79357 | Andrew | Banks | AB4318@gmail.com | boots | leather | brown |
52386 | Julie | Marsh | JulieMarsh59@gmail.com | sandals | fabric | black |
20487 | Thomas | Jensen | TJ5470@gmail.com | clogs | fabric | navy |
76971 | Janice | Hicks | Janice.Hicks@gmail.com | clogs | faux-leather | navy |
21586 | Gabriel | Porter | GabrielPorter24@gmail.com | clogs | leather | brown |
Here are some different ways of selecting multiple rows:
orders.iloc[3:7]
would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
14437 | Justin | Erickson | Justin.Erickson@outlook.com | clogs | faux-leather | red |
79357 | Andrew | Banks | AB4318@gmail.com | boots | leather | brown |
52386 | Julie | Marsh | JulieMarsh59@gmail.com | sandals | fabric | black |
20487 | Thomas | Jensen | TJ5470@gmail.com | clogs | fabric | navy |
orders.iloc[:4]
would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
54791 | Rebecca | Lindsay | RebeccaLindsay57@hotmail.com | clogs | faux-leather | black |
53450 | Emily | Joyce | EmilyJoyce25@gmail.com | ballet flats | faux-leather | navy |
91987 | Joyce | Waller | Joyce.Waller@gmail.com | sandals | fabric | black |
14437 | Justin | Erickson | Justin.Erickson@outlook.com | clogs | faux-leather | red |
orders.iloc[-3:]
would select the rows starting at the 3rd to last row and up to and including the final row
id | first_name | last_name | shoe_type | shoe_material | shoe_color | |
---|---|---|---|---|---|---|
20487 | Thomas | Jensen | TJ5470@gmail.com | clogs | fabric | navy |
76971 | Janice | Hicks | Janice.Hicks@gmail.com | clogs | faux-leather | navy |
21586 | Gabriel | Porter | GabrielPorter24@gmail.com | clogs | leather | brown |
1.One of your doctors thinks that there are more clinic visits in the late Spring.
Write a command that will produce a DataFrame made up of the data for April, May, and June from df
for all four sites (rows 3 through 6), and save it to april_may_june
.
2.Inspect april_may_june
using print
.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west']
)
april_may_june = df.iloc[3:6]
print(april_may_june)
Select Rows with Logic I
You can select a subset of a DataFrame by using logical statements:
df[df.MyColumnName == desired_column_value]
We have a large DataFrame with information about our customers. A few of the many rows look like this:
name | address | phone | age |
---|---|---|---|
Martha Jones | 123 Main St. | 234-567-8910 | 28 |
Rose Tyler | 456 Maple Ave. | 212-867-5309 | 22 |
Donna Noble | 789 Broadway | 949-123-4567 | 35 |
Amy Pond | 98 West End Ave. | 646-555-1234 | 29 |
Clara Oswald | 54 Columbus Ave. | 714-225-1957 | 31 |
… | … | … | … |
Suppose we want to select all rows where the customer’s age is 30. We would use:
df[df.age == 30]
In Python, ==
is how we test if a value is exactly equal to another value.
We can use other logical statements, such as:
- Greater Than,
>
— Here, we select all rows where the customer’s age is greater than 30:
df[df.age > 30]
- Less Than,
<
— Here, we select all rows where the customer’s age is less than 30:
df[df.age < 30]
- Not Equal,
!=
— This snippet selects all rows where the customer’s name is notClara Oswald
:
df[df.name != 'Clara Oswald']
1.You’re going to staff the clinic for January of this year. You want to know how many visits took place in January of last year, to help you prepare.
Create variable january
using a logical statement that selects the row of df
where the 'month'
column is 'January'
.
2.Inspect january
using print
.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west'])
january = df[df.month == 'January']
print(january)
Select Rows with Logic II
You can also combine multiple logical statements, as long as each statement is in parentheses.
For instance, suppose we wanted to select all rows where the customer’s age was under 30 or the customer’s name was “Martha Jones”:
name | address | phone | age |
---|---|---|---|
Martha Jones | 123 Main St. | 234-567-8910 | 28 |
Rose Tyler | 456 Maple Ave. | 212-867-5309 | 22 |
Donna Noble | 789 Broadway | 949-123-4567 | 35 |
Amy Pond | 98 West End Ave. | 646-555-1234 | 29 |
Clara Oswald | 54 Columbus Ave. | 714-225-1957 | 31 |
… |
We could use the following code:
df[(df.age < 30) |
(df.name == 'Martha Jones')]
In Python, |
means “or” and &
means “and”.
1.You want to see how the number of clinic visits changed between March and April.
Create the variable march_april
, which contains the data from March and April. Do this using two logical statements combined using |
, which means “or”.
2.Inspect march_april
using print
.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west'])
march_april = df[(df.month == 'March') | (df.month == 'April')]
print(march_april)
Select Rows with Logic III
Suppose we want to select the rows where the customer’s name is either “Martha Jones”, “Rose Tyler” or “Amy Pond”.
name | address | phone | age |
---|---|---|---|
Martha Jones | 123 Main St. | 234-567-8910 | 28 |
Rose Tyler | 456 Maple Ave. | 212-867-5309 | 22 |
Donna Noble | 789 Broadway | 949-123-4567 | 35 |
Amy Pond | 98 West End Ave. | 646-555-1234 | 29 |
Clara Oswald | 54 Columbus Ave. | 714-225-1957 | 31 |
… | … | … | … |
We could use the isin
command to check that df.name
is one of a list of values:
df[df.name.isin(['Martha Jones',
'Rose Tyler',
'Amy Pond'])]
1.Another doctor thinks that you have a lot of clinic visits in the late Winter.
Create the variable january_february_march
, containing the data from January, February, and March. Do this using a single logical statement with the isin
command.
2.Inspect january_february_march
using print
.
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west'])
january_february_march = df[df.month.isin(['January', 'February', 'March'])]
print(january_february_march)
Setting indices
When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc()
.
We can fix this using the method .reset_index()
. For example, here is a DataFrame called df
with non-consecutive indices:
First Name | Last Name | |
---|---|---|
0 | John | Smith |
4 | Jane | Doe |
7 | Joe | Schmo |
If we use the command df.reset_index()
, we get a new DataFrame with a new set of indices:
index | First Name | Last Name | |
---|---|---|---|
0 | 0 | John | Smith |
1 | 4 | Jane | Doe |
2 | 7 | Joe | Schmo |
Note that the old indices have been moved into a new column called 'index'
. Unless you need those values for something special, it’s probably better to use the keyword drop=True
so that you don’t end up with that extra column. If we run the command df.reset_index(drop=True)
, we get a new DataFrame that looks like this:
First Name | Last Name | |
---|---|---|
0 | John | Smith |
1 | Jane | Doe |
2 | Joe | Schmo |
Using .reset_index()
will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword inplace=True
we can just modify our existing DataFrame.
1.Examine the code in the workspace. Note that df2
is a subset of rows from df
.
Type the following and press “Run”:
print(df2)
Note that the indices on df2
are not consecutive.
2.Create a new DataFrame called df3
by resetting the indices on df2
(don’t use inplace
or drop
). Did df2
change after you ran this command?
3.Reset the indices of df2
by using the keyword inplace=True
and drop=True
. Did the indices of df2
change? How is df2
different from df3
?
import codecademylib3
import pandas as pd
df = pd.DataFrame([
['January', 100, 100, 23, 100],
['February', 51, 45, 145, 45],
['March', 81, 96, 65, 96],
['April', 80, 80, 54, 180],
['May', 51, 54, 54, 154],
['June', 112, 109, 79, 129]],
columns=['month', 'clinic_east',
'clinic_north', 'clinic_south',
'clinic_west']
)
df2 = df.loc[[1, 3, 5]]
# print(df2)
df3 = df2.reset_index()
print(df3)
df2.reset_index(inplace = True, drop = True)
print(df2)
1.In this example, you’ll be the data analyst for ShoeFly.com, a fictional online shoe store. You’ve seen this data; now it’s your turn to work with it!
Load the data from shoefly.csv
into the variable orders
.
2.Inspect the first 5 lines of the data.
3.Your marketing department wants to send out an email blast to everyone who ordered shoes!
Select all of the email addresses from the column email
and save them to a variable called emails
.
4.Frances Palmer claims that her order was wrong. What did Frances Palmer order?
Use logic to select that row of orders
and save it to the variable frances_palmer
.
5.We need some customer reviews for our comfortable shoes. Select all orders for shoe_type
: clogs
, boots
, and ballet flats
and save them to the variable comfy_shoes
import codecademylib3
import pandas as pd
#Part 1: reading the csv
orders = pd.read_csv('shoefly.csv')
#Part 2: inspecting the first five lines of data
print(orders.head(5))
#Part 3: selecting the column 'email'
emails = orders.email
#Part 4: the Frances Palmer incident
frances_palmer = orders[(orders.first_name == 'Frances') & (orders.last_name == 'Palmer')]
#Part 5: Comfy feet means more time on the street
comfy_shoes = orders[orders.shoe_type.isin(['clogs', 'boots', 'ballet flats'])]