深入学习Python库中的Pandas 和NumPy

Introduction to Pandas and NumPy

Pandas

Pandas is a very popular library for working with data (its goal is to be the most powerful and flexible open-source tool, and in our opinion,it has reached that goal).DataFrames are at the center of pandas.A DataFrame is structured like a table or spreadsheet. The rows and the columns both have indexes, and you can perform operations on rows or columns separately.

A pandas DataFrame can be easily changed and manipulated.Pandas has helpful functions for handling missing data,performing operations on columns and rows, and transforming data.if that wasn't enough, a lot of SQL functions have counterparts in pandas, such as join,merge,filter by,and group by.with all of these powerful tools, it should come as no surprise that pandas is very popular among data scientists.

NumPy

NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. There are a few functions that exist in NumPy that we use on pandas DataFrames. For us, the most important part about NumPy is that pandas is built on top of it. So, NumPy is a dependency of Pandas.

Installation

If you have Anaconda installed,NumPy and Pandas may have been auto-installed as well! if they haven't been, or if you want to update to the lastet versions, you can open a terminal window and run the following commands:

conda install numpy
conda install pandas

If you don’t have Anaconda installed, you can alternatively install the libraries using pip by running the following commands from your terminal:

pip install numpy
pip install pandas

Once you’ve installed these libraries, you’re ready to open any Python coding environment (we recommend Jupyter Notebook). Before you can use these libraries, you’ll need to import them using the following lines of code. We’ll use the abbreviations np and pd, respectively, to simplify our function calls in the future.

import numpy as np
import pandas as pd

NumPy Arrays

NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.

The code below initializes a Python list named list1:

list1 = [1,2,3,4]

To convert this to a one-dimensional ndarray with one row and four columns, we can use the np.array() function:

array1 = np.array(list1)
print(array1)

[1 2 3 4]

 To get a two-dimensional ndarray from a list, we must start with a Python list of lists:

list2 = [[1,2,3],[4,5,6]]
array2 = np.array(list2)
print(array2)

[[1 2 3]

[4 5 6]]

In the above output, you may notice that the NumPy array print-out is displayed in a way that clearly demonstrates its multi-dimensional structure: two rows and three columns.

Many operations can be performed on NumPy arrays which makes them very helpful for manipulating data:

  • Selecting array elements

  • Slicing arrays

  • Reshaping arrays

  • Splitting arrays

  • Combining arrays

  • Numerical operations (min, max, mean, etc)

Mathematical operations can be performed on all values in a ndarray at one time rather than having to loop through values, as is necessary with a Python list. This is very helpful in many scenarios. Say you own a toy store and decide to decrease the price of all toys by €2 for a weekend sale. With the toy prices stored in an ndarray, you can easily facilitate this operation.

toyPrices = np.array([5,8,3,6])
print(toyPrices - 2)

[3 6 1 4]

If, however, you had stored your toy prices in a Python list, you would have to manually loop through the entire list to decrease each toy price.

toyPrices = [5,8,3,6]
# print(toyPrices - 2) -- Not possible. Causes an error
for i in range(len(toyPrices)):
    toyPrices[i] -= 2
print(toyPrices)

[3,6,1,4]

 Pandas Series and Dataframes

Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. Say you want to store the ages of students in an online course to eventually figure out the average student age. If stored in a NumPy array, you could only access these ages with the internal ndarray indices 0,1,2.... With a Series object, the indices of values are set to 0,1,2... by default, but you can customize the indices to be other values such as student names so an age can be accessed using a name. Customized indices of a Series are established by sending values into the Series constructor, as you will see below.

A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.

# Create a Series using a NumPy array of ages with the default numerical indices
ages = np.array([13,25,19])
series1 = pd.Series(ages)
print(series1)
0  |  13
1  |  25
2  |  19
dtype: int64

When printing a Series, the data type of its elements is also printed. To customize the indices of a Series object, use the index argument of the Series constructor.

# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age
ages = np.array([13,25,19])
series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh'])
print(series1)
Emma    |  13
Swetha  |  25
Serajh  |  19
dtype: int64

Series objects provide more information than NumPy arrays do. Printing a NumPy array of ages does not print the indices or allow us to customize them.

ages = np.array([13,25,19])
print(ages)
[13 25 19]

 Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. Here we have created a DataFrame using a Python list of lists. Each nested list represents the data in one row of the DataFrame. We use the keyword columns to pass in the list of our custom column names.

dataf = pd.DataFrame([
    ['John Smith','123 Main St',34],
    ['Jane Doe', '456 Maple Ave',28],
    ['Joe Schmo', '789 Broadway',51]
    ],
    columns=['name','address','age'])

This is how the DataFrame is displayed:

          name      |   address     |   age
0    | John Smith   | 123 Main St   |   34
1    | Jane Doe     | 456 Maple Ave |   28
2    | Joe Schmo    | 789 Broadway  |   51

The default row indices are 0,1,2..., but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame. To use the names column as indices instead of the default numerical values, we can run the following command on our DataFrame:

dataf.set_index('name')
   name      |   address     |  age
John Smith   | 123 Main St   |   34
Jane Doe     | 456 Maple Ave |   28
Joe Schmo    | 789 Broadway  |   51

DataFrames are useful because they make it much easier to select, manipulate, and summarize data. Their tabular format (a table with rows and columns) also makes it easier to label, simpler to read, and easier to export data to and from a spreadsheet. Understanding the power of these new data structures is the key to unlocking many new avenues for data manipulation, exploration, and analysis!

10 minutes to pandas

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.

Customarily, we import as follows:

Importing the Pandas Module

Pandas is a Python module for working with tabular data(data in a table with rows and columns).Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of python.

The pandas module is usually imported at the top of a Python file under the alias pd.

import pandas as pd

If we need to access the pandas module, we can do so by operating on pd.

In this lesson, you’ll learn the basics of working with a single table in Pandas, such as:

  • Create a table from scratch
  • Loading data from another file
  • Selecting certain rows or columns of a table

Note: In order for Codecademy to properly display data from Pandas, we need to import another special library:

import codecademylib3

 Create a DataFrame I

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

You can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:

df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})

This command creates a DataFrame called df1 that looks like this:

addressagename
123 Main St.34John Smith
456 Maple Ave.28Jane Doe
789 Broadway51Joe Schmo

 

1.You run an online clothing store called Panda’s Wardrobe. You need a DataFrame containing information about your products.

Create a DataFrame with the following data that your inventory manager sent you:

Product IDProduct NameColor
1t-shirtblue
2t-shirtgreen
3skirtred
4skirtblack

We have already filled in the information for Product ID in df1.

Add the code to create the columns Product Name and Color and their associated data.

import codecademylib3
import pandas as pd

df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  # add Product Name and Color here
  'Product Name': ['t-shirt','t-shirt','skirt','skirt'],
  'Color': ['blue','green','red','black']
})

print(df1)

Create a DataFrame II

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])

This command produces a DataFrame df2 that looks like this:

nameaddressage
John Smith123 Main St.34
Jane Doe456 Maple Ave.28
Joe Schmo789 Broadway51

 In this example, we were able to control the ordering of the columns because we used lists.

1.

You’re running a chain of pita shops called Pita Power. You want to create a DataFrame with information on your different store locations.

Use a list of lists to create a DataFrame with the following data:

Store IDLocationNumber of Employees
1San Diego100
2Los Angeles120
3San Francisco90
4Sacramento115

We have filled in the information for the first two rows in df2.

Add the code to create the 3rd and 4th rows, and the column names.

import codecademylib3
import pandas as pd

df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  # Fill in rows 3 and 4
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns=[
    #add column names here
    'Store ID','Location','Number of Employees'
  ])

print(df2)

Comma Separated Variables (CSV)

We now know how to create our own DataFrame. However, most of the time, we’ll be working with datasets that already exist. One of the most common formats for big datasets is the CSV.

CSV (comma separated values) is a text-only spreadsheet format. You can find CSVs in lots of places:

  • Online datasets (here’s an example from data.gov)
  • Export from Excel or Google Sheets
  • Export from SQL

column1,column2,column3
value1,value2,value3

That example CSV represents the following table:

column1column2column3
value1value2value3

1.You run a cupcake store and want to create a record of all of the cupcakes that you offer.

Write the following data as a CSV in cupcakes.csv.

namecake_flavorfrosting_flavortopping
Chocolate Cakechocolatechocolatechocolate shavings
Birthday Cakevanillavanillarainbow sprinkles
Carrot Cakecarrotcream cheesealmonds

cupcakes.csv

name,cake_flavor,frosting_flavor,topping
Chocolate Cake,chocolate,chocolate,chocolate shavings
Birthday Cake,vanilla,vanilla,rainbow sprinkles
Carrot Cake,carrot,cream,cheese	almonds

Loading and Saving CSVs

When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

pd.read_csv('my-csv-file.csv')

In the example above, the .read_csv() method is called. The CSV file called my-csv-file is passed in as an argument.

We can also save data to a CSV, using .to_csv().

df.to_csv('new-csv-file.csv')

In the example above, the .to_csv() method is called on df (which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv). By default, this method will save the CSV file in your current directory.

1.You’re working for the County of Whoville and you just received a CSV of data about the different cities in your county. Read the CSV 'sample.csv' into a variable called df, so that you can learn more about the cities.

2.Let’s inspect the CSV.

Type print(df) on the next line and then run your code. What sort of data were you sent?

import codecademylib3
import pandas as pd

df = pd.read_csv('sample.csv')
print(df)


Inspect a DataFrame

When we load a new DataFrame from a CSV, we want to know what it looks like.

If it’s a small DataFrame, you can display it by typing print(df).

If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method .head() gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument n. For example, df.head(10) would show the first 10 rows.

The method df.info() gives some statistics for each column.

1.You’re working for a Hollywood studio, trying to use data to predict the next big hit. Load the CSV imdb.csv into a variable called df, so that you can learn about popular movies from the past 90 years.

2.Let’s learn about these movies.

Paste the following code into script.py:

print(df.head())

3.What exactly is in this dataset?

Paste the following code into script.py to learn more about this data:

print(df.info())

Select Columns

Now we know how to create and load data. Let’s select parts of those datasets that are interesting or important to our analyses.

Suppose you have the DataFrame called customers, which contains the ages of your customers:

nameage
Rebecca Erikson35
Thomas Roberson28
Diane Ochoa42

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.

There are two possible syntaxes for selecting all values from a column:

  1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type customers['age'] to select the ages.
  2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: df.MySecondColumn. In our example, we would type customers.age.

When we select a single column, the result is called a Series.

1.The DataFrame df represents data collected by four health clinics run by the same organization. Each row represents a month from January through June and shows the number of appointments made at four different clinics.

You want to analyze what’s been happening at the North location. Create a variable called clinic_north that contains ONLY the data from the column clinic_north.

2.What exactly have you selected?

After you create the variable, enter the command:

print(type(clinic_north))
import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df['clinic_north']

print(type(clinic_north))

Selecting Multiple Columns

When you have a larger DataFrame, you might want to select just a few columns.

For instance, let’s return to a DataFrame of orders from ShoeFly.com:

idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
54791RebeccaLindsayRebeccaLindsay57@hotmail.comclogsfaux-leatherblack
53450EmilyJoyceEmilyJoyce25@gmail.comballet flatsfaux-leathernavy
91987JoyceWallerJoyce.Waller@gmail.comsandalsfabricblack
14437JustinEricksonJustin.Erickson@outlook.comclogsfaux-leatherred

We might just be interested in the customer’s last_name and email. We want a DataFrame like this:

last_nameemail
LindsayRebeccaLindsay57@hotmail.com
JoyceEmilyJoyce25@gmail.com
WallerJoyce.Waller@gmail.com
EricksonJustin.Erickson@outlook.com

To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:

new_df = orders[['last_name', 'email']]

Note: Make sure that you have a double set of brackets ([[]]), or this command won’t work!

1.Now, you want to compare visits to the Northern and Southern clinics.

Create a variable called clinic_north_south that contains ONLY the data from the columns clinic_north and clinic_south.

2.When we select multiple columns, do we get a Series or a DataFrame?

After you’ve created the variable, enter the command:

print(type(clinic_north_south))

to see what data type you’ve created.

How is this different from what happened in the previous exercise?

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north_south = df[['clinic_north','clinic_south']]

print(type(clinic_north_south))

Select Rows

Let’s revisit our orders from ShoeFly.com:

idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
54791RebeccaLindsayRebeccaLindsay57@hotmail.comclogsfaux-leatherblack
53450EmilyJamesEmilyJames25@gmail.comballet flatsfaux-leathernavy
91987JoyceWallerJoyce.Waller@gmail.comsandalsfabricblack
14437JustinEricksonJustin.Erickson@outlook.comclogsfaux-leatherred

Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller’s order is the 2nd row.

We select it using the following command:

orders.iloc[2]

When we select a single row, the result is a Series (just like when we select a single column).

1.You’re getting ready to staff the clinic for March this year. You want to know how many visits took place in March last year, to help you prepare.

Write a command that will produce a Series made up of the March data from df from all four clinic sites and save it to the variable march.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

march=df.iloc[2]

Selecting Multiple Rows

You can also select multiple rows from a DataFrame.

Here are a few more rows from ShoeFly.com’s orders DataFrame:

idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
54791RebeccaLindsayRebeccaLindsay57@hotmail.comclogsfaux-leatherblack
53450EmilyJoyceEmilyJoyce25@gmail.comballet flatsfaux-leathernavy
91987JoyceWallerJoyce.Waller@gmail.comsandalsfabricblack
14437JustinEricksonJustin.Erickson@outlook.comclogsfaux-leatherred
79357AndrewBanksAB4318@gmail.combootsleatherbrown
52386JulieMarshJulieMarsh59@gmail.comsandalsfabricblack
20487ThomasJensenTJ5470@gmail.comclogsfabricnavy
76971JaniceHicksJanice.Hicks@gmail.comclogsfaux-leathernavy
21586GabrielPorterGabrielPorter24@gmail.comclogsleatherbrown

Here are some different ways of selecting multiple rows:

  • orders.iloc[3:7] would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)
idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
14437JustinEricksonJustin.Erickson@outlook.comclogsfaux-leatherred
79357AndrewBanksAB4318@gmail.combootsleatherbrown
52386JulieMarshJulieMarsh59@gmail.comsandalsfabricblack
20487ThomasJensenTJ5470@gmail.comclogsfabricnavy
  • orders.iloc[:4] would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)
idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
54791RebeccaLindsayRebeccaLindsay57@hotmail.comclogsfaux-leatherblack
53450EmilyJoyceEmilyJoyce25@gmail.comballet flatsfaux-leathernavy
91987JoyceWallerJoyce.Waller@gmail.comsandalsfabricblack
14437JustinEricksonJustin.Erickson@outlook.comclogsfaux-leatherred
  • orders.iloc[-3:] would select the rows starting at the 3rd to last row and up to and including the final row
idfirst_namelast_nameemailshoe_typeshoe_materialshoe_color
20487ThomasJensenTJ5470@gmail.comclogsfabricnavy
76971JaniceHicksJanice.Hicks@gmail.comclogsfaux-leathernavy
21586GabrielPorterGabrielPorter24@gmail.comclogsleatherbrown

1.One of your doctors thinks that there are more clinic visits in the late Spring.

Write a command that will produce a DataFrame made up of the data for April, May, and June from df for all four sites (rows 3 through 6), and save it to april_may_june.

 2.Inspect april_may_june using print.

 

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

april_may_june = df.iloc[3:6]
print(april_may_june)

Select Rows with Logic I

You can select a subset of a DataFrame by using logical statements:

df[df.MyColumnName == desired_column_value]

We have a large DataFrame with information about our customers. A few of the many rows look like this:

nameaddressphoneage
Martha Jones123 Main St.234-567-891028
Rose Tyler456 Maple Ave.212-867-530922
Donna Noble789 Broadway949-123-456735
Amy Pond98 West End Ave.646-555-123429
Clara Oswald54 Columbus Ave.714-225-195731

Suppose we want to select all rows where the customer’s age is 30. We would use:

df[df.age == 30]

In Python, == is how we test if a value is exactly equal to another value.

We can use other logical statements, such as:

  • Greater Than, > — Here, we select all rows where the customer’s age is greater than 30:
df[df.age > 30]
  • Less Than, < — Here, we select all rows where the customer’s age is less than 30:
df[df.age < 30]
  • Not Equal, != — This snippet selects all rows where the customer’s name is not Clara Oswald:
df[df.name != 'Clara Oswald']

1.You’re going to staff the clinic for January of this year. You want to know how many visits took place in January of last year, to help you prepare.

Create variable january using a logical statement that selects the row of df where the 'month' column is 'January'.

2.Inspect january using print.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

january = df[df.month == 'January']
print(january)

Select Rows with Logic II

You can also combine multiple logical statements, as long as each statement is in parentheses.

For instance, suppose we wanted to select all rows where the customer’s age was under 30 or the customer’s name was “Martha Jones”:

nameaddressphoneage
Martha Jones123 Main St.234-567-891028
Rose Tyler456 Maple Ave.212-867-530922
Donna Noble789 Broadway949-123-456735
Amy Pond98 West End Ave.646-555-123429
Clara Oswald54 Columbus Ave.714-225-195731

We could use the following code:

df[(df.age < 30) |
   (df.name == 'Martha Jones')]

In Python, | means “or” and & means “and”.

1.You want to see how the number of clinic visits changed between March and April.

Create the variable march_april, which contains the data from March and April. Do this using two logical statements combined using |, which means “or”.

2.Inspect march_april using print.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

march_april = df[(df.month == 'March') | (df.month == 'April')]
print(march_april)




Select Rows with Logic III

Suppose we want to select the rows where the customer’s name is either “Martha Jones”, “Rose Tyler” or “Amy Pond”.

nameaddressphoneage
Martha Jones123 Main St.234-567-891028
Rose Tyler456 Maple Ave.212-867-530922
Donna Noble789 Broadway949-123-456735
Amy Pond98 West End Ave.646-555-123429
Clara Oswald54 Columbus Ave.714-225-195731

We could use the isin command to check that df.name is one of a list of values:

df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]

1.Another doctor thinks that you have a lot of clinic visits in the late Winter.

Create the variable january_february_march, containing the data from January, February, and March. Do this using a single logical statement with the isin command.

2.Inspect january_february_march using print.

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west'])

january_february_march = df[df.month.isin(['January', 'February', 'March'])]

print(january_february_march)

Setting indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc().

We can fix this using the method .reset_index(). For example, here is a DataFrame called df with non-consecutive indices:

First NameLast Name
0JohnSmith
4JaneDoe
7JoeSchmo

If we use the command df.reset_index(), we get a new DataFrame with a new set of indices:

indexFirst NameLast Name
00JohnSmith
14JaneDoe
27JoeSchmo

Note that the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it’s probably better to use the keyword drop=True so that you don’t end up with that extra column. If we run the command df.reset_index(drop=True), we get a new DataFrame that looks like this:

First NameLast Name
0JohnSmith
1JaneDoe
2JoeSchmo

Using .reset_index() will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword inplace=True we can just modify our existing DataFrame.

1.Examine the code in the workspace. Note that df2 is a subset of rows from df.

Type the following and press “Run”:

print(df2)

Note that the indices on df2 are not consecutive.

2.Create a new DataFrame called df3 by resetting the indices on df2 (don’t use inplace or drop). Did df2 change after you ran this command?

3.Reset the indices of df2 by using the keyword inplace=True and drop=True. Did the indices of df2 change? How is df2 different from df3?

import codecademylib3
import pandas as pd

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

df2 = df.loc[[1, 3, 5]]

# print(df2)

df3 = df2.reset_index()

print(df3)

df2.reset_index(inplace = True, drop = True)

print(df2)

1.In this example, you’ll be the data analyst for ShoeFly.com, a fictional online shoe store. You’ve seen this data; now it’s your turn to work with it!

Load the data from shoefly.csv into the variable orders.

2.Inspect the first 5 lines of the data.

3.Your marketing department wants to send out an email blast to everyone who ordered shoes!

Select all of the email addresses from the column email and save them to a variable called emails.

4.Frances Palmer claims that her order was wrong. What did Frances Palmer order?

Use logic to select that row of orders and save it to the variable frances_palmer.

5.We need some customer reviews for our comfortable shoes. Select all orders for shoe_typeclogsboots, and ballet flats and save them to the variable comfy_shoes

import codecademylib3
import pandas as pd

#Part 1: reading the csv
orders = pd.read_csv('shoefly.csv')

#Part 2: inspecting the first five lines of data
print(orders.head(5))

#Part 3: selecting the column 'email'
emails = orders.email

#Part 4: the Frances Palmer incident
frances_palmer = orders[(orders.first_name == 'Frances') & (orders.last_name == 'Palmer')]

#Part 5: Comfy feet means more time on the street
comfy_shoes = orders[orders.shoe_type.isin(['clogs', 'boots', 'ballet flats'])]

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值