数学建模之数据分析【六】：Pandas读取数据集的实用操作-CSDN博客

本文链接：https://blog.csdn.net/lmx1458070445/article/details/141259639

文章目录

一、Pandas读取CSV
二、将 Excel 电子表格加载为 pandas DataFrame
- 2.1 使用excel文件创建数据帧

公众号/小红书：快乐数模

一、Pandas读取CSV

1.1 Pandas read_csv的语法

以下是 Pandas 读取 CSV 语法及其参数:

pd.read_csv（filepath_or_buffer， sep=' ，' ， header='infer'， index_col=None， usecols=None， engine=None， skiprows=None， nrows=None）

filepath_or_buffer：csv 文件的位置。它接受文件的任何字符串、路径或 URL。
sep：代表分隔符，默认为“，”。
header：它接受 int、int 列表、用作列名的行号以及数据的开头。如果未传递任何名称，或想要自动生成列名时，即 header=None，则它将第一列显示为 0，第二列显示为 1，依此类推。
usecols：仅从 CSV 文件中检索选定的列。
nrows：要从数据集中显示的行数。
index_col：如果为 None，则不会随记录一起显示索引号。
skiprows：跳过新数据帧中传递的行。

下面分别给出一些例子

1.1.1 使用 Pandas read_csv读取 CSV 文件

首先导入Pandas库，使用Pandas库加CSV文件：

# Import pandas
import pandas as pd

# reading csv file 
df = pd.read_csv("people.csv")
print(df.head())

1.1.2 read_csv（）中使用 sep

获取CSV 文件，添加一些特殊字符查看 sep 参数的工作原理。

# sample = "totalbill_tip, sex:smoker, day_time, size
# 16.99, 1.01:Female|No, Sun, Dinner, 2
# 10.34, 1.66, Male, No|Sun:Dinner, 3
# 21.01:3.5_Male, No:Sun, Dinner, 3
#23.68, 3.31, Male|No, Sun_Dinner, 2
# 24.59:3.61, Female_No, Sun, Dinner, 4
# 25.29, 4.71|Male, No:Sun, Dinner, 4"

# Importing pandas library
import pandas as pd

# Load the data of csv
df = pd.read_csv('sample.csv',
                 sep='[:, |_]',  # 这里的分割符可以是   : , | _   这四种都可以。
                 engine='python') #使用正则表达式作为分割符时，这里的engine = 'python'是必须的。
# Print the Dataframe
print(df)

1.1.3 read_csv（）中使用 use_cols

使用“性别”、“职位”索引，简单地使用index_col参数重新索引标题。

df = pd.read_csv('people.csv',
        header=0,  #指定CSV文件的第一行（索引为0）是列名
        index_col=["Sex", "Job Title"],  #选择特定列作为索引，后续组织数据的特定列使用
        usecols=["Sex", "Job Title", "Email"])    #读进三列数据
print(df.head())

1.1.4 在 read_csv（）中使用 nrows

设定nrows()参数显示五行。

df = pd.read_csv('people.csv',
        header=0,  #设定 第一行为列名
        index_col=["Sex", "Job Title"],  #使用两列索引数据
        usecols=["Sex", "Job Title", "Email"],  # 读进了三列数据
                nrows=3)   #只显示了前三行
print(df)

1.1.5 在read_csv(）中使用跳过行

df= pd.read_csv("people.csv")
print("Previous Dataset: ")
print(df)
# using skiprows
df = pd.read_csv("people.csv", skiprows = [1,5])  #跳过第2行和第6行
print("Dataset After skipping rows: ")
print(df)

1.2 将Pandas Dataframe 另存为 CSV

我们将学习如何使用 Pandas to_csv（）方法将 Pandas DataFrame 导出到 CSV 文件。默认情况下，to_csv（）方法将 DataFrame 导出到 CSV 文件，其中行索引作为第一列，逗号作为分隔符。

1.2.1 只需使用 df.to_csv（）将 DataFrame 导出到 CSV 文件。

# importing pandas as pd
import pandas as pd
 
# list of name, degree, score
nme = ["aparna", "pankaj", "sudhir", "Geeku"]
deg = ["MBA", "BCA", "M.Tech", "MBA"]
scr = [90, 40, 80, 98]
 
# dictionary of lists
dict = {'name': nme, 'degree': deg, 'score': scr}
     
df = pd.DataFrame(dict) 
print(df)

保存为.csv，只需使用df.to_csv( )。

# saving the dataframe
df.to_csv('file1.csv')

1.2.2 保存不带标题和索引的.csv文件

# saving the dataframe
df.to_csv('file2.csv', header=False, index=False)    #设定无标题，设定无索引

1.2.3 将csv保存在指定位置

# saving the dataframe
df.to_csv(r'C:\Users\Admin\Desktop\file3.csv')  #前面加r字符表示不允许使用转义字符

1.2.4 使用制表分割符将DataFrame写入csv文件

import pandas as pd
import numpy as np
 
users = {'Name': ['Amit', 'Cody', 'Drew'],  #设定了一个字典
    'Age': [20,21,25]}
 
#create DataFrame
df = pd.DataFrame(users, columns=['Name','Age'])  # 取users字典，将Name和Age列化作PD的框架
 
print("Original DataFrame:")
print(df)  #打印原始的pd数据
print('Data from Users.csv:')
 
df.to_csv('Users.csv', sep='\t', index=False,header=True)#有列名，按照\t分割开,保存为csv文件
new_df = pd.read_csv('Users.csv')   #读csv文件
 
print(new_df)  #将CSV文件中的内容打印

二、将 Excel 电子表格加载为 pandas DataFrame

Pandas 是一个非常强大且可扩展的数据分析工具。它支持多种文件格式，我们可能会以任何格式获取数据。Pandas 还支持 excel 文件格式。

我们首先需要导入 Pandas 并加载 excel 文件，然后将 excel 文件表解析为 Pandas 数据帧。

import pandas as pd 
  
# Import the excel file and call it xls_file 
excel_file = pd.ExcelFile('pandasEx.xlsx')  #读取excel文件，将excel文件保存在excel对象 
  
# View the excel_file's sheet names 
print(excel_file.sheet_names)   #打印出excel文件的表名
  
# Load the excel_file's Sheet1 as a dataframe 
df = excel_file.parse('Sheet1')     #抽取出Sheet1的表转换成pd框架
print(df)

读取指定列：

# import pandas lib as pd  
import pandas as pd  
    
require_cols = [0, 3]        #读取指定的第一列和第四列
    
# only read specific columns from an excel file  
required_df = pd.read_excel('SampleWork2.xlsx', usecols = require_cols)  
    
print(required_df)

2.1 使用excel文件创建数据帧

使用 Pandas 将 excel 文件读取到 Pandas 数据帧对象。

2.1.1 使用数据帧读取文件

使用 pandas 的 read_excel（）方法读取 excel 文件。

# import pandas lib as pd
import pandas as pd
 
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel('SampleWork.xlsx')  #使用df = pd.read_excel()读取文件
 
print(dataframe1)

2.1.2 使用 read_excel（）方法的“sheet_name”阅读特定表格。

# import pandas lib as pd
import pandas as pd
 
# read 2nd sheet of an excel file
dataframe2 = pd.read_excel('SampleWork.xlsx', sheet_name = 1)  #工作表中的索引是从0开始吃的，因此这里是读取第2个工作表
 
print(dataframe2)

2.1.3 使用 read_excel（）方法的 ‘usecols’ 参数读取特定列

# import pandas lib as pd
import pandas as pd
 
require_cols = [0, 3]
 
# only read specific columns from an excel file
required_df = pd.read_excel('SampleWork.xlsx', usecols = require_cols)   #读取指定的第一列和第四列/
 
print(required_df)

2.1.4 使用 read_excel（）方法的 ‘na_values’ 参数处理缺失数据

# import pandas lib as pd
import pandas as pd
 
# Handling missing values of 3rd sheet of an excel file.
dataframe = pd.read_excel('SampleWork.xlsx', na_values = "Missing",  #表Missing的值视为空值
                                                    sheet_name = 2)
 
print(dataframe)

2.1.5 read_excel（）方法的 ‘skiprows’ 参数读取 Excel 文件时跳过起始行

# import pandas lib as pd
import pandas as pd
 
# read 2nd sheet of an excel file after
# skipping starting two rows 
df = pd.read_excel('SampleWork.xlsx', sheet_name = 1, skiprows = 2)  #读入表单2，跳过起始的两行
 
print(df)

2.1.6 将标题设置为任何行，并使用 read_excel（）方法的 ‘header’ 参数从该行开始读取

# import pandas lib as pd
import pandas as pd
 
# setting the 3rd row as header.
df = pd.read_excel('SampleWork.xlsx', sheet_name = 1, header = 2)  #读入表单2数据，将第二行设置为标题行，接着继续往后读
 
print(df)

2.1.7 使用 read_excel（）方法的 ‘sheet_name’ 参数读取多个 Excel 工作表

# import pandas lib as pd
import pandas as pd
 
# read both 1st and 2nd sheet.
df = pd.read_excel('SampleWork.xlsx', na_values = "Missing",  #设定missing的值为空值
                                        sheet_name =[0, 1])     #读入第1个表单和第2个表单
 
print(df)

2.1.8 使用 read_excel（）方法的 ‘sheet_name’ 参数一起读取 excel 文件的所有表格

# import pandas lib as pd
import pandas as pd
 
# read all sheets together.
all_sheets_df = pd.read_excel('SampleWork.xlsx', na_values = "Missing",   
                                                     sheet_name = None)  #读入所有表单，这里None表示的是读取所有工作表
 
print(all_sheets_df)