2021-05-31 pandas读取文件&DataFrame查看和操作数据

最新推荐文章于 2024-08-21 16:45:43 发布

mlxgccc

最新推荐文章于 2024-08-21 16:45:43 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/mlxgccc/article/details/117427665

版权

一、读取数据

通过 read_ csv 函数将 csv 读取到 pandas 的 DataFrame 对象；

df_rating =pd.read_csv(" csv文件 ")

通过 read_excel 函数将 excel 文件读取到 DataFrame，并且可以通过 cheet_name 参数指定要读取哪个表，以及通过 use_cols 参数来指定要读取哪几列；

# 读取excel
df_info = pd.read_excel("info.xlsx")
# 读取excel中不同的表格
df_perf = pd.read_excel("info.xlsx", sheet_name="sheet2")
# 选择性读取
df_perf1 = pd.read_excel("info.xlsx", sheet_name="sheet2", usecols="A,B")

通过 read_html 函数将 html 内容中的表格提取为一个DataFrame 的列表，通过逐一查看来确定哪个是我们想要的。

import urllib3
def download_content(url):
   http = urllib3.PoolManager()
   response = http.request("GET", url)
   response_data = response.data
   html_content = response_data.decode()
   return html_content
   html_content = download_content("http://fx.cmbchina.com/Hq/")

读取数据

cmb_table_list = pd.read_html(html_content)
print(len(cmb_table_list))
cmb_table_list[0]

二、查看和操作数据
常见的概念：index、Series 和 DataFrame

index（索引）

import pandas as pd
# 通过列表创建 Series
ser1 = pd.Series([1,3,5,7])
# 通过 notebook 打印 ser1
ser1
print("values: ", ser1.values)
print("index: ", ser1.index)

DataFrame

# 获取 rating 这一列，存储在ser_rating 变量中
ser_rating = df_rating["rating"]
# 输出 ser_rating 这个 Series
print(ser_rating)
# 分割一下，方便查看
print("------------分割一下------------")
# 查看数据的类型
print(type(ser_rating))

对比通过行索引取出的行 Series 和通过列索引取出的列 Series 不难发现：
列 Series 的 index 是 DataFrame 的行头，
行 Series 的 index 则是 DataFrame 的列名

Dataframe的创建
有些像sql的insert into

# 将列索引保存在 index_arr 变量中
index_arr = ["姓名", "年龄", "籍贯", "部门"]
# 构建小明、小亮、小E的行 Series，并使用我们创建好的 index_arr 作为 Series 的index
ser_xiaoming = pd.Series(["小明", 22, "河北","IT部"], index= index_arr)
ser_xiaoliang = pd.Series(["小亮", 25, "广东","IT部"], index = index_arr)
ser_xiaoe = pd.Series(["小E", 23, "四川","财务部"], index=  index_arr)
# 直接将三个 Series 以列表的形式作为 DataFrame 的参数，创建 DataFrame
df_info = pd.DataFrame([ser_xiaoming, ser_xiaoliang, ser_xiaoe])
# 使用 notebook 打印 DataFrame
df_info

方法：
添加行：用 append 方法

# 新建一个行 Series，存储在 ser_xiaoh变量中
ser_xiaoh = pd.Series(["小红", 28, "福建", "财务部"], index = index_arr)
# 调用 append 方法添加到DataFrame 中
# 设置 ignore_index 的含义是让 DataFrame 自动生成行索引
# 调用 append 之后，会返回一个新的 DataFrame，我们将其保存回 df_info 变量
df_info = df_info.append(ser_xiaoh, ignore_index= True)
# 查看添加后的DataFrame
df_info

添加列：将新增加的列 Series 通过列名作为列索引赋值给 DataFrame 。

# 直接将新添加的列名当作 DataFrame 的列索引，对其赋新的值，列完全相同
df_info["考核结果"] = "合格"
# 查看
df_info

**列名不完全相同**
# 将新添加的 Series 赋值给 DataFrame 中新列名对应的列 Series
df_info["考核结果"] = pd.Series(["合格", "良好", "优秀", "良好"])
# 查看
df_info

删除行或者列：用 drop 方法，通过 axis 参数控制删除行或删除列

# labels 是要删除的列名
# axis = 1 代表要删除的是列
# inplace = True 代表删除直接在 df_info 中生效。
df_info.drop(labels = "考核结果", axis=1, inplace= True)
# 查看
df_info

对比，如果删除行
df_info.drop(labels=2, axis=0, inplace= True)

单个单元格的查看与修改：loc 属性，中括号内第一个元素是行索引，第二个是列索引。

# 查看与修改
df_info.loc[1, "籍贯"]
df_info.loc[1, "籍贯"] = "广西"

排序：用 sort_values 方法， by 参数指定要用哪一列作为排序标准，ascending 参数决定要升序还是降序。

# by 指定排序标准
df_rating.sort_values(by = "rating", inplace=True)
# 查看排序后的 DataFrame
df_rating
# 降序用ascending = False
df_rating.sort_values(by="rating", inplace=True, ascending=False)

取前 N 个和后 N 个：head 和 tail 函数，N 就是函数的参数。

# 前20
df_rating.head(20)
# 后20
df_rating.tail(20)

获取行数和列数：shape 属性。

# shape 属性，返回一个元祖，第一个是行数，第二个元素是列数
shape = df_rating.shape
# 打印行数和列数
print("行数:", shape[0])
print("列数:", shape[1])

mlxgccc

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫