pandas学习笔记之Dataframe索引

最新推荐文章于 2024-06-08 17:34:03 发布

奔跑的乌班

最新推荐文章于 2024-06-08 17:34:03 发布

阅读量759

点赞数

分类专栏：数据分析文章标签：数据分析 pandas 索引多重索引

本文链接：https://blog.csdn.net/u010199356/article/details/85696687

版权

数据分析专栏收录该内容

12 篇文章 9 订阅

订阅专栏

  #  DataFra是一个表格，  有行索引和列索引，可以被看做由Series组成的字典（共用一个索引）
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(np.random.rand(12).reshape(3,4) * 100, index = ["one","two","three"], columns = list("abcd"))
    print(df)
    print("-----------------")
    
    # 选择列
    data1 = df["a"]  # 默认选择行
    print(data1)
    print()
    data2 = df[["a", "d"]]
    print(data2)
    print()
    # []中写列名，所以一般数据columns都会单独制定，不会用默认数字列名，以免和index冲突
    data3 = df.loc["one"]
    print(data3)
    print(type(data3))  # 单选为Series
    print()
    
    data4 = df.loc[["one","three"]]
    print(data4)
    print(type(data4))  # 多选为Dataframe
    
    data5 = df[:1]
    print(data5)
    print(type(data5))
    print()
    # print(df[0]) # KeyError: 0

# 总结：df[]中为数字的时候，默认选择行，并且只能进行切片，不能单独选择（df[0]会报错），
#      另外的，即使只选择一行，其结果格式仍然为Dataframe格式，这里需要注意一下，使用loc[index]取一行为Series格式
    
# 核心笔记：df[col]一般用于选择列，[]中写列名

df.loc[] 按index选择行（列索引）

df = pd.DataFrame(np.random.rand(16).reshape(4,4) * 100, index = ["one","two","three", "four"], columns = list("abcd"))
print(df)
print()
print(df["a"])
print(df["a"]["one"])
print(df["a"][0])
# print(df.loc[0])  # 报错

print()
# 
print(df.loc["one"])
print(df.loc[["one","three"]])

print()
print(df.loc["one":"three"])  # 针对index做切片索引，这里是末端包含的

df.iloc[] - 按照整数位置（从轴的0到length-1）选择行（行索引）

类似list的索引，其顺序就是dataframe的整数位置，从0开始计

df = pd.DataFrame(np.random.rand(16).reshape(4,4) * 100, index = ["one","two","three", "four"], columns = list("abcd"))
print(df)

print(df.iloc[0])
print(df.iloc[[0,1,2]])  # 多行索引，位置可以变
print(df.iloc[[2,0,1]])
# print(df.iloc[4])  # 这里报错
# 和loc索引不同，不能索引超出数据行数的整数位置

print()

print(df.iloc[1:3])
# 切片索引，末端不包含（针对整数位置索引）

Dataframe 的布尔索引

df = pd.DataFrame(np.random.rand(16).reshape(4,4) * 100, index = ["one","two","three", "four"], columns = list("abcd"))
print(df)
print("----------------------")

b1 = df < 20
print(b1, type(b1))
print(df[b1])
print()
#  不做索引则会对数据每个值进行判断
#  索引结果保留 所有数据：True返回原数据， False返回值为NaN

print(df[df[["a"]]>50])
# 单列做判断
# 索引结果保留所有数据，True返回元数据，False返回值为NaN

print()
print(df[df[["a","b"]]>50])


print()
print(df[df.loc[["one","three"]]>50])
# 多行做判断（根据index）
# 索引结果保留所有数据，True返回元数据，False返回值为NaN

print()
print(df[df.iloc[[1,3]]>50])

多重索引：比如同时索引行和列

先选择列，再选择行----相当于对于一个数据，先筛选字段，再选择数据

df = pd.DataFrame(np.random.rand(16).reshape(4,4) * 100, index = ["one","two","three", "four"], columns = list("abcd"))
print(df)
print("----------------------")
print(df["a"].loc[["one","four"]])   # Series格式
print()
print(df[["a","d"]].loc[["one","four"]])  # dataframe格式
print()
print(df[df>50].iloc[::2])

print()
# print(df[["a","b","y"]]) # KeyError: "['y'] not in index"
# print(df.loc[["p"]])   这里都会报错，因为此时数组已经创建完成

对于多重索引，读者可以尝试更多的索引方式，这里不再赘述