dataframe 一列的不同值_Pandas数据结构：DataFrame

最新推荐文章于 2023-06-08 22:50:51 发布

weixin_39604819

最新推荐文章于 2023-06-08 22:50:51 发布

阅读量984

点赞数

文章标签： dataframe 一列的不同值 dataframe 排序 dataframe列相加行相加 dataframe排序 dataframe转化为array pandas dataframe

刚刚接触pandas的朋友，想了解数据结构，就一定要认识DataFrame，接下来给大家详细介绍！

初识DataFrame

import numpy as npimport pandas as pd

data = {"name": ["Jack", "Tom", "LiSa"],        "age": [20, 21, 18],        "city": ["BeiJing", "TianJin", "ShenZhen"]}print(data)print("")frame = pd.DataFrame(data)  # 创建DataFrameprint(frame)print("")print(frame.index)  # 查看行索引print("")print(frame.columns)  # 查看列索引print("")print(frame.values)  # 查看值

{'name': ['Jack', 'Tom', 'LiSa'], 'age': [20, 21, 18], 'city': ['BeiJing', 'TianJin', 'ShenZhen']}   age      city  name0   20   BeiJing  Jack1   21   TianJin   Tom2   18  ShenZhen  LiSaRangeIndex(start=0, stop=3, step=1)Index(['age', 'city', 'name'], dtype='object')[[20 'BeiJing' 'Jack'] [21 'TianJin' 'Tom'] [18 'ShenZhen' 'LiSa']]

创建DataFrame

方法一: 由字典创建字典的key是列索引值可以是

1.列表

2.ndarray

3.Series

# 值是ndarray  注意: 用ndarray创建DataFrame值的个数必须相同 否则报错data2 = {"one": np.random.rand(3),         "two": np.random.rand(3)        }print(data2)print("")print(pd.DataFrame(data2))

{'one': array([ 0.60720023,  0.30838024,  0.30678266]), 'two': array([ 0.21368784,  0.03797809,  0.41698718])}        one       two0  0.607200  0.2136881  0.308380  0.0379782  0.306783  0.416987

# 值是Series--带有标签的一维数组  注意: 用Series创建DataFrame值的个数可以不同  少的值用Nan填充data3 = {"one": pd.Series(np.random.rand(4)),         "two": pd.Series(np.random.rand(5))        }print(data3)print("")df3 = pd.DataFrame(data3)print(df3)print("")

{'one': 0    0.2176391    0.9216412    0.8988103    0.933510dtype: float64, 'two': 0    0.1327891    0.0999042    0.7234953    0.7191734    0.477456dtype: float64}        one       two0  0.217639  0.1327891  0.921641  0.0999042  0.898810  0.7234953  0.933510  0.7191734       NaN  0.477456

# 值是Series--带有标签的一维数组  注意: 用Series创建DataFrame值的个数可以不同  少的值用Nan填充data3 = {"one": pd.Series(np.random.rand(4)),         "two": pd.Series(np.random.rand(5))        }print(data3)print("")df3 = pd.DataFrame(data3)print(df3)print("")

{'one': 0    0.2176391    0.9216412    0.8988103    0.933510dtype: float64, 'two': 0    0.1327891    0.0999042    0.7234953    0.7191734    0.477456dtype: float64}        one       two0  0.217639  0.1327891  0.921641  0.0999042  0.898810  0.7234953  0.933510  0.7191734       NaN  0.477456

方法二: 通过二维数组直接创建

data = [{"one": 1, "two": 2}, {"one": 5, "two": 10, "three": 15}]  # 每一个字典在DataFrame里就是一行数据print(data)print("")df1 = pd.DataFrame(data)print(df1)print("")df2 = pd.DataFrame(data, index=list("ab"), columns=["one", "two", "three", "four"])print(df2)

[{'one': 1, 'two': 2}, {'one': 5, 'two': 10, 'three': 15}]   one  three  two0    1    NaN    21    5   15.0   10   one  two  three  foura    1    2    NaN   NaNb    5   10   15.0   NaN

方法三: 由字典组成的列表创建 DataFrame

# columns为字典的key index为子字典的keydata = {"Jack": {"age":1, "country":"China", "sex":"man"},         "LiSa": {"age":18, "country":"America", "sex":"women"},        "Tom": {"age":20, "country":"English"}}df1 = pd.DataFrame(data)print(df1)print("")# 注意: 这里的index并不能给子字典的key(行索引)重新命名 但可以给子字典的key重新排序 若出现原数组没有的index 那么就填充NaN值df2 = pd.DataFrame(data, index=["sex", "age", "country"])print(df2)print("")df3 = pd.DataFrame(data, index=list("abc"))print(df3)print("")# columns 给列索引重新排序 若出现原数组没有的列索引填充NaN值df4 = pd.DataFrame(data, columns=["Tom", "LiSa", "Jack", "TangMu"])print(df4)

          Jack     LiSa      Tomage          1       18       20country  China  America  Englishsex        man    women      NaN          Jack     LiSa      Tomsex        man    women      NaNage          1       18       20country  China  America  English   Jack  LiSa  Toma   NaN   NaN  NaNb   NaN   NaN  NaNc   NaN   NaN  NaN             Tom     LiSa   Jack TangMuage           20       18      1    NaNcountry  English  America  China    NaNsex          NaN    women    man    NaN

方法四: 由字典组成的字典

# columns为字典的key index为子字典的keydata = {"Jack": {"age":1, "country":"China", "sex":"man"},         "LiSa": {"age":18, "country":"America", "sex":"women"},        "Tom": {"age":20, "country":"English"}}df1 = pd.DataFrame(data)print(df1)print("")# 注意: 这里的index并不能给子字典的key(行索引)重新命名 但可以给子字典的key重新排序 若出现原数组没有的index 那么就填充NaN值df2 = pd.DataFrame(data, index=["sex", "age", "country"])print(df2)print("")df3 = pd.DataFrame(data, index=list("abc"))print(df3)print("")# columns 给列索引重新排序 若出现原数组没有的列索引填充NaN值df4 = pd.DataFrame(data, columns=["Tom", "LiSa", "Jack", "TangMu"])print(df4)

          Jack     LiSa      Tomage          1       18       20country  China  America  Englishsex        man    women      NaN          Jack     LiSa      Tomsex        man    women      NaNage          1       18       20country  China  America  English   Jack  LiSa  Toma   NaN   NaN  NaNb   NaN   NaN  NaNc   NaN   NaN  NaN             Tom     LiSa   Jack TangMuage           20       18      1    NaNcountry  English  America  China    NaNsex          NaN    women    man    NaN

DataFrame索引

选择行与列

选择列直接用df["列标签"]

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,                 index = ["one", "two", "three"], columns = ["a", "b", "c", "d"])print(df)print("")print(df["a"], "  ", type(df["a"]))  # 取一列print("")print(df[["a", "c"]], "  ", type(df[["a", "c"]]))  # 取多列

               a          b          c          done    92.905464  11.630358  19.518051  77.417377two    91.107357   0.641600   4.913662  65.593182three   3.152801  42.324671  14.030304  22.138608one      92.905464two      91.107357three     3.152801Name: a, dtype: float64                   a          cone    92.905464  19.518051two    91.107357   4.913662three   3.152801  14.030304

选择行不能通过标签索引 df["one"] 来选择行要用 df.loc["one"], loc就是针对行来操作的

print(df)print("")print(df.loc["one"], " ", type(df.loc["one"]))  # 取一行print("")print(df.loc[["one", "three"]], " ", type(df.loc[["one", "three"]])) # 取不连续的多行print("")

               a          b          c          done    92.905464  11.630358  19.518051  77.417377two    91.107357   0.641600   4.913662  65.593182three   3.152801  42.324671  14.030304  22.138608a    92.905464b    11.630358c    19.518051d    77.417377Name: one, dtype: float64                  a          b          c          done    92.905464  11.630358  19.518051  77.417377three   3.152801  42.324671  14.030304  22.138608

loc支持切片索引--针对行并包含末端 df.loc["one": "three"]

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index=["one", "two", "three", "four"],                 columns=["a", "b", "c", "d"])print(df)print("")print(df.loc["one": "three"])print("") print(df[: 3])  # 切片表示取连续的多行(尽量不用 免得混淆)

               a          b          c          done    65.471894  19.137274  31.680635  41.659808two    31.570587  45.575849  37.739644   5.140845three  54.930986  68.232707  17.215544  70.765401four   45.591798  63.274956  74.056045   2.466652               a          b          c          done    65.471894  19.137274  31.680635  41.659808two    31.570587  45.575849  37.739644   5.140845three  54.930986  68.232707  17.215544  70.765401               a          b          c          done    65.471894  19.137274  31.680635  41.659808two    31.570587  45.575849  37.739644   5.140845three  54.930986  68.232707  17.215544  70.765401

iloc也是对行来操作的只不过把行标签改成了行索引并且是不包含末端的

print(df)print("")print(df.iloc[0])  # 取一行print("")print(df.iloc[[0,2]])  # 取不连续的多行print("")print(df.iloc[0:3])  # 不包含末端

               a          b          c          done    65.471894  19.137274  31.680635  41.659808two    31.570587  45.575849  37.739644   5.140845three  54.930986  68.232707  17.215544  70.765401four   45.591798  63.274956  74.056045   2.466652a    65.471894b    19.137274c    31.680635d    41.659808Name: one, dtype: float64               a          b          c          done    65.471894  19.137274  31.680635  41.659808three  54.930986  68.232707  17.215544  70.765401               a          b          c          done    65.471894  19.137274  31.680635  41.659808two    31.570587  45.575849  37.739644   5.140845three  54.930986  68.232707  17.215544  70.765401

布尔型索引

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index=["one", "two", "three", "four"],                 columns=["a", "b", "c", "d"])print(df)print("")d1 = df >50  # d1为布尔型索引print(d1)print("")print(df[d1])  # df根据d1 只返回True的值  False的值对应为NaNprint("")

               a          b          c          done    91.503673  74.080822  85.274682  80.788609two    49.670055  42.221393  36.674490  69.272958three  78.349843  68.090150  22.326223  93.984369four   79.057146  77.687246  32.304265   0.567816           a      b      c      done     True   True   True   Truetwo    False  False  False   Truethree   True   True  False   Truefour    True   True  False  False               a          b          c          done    91.503673  74.080822  85.274682  80.788609two          NaN        NaN        NaN  69.272958three  78.349843  68.090150        NaN  93.984369four   79.057146  77.687246        NaN        NaN

选取某一列作为布尔型索引返回True所在行的所有列注意: 不能选取多列作为布尔型索引

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index=["one", "two", "three", "four"],                 columns=["a", "b", "c", "d"], dtype=np.int64)print(df)print("") d2 = df["b"] > 50  print(d2)print("")print(df[d2])

        a   b   c   done    27  18  47  61two    26  35  16  78three  80  98  94  41four   85   3  47  90one      Falsetwo      Falsethree     Truefour     FalseName: b, dtype: bool        a   b   c   dthree  80  98  94  41

选取多列作为布尔型索引返回True所对应的值 False对应为NaN 没有的列全部填充为NaN

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100, index=["one", "two", "three", "four"],                 columns=["a", "b", "c", "d"], dtype=np.int64)print(df)print("")d3 = df[["a", "c"]] > 50print(d3)print("")print(df[d3])

        a   b   c   done    49  82  32  39two    78   2  24  84three   6  84  84  69four   21  89  16  77           a      cone    False  Falsetwo     True  Falsethree  False   Truefour   False  False          a   b     c   done     NaN NaN   NaN NaNtwo    78.0 NaN   NaN NaNthree   NaN NaN  84.0 NaNfour    NaN NaN   NaN NaN

多重索引

print(df)

        a   b   c   done    49  82  32  39two    78   2  24  84three   6  84  84  69four   21  89  16  77

print(df["a"].loc[["one", "three"]])  # 取列再取行print("")print(df[["a", "c"]].iloc[0:3])

one      49three     6Name: a, dtype: int64        a   cone    49  32two    78  24three   6  84

print(df.loc[["one", "three"]][["a", "c"]])  # 取行再取列

        a   cone    49  32three   6  84

print(df > 50)print("")print(df[df>50])print("")print(df[df>50][["a","b"]])

           a      b      c      done    False   True  False  Falsetwo     True  False  False   Truethree  False   True   True   Truefour   False   True  False   True          a     b     c     done     NaN  82.0   NaN   NaNtwo    78.0   NaN   NaN  84.0three   NaN  84.0  84.0  69.0four    NaN  89.0   NaN  77.0          a     bone     NaN  82.0two    78.0   NaNthree   NaN  84.0four    NaN  89.0

DataFrame基本技巧

import numpy as npimport pandas as pd

arr = np.random.rand(16).reshape(8, 2)*10# print(arr)print("")print(len(arr))print("")df = pd.DataFrame(arr, index=[chr(i) for i in range(97, 97+len(arr))], columns=["one", "two"])print(df)

8        one       twoa  2.129959  1.827002b  8.631212  0.423903c  6.262012  3.851107d  6.890305  9.543065e  6.883742  3.643955f  2.740878  6.851490g  6.242513  7.402237h  9.226572  3.179664

查看数据

print(df)print("")print(df.head(2))  # 查看头部数据 默认查看5条print("")print(df.tail(3))  # 查看末尾数据 默认查看5条

        one       twoa  2.129959  1.827002b  8.631212  0.423903c  6.262012  3.851107d  6.890305  9.543065e  6.883742  3.643955f  2.740878  6.851490g  6.242513  7.402237h  9.226572  3.179664        one       twoa  2.129959  1.827002b  8.631212  0.423903        one       twof  2.740878  6.851490g  6.242513  7.402237h  9.226572  3.179664

转置

print(df)

        one       twoa  2.129959  1.827002b  8.631212  0.423903c  6.262012  3.851107d  6.890305  9.543065e  6.883742  3.643955f  2.740878  6.851490g  6.242513  7.402237h  9.226572  3.179664

print(df.T)

            a         b         c         d         e         f         g  one  2.129959  8.631212  6.262012  6.890305  6.883742  2.740878  6.242513   two  1.827002  0.423903  3.851107  9.543065  3.643955  6.851490  7.402237               h  one  9.226572  two  3.179664

添加与修改

df = pd.DataFrame(np.random.rand(16).reshape(4,4),index=["one", "two", "three", "four"], columns=["a", "b", "c", "d"])print(df)print("")df.loc["five"] = 100  # 增加一行print(df)print("")df["e"] = 10  # 增加一列print(df)print("")df["e"] = 101  # 修改一列print(df)print("")df.loc["five"] = 111  # 修改一行print(df)print("")

              a         b         c         done    0.708481  0.285426  0.355058  0.990070two    0.199559  0.733047  0.322982  0.791169three  0.198043  0.801163  0.356082  0.857501four   0.430182  0.020549  0.896011  0.503088                a           b           c           done      0.708481    0.285426    0.355058    0.990070two      0.199559    0.733047    0.322982    0.791169three    0.198043    0.801163    0.356082    0.857501four     0.430182    0.020549    0.896011    0.503088five   100.000000  100.000000  100.000000  100.000000                a           b           c           d   eone      0.708481    0.285426    0.355058    0.990070  10two      0.199559    0.733047    0.322982    0.791169  10three    0.198043    0.801163    0.356082    0.857501  10four     0.430182    0.020549    0.896011    0.503088  10five   100.000000  100.000000  100.000000  100.000000  10                a           b           c           d    eone      0.708481    0.285426    0.355058    0.990070  101two      0.199559    0.733047    0.322982    0.791169  101three    0.198043    0.801163    0.356082    0.857501  101four     0.430182    0.020549    0.896011    0.503088  101five   100.000000  100.000000  100.000000  100.000000  101                a           b           c           d    eone      0.708481    0.285426    0.355058    0.990070  101two      0.199559    0.733047    0.322982    0.791169  101three    0.198043    0.801163    0.356082    0.857501  101four     0.430182    0.020549    0.896011    0.503088  101five   111.000000  111.000000  111.000000  111.000000  111

删除 del(删除行)/drop(删除列指定axis=1删除行)

df = pd.DataFrame(np.random.rand(16).reshape(4,4),index=["one", "two", "three", "four"], columns=["a", "b", "c", "d"])print(df)print("")del df["a"]  # 删除列  改变原数组print(df)

              a         b         c         done    0.339979  0.577661  0.108308  0.482164two    0.374043  0.102067  0.660970  0.786986three  0.384832  0.076563  0.529472  0.358780four   0.938592  0.852895  0.466709  0.938307              b         c         done    0.577661  0.108308  0.482164two    0.102067  0.660970  0.786986three  0.076563  0.529472  0.358780four   0.852895  0.466709  0.938307

df = pd.DataFrame(np.random.rand(16).reshape(4,4),index=["one", "two", "three", "four"], columns=["a", "b", "c", "d"])print(df)print("")d1 = df.drop("one")  # 删除行 并返回新的数组 不改变原数组print(d1)print("")print(df)

              a         b         c         done    0.205438  0.324132  0.401131  0.368300two    0.471426  0.671785  0.837956  0.097416three  0.888816  0.451950  0.137032  0.568844four   0.524813  0.448306  0.875787  0.479477              a         b         c         dtwo    0.471426  0.671785  0.837956  0.097416three  0.888816  0.451950  0.137032  0.568844four   0.524813  0.448306  0.875787  0.479477              a         b         c         done    0.205438  0.324132  0.401131  0.368300two    0.471426  0.671785  0.837956  0.097416three  0.888816  0.451950  0.137032  0.568844four   0.524813  0.448306  0.875787  0.479477

df = pd.DataFrame(np.random.rand(16).reshape(4,4),index=["one", "two", "three", "four"], columns=["a", "b", "c", "d"])print(df)print("")d2 = df.drop("a", axis=1)  # 删除列 返回新的数组 不会改变原数组print(d2)print("")print(df)

              a         b         c         done    0.939552  0.613218  0.357056  0.534264two    0.110583  0.602123  0.990186  0.149132three  0.756016  0.897848  0.176100  0.204789four   0.655573  0.819009  0.094322  0.656406              b         c         done    0.613218  0.357056  0.534264two    0.602123  0.990186  0.149132three  0.897848  0.176100  0.204789four   0.819009  0.094322  0.656406              a         b         c         done    0.939552  0.613218  0.357056  0.534264two    0.110583  0.602123  0.990186  0.149132three  0.756016  0.897848  0.176100  0.204789four   0.655573  0.819009  0.094322  0.656406

排序

根据指定列的列值排序同时列值所在的行也会跟着移动 .sort_values(['列'])

# 单列df = pd.DataFrame(np.random.rand(16).reshape(4,4), columns=["a", "b", "c", "d"])print(df)print("")print(df.sort_values(['a']))  # 默认升序  print("")print(df.sort_values(['a'], ascending=False))  # 降序

          a         b         c         d0  0.616386  0.416094  0.072445  0.1401671  0.263227  0.079205  0.520708  0.8663162  0.665673  0.836688  0.733966  0.3102293  0.405777  0.090530  0.991211  0.712312          a         b         c         d1  0.263227  0.079205  0.520708  0.8663163  0.405777  0.090530  0.991211  0.7123120  0.616386  0.416094  0.072445  0.1401672  0.665673  0.836688  0.733966  0.310229          a         b         c         d2  0.665673  0.836688  0.733966  0.3102290  0.616386  0.416094  0.072445  0.1401673  0.405777  0.090530  0.991211  0.7123121  0.263227  0.079205  0.520708  0.866316

根据索引排序 .sort_index()

df = pd.DataFrame(np.random.rand(16).reshape(4,4), index=[2,1,3,0], columns=["a", "b", "c", "d"])print(df)print("")print(df.sort_index())  # 默认升序print("")print(df.sort_index(ascending=False))  # 降序

          a         b         c         d2  0.669311  0.118176  0.635512  0.2483881  0.752321  0.935779  0.572554  0.2740193  0.701334  0.354684  0.592998  0.4026860  0.548317  0.966295  0.191219  0.307908          a         b         c         d0  0.548317  0.966295  0.191219  0.3079081  0.752321  0.935779  0.572554  0.2740192  0.669311  0.118176  0.635512  0.2483883  0.701334  0.354684  0.592998  0.402686          a         b         c         d3  0.701334  0.354684  0.592998  0.4026862  0.669311  0.118176  0.635512  0.2483881  0.752321  0.935779  0.572554  0.2740190  0.548317  0.966295  0.191219  0.307908

df = pd.DataFrame(np.random.rand(16).reshape(4,4), index=["x", "z", "y", "t"], columns=["a", "b", "c", "d"])print(df)print("")print(df.sort_index())  # 根据字母顺序表排序

          a         b         c         dx  0.717421  0.206383  0.757656  0.720580z  0.969988  0.551812  0.210200  0.083031y  0.956637  0.759216  0.350744  0.335287t  0.846718  0.207411  0.936231  0.891330          a         b         c         dt  0.846718  0.207411  0.936231  0.891330x  0.717421  0.206383  0.757656  0.720580y  0.956637  0.759216  0.350744  0.335287z  0.969988  0.551812  0.210200  0.083031

df = pd.DataFrame(np.random.rand(16).reshape(4,4), index=["three", "one", "four", "two"], columns=["a", "b", "c", "d"])print(df)print("")print(df.sort_index())  # 根据单词首字母排序

              a         b         c         dthree  0.173818  0.902347  0.106037  0.303450one    0.591793  0.526785  0.101916  0.884698four   0.685250  0.364044  0.932338  0.668774two    0.240763  0.260322  0.722891  0.634825              a         b         c         dfour   0.685250  0.364044  0.932338  0.668774one    0.591793  0.526785  0.101916  0.884698three  0.173818  0.902347  0.106037  0.303450two    0.240763  0.260322  0.722891  0.634825

(1)获取更多优质内容及精彩资讯，可前往：https://www.cda.cn/?seo

(2)了解更多数据领域的优质课程：

weixin_39604819

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
dataframe 一列的不同值_Pandas数据结构：DataFrame

刚刚接触pandas的朋友，想了解数据结构，就一定要认识DataFrame，接下来给大家详细介绍！初识DataFrameimport numpy as npimport pandas as pddata = {"name": ["Jack", "Tom", "LiSa"], "age": [20, 21, 18], "city": ["BeiJing", "TianJi...
复制链接

扫一扫

dataframe 一列的不同值_Pandas数据结构：DataFrame

“相关推荐”对你有帮助么？