Pandas-task02

最新推荐文章于 2024-05-19 23:31:32 发布

Python有温度

最新推荐文章于 2024-05-19 23:31:32 发布

阅读量110

点赞数

分类专栏： pandas 文章标签： python

本文链接：https://blog.csdn.net/weixin_46095673/article/details/111410085

版权

pandas 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、文件读取

pandas可以读取文件格式有很多种，但需要熟练使用三种

df_csv = pd.read_csv( ‘data.csv’)
df_table = pd.read_table(‘table.txt’)
df_excel = pd.read_excel(‘excel.xlsx’)

注意：读取时，由于文件数据索引或列名问题，常常需要设置 header = None 第一行不做列名。inde_col 设置想要列为索引。nrows 设置读取行数。

In [10]: pd.read_table('data/my_table.txt', header=None)
Out[10]: 
      0     1     2                3
0  col1  col2  col3             col4
1     2     a   1.4   apple 2020/1/1
2     3     b   3.4  banana 2020/1/2
3     6     c   2.5  orange 2020/1/5
4     5     d   3.2   lemon 2020/1/7

In [11]: pd.read_csv('data/my_csv.csv', index_col=['col1', 'col2'])
Out[11]: 
           col3    col4      col5
col1 col2                        
2    a      1.4   apple  2020/1/1
3    b      3.4  banana  2020/1/2
6    c      2.5  orange  2020/1/5
5    d      3.2   lemon  2020/1/7

In [12]: pd.read_table('data/my_table.txt', usecols=['col1', 'col2'])
Out[12]: 
   col1 col2
0     2    a
1     3    b
2     6    c
3     5    d

In [13]: pd.read_csv('data/my_csv.csv', parse_dates=['col5'])
Out[13]: 
   col1 col2  col3    col4       col5
0     2    a   1.4   apple 2020-01-01
1     3    b   3.4  banana 2020-01-02
2     6    c   2.5  orange 2020-01-05
3     5    d   3.2   lemon 2020-01-07

In [14]: pd.read_excel('data/my_excel.xlsx', nrows=2)
Out[14]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2

此外读取txt文件时，常常遇到分隔符非空格的情况，需要设置read_table() 的参数sep ，具体如下：

In [15]: pd.read_table('data/my_table_special_sep.txt')
Out[15]: 
              col1 |||| col2
0  TS |||| This is an apple.
1  GQ |||| My name is Bob.
2  WT |||| Well done!
3   PT |||| May I help you?```


加了分隔符后，结果如下：
   

```python
In [16]: pd.read_table('data/my_table_special_sep.txt',
   ....:               sep=' \|\|\|\| ', engine='python')
   ....: 
Out[16]: 
  col1               col2
0   TS  This is an apple.
1   GQ    My name is Bob.
2   WT         Well done!
3   PT    May I help you?

二、文件保存

通常在把数据存入表格中，不想保存数据索引时，操作如下：

In [17]: df_csv.to_csv('data/my_csv_saved.csv', index=False)

In [18]: df_excel.to_excel('data/my_excel_saved.xlsx', index=False)

这里index = false 就是设置索引

pandas 中没用to_table 函数，如果想把保存为txt 文件，可以使用to_csv（）函数，但需要自定义分隔符，设置分隔符 sep =’\t。

三、pandas常用数据格式有两种，Series和Dataframe

1.Series 由四部分组成，分别为 data ,index, dtype ,name.

In [22]: s = pd.Series(data = [100, 'a', {'dic1':5}],
   ....:               index = pd.Index(['id1', 20, 'third'], name='my_idx'),
   ....:               dtype = 'object',
   ....:               name = 'my_name')
   ....: 

In [23]: s
Out[23]: 
my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object
In [24]: s.values
Out[24]: array([100, 'a', {'dic1': 5}], dtype=object)

In [25]: s.index
Out[25]: Index(['id1', 20, 'third'], dtype='object', name='my_idx')

In [26]: s.dtype
Out[26]: dtype('O')

In [27]: s.name
Out[27]: 'my_name'

2.DataFrame 不同于Series在于增加列索引，维度从一维到二维。

In [31]: df = pd.DataFrame(data = data,
…: index = [‘row_%d’%i for i in range(3)],
…: columns=[‘col_0’, ‘col_1’, ‘col_2’])
…:

In [32]: df
Out[32]:
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2

四、基本函数

1.汇总类函数

In [46]: df.head(2)
Out[46]: 
                          School     Grade            Name  Gender  Height  Weight Transfer
0  Shanghai Jiao Tong University  Freshman    Gaopeng Yang  Female   158.9    46.0        N
1              Peking University  Freshman  Changqiang You    Male   166.5    70.0        N

In [47]: df.tail(3)
Out[47]: 
                            School      Grade            Name  Gender  Height  Weight Transfer
197  Shanghai Jiao Tong University     Senior  Chengqiang Chu  Female   153.9    45.0        N
198  Shanghai Jiao Tong University     Senior   Chengmei Shen    Male   175.3    71.0        N
199            Tsinghua University  Sophomore     Chunpeng Lv    Male   155.7    51.0        N

info, describe 分别返回表的信息概况和表中数值列对应的主要统计量：

In [48]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

In [49]: df.describe()
Out[49]: 
           Height      Weight
count  183.000000  189.000000
mean   163.218033   55.015873
std      8.608879   12.824294
min    145.400000   34.000000
25%    157.150000   46.000000
50%    161.900000   51.000000
75%    167.500000   65.000000
max    193.900000   89.000000

2.特征统计最常见的是 sum, mean, median, var, std, max, min 。

n [50]: df_demo = df[['Height', 'Weight']]

In [51]: df_demo.mean()
Out[51]: 
Height    163.218033
Weight     55.015873
dtype: float64

In [52]: df_demo.max()
Out[52]: 
Height    193.9
Weight     89.0
dtype: float64

In [56]: df_demo.mean(axis=1).head() # 在这个数据集上体重和身高的均值并没有意义
Out[56]: 
0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

3.替换函数

一般而言，替换操作是针对某一个列进行的，因此下面的例子都以 Series 举例。pandas 中的替换函数可以归纳为三类：映射替换、逻辑替换、数值替换。映射替换包含 replace 方法。

在 replace 中，可以通过字典构造，或者传入两个列表来进行替换：

In [67]: df['Gender'].replace({'Female':0, 'Male':1}).head()
Out[67]: 
0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

In [68]: df['Gender'].replace(['Female', 'Male'], [0, 1]).head()
Out[68]: 
0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

用0 1代替Female和Male 。

replace 还有一种特殊的方向替换，指定 method 参数为 ffill 则为用前面一个最近的未被替换的值进行替换， bfill 则使用后面最近的未被替换的值进行替换。从下面的例子可以看到，它们的结果是不同的：

In [69]: s = pd.Series([‘a’, 1, ‘b’, 2, 1, 1, ‘a’])

In [70]: s.replace([1, 2], method=‘ffill’)
Out[70]:
0 a
1 a
2 b
3 b
4 b
5 b
6 a
dtype: object

In [71]: s.replace([1, 2], method=‘bfill’)
Out[71]:
0 a
1 b
2 b
3 a
4 a
5 a
6 a
dtype: object

逻辑替换包括了 where 和 mask ，这两个函数是完全对称的： where 函数在传入条件为 False 的对应行进行替换，而 mask 在传入条件为 True 的对应行进行替换，当不指定替换值时，替换为缺失值。

In [72]: s = pd.Series([-1, 1.2345, 100, -50])

In [73]: s.where(s<0)
Out[73]: 
0    -1.0
1     NaN
2     NaN
3   -50.0
dtype: float64

In [74]: s.where(s<0, 100)
Out[74]: 
0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64

In [75]: s.mask(s<0)
Out[75]: 
0         NaN
1      1.2345
2    100.0000
3         NaN
dtype: float64

In [76]: s.mask(s<0, -50)
Out[76]: 
0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

数值替换包含了 round, abs, clip 方法，它们分别表示按照给定精度四舍五入、取绝对值和截断：

In [79]: s = pd.Series([-1, 1.2345, 100, -50])

In [80]: s.round(2)
Out[80]: 
0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64

In [81]: s.abs()
Out[81]: 
0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64
In [82]: s.clip(0, 2) # 前两个数分别表示上下截断边界
Out[82]: 
0    0.0000
1    1.2345
2    2.0000
3    0.0000
dtype: float64

4.排序函数
排序共有两种方式，其一为值排序，其二为索引排序，对应的函数是 sort_values 和 sort_index 。`

In [84]: df_demo.sort_values('Height').head()
Out[84]: 
                         Height  Weight
Grade     Name                         
Junior    Xiaoli Chu      145.4    34.0
Senior    Gaomei Lv       147.3    34.0
Sophomore Peng Han        147.8    34.0
Senior    Changli Lv      148.7    41.0
Sophomore Changjuan You   150.5    40.0
In [85]: df_demo.sort_values('Height', ascending=False).head()
Out[85]: 
                        Height  Weight
Grade    Name                         
Senior   Xiaoqiang Qin   193.9    79.0
         Mei Sun         188.9    89.0
         Gaoli Zhao      186.5    83.0
Freshman Qiang Han       185.3    87.0
Senior   Qiang Zheng     183.9    87.0

在排序中，经常遇到多列排序的问题，比如在体重相同的情况下，对身高进行排序，并且保持身高降序排列，体重升序排列：

In [86]: df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head()
Out[86]: 
                       Height  Weight
Grade     Name                       
Sophomore Peng Han      147.8    34.0
Senior    Gaomei Lv     147.3    34.0
Junior    Xiaoli Chu    145.4    34.0
Sophomore Qiang Zhou    150.5    36.0
Freshman  Yanqiang Xu   152.4    38.0

Python有温度

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas-task02

一、文件读取pandas可以读取文件格式有很多种，但需要熟练使用三种df_csv = pd.read_csv( ‘data.csv’)df_table = pd.read_table(‘table.txt’)df_excel = pd.read_excel(‘excel.xlsx’)注意：读取时，由于文件数据索引或列名问题，常常需要设置 header = None 第一行不做列名。inde_col 设置想要列为索引。nrows 设置读取行数。In [10]: pd.read_table('da
复制链接

扫一扫