Python数据分析学习系列一——Pandas入门学习

  • Pandas是一个强大的Python数据分析的工具包,是基于Numpy构建的
  • Pandas的主要功能
    • 具备对其功能的数据结构DataFrame、Series
    • 集成时间序列功能
    • 提供丰富的数学运算和操作
    • 灵活处理缺失数据
  • 安装方法:pip install pandas
  • 引用方法:import pandas as pd

1 Series-一维数据对象

  • Series是一种类似于一维数组的对象,由一维数据和一组与之相关的数据标签(索引)组成。
  • 创建方式:pd.Series([1,2,3])
  • 获取值数组和索引数组:values属性和index属性
  • Series比较像列表(数组)和字典的结合体

1.1 Series-使用特性

  • Series支持array的特性(下标):
    • 从arrar创建Series:Series(array)
    • 与标量进行运算:sr*2
    • 两个长度一样的Series运算:sr1+sr2
    • 索引:sr[0], sr[[1,2,4]]
    • 切片:sr[0:2]
    • 通用函数:np.abs(sr)
    • 布尔值索引:sr[sr>5]
  • Series支持字典的特性(标签):
    • 从字典创建Series:Series(dict)
    • in运算:‘a’ in sr
    • 键索引:sr[‘a’], sr[[‘a’,‘b’,‘d’]]
import pandas as pd
pd.Series([2,3,4,5])
0    2
1    3
2    4
3    5
dtype: int64
pd.Series([2,3,4,5], index=['a','b','c','d'])
a    2
b    3
c    4
d    5
dtype: int64
import numpy as np
pd.Series(np.arange(5))
0    0
1    1
2    2
3    3
4    4
dtype: int32
pd.Series({'a':1,
          'b':2,
          'c':3,
          'd':4})
a    1
b    2
c    3
d    4
dtype: int64
sr = pd.Series([2,3,4,5], index=['a','b','c','d'])
sr
a    2
b    3
c    4
d    5
dtype: int64
sr[0] # 可以通过位置进行索引
2
sr['a'] #也可以通过“索引(index)”进行索引
2
sr+2 # 可以像ndarray一样,与数值进行+、-、*、/运算
a    4
b    5
c    6
d    7
dtype: int64
sr+sr # 也可以ndarray一样,2个长度一样的Series进行运算
a     4
b     6
c     8
d    10
dtype: int64
sr[0:2] # 可以进行切片
a    2
b    3
dtype: int64
np.sum(sr) # 支持通用函数
14
sr[sr>3] # 支持布尔值索引
c    4
d    5
dtype: int64
sr = pd.Series({'a':1,'b':2})
sr
a    1
b    2
dtype: int64
'a' in sr # 可以像字典一样,用“in”查询索引(键)
True
sr.index # 获取索引
Index(['a', 'b'], dtype='object')
sr.values # 获取值
array([1, 2], dtype=int64)
sr = pd.Series([2,3,4,5], index=['a','b','c','d'])
sr
a    2
b    3
c    4
d    5
dtype: int64
sr[['a','c']]
a    2
c    4
dtype: int64
sr['a':'c'] # 通过键切片是前后都包括的
a    2
b    3
c    4
dtype: int64

1.2 Series-整数索引

sr = pd.Series(np.arange(20))
sr
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32
sr2 = sr[10:].copy()
sr2
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
dtype: int32
sr2[10] #整数索引既可以解释成标签,也可以解释成下标,pandas为了明确,全部定义为“标签”
10
sr2.loc[10] # 明确使用标签
10
sr2.iloc[9] # 明确解释成下标
19
sr2.iloc[0:3]
10    10
11    11
12    12
dtype: int32

1.3 Series-数据对齐

sr1 = pd.Series([12,23,34],index=['c','a','d']) # a与a相加,b与b相加...
sr2 = pd.Series([11,20,10],index=['d','c','a'])
sr1+sr2
a    33
c    32
d    45
dtype: int64

Pandas在运行两个Series对象的运算时,会按索引(标签)进行对齐,然后运算。

sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10,21],index=['d','c','a','b'])
sr1+sr2
a    33.0
b     NaN
c    32.0
d    45.0
dtype: float64
sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10],index=['b','c','a'])
sr1+sr2
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
sr1 = pd.Series([12,23,34],index=['c','a','d'])
sr2 = pd.Series([11,20,10],index=['b','c','a'])
sr1.add(sr2, fill_value=0) 
# 想让没有的索引不显示nan,需要将“+”改成add(+),此外还有sub(-),div(/),mul(*)。
a    33.0
b    11.0
c    32.0
d    34.0
dtype: float64

1.4 Series-缺失数据

# 方法一:去掉缺失数据isnull(),dropna()
sr = sr1+sr2
sr.isnull() # notnull()
a    False
b     True
c    False
d     True
dtype: bool
sr[sr.notnull()]
a    33.0
c    32.0
dtype: float64
sr[~sr.isnull()]
a    33.0
c    32.0
dtype: float64
sr.dropna()
a    33.0
c    32.0
dtype: float64
# 方法二:给缺失值赋值fillna()
sr.fillna(0)
a    33.0
b     0.0
c    32.0
d     0.0
dtype: float64
sr # numpy和pandas都是不会在sr的基础上修改的,所以需要进行赋值 sr = sr.fillna(0)
a    33.0
b     NaN
c    32.0
d     NaN
dtype: float64
sr.fillna(sr.mean())
a    33.0
b    32.5
c    32.0
d    32.5
dtype: float64

2 DataFrame-二维数据对象

  • DataFrame是一个表格型的数据结构,含有一组有序的列。DataFrame可以被看做是由Series组成的字典,并且共用一个行索引。
  • 创建方式:
    • pd.DataFrame({‘one’:[1,3.5.7],‘two’:[2,4,6,8]})
  • csv文件读取与写入:
    • df.read_csv(‘filename.csv’)
    • df.to_csv(‘filename.csv’)
d1 = pd.DataFrame({'one':[1,3,5,7],'two':[2,4,6,8]}) # 通过字典创建DataFrame时,不同列的行数必须一样
d1
onetwo
012
134
256
378
d1 = pd.DataFrame({'one':[1,3,5,7],'two':[2,4,6,8]},index=['a','b','c','d'])
d1
onetwo
a12
b34
c56
d78
d1 = pd.DataFrame({'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
onetwo
a1.02
b3.04
c5.06
dNaN8
d1.dtypes
one    float64
two      int64
dtype: object

注:

  • 1.当使用Series组成一个DataFrame的时候,两个Series位置按照标签对齐;
  • 2.因为有nan(浮点型),所以“one”整列自动变成浮点型。
d2 = pd.read_csv('test.csv')
d2
abc
0123
1456
2789
d1.to_csv('test2.csv')

2.1 DataFrame-常用属性

  • index:获取索引
  • T:转置
  • columns:获取列索引
  • values:获取值数组
  • describe():获取快速统计(这是一个方法)
d1 = pd.DataFrame({'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
onetwo
a1.02
b3.04
c5.06
dNaN8
d1.values
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.],
       [nan,  8.]])
d1.T
abcd
one1.03.05.0NaN
two2.04.06.08.0
d1.dtypes
one    float64
two      int64
dtype: object
d1.describe()
onetwo
count3.04.000000
mean3.05.000000
std2.02.581989
min1.02.000000
25%2.03.500000
50%3.05.000000
75%4.06.500000
max5.08.000000

2.2 DataFrame-索引切片

  • DataFrame是一个二维数据类型,所以有行索引和列索引。
  • DataFrame同样可以通过标签和位置两种方法进行索引和切片
  • loc属性和iloc属性
    • 使用方法:逗号隔开,前面是行索引,后面是列索引
    • 行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配
d1 = pd.DataFrame({'one':pd.Series([1,3,5],index=['a','b','c']),'two':pd.Series([2,4,6,8],index=['a','b','c','d'])})
d1
onetwo
a1.02
b3.04
c5.06
dNaN8
d1['one']['a']
1.0
d1.loc['a','one']
1.0
d1.loc['a',:]
one    1.0
two    2.0
Name: a, dtype: float64
d1.loc['d','one']
nan
d1.loc[d1.one.isnull(),'one']
d   NaN
Name: one, dtype: float64

2.3 DataFrame-数据对齐与缺失值处理

  • DataFrame对象在运算时,同样会进行数据对齐,其行索引和列索引分别对齐。
  • DataFrame处理缺失数据的方法:
    • dropna(axis=0,how=‘any’)
    • fillna()
    • isnull()
    • notnull()
d2 = pd.DataFrame({'two':[1,2,3,4],'one':[4,5,6,7]},index=['c','d','b','a'])
d2
twoone
c14
d25
b36
a47
d1+d2 # 行和列都要对齐
onetwo
a8.06
b9.07
c9.07
dNaN10
d1.fillna(0)
onetwo
a1.02
b3.04
c5.06
d0.08
d1.dropna() # 这一行只要有一个缺失值,就会把整行删掉
onetwo
a1.02
b3.04
c5.06
import numpy as np
d1.loc['d','two'] = np.nan
d1.loc['c','two'] = np.nan
d1
onetwo
a1.02.0
b3.04.0
c5.0NaN
dNaNNaN
d1.dropna(how='all')
onetwo
a1.02.0
b3.04.0
c5.0NaN
d2 = d1.dropna(how='all')
d2.dropna(axis=1)
one
a1.0
b3.0
c5.0

2.4 pandas-其他常用方法

  • mean(axis=0,skipna=False):对列(行)求平均值
  • sum(axis=1):对列(行)求和
  • sort_index(axis,…,ascending=True):对列(行)索引排序
  • sort_values(by,axis,ascending):按某一列(行)的值排序
  • Numpy的通用函数同样适用于pandas
d1
onetwo
a1.02.0
b3.04.0
c5.0NaN
dNaNNaN
d1.mean() # 返回的是一个对于每一列(行)求平均的Series
one    3.0
two    3.0
dtype: float64
d1.mean(axis=1)
a    1.5
b    3.5
c    5.0
d    NaN
dtype: float64
d1.mean(axis='columns')
a    1.5
b    3.5
c    5.0
d    NaN
dtype: float64
d1.sort_values(by='two')
onetwo
a1.02.0
b3.04.0
c5.0NaN
dNaNNaN
d1.sort_values(by='two', ascending=False) # 有缺失值的部分不参与排序,统一放在最后面
onetwo
b3.04.0
a1.02.0
c5.0NaN
dNaNNaN
d1.sort_values(by='a', axis=1,ascending=False)
twoone
a2.01.0
b4.03.0
cNaN5.0
dNaNNaN
d1.sort_index()
onetwo
a1.02.0
b3.04.0
c5.0NaN
dNaNNaN

3 pandas-时间对象

3.1 pandas-时间对象处理

  • 时间序列类型
    • 时间戳:特定时刻
    • 固定时期:如2020年12月
    • 时间间隔:起始时间-结束时间
  • Python标准库处理时间对象:datetime
  • 灵活处理时间对象:dateutil
    • dateutil.parser.parse()
  • 成组处理时间对象:pandas
    • pd.to_datetime()
import datetime
datetime.datetime.strptime('2020-01-01','%Y-%m-%d') # strptime的p代表parse,strftime的f代表format
datetime.datetime(2020, 1, 1, 0, 0)
import dateutil
dateutil.parser.parse('2020-01-01')
datetime.datetime(2020, 1, 1, 0, 0)
dateutil.parser.parse('02/03/2020')
datetime.datetime(2020, 2, 3, 0, 0)
dateutil.parser.parse('20200203')
datetime.datetime(2020, 2, 3, 0, 0)
pd.to_datetime(['2001-01-01','2002/01/01'])
DatetimeIndex(['2001-01-01', '2002-01-01'], dtype='datetime64[ns]', freq=None)

3.2 pandas-时间对象生成

pd.date_range?
pd.date_range('2010-01-01','2010-05-01')
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10',
               ...
               '2010-04-22', '2010-04-23', '2010-04-24', '2010-04-25',
               '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29',
               '2010-04-30', '2010-05-01'],
              dtype='datetime64[ns]', length=121, freq='D')
pd.date_range('2010-01-01',periods=10)
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10'],
              dtype='datetime64[ns]', freq='D')
pd.date_range('2010-01-01',periods=10,freq='w') # 生成每个周日
DatetimeIndex(['2010-01-03', '2010-01-10', '2010-01-17', '2010-01-24',
               '2010-01-31', '2010-02-07', '2010-02-14', '2010-02-21',
               '2010-02-28', '2010-03-07'],
              dtype='datetime64[ns]', freq='W-SUN')
pd.date_range('2010-01-01',periods=10,freq='w-MON') # 生成每个周一
DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25',
               '2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22',
               '2010-03-01', '2010-03-08'],
              dtype='datetime64[ns]', freq='W-MON')
pd.date_range('2010-01-01',periods=10,freq='B') # 生成每个工作日
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14'],
              dtype='datetime64[ns]', freq='B')
dt = pd.date_range('2010-01-01',periods=10,freq='B')
dt[0]
Timestamp('2010-01-01 00:00:00', freq='B')
dt[0].to_pydatetime()
datetime.datetime(2010, 1, 1, 0, 0)
pd.date_range('2010-01-01',periods=10,freq='1h20min') # 还可以间隔1小时20分钟
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00',
               '2010-01-01 02:40:00', '2010-01-01 04:00:00',
               '2010-01-01 05:20:00', '2010-01-01 06:40:00',
               '2010-01-01 08:00:00', '2010-01-01 09:20:00',
               '2010-01-01 10:40:00', '2010-01-01 12:00:00'],
              dtype='datetime64[ns]', freq='80T')

4 pandas-时间序列

  • 时间序列就是以时间对象为索引的Series或DataFrame。
  • datetime对象作为索引时是存储在DatetimeIndex对象中的。
  • 时间序列特殊功能:
    • 传入“年”或“月”作为切片当时
    • 传入日期范围作为切片方式
    • 丰富的函数支持:resample(),truncate(),…
pd.date_range('2010-01-01','2010-05-01')
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10',
               ...
               '2010-04-22', '2010-04-23', '2010-04-24', '2010-04-25',
               '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29',
               '2010-04-30', '2010-05-01'],
              dtype='datetime64[ns]', length=121, freq='D')
sr = pd.Series(np.arange(1000), index=pd.date_range('2020-01-01', periods=1000))
sr
2020-01-01      0
2020-01-02      1
2020-01-03      2
2020-01-04      3
2020-01-05      4
             ... 
2022-09-22    995
2022-09-23    996
2022-09-24    997
2022-09-25    998
2022-09-26    999
Freq: D, Length: 1000, dtype: int32
sr.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20',
               '2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24',
               '2022-09-25', '2022-09-26'],
              dtype='datetime64[ns]', length=1000, freq='D')
sr['2020-03'] # 可以选取某一年或月的数据
2020-03-01    60
2020-03-02    61
2020-03-03    62
2020-03-04    63
2020-03-05    64
2020-03-06    65
2020-03-07    66
2020-03-08    67
2020-03-09    68
2020-03-10    69
2020-03-11    70
2020-03-12    71
2020-03-13    72
2020-03-14    73
2020-03-15    74
2020-03-16    75
2020-03-17    76
2020-03-18    77
2020-03-19    78
2020-03-20    79
2020-03-21    80
2020-03-22    81
2020-03-23    82
2020-03-24    83
2020-03-25    84
2020-03-26    85
2020-03-27    86
2020-03-28    87
2020-03-29    88
2020-03-30    89
2020-03-31    90
Freq: D, dtype: int32
sr['2020']
2020-01-01      0
2020-01-02      1
2020-01-03      2
2020-01-04      3
2020-01-05      4
             ... 
2020-12-27    361
2020-12-28    362
2020-12-29    363
2020-12-30    364
2020-12-31    365
Freq: D, Length: 366, dtype: int32
sr['2020-05':'2020-10']
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-10-27    300
2020-10-28    301
2020-10-29    302
2020-10-30    303
2020-10-31    304
Freq: D, Length: 184, dtype: int32
sr['2020-05-01':'2020-10-31']
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-10-27    300
2020-10-28    301
2020-10-29    302
2020-10-30    303
2020-10-31    304
Freq: D, Length: 184, dtype: int32
sr.resample('W').sum() # 每周的求和
2020-01-05      10
2020-01-12      56
2020-01-19     105
2020-01-26     154
2020-02-02     203
              ... 
2022-09-04    6818
2022-09-11    6867
2022-09-18    6916
2022-09-25    6965
2022-10-02     999
Freq: W-SUN, Length: 144, dtype: int32
sr.resample('M').mean() # 每月的平均
2020-01-31     15.0
2020-02-29     45.0
2020-03-31     75.0
2020-04-30    105.5
2020-05-31    136.0
2020-06-30    166.5
2020-07-31    197.0
2020-08-31    228.0
2020-09-30    258.5
2020-10-31    289.0
2020-11-30    319.5
2020-12-31    350.0
2021-01-31    381.0
2021-02-28    410.5
2021-03-31    440.0
2021-04-30    470.5
2021-05-31    501.0
2021-06-30    531.5
2021-07-31    562.0
2021-08-31    593.0
2021-09-30    623.5
2021-10-31    654.0
2021-11-30    684.5
2021-12-31    715.0
2022-01-31    746.0
2022-02-28    775.5
2022-03-31    805.0
2022-04-30    835.5
2022-05-31    866.0
2022-06-30    896.5
2022-07-31    927.0
2022-08-31    958.0
2022-09-30    986.5
Freq: M, dtype: float64
sr.truncate(before='2020-05-01',after='2020-10-01')
2020-05-01    121
2020-05-02    122
2020-05-03    123
2020-05-04    124
2020-05-05    125
             ... 
2020-09-27    270
2020-09-28    271
2020-09-29    272
2020-09-30    273
2020-10-01    274
Freq: D, Length: 154, dtype: int32

5 pandas-文件处理

5.1 pandas-读取文件

  • 数据文件常用格式:csv(以某间隔符分割的数据)

  • pandas读取文件:从文件名、url、文件对象中加载数据

    • read_csv 默认分隔符为逗号
    • read_tabel 默认分隔符为制表符
  • read_csv,read_tabel函数主要参数:

    • sep 指定分隔符,可以用正则表达式如’\s+’
    • header=None 指定文件无列名
    • names 指定列名
    • index_col 指定某列作为索引
    • skip_row 指定跳过某些行
    • na_values 指定某些字符串表示缺失值
    • parse_dates 指定某些列是否被解析为日期,输入类型为布尔值或列表
pd.read_csv('399300.csv')
日期股票代码名称收盘价最高价最低价开盘价前收盘涨跌额涨跌幅成交量成交金额
02021/1/29'399300沪深3005351.96465430.20155288.09555413.96845377.1427-25.1781-0.468218217878400390,287,690,019.00
12021/1/28'399300沪深3005377.14275462.23525360.37665450.36955528.0034-150.8607-2.72917048558500376,166,523,178.00
22021/1/27'399300沪深3005528.00345534.99285449.63855505.77085512.967815.03560.272716019084100376,892,605,839.00
32021/1/26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
42021/1/25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
.......................................
46252002/1/10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
46262002/1/9'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
46272002/1/8'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
46282002/1/7'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
46292002/1/4'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4630 rows × 12 columns

pd.read_csv('399300.csv', index_col='日期')
股票代码名称收盘价最高价最低价开盘价前收盘涨跌额涨跌幅成交量成交金额
日期
2021/1/29'399300沪深3005351.96465430.20155288.09555413.96845377.1427-25.1781-0.468218217878400390,287,690,019.00
2021/1/28'399300沪深3005377.14275462.23525360.37665450.36955528.0034-150.8607-2.72917048558500376,166,523,178.00
2021/1/27'399300沪深3005528.00345534.99285449.63855505.77085512.967815.03560.272716019084100376,892,605,839.00
2021/1/26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
2021/1/25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
....................................
2002/1/10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
2002/1/9'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
2002/1/8'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
2002/1/7'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
2002/1/4'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4630 rows × 11 columns

df = _
df.index # 日期被解释成一个字符串
Index(['2021/1/29', '2021/1/28', '2021/1/27', '2021/1/26', '2021/1/25',
       '2021/1/22', '2021/1/21', '2021/1/20', '2021/1/19', '2021/1/18',
       ...
       '2002/1/17', '2002/1/16', '2002/1/15', '2002/1/14', '2002/1/11',
       '2002/1/10', '2002/1/9', '2002/1/8', '2002/1/7', '2002/1/4'],
      dtype='object', name='日期', length=4630)
# df = pd.read_csv('399300.csv',index_col='日期',parse_dates=True) # 把所有能解释成日期的都解释成时间对象
df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期']) # 把冲入的列解释成时间对象
df
股票代码名称收盘价最高价最低价开盘价前收盘涨跌额涨跌幅成交量成交金额
日期
2021-01-29'399300沪深3005351.96465430.20155288.09555413.96845377.1427-25.1781-0.468218217878400390,287,690,019.00
2021-01-28'399300沪深3005377.14275462.23525360.37665450.36955528.0034-150.8607-2.72917048558500376,166,523,178.00
2021-01-27'399300沪深3005528.00345534.99285449.63855505.77085512.967815.03560.272716019084100376,892,605,839.00
2021-01-26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
2021-01-25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
....................................
2002-01-10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
2002-01-09'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
2002-01-08'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
2002-01-07'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
2002-01-04'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4630 rows × 11 columns

df.index
DatetimeIndex(['2021-01-29', '2021-01-28', '2021-01-27', '2021-01-26',
               '2021-01-25', '2021-01-22', '2021-01-21', '2021-01-20',
               '2021-01-19', '2021-01-18',
               ...
               '2002-01-17', '2002-01-16', '2002-01-15', '2002-01-14',
               '2002-01-11', '2002-01-10', '2002-01-09', '2002-01-08',
               '2002-01-07', '2002-01-04'],
              dtype='datetime64[ns]', name='日期', length=4630, freq=None)
df = pd.read_csv('399300-2.csv',header=None) # 如果原文件中没有列名,可以让header=None,自动生成列名
df
01234567891011
02021/1/29'399300沪深3005351.96465430.20155288.09555413.96845377.1427-25.1781-0.468218217878400390,287,690,019.00
12021/1/28'399300沪深3005377.14275462.23525360.37665450.36955528.0034-150.8607-2.72917048558500376,166,523,178.00
22021/1/27'399300沪深3005528.00345534.99285449.63855505.77085512.967815.03560.272716019084100376,892,605,839.00
32021/1/26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
42021/1/25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
.......................................
46252002/1/10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
46262002/1/9'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
46272002/1/8'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
46282002/1/7'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
46292002/1/4'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4630 rows × 12 columns

# 如果原文件中没有列名,可以让header=None,同时自己命名
df = pd.read_csv('399300-2.csv',header=None, names=['股票代码', '名称', '收盘价', '最高价', '最低价',
                                                    '开盘价', '前收盘', '涨跌额', '涨跌幅', '成交量','成交金额 ']) 
df
股票代码名称收盘价最高价最低价开盘价前收盘涨跌额涨跌幅成交量成交金额
2021/1/29'399300沪深3005351.96465430.20155288.09555413.96845377.1427-25.1781-0.468218217878400390,287,690,019.00
2021/1/28'399300沪深3005377.14275462.23525360.37665450.36955528.0034-150.8607-2.72917048558500376,166,523,178.00
2021/1/27'399300沪深3005528.00345534.99285449.63855505.77085512.967815.03560.272716019084100376,892,605,839.00
2021/1/26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
2021/1/25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
....................................
2002/1/10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
2002/1/9'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
2002/1/8'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
2002/1/7'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
2002/1/4'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4630 rows × 11 columns

df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期'], skiprows=[1,2,3])
df
股票代码名称收盘价最高价最低价开盘价前收盘涨跌额涨跌幅成交量成交金额
日期
2021-01-26'399300沪深3005512.96785600.90175505.99625600.90175625.9232-112.9554-2.007817190459000415,008,069,865.00
2021-01-25'399300沪深3005625.92325655.47955543.26635564.12375569.77656.14721.008119704701900508,166,980,802.00
2021-01-22'399300沪深3005569.77605573.65945513.87695562.37905564.96934.80670.086419930002000456,622,193,436.00
2021-01-21'399300沪深3005564.96935593.10585490.56265492.95875476.433688.53571.616720995019700453,183,684,479.00
2021-01-20'399300沪深3005476.43365496.04935426.53575439.91115437.523438.91020.715617091326000373,770,384,496.00
....................................
2002-01-10'399300沪深3001281.26001281.26001281.26001281.26001272.658.610.67650-
2002-01-09'399300沪深3001272.65001272.65001272.65001272.65001292.71-20.06-1.55180-
2002-01-08'399300沪深3001292.71001292.71001292.71001292.71001302.08-9.37-0.71960-
2002-01-07'399300沪深3001302.08001302.08001302.08001302.08001316.46-14.38-1.09230-
2002-01-04'399300沪深3001316.46001316.46001316.46001316.4600NoneNoneNone0-

4627 rows × 11 columns

df = pd.read_csv('399300.csv',index_col='日期',parse_dates=['日期'], na_values=['None','NA','nan']) # 那些解释成缺失值

5.2 pandas-写入文件

  • 写入到csv文件:to_csv函数
  • 写入文件函数的主要参数
    • sep 指定文件分隔符
    • na_rep 指定缺失值转换的字符串,默认为空字符串
    • header=False 不输出列名一行
    • index=False 不输出索引一列
    • columns 指定输出的列,传入列表
df.iloc[1,1] = np.nan
df.to_csv('test.csv', header=False, index=False, na_rep='None',encoding='ANSI')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值