python 处理xml pandas_python学习-pandas系列

最新推荐文章于 2024-08-02 15:56:57 发布

weixin_39984098

最新推荐文章于 2024-08-02 15:56:57 发布

阅读量240

点赞数

文章标签： python 处理xml pandas

本文链接：https://blog.csdn.net/weixin_39984098/article/details/113641333

版权

本文详细介绍了如何使用Python的Pandas库处理XML数据，包括DataFrame的创建、查看、选择、赋值、缺失值处理等操作。通过实例展示了DataFrame的灵活性和便利性，帮助读者掌握在数据分析中利用Pandas处理XML数据的方法。

摘要由CSDN通过智能技术生成

python - pandas学习 Dataframe

import numpy as np

import pandas as pd

DataFrame 是由多种类型的列构成的二维标签数据结构，类似于 Excel 、SQL 表，或 Series 对象构成的字典, 虽然DataFrame的用途更广，但是也是一系列的Series构成的，基本上也都是Series的方法和属性

# Series 包含两部分

data = [[1,2,3], [2,3,4]]

s = pd.DataFrame(data, index=['first', 'second'])

print(s)

# series，索引(行索引和列索引)

0 1 2

first 1 2 3

second 2 3 4

1. DataFrame 的生成

# 通过多维数组生成

# -- 不指定index

arr = pd.DataFrame([[1,2,3],[2.0,3.3,4.5,6.3],['a','b','c']])

print(arr)

print('%%%%%%%%%%%%%%%%%%')

# -- 指定index, 需要注意index和columns要等于对应数组的最大长度

arr = pd.DataFrame([[1,2,3],[2.0,3.3,4.5,6.3],['a','b','c']],index=['a', 'b', 'c'], columns=[1,2,3,4])

print(arr)

0 1 2 3

0 1 2 3 NaN

1 2 3.3 4.5 6.3

2 a b c NaN

%%%%%%%%%%%%%%%%%%

1 2 3 4

a 1 2 3 NaN

b 2 3.3 4.5 6.3

c a b c NaN

# 通过字典生成Dataframe有多种灵活的方式

# 1、字典列表生成: 每个列表的长度要完全相同

# -- 不指定index

df = pd.DataFrame({'frist':[1,2,3], 'second':['2','c', 'd'], 'third':['3',4,5]})

print(df)

print('%%%%%%%%%%%%%')

# -- 指定index: columns 会寻找对应的key

df = pd.DataFrame({'frist':[1,2,3], 'second':['2','c', 'd'], 'third':['3',4,5]}, index=['a','b','d'], columns=['frist','2', 'third'])

print(df)

print('%%%%%%%%%%%%%')

# 2、列表字典生成: 比字典列表生成更灵活，注意columns的顺序

df = pd.DataFrame([{'a':1,'b':2,'c':3},{1:'a','a':4,'b':5, 'c':6}], index=['hello', 'dataframe'], columns=['c',1, 'a'])

print(df)

frist second third

0 1 2 3

1 2 c 4

2 3 d 5

%%%%%%%%%%%%%

frist 2 third

a 1 NaN 3

b 2 NaN 4

d 3 NaN 5

%%%%%%%%%%%%%

c 1 a

hello 3 NaN 1

dataframe 6 a 4

# 通过字典生成Dataframe有多种灵活的方式

# 3、Series 字典:

# -- 不指定index

df = pd.DataFrame({'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.],index=['a', 'b', 'c', 'd'])})

print(df)

print('%%%%%%%%%%%%%')

# -- 指定index: columns 会寻找对应的key

df = pd.DataFrame({'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])},index=[1, 'a','b', 'c', 'd'])

print(df)

one two

a 1.0 1.0

b 2.0 2.0

c 3.0 3.0

d NaN 4.0

%%%%%%%%%%%%%

one two

1 NaN NaN

a 1.0 1.0

b 2.0 2.0

c 3.0 3.0

d NaN 4.0

2. dataFrame的数据查看

# 首先准备生成一个关于基金数据的dataFrame

def get_fund_series():

import requests

from dateutil import parser

import datetime

headers = {

'Connection': 'keep-alive',

'Cache-Control': 'max-age=0',

'Upgrade-Insecure-Requests': '1',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',

'Sec-Fetch-User': '?1',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',

'Sec-Fetch-Site': 'none',

'Sec-Fetch-Mode': 'navigate',

'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.9',

}

response = requests.get('https://glink.genius.com.cn/base/V_JRJ_FUND_NET_HISTORY/full=2&sort=TRADEDATE%20desc&filter-FUND_CODE-str=002001&limit=200', headers=headers)

res = response.json()

data = []

index= []

for item in res['rows']:

temp = parser.parse(item['MTIME'])+ datetime.timedelta(hours=8)

temp = temp.strftime('%Y-%m-%d')

if temp not in index:

index.append(temp)

data.append({'单元净值':item['UNIT_NET'],'累计净值':item['ACCUM_NET']})

return index, data

index, data = get_fund_series()

fund = pd.DataFrame(data, index=index)

# 查看 Series 头部和尾部数据

print(fund.head())

print(fund.tail(3))

单元净值累计净值

2019-12-21 1.503 4.532

2019-12-20 1.506 4.535

2019-12-19 1.511 4.540

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

单元净值累计净值

2019-03-07 1.385 4.369

2019-03-06 1.387 4.371

2019-03-05 1.382 4.366

# 查看index

print(fund.index)

Index(['2019-12-21', '2019-12-20', '2019-12-19', '2019-12-18', '2019-12-17',

'2019-12-14', '2019-12-13', '2019-12-12', '2019-12-11', '2019-12-10',

...

'2019-03-16', '2019-03-15', '2019-03-14', '2019-03-13', '2019-03-12',

'2019-03-09', '2019-03-08', '2019-03-07', '2019-03-06', '2019-03-05'],

dtype='object', length=198)

# 无论是头部尾部，还是index 可以转为list/numpy.array/

print(fund.head().values, fund.head().to_numpy(), np.asarray(fund.head()), list(fund.head()))

[[1.503 4.532]

[1.506 4.535]

[1.511 4.54 ]

[1.512 4.541]

[1.503 4.532]] [[1.503 4.532]

[1.506 4.535]

[1.511 4.54 ]

[1.512 4.541]

[1.503 4.532]] [[1.503 4.532]

[1.506 4.535]

[1.511 4.54 ]

[1.512 4.541]

[1.503 4.532]] ['单元净值', '累计净值']DataFrame 含多种数据类型时，DataFrame.values 会复制数据，并将数据的值强制转换同一种数据类型，这是一种代价较高的操作。DataFrame.to_numpy() 则返回 NumPy 数组，这种方式更清晰，也不会把 DataFrame 里的数据都当作一种类型。

describe() 可以快速查看数据的统计摘要

print(fund.describe())

单元净值累计净值

count 198.000000 198.000000

mean 1.482091 4.479803

std 0.055413 0.066070

min 1.341000 4.325000

25% 1.441250 4.425250

50% 1.487500 4.483500

75% 1.531750 4.541000

max 1.565000 4.586000

# 转置数据

print(fund.T)

2019-12-21 2019-12-20 2019-12-19 2019-12-18 2019-12-17 2019-12-14 \

单元净值 1.503 1.506 1.511 1.512 1.503 1.514

累计净值 4.532 4.535 4.540 4.541 4.532 4.543

2019-12-13 2019-12-12 2019-12-11 2019-12-10 ... 2019-03-16 \

单元净值 1.496 1.498 1.502 1.497 ... 1.380

累计净值 4.525 4.527 4.531 4.526 ... 4.364

2019-03-15 2019-03-14 2019-03-13 2019-03-12 2019-03-09 2019-03-08 \

单元净值 1.364 1.364 1.365 1.365 1.341 1.366

累计净值 4.348 4.348 4.349 4.349 4.325 4.350

2019-03-07 2019-03-06 2019-03-05

单元净值 1.385 1.387 1.382

累计净值 4.369 4.371 4.366

[2 rows x 198 columns]

# 排序 ascending 升序

# -- 按index 排序

print(fund.sort_index(axis=1, ascending=False))

print('##################')

# -- 按值排序

print(fund.sort_values(ascending=False, by='累计净值'))

累计净值单元净值

2019-12-21 4.532 1.503

2019-12-20 4.535 1.506

2019-12-19 4.540 1.511

2019-12-18 4.541 1.512

2019-12-17 4.532 1.503

... ... ...

2019-03-09 4.325 1.341

2019-03-08 4.350 1.366

2019-03-07 4.369 1.385

2019-03-06 4.371 1.387

2019-03-05 4.366 1.382

[198 rows x 2 columns]

##################

单元净值累计净值

2019-11-20 1.557 4.586

2019-11-21 1.552 4.581

2019-11-06 1.565 4.579

2019-11-05 1.563 4.577

2019-11-15 1.548 4.577

... ... ...

2019-03-13 1.365 4.349

2019-03-12 1.365 4.349

2019-03-15 1.364 4.348

2019-03-14 1.364 4.348

2019-03-09 1.341 4.325

[198 rows x 2 columns]

3. 选择查看值

# 获取某一具体值

# 根据索引(类似dict)

# 获取某一列

print(fund['累计净值'])

# 获取某一行不能使用fund['2019-12-18']

print(fund.loc['2019-12-18'])

print(fund['累计净值']['2019-12-18'], fund.loc['2019-12-18', '累计净值'],fund.at['2019-12-18', '累计净值'])

# 根据下标索引(类似list)

# 获取某一列不能使用fund[0]

print(fund.iloc[:,0])

# 获取某一行

print(fund.iloc[0])

# 获取具体值

print(fund.iloc[0, 0],fund.iat[0, 0])

2019-12-21 4.532

2019-12-20 4.535

2019-12-19 4.540

2019-12-18 4.541

2019-12-17 4.532

...

2019-03-09 4.325

2019-03-08 4.350

2019-03-07 4.369

2019-03-06 4.371

2019-03-05 4.366

Name: 累计净值, Length: 198, dtype: float64

单元净值 1.512

累计净值 4.541

Name: 2019-12-18, dtype: float64

4.541 4.541 4.541

2019-12-21 1.503

2019-12-20 1.506

2019-12-19 1.511

2019-12-18 1.512

2019-12-17 1.503

...

2019-03-09 1.341

2019-03-08 1.366

2019-03-07 1.385

2019-03-06 1.387

2019-03-05 1.382

Name: 单元净值, Length: 198, dtype: float64

单元净值 1.503

累计净值 4.532

Name: 2019-12-21, dtype: float64

1.503 1.503注：pandas的数据结构都有显示索引和隐式索引，对Series而言，显示索引就是index, 它的用法就像dict， pandas的loc和at都是使用显示索引。隐式索引就像list的下标索引，pandas的iloc和iat都是使用隐式索引

loc，iloc既可以取多值，也可以取单一数值，而at，iat只能取单一数值

# 切片若按列切片，则需要使用loc和iloc

# 根据索引注意索引要之前向后

print(fund['2019-12-18':'2019-12-17'])

# 根据下标索引(类似list)

print(fund[:5])

# pandas方法loc，iloc

print(fund.loc['2019-12-18':'2019-12-17'])

print(fund.loc['2019-12-18':'2019-12-17', '单元净值':'累计净值'])

print('***************')

print(fund.iloc[-5:-1, 1])

单元净值累计净值

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

单元净值累计净值

2019-12-21 1.503 4.532

2019-12-20 1.506 4.535

2019-12-19 1.511 4.540

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

单元净值累计净值

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

单元净值累计净值

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

***************

2019-03-09 4.325

2019-03-08 4.350

2019-03-07 4.369

2019-03-06 4.371

Name: 累计净值, dtype: float64选择、设置标准 Python / Numpy 的表达式已经非常直观，但还是推荐优化过的 Pandas 数据访问方法：.at、.iat、.loc 和 .iloc。

5. 布尔索引

布尔索引像filter函数一样，返回True的内容

# 找出单位净值大于1.5的内容

print(fund[fund['单元净值']>1.5])

单元净值累计净值

2019-12-21 1.503 4.532

2019-12-20 1.506 4.535

2019-12-19 1.511 4.540

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

... ... ...

2019-07-06 1.520 4.504

2019-07-05 1.502 4.486

2019-07-04 1.518 4.502

2019-07-03 1.536 4.520

2019-07-02 1.532 4.516

[84 rows x 2 columns]

5. 赋值操作

# 对index的赋值操作, z注意获取的是实时数据，注意index的范围，每天都不一样

fund_index = pd.to_datetime(fund.index, format="%Y-%m-%d")

print(fund_index)

DatetimeIndex(['2019-12-21', '2019-12-20', '2019-12-19', '2019-12-18',

'2019-12-17', '2019-12-14', '2019-12-13', '2019-12-12',

'2019-12-11', '2019-12-10',

...

'2019-03-16', '2019-03-15', '2019-03-14', '2019-03-13',

'2019-03-12', '2019-03-09', '2019-03-08', '2019-03-07',

'2019-03-06', '2019-03-05'],

dtype='datetime64[ns]', length=198, freq=None)

fund.index = fund_index

print(fund)

print(fund.index)

单元净值累计净值

2019-12-21 1.503 4.532

2019-12-20 1.506 4.535

2019-12-19 1.511 4.540

2019-12-18 1.512 4.541

2019-12-17 1.503 4.532

... ... ...

2019-03-09 1.341 4.325

2019-03-08 1.366 4.350

2019-03-07 1.385 4.369

2019-03-06 1.387 4.371

2019-03-05 1.382 4.366

[198 rows x 2 columns]

DatetimeIndex(['2019-12-21', '2019-12-20', '2019-12-19', '2019-12-18',

'2019-12-17', '2019-12-14', '2019-12-13', '2019-12-12',

'2019-12-11', '2019-12-10',

...

'2019-03-16', '2019-03-15', '2019-03-14', '2019-03-13',

'2019-03-12', '2019-03-09', '2019-03-08', '2019-03-07',

'2019-03-06', '2019-03-05'],

dtype='datetime64[ns]', length=198, freq=None)

# 利用索引赋值

s1 = pd.DataFrame({'单元净值':np.nan, '累计净值':np.nan}, index=pd.date_range('2019-03-05', '2019-12-21'))

s1.loc[fund.index] = fund

print(s1)

单元净值累计净值

2019-03-05 1.382 4.366

2019-03-06 1.387 4.371

2019-03-07 1.385 4.369

2019-03-08 1.366 4.350

2019-03-09 1.341 4.325

... ... ...

2019-12-17 1.503 4.532

2019-12-18 1.512 4.541

2019-12-19 1.511 4.540

2019-12-20 1.506 4.535

2019-12-21 1.503 4.532

[292 rows x 2 columns]

# 单点赋值

s1['单元净值']['2019-03-05'] = s1['单元净值']['2019-03-06']

# 切片赋值

s1['2019-12-15':'2019-12-16'] = 1.514

7. 缺失值处理

# 1、删除缺失值

fund = s1.dropna(how='any')

print(fund)

# dropna 的常用参数： (不改变原变量，生成新的变量)

# 1、axis ： 0或'index'：删除包含缺失值的行。 1或“列”：删除包含缺失值的列。

# 2、how： 'any'：如果存在任何NA值，则删除该行或列。 'all'：如果所有值均为NA，则删除该行或列。

# 3、thresh : int, 需要多少个非NAN值才不被删除

# 4、inplace：True 直接修改原对象

单元净值累计净值

2019-03-05 1.387 4.366

2019-03-06 1.387 4.371

2019-03-07 1.385 4.369

2019-03-08 1.366 4.350

2019-03-09 1.341 4.325

... ... ...

2019-12-17 1.503 4.532

2019-12-18 1.512 4.541

2019-12-19 1.511 4.540

2019-12-20 1.506 4.535

2019-12-21 1.503 4.532

[200 rows x 2 columns]

# 2、填充缺失值

print(s1.fillna(value=5))

# fillna的常用参数： (不改变原变量，生成新的变量)

# 1、value 可以是scalar, dict, Series, or DataFrame 该值指定用于每个索引(对于Series)或列(对于DataFrame)使用哪个值。不在dict / Series / DataFrame中的值将不被填充。该值不能是列表。

# 2、method：填充方法{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}， ‘backfill’, ‘bfill’: 向后填充; ‘pad’, ‘ffill’向前填充, None指定值填充

# 3、limit ：如果指定了method，则这是要向前/向后填充的连续NaN值的最大数量。

# 4、axis : 填充缺失值所沿的轴。

# 5、inplace：True 直接修改原对象

单元净值累计净值

2019-03-05 1.387 4.366

2019-03-06 1.387 4.371

2019-03-07 1.385 4.369

2019-03-08 1.366 4.350

2019-03-09 1.341 4.325

... ... ...

2019-12-17 1.503 4.532

2019-12-18 1.512 4.541

2019-12-19 1.511 4.540

2019-12-20 1.506 4.535

2019-12-21 1.503 4.532

[292 rows x 2 columns]

s1.fillna(method='pad', inplace=True)

print(s1)

单元净值累计净值

2019-03-05 1.387 4.366

2019-03-06 1.387 4.371

2019-03-07 1.385 4.369

2019-03-08 1.366 4.350

2019-03-09 1.341 4.325

... ... ...

2019-12-17 1.503 4.532

2019-12-18 1.512 4.541

2019-12-19 1.511 4.540

2019-12-20 1.506 4.535

2019-12-21 1.503 4.532

[292 rows x 2 columns]

7.方法与函数Apply 函数

# 把每一列当作参数

s1.apply(lambda x: x.max() - x.min())

单元净值 0.224

累计净值 3.072

dtype: float64结合

s = pd.Series(['A', 'B', 'C'])

s2 = pd.Series(['d', 'e', 'f'])

pd.concat([s,s2])

0 A

1 B

2 C

0 d

1 e

2 f

dtype: object追加

s.append(s2, ignore_index=True)

0 A

1 B

2 C

3 d

4 e

5 f

dtype: object连接(join)-- SQL风格

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

print(left)

print(pd.merge(left, right, on='key'))

key lval

0 foo 1

1 bar 2

key lval rval

0 foo 1 4

1 bar 2 5分组(Grouping)

“group by” 指的是涵盖下列一项或多项步骤的处理流程：分割：按条件把数据分割成多组；

应用：为每组单独应用函数；

组合：将处理结果组合成一个数据结构。

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],

'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],

'C': np.random.randn(8),

'D': np.random.randn(8)})

print(df)

A B C D

0 foo one -0.736775 -1.032603

1 bar one 0.161155 0.312022

2 foo two -0.398455 -1.259967

3 bar three 0.265293 0.315548

4 foo two -0.411767 0.946389

5 bar two -1.642788 0.544317

6 foo one -0.209906 -2.848426

7 foo three 1.674149 0.153529

df.groupby('A').sum()

C D A bar -1.216341 1.171887 foo -0.082753 -4.041078

df.groupby(['A', 'B']).sum()

C D A B bar one 0.161155 0.312022 three 0.265293 0.315548 two -1.642788 0.544317 foo one -0.946680 -3.881029 three 1.674149 0.153529 two -0.810222 -0.313578