利用python进行数据分析: 读书笔记

最新推荐文章于 2022-05-26 13:59:25 发布

shuai_wen

最新推荐文章于 2022-05-26 13:59:25 发布

阅读量257

点赞数

分类专栏：人工智能

本文链接：https://blog.csdn.net/u011279649/article/details/103615992

版权

人工智能专栏收录该内容

159 篇文章

订阅专栏

问题：分析accel的采样率

最近要分析sensor的采样率，第一列是timer ticker数，要算下sample rate, 虽然之前做过类似的分析，但长时间不看，几乎忘光了，之前的记录也不能复现，各种错误。这里特意记录下，包括原始数据和过程等，以备将来查找。

In [103]: !cat test.txt
1069968951, 0.140654, -0.144819, 9.891772
1070711920, 0.150230, -0.154395, 9.891772
1071452545, 0.140654, -0.154395, 9.882196
1072193170, 0.135865, -0.144819, 9.872621
1072932037, 0.145442, -0.140030, 9.872621
1073670319, 0.145442, -0.130454, 9.877409
1074408600, 0.140654, -0.130454, 9.872621
1075146881, 0.145442, -0.154395, 9.867833
1075885748, 0.159806, -0.130454, 9.872621
1076625201, 0.135865, -0.144819, 9.877409

numpy

为什么引入ndarray

因为pandas是以numpy为基础的，虽然直接使用的pandas, 还是要了解下numpy, 对深度学习中tensor的理解也有好处。numpy的核心数据结构是 ndarray, 实例如下：

In [104]: nda1 = np.arange(18).reshape(3,6)

In [105]: nda1
Out[105]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])

ndarray的目的是实现矢量化，就是并行化，替代单线程的for 实现.

ndarray的属性

ndarray的属性包括：ndim, shape, dytpe等。特别注意下astype函数，可以固定到想要的数据类型。

ndarray的行为

这里的常见操作包括怎样获得行、列、块和narray 和标量的运算 (广播性)

DataFrame

pandas是基于numpy构建的，使以numpy为中心的应用更方便

pandas中主要数据结构是Series, DataFrame. 其中Series表示的数据项中都有对应的索引，而DataFrame表示的数据有行列两个索引即表格型数据。

pandas的操作

怎样获得行、列、块

特别关注下行操作的loc/ iloc分别表示基于名字的还是索引号的，如果有name 还有index时使用loc

基于文件的操作

从文件中读出数据，使用DataFrame,Series进行分析，把结果在保存到文件。具体的应用大多是基于文件的。以文件“test.txt”为例。

In [48]: !cat test.txt
1069968951, 0.140654, -0.144819, 9.891772
1070711920, 0.150230, -0.154395, 9.891772
1071452545, 0.140654, -0.154395, 9.882196
1072193170, 0.135865, -0.144819, 9.872621
1072932037, 0.145442, -0.140030, 9.872621
1073670319, 0.145442, -0.130454, 9.877409
1074408600, 0.140654, -0.130454, 9.872621
1075146881, 0.145442, -0.154395, 9.867833
1075885748, 0.159806, -0.130454, 9.872621
1076625201, 0.135865, -0.144819, 9.877409

读文件(默认第一行是column名字)

In [49]: df = pd.read_csv('test.txt')

In [50]: df
Out[50]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1071452545 0.140654 -0.154395 9.882196
2 1072193170 0.135865 -0.144819 9.872621
3 1072932037 0.145442 -0.140030 9.872621
4 1073670319 0.145442 -0.130454 9.877409
5 1074408600 0.140654 -0.130454 9.872621
6 1075146881 0.145442 -0.154395 9.867833
7 1075885748 0.159806 -0.130454 9.872621
8 1076625201 0.135865 -0.144819 9.877409

读文件加header=None,系统自动生成列名

In [51]: df = pd.read_csv('test.txt', header=None)

In [52]: df
Out[52]:
0 1 2 3
0 1069968951 0.140654 -0.144819 9.891772
1 1070711920 0.150230 -0.154395 9.891772
2 1071452545 0.140654 -0.154395 9.882196
3 1072193170 0.135865 -0.144819 9.872621
4 1072932037 0.145442 -0.140030 9.872621
5 1073670319 0.145442 -0.130454 9.877409
6 1074408600 0.140654 -0.130454 9.872621
7 1075146881 0.145442 -0.154395 9.867833
8 1075885748 0.159806 -0.130454 9.872621
9 1076625201 0.135865 -0.144819 9.877409

读文件加 names参数

In [55]: names = ['time', 'x', 'y', 'z']
In [53]: df = pd.read_csv('test.txt', names=names)

In [54]: df
Out[54]:
time x y z
0 1069968951 0.140654 -0.144819 9.891772
1 1070711920 0.150230 -0.154395 9.891772
2 1071452545 0.140654 -0.154395 9.882196
3 1072193170 0.135865 -0.144819 9.872621
4 1072932037 0.145442 -0.140030 9.872621
5 1073670319 0.145442 -0.130454 9.877409
6 1074408600 0.140654 -0.130454 9.872621
7 1075146881 0.145442 -0.154395 9.867833
8 1075885748 0.159806 -0.130454 9.872621
9 1076625201 0.135865 -0.144819 9.877409

某列按时间序列下行减上行(并添加一列)

In [55]: df['interval'] = df['time'].shift(-1) -df['time']

In [56]: df
Out[56]:
time x y z interval
0 1069968951 0.140654 -0.144819 9.891772 742969.0
1 1070711920 0.150230 -0.154395 9.891772 740625.0
2 1071452545 0.140654 -0.154395 9.882196 740625.0
3 1072193170 0.135865 -0.144819 9.872621 738867.0
4 1072932037 0.145442 -0.140030 9.872621 738282.0
5 1073670319 0.145442 -0.130454 9.877409 738281.0
6 1074408600 0.140654 -0.130454 9.872621 738281.0
7 1075146881 0.145442 -0.154395 9.867833 738867.0
8 1075885748 0.159806 -0.130454 9.872621 739453.0
9 1076625201 0.135865 -0.144819 9.877409 NaN

对列做标量运算(对象的视图<直接更改>不是copy)

In [57]: df['interval'] = df['interval'] * ( 250.0/ 48) / 100000

In [58]: df
Out[58]:
time x y z interval
0 1069968951 0.140654 -0.144819 9.891772 38.696302
1 1070711920 0.150230 -0.154395 9.891772 38.574219
2 1071452545 0.140654 -0.154395 9.882196 38.574219
3 1072193170 0.135865 -0.144819 9.872621 38.482656
4 1072932037 0.145442 -0.140030 9.872621 38.452188
5 1073670319 0.145442 -0.130454 9.877409 38.452135
6 1074408600 0.140654 -0.130454 9.872621 38.452135
7 1075146881 0.145442 -0.154395 9.867833 38.482656
8 1075885748 0.159806 -0.130454 9.872621 38.513177
9 1076625201 0.135865 -0.144819 9.877409 NaN

由一列生成新的列

In [59]: df['hz'] = 1000 / df['interval']

In [60]: df
Out[60]:
time x y z interval hz
0 1069968951 0.140654 -0.144819 9.891772 38.696302 25.842263
1 1070711920 0.150230 -0.154395 9.891772 38.574219 25.924051
2 1071452545 0.140654 -0.154395 9.882196 38.574219 25.924051
3 1072193170 0.135865 -0.144819 9.872621 38.482656 25.985732
4 1072932037 0.145442 -0.140030 9.872621 38.452188 26.006323
5 1073670319 0.145442 -0.130454 9.877409 38.452135 26.006358
6 1074408600 0.140654 -0.130454 9.872621 38.452135 26.006358
7 1075146881 0.145442 -0.154395 9.867833 38.482656 25.985732
8 1075885748 0.159806 -0.130454 9.872621 38.513177 25.965139
9 1076625201 0.135865 -0.144819 9.877409 NaN NaN

对一列的分析：均值、最大、最小、和最值的索引

注意这些都是函数不是属性要加()

In [73]: df['hz'].mean()
Out[73]: 25.960667354462956

In [74]: df['hz'].max()
Out[74]: 26.00635801273499

In [75]: df['hz'].min()
Out[75]: 25.84226259776653
In [81]: df['hz'].count()
Out[81]: 9

In [108]: df['z'].idxmax()
Out[108]: 0

In [109]: df['z'].idxmin()
Out[109]: 5

读前几行和跳过几行

In [111]: pd.read_csv('test.txt', nrows=5)
Out[111]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1071452545 0.140654 -0.154395 9.882196
2 1072193170 0.135865 -0.144819 9.872621
3 1072932037 0.145442 -0.140030 9.872621
4 1073670319 0.145442 -0.130454 9.877409

In [112]: pd.read_csv('test.txt', skiprows=[2, 4])
Out[112]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1072193170 0.135865 -0.144819 9.872621
2 1073670319 0.145442 -0.130454 9.877409
3 1074408600 0.140654 -0.130454 9.872621
4 1075146881 0.145442 -0.154395 9.867833
5 1075885748 0.159806 -0.130454 9.872621
6 1076625201 0.135865 -0.144819 9.877409

把结果写入文件(to_csv)

pd.read_csv('test.txt', skiprows=[2, 4]).to_csv('1111')

Series，DataFrame绘图

series, dataframe的绘图还是比较简单的，直接调用plot()得到line图，可以在plot函数中加些不同参数进行常见的设置。

Series对象直接调用plot() 得到是以series的index为横坐标，values为纵坐标的折线图

图像的类型分为：kind参数：line, bar, barh, kde是个啥？

In [43]: data = Series(np.random.rand(16), index=list('abcdefghijklmnop'))

In [46]: fig, axes = plt.subplots(2, 1)

In [49]: data.plot(kind='bar', ax=axes[0])
Out[49]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4119dc890>

In [50]: data.plot(kind='barh', ax=axes[1], alpha=0.7)
Out[50]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc408d8dc90>

In [51]: fig.savefig('222')

DataFrame

In [52]: data = DataFrame(np.random.rand(16).reshape(4,4), index=list('abcd'))

In [53]: data
Out[53]:
0 1 2 3
a 0.454907 0.373731 0.175647 0.928707
b 0.679350 0.946455 0.530186 0.279625
c 0.090639 0.260855 0.333095 0.670392
d 0.365898 0.227276 0.158680 0.919795

In [54]: data.plot()
Out[54]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4117b1610>

In [55]: data.plot(kind='bar')
Out[55]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc41171cfd0>

In [56]: data = DataFrame(np.random.rand(16).reshape(4,4), index=list('abcd'))

In [57]: fig, axes = plt.subplots(2, 1)

In [58]: data.plot(kind='bar', ax=axes[0])
Out[58]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4115d0610>

In [59]: data.plot(kind='line', ax=axes[1], alpha=0.7)
Out[59]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4115526d0>

DataFrame把各列画到不同的图中

In [62]: data.plot(kind='line', ax=axes[1], alpha=0.7, subplots=True)

Series, DataFrame绘图

http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/how-pandas-uses-matplotlib-plus-figures-axes-and-subplots/

可以指定DataFrame的某列为图形的行、列坐标

plot(x='year', y='unemployment', ax=ax, legend=False)

利用python进行数据分析: 读书笔记

问题：分析accel的采样率

numpy

为什么引入ndarray

ndarray的属性

ndarray的行为

DataFrame

pandas的操作

基于文件的操作

读文件(默认第一行是column名字)

读文件加header=None,系统自动生成列名

读文件加 names参数

某列按时间序列下行减上行(并添加一列)

对列做标量运算(对象的视图<直接更改>不是copy)

由一列生成新的列

对一列的分析：均值、最大、最小、和最值的索引

读前几行和跳过几行

把结果写入文件(to_csv)

Series，DataFrame绘图

DataFrame

DataFrame把各列画到不同的图中

Series, DataFrame绘图

可以指定DataFrame的某列为图形的 行、列坐标

可以指定DataFrame的某列为图形的行、列坐标