问题:分析accel的采样率
最近要分析sensor的采样率, 第一列是timer ticker数,要算下sample rate, 虽然之前做过类似的分析,但长时间不看,几乎忘光了,之前的记录也不能复现,各种错误。这里特意记录下,包括原始数据和过程等,以备将来查找。
In [103]: !cat test.txt
1069968951, 0.140654, -0.144819, 9.891772
1070711920, 0.150230, -0.154395, 9.891772
1071452545, 0.140654, -0.154395, 9.882196
1072193170, 0.135865, -0.144819, 9.872621
1072932037, 0.145442, -0.140030, 9.872621
1073670319, 0.145442, -0.130454, 9.877409
1074408600, 0.140654, -0.130454, 9.872621
1075146881, 0.145442, -0.154395, 9.867833
1075885748, 0.159806, -0.130454, 9.872621
1076625201, 0.135865, -0.144819, 9.877409
numpy
为什么引入ndarray
因为pandas是以numpy为基础的,虽然直接使用的pandas, 还是要了解下numpy, 对深度学习中tensor的理解也有好处。numpy的核心数据结构是 ndarray, 实例如下:
In [104]: nda1 = np.arange(18).reshape(3,6)
In [105]: nda1
Out[105]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
ndarray的目的是实现矢量化,就是并行化,替代单线程的for 实现.
ndarray的属性
ndarray的属性包括:ndim, shape, dytpe等。特别注意下astype函数,可以固定到想要的数据类型。
ndarray的行为
这里的常见操作包括怎样获得行、列、块和narray 和标量的运算 (广播性)
DataFrame
pandas是基于numpy构建的,使以numpy为中心的应用更方便
pandas中主要数据结构是Series, DataFrame. 其中Series表示的数据项中都有对应的索引,而DataFrame表示的数据 有行列两个索引即表格型数据。
pandas的操作
怎样获得行、列、块
特别关注下行操作的loc/ iloc分别表示基于名字的还是索引号的,如果有name 还有index时 使用loc
基于文件的操作
从文件中读出数据,使用DataFrame,Series进行分析,把结果在保存到文件。具体的应用大多是基于文件的。以文件“test.txt”为例。
In [48]: !cat test.txt
1069968951, 0.140654, -0.144819, 9.891772
1070711920, 0.150230, -0.154395, 9.891772
1071452545, 0.140654, -0.154395, 9.882196
1072193170, 0.135865, -0.144819, 9.872621
1072932037, 0.145442, -0.140030, 9.872621
1073670319, 0.145442, -0.130454, 9.877409
1074408600, 0.140654, -0.130454, 9.872621
1075146881, 0.145442, -0.154395, 9.867833
1075885748, 0.159806, -0.130454, 9.872621
1076625201, 0.135865, -0.144819, 9.877409
读文件(默认第一行是column名字)
In [49]: df = pd.read_csv('test.txt')
In [50]: df
Out[50]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1071452545 0.140654 -0.154395 9.882196
2 1072193170 0.135865 -0.144819 9.872621
3 1072932037 0.145442 -0.140030 9.872621
4 1073670319 0.145442 -0.130454 9.877409
5 1074408600 0.140654 -0.130454 9.872621
6 1075146881 0.145442 -0.154395 9.867833
7 1075885748 0.159806 -0.130454 9.872621
8 1076625201 0.135865 -0.144819 9.877409
读文件加header=None,系统自动生成列名
In [51]: df = pd.read_csv('test.txt', header=None)
In [52]: df
Out[52]:
0 1 2 3
0 1069968951 0.140654 -0.144819 9.891772
1 1070711920 0.150230 -0.154395 9.891772
2 1071452545 0.140654 -0.154395 9.882196
3 1072193170 0.135865 -0.144819 9.872621
4 1072932037 0.145442 -0.140030 9.872621
5 1073670319 0.145442 -0.130454 9.877409
6 1074408600 0.140654 -0.130454 9.872621
7 1075146881 0.145442 -0.154395 9.867833
8 1075885748 0.159806 -0.130454 9.872621
9 1076625201 0.135865 -0.144819 9.877409
读文件加 names参数
In [55]: names = ['time', 'x', 'y', 'z']
In [53]: df = pd.read_csv('test.txt', names=names)
In [54]: df
Out[54]:
time x y z
0 1069968951 0.140654 -0.144819 9.891772
1 1070711920 0.150230 -0.154395 9.891772
2 1071452545 0.140654 -0.154395 9.882196
3 1072193170 0.135865 -0.144819 9.872621
4 1072932037 0.145442 -0.140030 9.872621
5 1073670319 0.145442 -0.130454 9.877409
6 1074408600 0.140654 -0.130454 9.872621
7 1075146881 0.145442 -0.154395 9.867833
8 1075885748 0.159806 -0.130454 9.872621
9 1076625201 0.135865 -0.144819 9.877409
某列按时间序列下行减上行(并添加一列)
In [55]: df['interval'] = df['time'].shift(-1) -df['time']
In [56]: df
Out[56]:
time x y z interval
0 1069968951 0.140654 -0.144819 9.891772 742969.0
1 1070711920 0.150230 -0.154395 9.891772 740625.0
2 1071452545 0.140654 -0.154395 9.882196 740625.0
3 1072193170 0.135865 -0.144819 9.872621 738867.0
4 1072932037 0.145442 -0.140030 9.872621 738282.0
5 1073670319 0.145442 -0.130454 9.877409 738281.0
6 1074408600 0.140654 -0.130454 9.872621 738281.0
7 1075146881 0.145442 -0.154395 9.867833 738867.0
8 1075885748 0.159806 -0.130454 9.872621 739453.0
9 1076625201 0.135865 -0.144819 9.877409 NaN
对列做标量运算(对象的视图<直接更改>不是copy)
In [57]: df['interval'] = df['interval'] * ( 250.0/ 48) / 100000
In [58]: df
Out[58]:
time x y z interval
0 1069968951 0.140654 -0.144819 9.891772 38.696302
1 1070711920 0.150230 -0.154395 9.891772 38.574219
2 1071452545 0.140654 -0.154395 9.882196 38.574219
3 1072193170 0.135865 -0.144819 9.872621 38.482656
4 1072932037 0.145442 -0.140030 9.872621 38.452188
5 1073670319 0.145442 -0.130454 9.877409 38.452135
6 1074408600 0.140654 -0.130454 9.872621 38.452135
7 1075146881 0.145442 -0.154395 9.867833 38.482656
8 1075885748 0.159806 -0.130454 9.872621 38.513177
9 1076625201 0.135865 -0.144819 9.877409 NaN
由一列生成新的列
In [59]: df['hz'] = 1000 / df['interval']
In [60]: df
Out[60]:
time x y z interval hz
0 1069968951 0.140654 -0.144819 9.891772 38.696302 25.842263
1 1070711920 0.150230 -0.154395 9.891772 38.574219 25.924051
2 1071452545 0.140654 -0.154395 9.882196 38.574219 25.924051
3 1072193170 0.135865 -0.144819 9.872621 38.482656 25.985732
4 1072932037 0.145442 -0.140030 9.872621 38.452188 26.006323
5 1073670319 0.145442 -0.130454 9.877409 38.452135 26.006358
6 1074408600 0.140654 -0.130454 9.872621 38.452135 26.006358
7 1075146881 0.145442 -0.154395 9.867833 38.482656 25.985732
8 1075885748 0.159806 -0.130454 9.872621 38.513177 25.965139
9 1076625201 0.135865 -0.144819 9.877409 NaN NaN
对一列的分析:均值、最大、最小、和最值的索引
注意这些都是函数不是属性要加()
In [73]: df['hz'].mean()
Out[73]: 25.960667354462956
In [74]: df['hz'].max()
Out[74]: 26.00635801273499
In [75]: df['hz'].min()
Out[75]: 25.84226259776653
In [81]: df['hz'].count()
Out[81]: 9
In [108]: df['z'].idxmax()
Out[108]: 0
In [109]: df['z'].idxmin()
Out[109]: 5
读前几行和跳过几行
In [111]: pd.read_csv('test.txt', nrows=5)
Out[111]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1071452545 0.140654 -0.154395 9.882196
2 1072193170 0.135865 -0.144819 9.872621
3 1072932037 0.145442 -0.140030 9.872621
4 1073670319 0.145442 -0.130454 9.877409
In [112]: pd.read_csv('test.txt', skiprows=[2, 4])
Out[112]:
1069968951 0.140654 -0.144819 9.891772
0 1070711920 0.150230 -0.154395 9.891772
1 1072193170 0.135865 -0.144819 9.872621
2 1073670319 0.145442 -0.130454 9.877409
3 1074408600 0.140654 -0.130454 9.872621
4 1075146881 0.145442 -0.154395 9.867833
5 1075885748 0.159806 -0.130454 9.872621
6 1076625201 0.135865 -0.144819 9.877409
把结果写入文件(to_csv)
pd.read_csv('test.txt', skiprows=[2, 4]).to_csv('1111')
Series,DataFrame绘图
series, dataframe的绘图还是比较简单的,直接调用plot()得到line图,可以在plot函数中加些不同参数进行常见的设置。
Series对象 直接调用plot() 得到是以series的index为横坐标,values为纵坐标的折线图
图像的类型分为:kind参数:line, bar, barh, kde是个啥?
In [43]: data = Series(np.random.rand(16), index=list('abcdefghijklmnop'))
In [46]: fig, axes = plt.subplots(2, 1)
In [49]: data.plot(kind='bar', ax=axes[0])
Out[49]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4119dc890>
In [50]: data.plot(kind='barh', ax=axes[1], alpha=0.7)
Out[50]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc408d8dc90>
In [51]: fig.savefig('222')
DataFrame
In [52]: data = DataFrame(np.random.rand(16).reshape(4,4), index=list('abcd'))
In [53]: data
Out[53]:
0 1 2 3
a 0.454907 0.373731 0.175647 0.928707
b 0.679350 0.946455 0.530186 0.279625
c 0.090639 0.260855 0.333095 0.670392
d 0.365898 0.227276 0.158680 0.919795
In [54]: data.plot()
Out[54]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4117b1610>
In [55]: data.plot(kind='bar')
Out[55]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc41171cfd0>
In [56]: data = DataFrame(np.random.rand(16).reshape(4,4), index=list('abcd'))
In [57]: fig, axes = plt.subplots(2, 1)
In [58]: data.plot(kind='bar', ax=axes[0])
Out[58]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4115d0610>
In [59]: data.plot(kind='line', ax=axes[1], alpha=0.7)
Out[59]: <matplotlib.axes._subplots.AxesSubplot at 0x7fc4115526d0>
DataFrame把各列画到不同的图中
In [62]: data.plot(kind='line', ax=axes[1], alpha=0.7, subplots=True)
Series, DataFrame绘图
可以指定DataFrame的某列为图形的 行、列坐标
plot(x='year', y='unemployment', ax=ax, legend=False)