《NLTK基础教程》读书笔记 008期

最新推荐文章于 2024-07-08 16:14:38 发布

bright_silmarillion

最新推荐文章于 2024-07-08 16:14:38 发布

阅读量331

点赞数 1

分类专栏： NLTK 读书笔记文章标签： pandas Python3 numpy scipy matplotlib

本文链接：https://blog.csdn.net/bright_silmarillion/article/details/81016376

版权

NLTK 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

读书笔记

12 篇文章 0 订阅

订阅专栏

这章主要是机器学习的知识？
嘛，说是机器学习，结果还是numpy、pandas、scipy、matplotlib这些玩意儿的使用，没有任何tensorflow、caffe、keras等高级库的使用说明。

np.logspace(0,1)的结果不是只有两行，看也知道省略了很多，真实结果如下

array([ 1.        ,  1.04811313,  1.09854114,  1.1513954 ,  1.20679264,
        1.26485522,  1.32571137,  1.38949549,  1.45634848,  1.52641797,
        1.59985872,  1.67683294,  1.75751062,  1.84206997,  1.93069773,
        2.02358965,  2.12095089,  2.22299648,  2.32995181,  2.44205309,
        2.55954792,  2.6826958 ,  2.8117687 ,  2.9470517 ,  3.0888436 ,
        3.23745754,  3.39322177,  3.55648031,  3.72759372,  3.90693994,
        4.09491506,  4.29193426,  4.49843267,  4.71486636,  4.94171336,
        5.17947468,  5.42867544,  5.68986603,  5.96362332,  6.25055193,
        6.55128557,  6.86648845,  7.19685673,  7.54312006,  7.90604321,
        8.28642773,  8.68511374,  9.10298178,  9.54095476, 10.        ])

8.1.3
那个B的赋值肯定有问题啦，很明显，应该是：B = np.array([n for n in range(4)])

注意比较：

np.repeat(A,2)
array([1, 1, 2, 2, 3, 3, 4, 4])

np.tile(A, 2)
array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

参考网页https://docs.scipy.org/doc/scipy/reference/linalg.html
可以发现，linalg并没有提供dot方法，说明这个地方肯定有typo，这地方作者写太快了吧，代码都乱七八糟的，等号写一个就够了，X一开始那个直接solve肯定会有错，solve又没有define，简单吐槽一下，继续。
然后因为没有置随机种子，所以结果正确性有待考证。

稀疏矩阵那里，貌似前面可能有from numpy import *，所以很明显的有一些函数是numpy中的，前面并没有加相关库，注意一下就好。

from scipy import sparse as sp
import numpy as np
A = np.array([[1,0,0],[0,2,0],[0,0,3]])
C = sp.csr_matrix(A)

print(A)
print(C)
print(C.toarray())
print(C * C.todense())
print(np.dot(C,C).todense())

关于后面scipy的优化方法，应该后来仔细学习一下，里面有不少还是跟matlab相同的算法的，练习。

pandas 数据指引重大错误！
这里一开始用的明明是iris.data，但是链接都没有给，我在这里给出真实链接：
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
原作者和翻译以及校注是多不用心？
data.describe()的结果：

describe:        sepal length  sepal width  petal length  petal width

正是因为这个结果，我们后面打印出的sepal_len_cnt才能是如下输出

count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
5.0    10
6.3     9
5.1     9
6.7     8
5.7     8
5.5     7
5.8     7
6.4     7
6.0     6
4.9     6
6.1     6
5.4     6
5.6     6
6.5     5
4.8     5
7.7     4
6.9     4
5.2     4
6.2     4
4.6     4
7.2     3
6.8     3
4.4     3
5.9     3
6.6     2
4.7     2
7.6     1
7.4     1
4.3     1
7.9     1
7.3     1
7.0     1
4.5     1
5.3     1
7.1     1
Name: sepal length, dtype: int64

很明显后面那个也错了，并不是data['Iris-setosa']，而应该是data['Cat']
结果为：

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Cat, dtype: int64

sntsosa[:5]的结果为：

   sepal length  sepal width  petal length  petal width          Cat
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

下面是道琼斯样例除head和resample的结果：

1453438639
7.621739999999999
DatetimeIndex(['2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28'],
              dtype='datetime64[ns]', name='date', freq=None)
Int64Index([ 7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4,
            11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11,
            18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21,
            28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,
             4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7,
            14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28],
           dtype='int64', name='date')
Int64Index([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
            3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2,
            3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2,
            2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,
            2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1],
           dtype='int64', name='date')
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011],
           dtype='int64', name='date')

在执行resample的时候，会收到过期提醒：how in .resample() is deprecated
参考网页：http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#resample-api
直接将代码改为：print(stockdata.resample('M').sum())
得到如下结果

            quarter      volume  percent_change_price              ...               percent_change_next_weeks_price  days_to_next_dividend  percent_return_next_dividend
date                                                               ...
2011-01-31       36  6779916771             19.637287              ...                                     34.302458                   2618                     18.519712
2011-02-28       32  5713027799             28.553732              ...                                     -4.583387                   1637                     13.819996
2011-03-31       32  5535580114             -7.317345              ...                                      3.263918                   1560                     13.930990

[3 rows x 8 columns]

后面操作的时候同样也会面临未来过期预警：FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()

将这里的三行代码改为：

stockdata_new.open = pd.to_numeric(stockdata_new.open.str.replace('$', ""))
stockdata_new.close = pd.to_numeric(stockdata_new.close.str.replace('$', ""))
(stockdata_new.close - stockdata_new.open).infer_objects()

这部分代码最后两个head的结果为：

date
2011-01-07    12.656
2011-01-14    13.368
2011-01-21    12.952
2011-01-28    12.696
2011-02-04    12.944
Name: newopen, dtype: float64

           stock   open    high     low  close     volume  newopen
date
2011-01-07    AA  15.82  $16.72  $15.78  16.42  239655616   12.656
2011-01-14    AA  16.71  $16.71  $15.64  15.97  242963398   13.368
2011-01-21    AA  16.19  $16.38  $15.60  15.79  138428495   12.952
2011-01-28    AA  15.87  $16.63  $15.82  16.13  151379173   12.696
2011-02-04    AA  16.18  $17.39  $16.18  17.14  154387761   12.944

丝毫不明白为什么有下面这四行，直接删掉

plt.subplot(2, 2, 3)
plt.plot(x, y, 'g--')
plt.subplot(2, 2, 4)
plt.plot(x, y, 'r-*')

这章会得到四张图：
这里写图片描述

后面有两个多出来的代码，图像是下面这两张：
这里写图片描述

bright_silmarillion

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《NLTK基础教程》读书笔记 008期

这章主要是机器学习的知识？嘛，说是机器学习，结果还是numpy、pandas、scipy、matplotlib这些玩意儿的使用，没有任何tensorflow、caffe、keras等高级库的使用说明。np.logspace(0,1)的结果不是只有两行，看也知道省略了很多，真实结果如下array([ 1. , 1.04811313, 1.09854114, 1.1...
复制链接

扫一扫

专栏目录