《NLTK基础教程》读书笔记 008期

这章主要是机器学习的知识?
嘛,说是机器学习,结果还是numpy、pandas、scipy、matplotlib这些玩意儿的使用,没有任何tensorflow、caffe、keras等高级库的使用说明。

np.logspace(0,1)的结果不是只有两行,看也知道省略了很多,真实结果如下

array([ 1.        ,  1.04811313,  1.09854114,  1.1513954 ,  1.20679264,
        1.26485522,  1.32571137,  1.38949549,  1.45634848,  1.52641797,
        1.59985872,  1.67683294,  1.75751062,  1.84206997,  1.93069773,
        2.02358965,  2.12095089,  2.22299648,  2.32995181,  2.44205309,
        2.55954792,  2.6826958 ,  2.8117687 ,  2.9470517 ,  3.0888436 ,
        3.23745754,  3.39322177,  3.55648031,  3.72759372,  3.90693994,
        4.09491506,  4.29193426,  4.49843267,  4.71486636,  4.94171336,
        5.17947468,  5.42867544,  5.68986603,  5.96362332,  6.25055193,
        6.55128557,  6.86648845,  7.19685673,  7.54312006,  7.90604321,
        8.28642773,  8.68511374,  9.10298178,  9.54095476, 10.        ])

8.1.3
那个B的赋值肯定有问题啦,很明显,应该是:B = np.array([n for n in range(4)])

注意比较:

np.repeat(A,2)
array([1, 1, 2, 2, 3, 3, 4, 4])

np.tile(A, 2)
array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

参考网页https://docs.scipy.org/doc/scipy/reference/linalg.html
可以发现,linalg并没有提供dot方法,说明这个地方肯定有typo,这地方作者写太快了吧,代码都乱七八糟的,等号写一个就够了,X一开始那个直接solve肯定会有错,solve又没有define,简单吐槽一下,继续。
然后因为没有置随机种子,所以结果正确性有待考证。


稀疏矩阵那里,貌似前面可能有from numpy import *,所以很明显的有一些函数是numpy中的,前面并没有加相关库,注意一下就好。

from scipy import sparse as sp
import numpy as np
A = np.array([[1,0,0],[0,2,0],[0,0,3]])
C = sp.csr_matrix(A)

print(A)
print(C)
print(C.toarray())
print(C * C.todense())
print(np.dot(C,C).todense())

关于后面scipy的优化方法,应该后来仔细学习一下,里面有不少还是跟matlab相同的算法的,练习。


pandas 数据指引重大错误!
这里一开始用的明明是iris.data,但是链接都没有给,我在这里给出真实链接:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
原作者和翻译以及校注是多不用心?
data.describe()的结果:

describe:        sepal length  sepal width  petal length  petal width

正是因为这个结果,我们后面打印出的sepal_len_cnt才能是如下输出

count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
5.0    10
6.3     9
5.1     9
6.7     8
5.7     8
5.5     7
5.8     7
6.4     7
6.0     6
4.9     6
6.1     6
5.4     6
5.6     6
6.5     5
4.8     5
7.7     4
6.9     4
5.2     4
6.2     4
4.6     4
7.2     3
6.8     3
4.4     3
5.9     3
6.6     2
4.7     2
7.6     1
7.4     1
4.3     1
7.9     1
7.3     1
7.0     1
4.5     1
5.3     1
7.1     1
Name: sepal length, dtype: int64

很明显后面那个也错了,并不是data['Iris-setosa'],而应该是data['Cat']
结果为:

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Cat, dtype: int64

sntsosa[:5]的结果为:

   sepal length  sepal width  petal length  petal width          Cat
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

下面是道琼斯样例除head和resample的结果:

1453438639
7.621739999999999
DatetimeIndex(['2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28',
               '2011-02-04', '2011-02-11', '2011-02-18', '2011-02-25',
               '2011-03-04', '2011-03-11', '2011-03-18', '2011-03-25',
               '2011-01-07', '2011-01-14', '2011-01-21', '2011-01-28'],
              dtype='datetime64[ns]', name='date', freq=None)
Int64Index([ 7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4,
            11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11,
            18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21,
            28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,
             4, 11, 18, 25,  7, 14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7,
            14, 21, 28,  4, 11, 18, 25,  4, 11, 18, 25,  7, 14, 21, 28],
           dtype='int64', name='date')
Int64Index([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3,
            3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2,
            3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2,
            2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,
            2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1],
           dtype='int64', name='date')
Int64Index([2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
            2011],
           dtype='int64', name='date')

在执行resample的时候,会收到过期提醒:how in .resample() is deprecated
参考网页:http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#resample-api
直接将代码改为:print(stockdata.resample('M').sum())
得到如下结果

            quarter      volume  percent_change_price              ...               percent_change_next_weeks_price  days_to_next_dividend  percent_return_next_dividend
date                                                               ...
2011-01-31       36  6779916771             19.637287              ...                                     34.302458                   2618                     18.519712
2011-02-28       32  5713027799             28.553732              ...                                     -4.583387                   1637                     13.819996
2011-03-31       32  5535580114             -7.317345              ...                                      3.263918                   1560                     13.930990

[3 rows x 8 columns]

后面操作的时候同样也会面临未来过期预警:FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects()

将这里的三行代码改为:

stockdata_new.open = pd.to_numeric(stockdata_new.open.str.replace('$', ""))
stockdata_new.close = pd.to_numeric(stockdata_new.close.str.replace('$', ""))
(stockdata_new.close - stockdata_new.open).infer_objects()

这部分代码最后两个head的结果为:

date
2011-01-07    12.656
2011-01-14    13.368
2011-01-21    12.952
2011-01-28    12.696
2011-02-04    12.944
Name: newopen, dtype: float64

           stock   open    high     low  close     volume  newopen
date
2011-01-07    AA  15.82  $16.72  $15.78  16.42  239655616   12.656
2011-01-14    AA  16.71  $16.71  $15.64  15.97  242963398   13.368
2011-01-21    AA  16.19  $16.38  $15.60  15.79  138428495   12.952
2011-01-28    AA  15.87  $16.63  $15.82  16.13  151379173   12.696
2011-02-04    AA  16.18  $17.39  $16.18  17.14  154387761   12.944

丝毫不明白为什么有下面这四行,直接删掉

plt.subplot(2, 2, 3)
plt.plot(x, y, 'g--')
plt.subplot(2, 2, 4)
plt.plot(x, y, 'r-*')

这章会得到四张图:
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述

后面有两个多出来的代码,图像是下面这两张:
这里写图片描述
这里写图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Visual Studio Code (VSCode) 是一个非常流行的轻量级文本编辑器,支持多种语言的开发,包括Python。要在VSCode中使用nltk库(Natural Language Toolkit),你需要先确保Python环境已经设置好,并安装了pip包管理器。以下是安装nltk在VSCode中的步骤: 1. 安装Python:如果你还没有安装Python,可以从官网下载并安装最新版本的Python(https://www.python.org/downloads/)。 2. 打开VSCode:启动VSCode,如果还没配置Python,可以安装官方推荐的Python插件 "Python" 或 "Pylance" 来提供更好的Python支持。 3. 安装Python插件:打开命令面板(快捷键 `Ctrl + Shift + P` 或 `Cmd + Shift + P`),输入 "Install Python Extension" 并选择你要安装的插件。例如,对于 "Python" 插件,会自动安装对应的Python版本支持。 4. 配置Python环境:打开设置(`Ctrl + ,` 或 `Cmd + ,`),搜索 "python interpreter",点击 "Add Path to Workspace" 或 "Add Path to Global" 根据需求添加Python路径。 5. 安装nltk:在终端(通过插件 "Terminal" 或 " integrated terminal")中,使用pip安装nltk: ``` pip install nltk ``` 6. 导入库并下载数据:第一次使用nltk时,需要下载一些资源,如词汇资源等。在终端中运行: ```python import nltk nltk.download('all') ``` 如果你想下载特定的数据集,可以用 `nltk.download()` 函数替换 'all'。 完成以上步骤后,你就可以在VSCode的Python环境中使用nltk了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值