2020年04月_木尧大兄弟

12月 11月 10月 09月 08月 07月 06月 05月 04月 03月 02月 01月

原创 Scrapy爬虫之下载器中间件（反爬：随机请求头、IP代理池）

一、下载器中间件配置随机请求头下载器中间件实现两个方法：process_request和process_response能获取当前浏览器请求头的网站：http://httpbin.org/user-agent全世界所有浏览器的请求头：http://www.useragentstring.com/pages/useragentstring.php?typ=Browser...

2020-04-30 15:34:22 581 1

原创 Python：os.path获取当前绝对路径、父级路径、判断文件夹是否存在、创建文件夹

import os# 当前文件所在文件夹的绝对路径my_path = os.path.dirname(__file__)print(my_path) # D:/pycharm_profession/Projects-professional/PaperScrapy/wxapp# 当前文件所在文件夹的上一级的绝对路径my_path = os.path.dirname(os.path...

2020-04-30 09:36:12 3594

原创 Python：使用map和lambda表达式实现同时操纵list中的每个元素

以下代码实现：对urls中每个url前面加上“https://”，返回map并转成list

2020-04-30 09:17:57 2041

原创 Scrapy爬虫之scrapy shell、Request和Response对象

进入爬虫项目内，执行scrapy shell 网址（不进入爬虫项目也不影响，不过不能获取项目的settings配置）测试一下response.xpath

2020-04-29 21:15:56 414

原创 Scrapy爬虫之CrawlSpider（继承自CrawlSpider类可自动嗅到链接）

创建项目后通过以下命令创建爬虫类：scrapy genspider -t crawl wxapp-union wxapp-union.com爬虫继承自CrawlSpider类，和base类区别就是多了rules和LinkExtractor。【tips】开启pipelines后需要在settings.py中解开注释（设置pipline优先级的那个）from scrapy.linke...

2020-04-29 20:56:02 414

原创 Scrapy爬虫之items

之前使用dict把spider中数据传到piplines，显得不专业，于是用items（类似Django先定义好数据字段）首先，items.py定义字段import scrapyclass XicispiderItem(scrapy.Item): # 存数据模型的，有点像django定义数据库 # define the fields for your item her...

2020-04-29 16:27:38 2398

原创 Scrapy爬虫之pipelines与导出为json文件

spider把数据封装成dict扔出来# -*- coding: utf-8 -*-import scrapy# 创建爬虫类，继承自scrapy.Spider --> 爬虫最基础的类，basic crawl csvfeed xmlfeed都继承自这个类class XicidailiSpider(scrapy.Spider): name = 'xicidaili' # ...

2020-04-29 16:13:10 888

原创 scrapy爬虫之原理和简单实战

安装scrapypip install scrapycmd执行scrapy和scrapy bench验证安装原理engine是引擎，核心大脑spiders写爬虫逻辑，提取数据(item)或者请求，请求交给调度器，数据交给管道scheduler是调度器(网址的优先队列，可以去重)downloader是下载网页用的item pipelines用来处理爬下来的item、保存持久化数据a...

2020-04-28 18:52:04 288

文本摘要 CNN/DailyMail 原始数据集。压缩包内含 cnn_stories.tgz 和 dailymail_stories.tgz 。可用于抽取式摘要（Extractive Summarization）任务以及生成式摘要（Abstractive Summarization）旨在方便国内的研究者们获取该数据集。技术细节可参考博文：https://blog.csdn.net/muyao987/article/details/104949367

2022-04-15

[PDF]Neural Network Methods in Natural Language Processing 基于深度学习的自然语言处理英文原版

Neural networks are a family of powerful machine learning models. This book focuses on the application of neural network models to natural language data. The first half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily define and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. The second part of the book (Parts III and IV) introduces more specialized neural network architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. These architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other applications. Finally, we also discuss tree-shaped networks, structured prediction, and the prospects of multi-task learning.

2018-11-23

希拉里克林顿邮件自然语言处理 Hillary Clinton's Emails

希拉里克林顿的电子邮件，整理了近7,000页克林顿的电子邮件，用作机器学习自然语言处理的语料。

2018-07-19

MFC类库详解.chm

MFC类库详解，以前做飞机大战项目时经常用。挺好的，对VS下的MFC编程有一定好处。

2015-08-02

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

文本摘要 CNN/DailyMail 原始数据集

[PDF]Neural Network Methods in Natural Language Processing 基于深度学习的自然语言处理英文原版

希拉里 克林顿 邮件 自然语言处理 Hillary Clinton's Emails

MFC类库详解.chm

空空如也

希拉里克林顿邮件自然语言处理 Hillary Clinton's Emails