2017年12月_高翔Sean

原创 Scrapy爬取知乎日报，并将文章保存为pdf

目标：在D:/知乎日报下有两个文件夹，latest存放最新爬下来的文章，past存放之前爬下来的文章在下一次爬的时候，如果文章已经爬过，就不再下载，如果没有就存放到latest中，并将之前已经存放在latest中的文章转移到past中所用库，scrapy（必须的），pdfkit（用于html到pdf的转换），os和shutil（处理文件）首先在http://daily.zh

2017-12-24 17:13:15 2289

原创 scrapy爬知乎日报--pipelines

当文章不存在时，保存到D：/知乎/latest，当文件存在时，如果在latest中，就移动到past中# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.s

2017-12-24 16:45:37 365

原创 python shutil.move 移动文件

https://docs.python.org/3.6/library/shutil.htmlshutil可以实现文件的复制，移动#复制文件：shutil.copyfile("oldfile","newfile") #oldfile和newfile都只能是文件shutil.copy("oldfile","newfile") #oldfile只能是文件夹，newfile可以是文件，也

2017-12-24 15:52:48 141326 4

转载 pdfkit报错：Exit with code 1 due to network error: ContentNotFoundError

try: if not os.path.exists(item["filename"]): pdfkit.from_url(item["url"], item["filename"]) else: print("文件已存在") except: # 此处一个Exit with code 1

2017-12-24 14:26:32 13264 5

原创 scrapy爬知乎返回500

用scrapy爬知乎日报时，总是返回500# -*- coding: utf-8 -*-import scrapy#import pdfkitfrom zhihudaily.items import ZhihudailyItemclass ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['da

2017-12-24 10:41:47 2421

原创 css 学习1

/* CSS 语法 *//* 元素选择器 */selector {property1: value1; property2: value2}/* 将h1的文字设置为红色，字体大小设置为14 */h1 {color:red; font_size:14}/* 增加可读性 */body { color: #000; background: #fff; margin: 0; padd

2017-12-23 15:17:38 215

原创在pipeline中写json文件

#write items to json fileimport jsonclass JsonwritePipeline(Object): def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" with open(filename.json, "wb") as f: f.write(l

2017-12-19 14:51:19 2927

原创让scrapy 重复抓取同一个页面

Request(url, dont_filter = True)

2017-12-19 14:44:01 6255

原创 scrapy shell 可以用于测试xpath的响应

1. 安装Ipython2. 在scrapy.cfg中设置 SCRAPY_PYTHON_SHELL[settings]shell = Ipython3. scrapy shell 4. shell 也可以用于本地文件 scrapy shell ./path/to/file.html5. scrapy shell "http://scrapy.org" --nolog6.

2017-12-19 14:43:04 667

原创 Python得到当前时间

import datetime as dprint(d.datetime.now().strftime("%Y.%m.%d-%H:%M:%S"))#2017.12.17-09:51:07

2017-12-17 09:52:02 8357

原创 response.xpath("//li[@class='next']/a/@href") is not None

if response.xpath("//li[@class='next']/a/@href") is not None : next_page = response.xpath("//li[@class='next']/a/@href").extract()[0] yield scrapy.Request('http://quotes.toscr

2017-12-15 20:59:55 4023

原创 xpath学习笔记

1. Chrome插件xpath helper可以辅助查看xpath2. //从当前匹配的节点中选择3. //book[1] 第一个book元素//book[last()-1] 倒数第二个book元素//book[position()4. //a[@id] 拥有id属性的a元素//a[@id="next" and @class="bank"] id属性为next并且cla

2017-12-12 20:05:34 494

Sean的博客