文章目录
Scrapy相关基本介绍参考这里
一、Xpath
1、xpath简介
- xpath使用路表达式在xml和html中进行导航
- xpath包含标准函数库
- xpath是一个W3C的标准
xpath节点关系
父节点、子节点、同胞节点、先辈节点、后代节点。
2、xpath语法
表达式 | 说明 |
---|---|
article | 选取所有article元素的所有子节点 |
/article | 选取根元素article |
article/a | 选取所有属于article的子元素的a元素 |
//div | 选取所有div子元素(无论出现文档任何地方) |
article//div | 选取所有属于article元素的后代的div元素,不管它出现在article之下的任何位置 |
//@class | 选取所有名为class的属性 |
/article/div[1] | 选取属于article子元素的第一个div元素 |
/article/div[last()] | 选取属于article子元素的最后一个div元素 |
/article/div[last()-1] | 选取属于article子元素的倒数第二个div元素 |
//div[@lang] | 选取所有拥有lang属性的div元素 |
//div[@lang=‘eng’] | 选取所有lang属性为eng的div元素 |
/div/* | 选取属于div元素的所有子节点 |
//* | 选取所有元素 |
//div[@*] | 选取所有带属性的div元素 |
/div/a | //div/p | 选取所有div元素的a和p元素 |
//span | //ul | 选取文档中的span元素和ul元素 |
article/div/p | //span | 选取所有属于article元素的div元素的p元素 以及 所有的span元素 |
二、CSS选择器
表达式 | 说明 |
---|---|
* | 选择所有节点 |
#container | 选择id为container的节点 |
.container | 选择所有class包含container的节点 |
li a | 选择所有li下的所有a节点 |
ul + p | 选择ul后面(兄弟节点)的第一个p元素 |
div#container > ul | 选择id为container的div的第一个ul子节点 |
ul ~ p | 选择与ul相邻的所有p元素 |
a[title] | 选择所有有title属性的a元素 |
a[href=“http://jobbole.com”] | 选择所有href属性为jobbole.com值的a元素 |
a[href*=“jobole”] | 选择所有href属性包含jobbole的a元素 |
a[href^=“http”] | 选择所有href属性值以http开头的a元素 |
a[href$=".jpg"] | 选择所有href属性值以.jpg结尾的a元素 |
input[type=radio]:checked | 选择选中的radio的元素 |
div:not(#container) | 选取所有id非container的div属性 |
li:nth-child(3) | 选取第三个li元素 |
tr:nth-child(2n) | 第偶数个tr |
三、爬取伯乐在线——初级
一般的爬虫步骤:
- 新建项目 (
scrapy startproject xxx
):新建一个新的爬虫项目 - 明确目标(编写
items.py
):定义提取的结构化数据 - 制作爬虫(
spiders/xxspider.py
):制作爬虫开始爬取网页,提取出结构化数据 - 存储内容(
pipelines.py
):设计管道存储爬取内容
目标任务:爬取伯乐在线所有技术文档,需要爬取的内容为:标题、创建时间、网站、网站id、文章封面图url、文章封面图路径、收藏数、点赞数、评论数、全文、标签
1、创建Scrapy项目
scrapy startproject Article
cd Article
2、编写item.py文件
根据需要爬取的内容定义爬取字段,因为需要爬取的内容为:标题、创建时间、网站、网站id、文章封面图url、文章封面图路径、收藏数、点赞数、评论数、全文、标签。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TestarticleItem(scrapy.Item):
title = scrapy.Field() # 标题
time = scrapy.Field() # 创建时间
url = scrapy.Field() # 网址
url_object_id = scrapy.Field() # 网址id(使用MD5方法)
front_image_url = scrapy.Field() # 文章封面图url
front_image_path = scrapy.Field() # 文章封面图路径
coll_nums = scrapy.Field() # 收藏数
comment_nums = scrapy.Field() # 评论数
fav_nums = scrapy.Field() # 点赞数
content = scrapy.Field() # 全文
tags = scrapy.Field() # 标签
3、编写spider文件
使用命令创建一个基础爬虫类:
scrapy genspider jobbole "blog.jobbole.com"
其中,jobbole为爬虫名,blog.jobbole.com为爬虫作用范围。
执行命令后会在 Article/spiders 文件夹中创建一个jobbole.py的文件,现在开始对其编写,该部分分别用xpath方法和css方法进行编写。
# -*- coding: utf-8 -*-
import re
import scrapy
import datetime
from scrapy.http import Request
from urllib import parse
from ArticleSpider.items import ArticleItem
from ArticleSpider.utils.common import get_md5
class JobboleSpider(scrapy.Spider):
name = "jobbole"
allowed_domains = ["python.jobbole.com"]
start_urls = ['http://python.jobbole.com/all-posts/']
def parse(self, response):
"""
1. 获取文章列表页中的文章url并交给scrapy下载后并进行解析
2. 获取下一页的url并交给scrapy进行下载, 下载完成后交给parse
"""
# 解析列表页中的所有文章url并交给scrapy下载后并进行解析
post_nodes = response.css("#archive .floated-thumb .post-thumb a")
for post_node in post_nodes:
image_url = post_node.css("img::attr(src)").extract_first("")
post_url = post_node.css("::attr(href)").extract_first("")
yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url":image_url}, callback=self.parse_detail_xpath)
# 提取下一页并交给scrapy进行下载
next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
if next_url:
yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
def parse_detail_xpath(self, response):
article_item = TestarticleItem()
# 提取文章具体字段
front_image_url = response.meta.get("front_image_url","")
title = response.xpath('//div[@class="entry-header"]/h1/text()').extract()[0]
time = response.xpath('//div[@class="entry-meta"]/p/text()').extract()[0].strip().replace("·","").strip()
fav_nums = response.xpath('//div[@class="post-adds"]/span[1]/h10/text()').extract()[0]
coll_nums = response.xpath('//div[@class="post-adds"]/span[2]/text()').extract()[0]
match_re = re.match(".*(\d+).*", coll_nums)
if match_re:
coll_nums = match_re.group(1)
else:
coll_nums = 0
comment_nums = response.xpath('//div[@class="post-adds"]/a[@href="#article-comment"]/span/text()').extract()[0]
match_re = re.match(".*(\d+).*", comment_nums)
if match_re:
comment_nums = match_re.group(1)
else:
comment_nums = 0
content = response.xpath('//div[@class="entry"]').extract()[0]
tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
tags = ",".join(tag_list)
article_item['title'] = title
article_item['url'] = response.url
article_item['url_object_id'] = get_md5(response.url)
try:
time = datetime.datetime.strptime(time,'%Y%m%d').date()
except Exception as e:
time = datetime.datetime.now().date()
article_item['time'] = time
article_item['front_image_url'] = [front_image_url]
article_item['fav_nums'] = fav_nums
article_item['coll_nums'] = coll_nums
article_item['comment_nums'] = comment_nums
article_item['tags'] = tags
article_item['content'] = content
yield article_item
def parse_detail_css(self, response):
article_item = TestarticleItem()
# 通过css选择器提取字段
front_image_url = response.meta.get("front_image_url", "") # 文章封面图
title = response.css(".entry-header h1::text").extract()[0]
time = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()
coll_nums = response.css(".vote-post-up h10::text").extract()[0]
fav_nums = response.css(".bookmark-btn::text").extract()[0]
match_re = re.match(".*?(\d+).*", fav_nums)
if match_re:
fav_nums = int(match_re.group(1))
else:
fav_nums = 0
comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0]
match_re = re.match(".*?(\d+).*", comment_nums)
if match_re:
comment_nums = int(match_re.group(1))
else:
comment_nums = 0
content = response.css("div.entry").extract()[0]
tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract()
tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
tags = ",".join(tag_list)
article_item["url_object_id"] = get_md5(response.url)
article_item["title"] = title
article_item["url"] = response.url
try:
time = datetime.datetime.strptime(time, "%Y/%m/%d").date()
except Exception as e:
time = datetime.datetime.now().date()
article_item["time"] = time
article_item["front_image_url"] = [front_image_url]
article_item["coll_nums"] = coll_nums
article_item["comment_nums"] = comment_nums
article_item["fav_nums"] = fav_nums
article_item["tags"] = tags
article_item["content"] = content
yield article_item
在 Aticle 目录下创建 utils/common.py
用于定义一些共有的函数。
# -*- coding: utf-8 -*-
import hashlib
def get_md5(url):
if isinstance(url, str):
url = url.encode('utf-8')
m = hashlib.md5()
m.update(url)
return m.hexdigest()
4、编写pipelines文件——保存在json文件
保存为json文件
- 利用
json
方式 - 利用 scrapy 中的
JsonItemExporter
方式
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonItemExporter
import codecs
import json
class ArticlePipeline(object):
def process_item(self, item, spider):
return item
# 使用json方式保存json文件
class JsonWithEncodingPipeline(object):
"""docstring for JsonWithEncodingPipeline"""
def __init__(self):
self.file = codecs.open('article.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
## `TypeError: Object of type 'date' is not JSON serializable`
item["time"] = str(item["time"])
lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(lines)
return item
def spider_closed(self, spider):
self.file.close()
# 使用scrapy自带的导入功能:JosnItemExporter
class JsonExporterPipeline(object):
"""docstring for JsonExporterPipeline"""
def __init__(self):
self.file = open('articleExport.json', 'wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
5、setting文件设置
ITEM_PIPELINES
设置pipelines文件中类的优先级,数字越小优先级越高,分别注释'Article.pipelines.JsonWithEncodingPipeline'
和'Article.pipelines.JsonExporterPipeline'
使用不同的json保存方法
# 设置请求头部,添加url
DEFAULT_REQUEST_HEADERS = {
"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# 设置item——pipelines
ITEM_PIPELINES = {
# 'Article.pipelines.ArticlePipeline': 300,
'Article.pipelines.JsonWithEncodingPipeline': 2,
# 'Article.pipelines.JsonExporterPipeline': 2,
}
6、执行程序
scrapy crawl jobbole
报错:TypeError: Object of type 'date' is not JSON serializable
解决方法:item[“item”]的类型是date,需要转化为str,使用如下:item["time"] = str(item["time"])
四、爬取伯乐在线——进阶
1、item loader机制
在上一节中,在spider文件中定义爬取并解析item.py中定义的字段,但是可移植性不强,item loader机制提供了一种便捷的方式填充抓取到的 :Item。 虽然Items可以使用自带的类字典形式API填充,但是Items Loaders提供了更便捷的API, 可以分析原始数据并对Item进行赋值。
(1)思路
参考文章:爬虫 Scrapy 学习系列之七:Item Loaders
- 通过item loader加载Item(spider文件中)
- item loader三个主要的方法分别是:
add_css(), add_xpath(), add_value()
- item loader三个主要的方法分别是:
from scrapy.loader import ItemLoader
# JobBoleArticleItem()为在items.py中声明的实例,response为返回的响应。
item_loader = ItemLoader(item=JobBoleArticleItem(), response=response)
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_value("url", response.url)
...
# 对结果进行解析,所有的结果都是一个list并保存到article_item中。
article_item = item_loader.load_item()
- 通过items.py处理数据
- 引入
from scrapy.loader.processors import MapCompose,TakeFirst, Join
等
在scrapy.Field中可以加入处理函数,同时可自定义处理函数
- 引入
(2)spider.py
spider.py文件中部分代码
from scrapy.loader import ItemLoader
class JobboleSpider(scrapy.Spider):
"""
添加部分,未变化的部分已省略
"""
def parse_detail(self, response):
article_item = ArticleItem()
front_image_url = response.meta.get("front_image_url", "") # 文章封面图
item_loader = ArticleItemLoader(item=ArticleItem(), response=response)
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_value("url", response.url)
item_loader.add_value("url_object_id", get_md5(response.url))
item_loader.add_css("time", "p.entry-meta-hide-on-mobile::text")
item_loader.add_value("front_image_url", [front_image_url])
item_loader.add_css("coll_nums", ".vote-post-up h10::text")
item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
item_loader.add_css("fav_nums", ".bookmark-btn::text")
item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
item_loader.add_css("content", "div.entry")
article_item = item_loader.load_item()
yield article_item
(3)item.py文件
定义相关处理函数,并利用input_processor
或output_processor
参数在输入前、输出后对字段元数据进行处理。
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
import datetime
import re
def date_convert(value):
try:
time = datetime.datetime.strptime(value, "%Y/%m/%d").date()
except Exception as e:
time = datetime.datetime.now().date()
return time
def get_nums(value):
match_re = re.match(".*?(\d+).*", value)
if match_re:
nums = int(match_re.group(1))
else:
nums = 0
return nums
def return_value(value):
return value
def remove_comment_tags(value):
# 去掉tag中提取的评论
if "评论" in value:
return ""
else:
return value
class ArticleItemLoader(ItemLoader):
# 自定义itemloader
default_output_processor = TakeFirst()
class ArticleItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
time = scrapy.Field(input_processor=MapCompose(date_convert))
url = scrapy.Field()
url_object_id = scrapy.Field() ## md5
front_image_url = scrapy.Field(output_processor=MapCompose(return_value))
front_image_path = scrapy.Field()
fav_nums = scrapy.Field(input_processor=MapCompose(get_nums))
coll_nums = scrapy.Field(input_processor=MapCompose(get_nums))
comment_nums = scrapy.Field(input_processor=MapCompose(get_nums))
content = scrapy.Field()
tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags),
output_processor=Join(","))
def get_insert_sql(self):
sql1 = "alter table article convert to character set utf8mb4;"
insert_sql = """
insert into article(title, time, url, url_object_id, front_image_url, front_image_path, coll_nums,comment_nums,fav_nums,content,tags)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE content=VALUES(fav_nums)
"""
front_image_url = ""
if self["front_image_url"]:
front_image_url = self["front_image_url"][0]
params = (self["title"], self["time"], self["url"],self["url_object_id"],self["front_image_url"],
self["front_image_path"],self["coll_nums"],self["comment_nums"],
self["fav_nums"],self["content"],self["tags"])
return insert_sql, params
2、pipelines文件
(1)相关环境安装(MySQL、Navicat)
安装相关环境:Ubuntu18.04 安装MySQL、Navicat
## mysqlclient是mysql的一个驱动
pip install mysqlclient
表定义如下图所示:
(2)保存到MySQL(同步机制)
import pymysql
import pymysql.cursors
class MysqlPipeline(object):
## 采用同步的机制写入mysql
"""docstring for MysqlPipeline"""
def __init__(self):
# self.conn = pymysql.connect('host', 'user', 'password', 'dbname', charset='utf8', use_unicode=True)
self.conn = pymysql.connect(host='localhost', user='root', password='asdfjkl;', db='atricle', charset="utf8mb4", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql1 = "alter table article convert to character set utf8mb4;"
insert_sql = """
insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
"""
self.cursor.execute(sql1)
self.cursor.execute(insert_sql,
(pymysql.escape_string(item["title"]),
item["url"],
item["url_object_id"],
item["time"],
item["coll_nums"],
item["comment_nums"],
item["fav_nums"],
pymysql.escape_string(item["content"]),
# item["url"], item["time"], item["coll_nums"]
))
self.conn.commit()
(3)保存到MySQL(异步机制)
当采集量大时,爬取的速度要高于读写的速度,所以对于大型的一般采用异步机制存储数据。
from twisted.enterprise import adbapi
class MysqlTwistedPipeline(object):
"""docstring for MysqlTwistedPipeline"""
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
'''传入settings的参数'''
dbparams = dict(
host = settings['MYSQL_HOST'],
db = settings['MYSQL_DB'],
user = settings['MYSQL_USER'],
password = settings['MYSQL_PASSWORD'],
charset = "utf8mb4",
cursorclass = pymysql.cursors.DictCursor,
use_unicode = True,
)
dbpool = adbapi.ConnectionPool("pymysql", **dbparams)
return cls(dbpool)
def process_item(self, item, spider):
# 使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item, spider) #处理异常
def handle_error(self, failure, item, spider):
# 处理异步插入的异常
print (failure)
def do_insert(self, cursor, item):
# 执行具体的插入
# 根据不同的item 构建不同的sql语句并插入到mysql中
sql1 = "alter table article convert to character set utf8mb4;"
insert_sql = """
insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
"""
cursor.execute(sql1)
cursor.execute(insert_sql,
(pymysql.escape_string(item["title"]),
item["url"],
item["url_object_id"],
item["time"],
item["coll_nums"],
item["comment_nums"],
item["fav_nums"],
pymysql.escape_string(item["content"]),
# item["url"], item["time"], item["coll_nums"]
))
保存结果如下图所示: