爬虫课程笔记
Scrapy
为什么学习scrapy?
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能够快速的抓取
Scrapy 使用了Twisted['twɪstɪd]异步网络框架,可以加快我们的下载速度。
http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html
异步与非阻塞区别
异步:调用在发出之后,这个调用就直接返回,不管有无结果
非阻塞:关注的是程序在等待调用结果(消息,返回值)时的状态,指在不能立刻得到结果之前,该调用不会阻塞当前线程。
爬虫流程
scrapy爬虫流程
入门
创建一个scrapy项目
scrapy startproject mySpider
命令:
scrapy startproject +<项目名字>
scrapy startproject myspider
生成一个爬虫
scrapy genspider baidu "baidu.com”
命令:scrapy genspider +<爬虫名字> + <允许爬取的域名>
scrapy genspider itcast “itcast.cn”
提取数据
完善spider,使用xpath等方法
保存数据
pipeline中保存数据
去掉注释开启pipeline
spider的数据传到pipeline
demo
# -*- coding: utf-8 -*-
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast' #爬虫名
allowed_domains = ['itcast.cn'] #允许爬取的范围
start_urls = ['http://www.itcast.cn/channel/teacher.shtml'] #最开始请求的url地址
def parse(self, response):
#处理start_url地址对应的响应
# ret1 = response.xpath("//div[@class='tea_con']//h30/text()").extract()
# print(ret1)
#分组
li_list = response.xpath("//div[@class='tea_con']//li")
for li in li_list:
item = {}
item["name"]=li.xpath(".//h3/text()").extract_first()
item["title"] = li.xpath(".//h4/text()").extract_first()
# print(item)
#Request, BaseItem, dict or None
yield item
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MyspiderPipeline(object):
def process_item(self, item, spider):
#TODO
item["hello"] = "world"
# print(item)
return item
class MyspiderPipeline1(object):
def process_item(self, item, spider):
print(item)
return item
logging
# -*- coding: utf-8 -*-
import scrapy
import logging
logger = logging.getLogger(__name__)
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['itcast.cn']
start_urls = ['http://www.itcast.cn/']
def parse(self, response):
for i in range(10):
item = {}
item["come_from"] = "itcast"
logger.warning(item)
yield item
# coding=utf-8
import logging
#设置日志的输出样式
logging.basicConfig(level=logging.INFO,
format='%(levelname)s [%(filename)s:%(lineno)d] '
': %(message)s'
' - %(asctime)s', datefmt='[%d/%b/%Y %H:%M:%S]',
)
logger = logging.getLogger(__name__)
if __name__ == '__main__':
logger.info("this is a info log")
logger.info("this is a info log 1")
实现翻页请求
# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
start_urls = ['http://hr.tencent.com/position.php']
def parse(self, response):
tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
for tr in tr_list:
item = TencentItem()
item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
item["position"] = tr.xpath("./td[2]/text()").extract_first()
item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
yield item
#找到下一页的url地址
next_url = response.xpath("//a[@id='next']/@href").extract_first()
if next_url != "javascript:;":
next_url = "http://hr.tencent.com/" +next_url
yield scrapy.Request(
next_url,
callback=self.parse,
# meta = {"item":item}
)
# def parse1(self,response):
# response.meta["item"]
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient
from tencent.items import TencentItem
client = MongoClient()
collection = client["tencent"]["hr"]
class TencentPipeline(object):
def process_item(self, item, spider):
if isinstance(item,TencentItem):
print(item)
collection.insert(dict(item))
return item
启动爬虫
scrapy crawl +爬虫名
深入scrapy
定义item
# -*- coding: utf-8 -*-
import scrapy
from yangguang.items import YangguangItem
from yangguang.settings import MONGO_HOST
class YgSpider(scrapy.Spider):
name = 'yg'
allowed_domains = ['sun07691.com']
# start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0']
start_urls = ['http://www.baidu.com']
def parse(self, response):
# self.settings["MONGO_HOST"]
# self.settings.get("MONGO_HOST","")
print(self.hello,"*"*100)
#分组
tr_list = response.xpath("//div[@class='greyframe']/table[2]/tr/td/table/tr")
for tr in tr_list:
item = YangguangItem()
item["title"] = tr.xpath("./td[2]/a[@class='news14']/@title").extract_first()
item["href"] = tr.xpath("./td[2]/a[@class='news14']/@href").extract_first()
item["publish_date"]=tr.xpath("./td[last()]/text()").extract_first()
yield scrapy.Request(
item["href"],
callback=self.parse_detail,
meta = {"item":item}
)
#翻页
next_url = response.xpath("//a[text()='>']/@href").extract_first()
if next_url is not None:
yield scrapy.Request(
next_url,
callback=self.parse
)
def parse_detail(self,response): #处理详情页
item = response.meta["item"]
item["content"] = response.xpath("//div[@class='c1 text14_2']//text()").extract()
item["content_img"] = response.xpath("//div[@class='c1 text14_2']//img/@src").extract()
item["content_img"] = ["http://wz.sun0769.com"+i for i in item["content_img"]]
# print(item)
yield item
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re
from yangguang.settings import MONGO_HOST
from pymongo import MongoClient
class YangguangPipeline(object):
def open_spider(self,spider):
# spider.hello = "world"
client = MongoClient()
self.collection = client["test"]["test"]
def process_item(self, item, spider):
spider.settings.get("MONGO_HOST")
item["content"] = self.process_content(item["content"])
print(item)
self.collection.insert(dict(item))
return item
def process_content(self,content):
content = [re.sub(r"\xa0|\s","",i) for i in content]
content = [i for i in content if len(i)>0] #去除列表中的空字符串
return content
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class YangguangItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
href = scrapy.Field()
publish_date = scrapy.Field()
content_img = scrapy.Field()
content = scrapy.Field()
在不同解析函数之间传参
程序的debug信息
scrapy shell
Scrapy shell是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath表达式
使用方法:
scrapy shell http://www.itcast.cn/channel/teacher.shtml
response.url:当前响应的url地址
response.request.url:当前响应对应的请求的url地址
response.headers:响应头
response.body:响应体,也就是html代码,默认是byte类型
response.requests.headers:当前响应的请求头
setting文件
重点
### logging 模块的使用
- scrapy
- settings中设置LOG_LEVEL=“WARNING”
- settings中设置LOG_FILE="./a.log" #设置日志保存的位置,设置会后终端不会显示日志内容
- import logging,实例化logger的方式在任何文件中使用logger输出内容
- 普通项目中
- import logging
- logging.basicConfig(...) #设置日志输出的样式,格式
- 实例化一个`logger=logging.getLogger(__name__)`
- 在任何py文件中调用logger即可