总结一下今天的学习过程
1,了解学习了scrapy的基本概念:requests and response、logging、sending email相关的概念与使用样例,同时查看了scrapy的官方文档说明
http://scrapy.readthedocs.io/en/latest/index.html
2,使用scrapy创建了两个工程,其中一个是西刺网站以及是Stackoverflow网站
cd /d E:\Python\scrapydata\spider
scrapy startproject collectips
cd collectips
scrapy genspider xici xicidaili.com
编写item:
编写spider:
编程piplines:
在settings中配置pipeline,使其生效:
ITEM_PIPELINES值 工程名.piplines.工程名pipline(两个组合首字母大写)
执行命令:scrapy crawl xici
注意
以前的python2.x的时候:
try:
fp=urllib.request.urlopen(blogurl)
except Exception, e:
print (e)
print('download exception %s' % blogurl)
return 0
现在python3.x的时候:
try:
fp=urllib.request.urlopen(blogurl)
except Exception as e:
print (e)
print('download exception %s' % blogurl)
return 0
否则会报错的
python2.x中 无括号
print "insert error:",e
Python3.x中
print ("insert error:",e)
setting中的数据库配置
# databases con
DBKWARGS={'db':'test','user':'root','passwd':123,'host':'localhost','use_unicode':True,'charset':'utf8'}
可惜的是没有成功,原因是:第一次爬取到了数据,可惜数据库没有设置对,第二次再爬取就把我拉黑了,访问不了了
折腾了半天。。。。。哎崩溃
3,爬取Stackoverflow网站
3.1 创建scrapy工程与spider
cd /d
E:\Python\scrapydata\spider
scrapy startproject stackoverflow
cd stackoverflow
scrapy genspider sflow stackoverflow.com/questions?sort=votes
3.2 编写items:
import scrapy
class StackoverflowItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
IP = scrapy.Field()
PORT = scrapy.Field()
POSITION = scrapy.Field()
TYPE = scrapy.Field()
SPEED = scrapy.Field()
3.3 编写spider:
注意:引入items文件中的item类
requests的再次请求并调用parse_question函数(callback=self.parse_question)
以及items的传递与接收
meta={'pre_item':pre_item}
parse_question函数的返回 yield pre_item
注意:引入items文件中的item类
requests的再次请求并调用parse_question函数(callback=self.parse_question)
以及items的传递与接收
meta={'pre_item':pre_item}
pre_item = response.meta['pre_item']
yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
parse_question函数的返回 yield pre_item
import scrapy
from stackoverflow.items import StackoverflowItem
class SflowSpider(scrapy.Spider):
name = 'sflow'
allowed_domains = ['http://stackoverflow.com/questions?sort=votes']
start_urls = ['http://stackoverflow.com/questions?sort=votes/']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
pre_item= StackoverflowItem()
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
def parse_question(self,response):
pre_item = response.meta['pre_item']
pre_item['IP'] = response.css('h1 a::text').extract()[0]
pre_item['PORT'] = response.css('.question .vote-count-post::text').extract()[0]
pre_item['POSITION'] = response.css('.question .post-text').extract()[0]
pre_item['TYPE'] = "33333"
pre_item['SPEED'] = "response.url"
print (pre_item)
yield pre_item
3.4 编程piplines:
将得到的数据插入到数据库中,注意这里的
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
参数配置及连接获取
将得到的数据插入到数据库中,注意这里的
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
参数配置及连接获取
import MySQLdb
class StackoverflowPipeline(object):
def process_item(self, item, spider):
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
cur = con.cursor()
sql = ("insert into proxys(IP,PORT,TYPE,POSITION,SPEED) values (%s,%s,%s,%s,%s)")
lis = (item['IP'],item['PORT'],item['TYPE'],item['POSITION'],item['SPEED'])
print (lis)
try:
cur.execute(sql,lis)
except Exception as e:
print ("insert error:",e)
con.rollback()
else:
con.commit()
cur.close()
con.close()
return item
3.5 在settings中配置
一个是字符编码# -*- coding: utf-8 -*-
数据库参数:DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
日志文件:LOG_FILE = "stackoverflow.log"
用户代理:USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
禁用cookies:COOKIES_ENABLED = False
等等其他的一些 反反爬取 的措施,待研究学习
一个是字符编码# -*- coding: utf-8 -*-
数据库参数:DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
日志文件:LOG_FILE = "stackoverflow.log"
用户代理:USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
禁用cookies:COOKIES_ENABLED = False
等等其他的一些 反反爬取 的措施,待研究学习
# -*- coding: utf-8 -*-
# Scrapy settings for stackoverflow project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'stackoverflow'
SPIDER_MODULES = ['stackoverflow.spiders']
NEWSPIDER_MODULE = 'stackoverflow.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'stackoverflow (+http://www.yourdomain.com)'
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
# databases con
DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
ITEM_PIPELINES = {
'stackoverflow.pipelines.StackoverflowPipeline': 300,
}
#Configure log file name
LOG_FILE = "stackoverflow.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'collectips (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
3.6,实验结果
4,今天遇到的Python3.x与Python2.x的区别
1,异常处理
以前的python2.x的时候:
try:
fp=urllib.request.urlopen(blogurl)
except Exception, e:
print (e)
print('download exception %s' % blogurl)
return 0
现在python3.x的时候:
try:
fp=urllib.request.urlopen(blogurl)
except Exception as e:
print (e)
print('download exception %s' % blogurl)
return 0
否则会报错的
python2.x中 无括号
print "insert error:",e
Python3.x中
print ("insert error:",e)
5, scrapy的爬虫原理以及各个组件的作用
6.通常防止爬虫被反主要有以下几个策略:
- 动态设置User-Agent(随机切换User-Agent,模拟不同用户的浏览器信息)
USER_AGENT
=
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
- 禁用Cookies(也就是不启用cookies middleware,不向Server发送cookies,有些网站通过cookie的使用发现爬虫行为)
- 可以通过COOKIES_ENABLED 控制 CookiesMiddleware 开启或关闭
# Disable cookies (enabled by default)
- 可以通过COOKIES_ENABLED 控制 CookiesMiddleware 开启或关闭
COOKIES_ENABLED
=
False
- 设置延迟下载(防止访问过于频繁,设置为 2秒 或更高)
- #为了使我们的爬虫访问表现得更像人类的操作,让我们降低请求速率(原理上借助AutoThrottle 拓展),在settings.py中继续添加
- CONCURRENT_REQUESTS = 1
- DOWNLOAD_DELAY = 5
- Google Cache 和 Baidu Cache:如果可能的话,使用谷歌/百度等搜索引擎服务器页面缓存获取页面数据。
- 使用IP地址池:VPN和代理IP,现在大部分网站都是根据IP来ban的。
- 使用 Crawlera(专用于爬虫的代理组件),正确配置和设置下载中间件后,项目所有的request都是通过crawlera发出。
7,学习了多级页面的爬取以及图片的爬取存储,还没有亲身实践CRAWLERA_ENABLED = TrueCRAWLERA_USER = '注册/购买的UserKey'CRAWLERA_PASS = '注册/购买的Password'
美好的一天 明天加油!