Python网络爬虫学习scrapy(二)

总结一下今天的学习过程

1,了解学习了scrapy的基本概念:requests and response、logging、sending email相关的概念与使用样例,同时查看了scrapy的官方文档说明

       http://scrapy.readthedocs.io/en/latest/index.html

2,使用scrapy创建了两个工程,其中一个是西刺网站以及是Stackoverflow网站

cd /d E:\Python\scrapydata\spider
scrapy startproject collectips
cd collectips
scrapy genspider xici xicidaili.com
编写item:
编写spider:
编程piplines:
在settings中配置pipeline,使其生效:
ITEM_PIPELINES值  工程名.piplines.工程名pipline(两个组合首字母大写)
执行命令:scrapy crawl xici

注意
以前的python2.x的时候:
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception, e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
 
现在python3.x的时候:
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception as e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
否则会报错的
python2.x中 无括号
print "insert error:",e

Python3.x中
print ("insert error:",e)
 setting中的数据库配置

# databases con
DBKWARGS={'db':'test','user':'root','passwd':123,'host':'localhost','use_unicode':True,'charset':'utf8'}

可惜的是没有成功,原因是:第一次爬取到了数据,可惜数据库没有设置对,第二次再爬取就把我拉黑了,访问不了了

折腾了半天。。。。。哎崩溃

3,爬取Stackoverflow网站

3.1 创建scrapy工程与spider

       cd /d  E:\Python\scrapydata\spider
       scrapy startproject stackoverflow
       cd stackoverflow
       scrapy genspider sflow stackoverflow.com/questions?sort=votes
3.2 编写items:
import scrapy


class StackoverflowItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    IP = scrapy.Field()
    PORT = scrapy.Field()
    POSITION = scrapy.Field()
    TYPE = scrapy.Field()
    SPEED = scrapy.Field()
3.3 编写spider:
注意:引入items文件中的item类
requests的再次请求并调用parse_question函数(callback=self.parse_question)
以及items的传递与接收
meta={'pre_item':pre_item}
pre_item = response.meta['pre_item']
yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
parse_question函数的返回  yield pre_item
import scrapy
from stackoverflow.items import StackoverflowItem

class SflowSpider(scrapy.Spider):
    name = 'sflow'
    allowed_domains = ['http://stackoverflow.com/questions?sort=votes']
    start_urls = ['http://stackoverflow.com/questions?sort=votes/']
    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            pre_item= StackoverflowItem()
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
        
    def parse_question(self,response):
        pre_item = response.meta['pre_item']
        pre_item['IP'] = response.css('h1 a::text').extract()[0]
        pre_item['PORT'] = response.css('.question .vote-count-post::text').extract()[0]
        pre_item['POSITION'] = response.css('.question .post-text').extract()[0]
        pre_item['TYPE'] = "33333"
        pre_item['SPEED'] = "response.url"
        print (pre_item)
        yield pre_item
        
        
3.4 编程piplines:
将得到的数据插入到数据库中,注意这里的
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
参数配置及连接获取
import MySQLdb

class StackoverflowPipeline(object):
    def process_item(self, item, spider):
        DBKWARGS = spider.settings.get('DBKWARGS')
        con = MySQLdb.connect(**DBKWARGS)
        cur = con.cursor()
        
        sql = ("insert into proxys(IP,PORT,TYPE,POSITION,SPEED) values (%s,%s,%s,%s,%s)")
        lis = (item['IP'],item['PORT'],item['TYPE'],item['POSITION'],item['SPEED'])
        print (lis)
        try:
            cur.execute(sql,lis)
        except Exception as e:
            print ("insert error:",e)
            con.rollback()
        else:
            con.commit()
        cur.close()
        con.close()
        
        
        return item
3.5 在settings中配置
一个是字符编码# -*- coding: utf-8 -*-
数据库参数:DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
日志文件:LOG_FILE = "stackoverflow.log"
用户代理:USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
禁用cookies:COOKIES_ENABLED = False

等等其他的一些 反反爬取 的措施,待研究学习
# -*- coding: utf-8 -*-

# Scrapy settings for stackoverflow project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'stackoverflow'

SPIDER_MODULES = ['stackoverflow.spiders']
NEWSPIDER_MODULE = 'stackoverflow.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'stackoverflow (+http://www.yourdomain.com)'

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
# databases con
DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
ITEM_PIPELINES = {
    'stackoverflow.pipelines.StackoverflowPipeline': 300,
}
#Configure log file name
LOG_FILE = "stackoverflow.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'collectips (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

3.6,实验结果



4,今天遇到的Python3.x与Python2.x的区别

1,异常处理
以前的python2.x的时候:
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception, e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
 
现在python3.x的时候:
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception as e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
否则会报错的
python2.x中 无括号
print "insert error:",e

Python3.x中
print ("insert error:",e)

5, scrapy的爬虫原理以及各个组件的作用


6.通常防止爬虫被反主要有以下几个策略:

  • 动态设置User-Agent(随机切换User-Agent,模拟不同用户的浏览器信息)
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
  • 禁用Cookies(也就是不启用cookies middleware,不向Server发送cookies,有些网站通过cookie的使用发现爬虫行为)
    • 可以通过COOKIES_ENABLED 控制 CookiesMiddleware 开启或关闭
      # Disable cookies (enabled by default)
           COOKIES_ENABLED = False
  • 设置延迟下载(防止访问过于频繁,设置为 2秒 或更高)
  1. #为了使我们的爬虫访问表现得更像人类的操作,让我们降低请求速率(原理上借助AutoThrottle 拓展),在settings.py中继续添加
  2. CONCURRENT_REQUESTS = 1
  3. DOWNLOAD_DELAY = 5
  • Google Cache 和 Baidu Cache:如果可能的话,使用谷歌/百度等搜索引擎服务器页面缓存获取页面数据。
  • 使用IP地址池:VPN和代理IP,现在大部分网站都是根据IP来ban的。
  • 使用 Crawlera(专用于爬虫的代理组件),正确配置和设置下载中间件后,项目所有的request都是通过crawlera发出。
CRAWLERA_ENABLED = True
CRAWLERA_USER = '注册/购买的UserKey'
CRAWLERA_PASS = '注册/购买的Password'

7,学习了多级页面的爬取以及图片的爬取存储,还没有亲身实践


美好的一天  明天加油!



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值