Python网络爬虫学习scrapy(二)

最新推荐文章于 2023-08-22 21:30:03 发布

每天八小时

最新推荐文章于 2023-08-22 21:30:03 发布

阅读量263

点赞数

分类专栏： Python学习

本文链接：https://blog.csdn.net/u013991917/article/details/78067460

版权

Python学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

总结一下今天的学习过程

1，了解学习了scrapy的基本概念：requests and response、logging、sending email相关的概念与使用样例，同时查看了scrapy的官方文档说明

http://scrapy.readthedocs.io/en/latest/index.html

2，使用scrapy创建了两个工程，其中一个是西刺网站以及是Stackoverflow网站

cd /d E:\Python\scrapydata\spider
scrapy startproject collectips
cd collectips
scrapy genspider xici xicidaili.com
编写item：
编写spider：
编程piplines：
在settings中配置pipeline，使其生效：
ITEM_PIPELINES值  工程名.piplines.工程名pipline(两个组合首字母大写)
执行命令：scrapy crawl xici

注意
以前的python2.x的时候：
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception, e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
 
现在python3.x的时候：
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception as e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
否则会报错的
python2.x中 无括号
print "insert error:",e

Python3.x中
print ("insert error:",e)

setting中的数据库配置

# databases con
DBKWARGS={'db':'test','user':'root','passwd':123,'host':'localhost','use_unicode':True,'charset':'utf8'}

可惜的是没有成功，原因是：第一次爬取到了数据，可惜数据库没有设置对，第二次再爬取就把我拉黑了，访问不了了

折腾了半天。。。。。哎崩溃

3，爬取Stackoverflow网站

3.1 创建scrapy工程与spider

cd /d E:\Python\scrapydata\spider

scrapy startproject stackoverflow

cd stackoverflow

scrapy genspider sflow stackoverflow.com/questions?sort=votes

3.2 编写items：

import scrapy


class StackoverflowItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    IP = scrapy.Field()
    PORT = scrapy.Field()
    POSITION = scrapy.Field()
    TYPE = scrapy.Field()
    SPEED = scrapy.Field()

3.3 编写spider：
注意：引入items文件中的item类
requests的再次请求并调用parse_question函数（callback=self.parse_question）
以及items的传递与接收
meta={'pre_item':pre_item}

pre_item = response.meta['pre_item']

yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
parse_question函数的返回 yield pre_item

import scrapy
from stackoverflow.items import StackoverflowItem

class SflowSpider(scrapy.Spider):
    name = 'sflow'
    allowed_domains = ['http://stackoverflow.com/questions?sort=votes']
    start_urls = ['http://stackoverflow.com/questions?sort=votes/']
    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            pre_item= StackoverflowItem()
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url,meta={'pre_item':pre_item},callback=self.parse_question,dont_filter=True)
        
    def parse_question(self,response):
        pre_item = response.meta['pre_item']
        pre_item['IP'] = response.css('h1 a::text').extract()[0]
        pre_item['PORT'] = response.css('.question .vote-count-post::text').extract()[0]
        pre_item['POSITION'] = response.css('.question .post-text').extract()[0]
        pre_item['TYPE'] = "33333"
        pre_item['SPEED'] = "response.url"
        print (pre_item)
        yield pre_item

3.4 编程piplines：
将得到的数据插入到数据库中，注意这里的
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
参数配置及连接获取

import MySQLdb

class StackoverflowPipeline(object):
    def process_item(self, item, spider):
        DBKWARGS = spider.settings.get('DBKWARGS')
        con = MySQLdb.connect(**DBKWARGS)
        cur = con.cursor()
        
        sql = ("insert into proxys(IP,PORT,TYPE,POSITION,SPEED) values (%s,%s,%s,%s,%s)")
        lis = (item['IP'],item['PORT'],item['TYPE'],item['POSITION'],item['SPEED'])
        print (lis)
        try:
            cur.execute(sql,lis)
        except Exception as e:
            print ("insert error:",e)
            con.rollback()
        else:
            con.commit()
        cur.close()
        con.close()
        
        
        return item

3.5 在settings中配置
一个是字符编码# -*- coding: utf-8 -*-
数据库参数：DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
日志文件：LOG_FILE = "stackoverflow.log"
用户代理：USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
禁用cookies：COOKIES_ENABLED = False

等等其他的一些反反爬取的措施，待研究学习

# -*- coding: utf-8 -*-

# Scrapy settings for stackoverflow project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'stackoverflow'

SPIDER_MODULES = ['stackoverflow.spiders']
NEWSPIDER_MODULE = 'stackoverflow.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'stackoverflow (+http://www.yourdomain.com)'

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
# databases con
DBKWARGS={'db':'test','user':'root','passwd':'123','host':'localhost','use_unicode':True,'charset':'utf8'}
ITEM_PIPELINES = {
    'stackoverflow.pipelines.StackoverflowPipeline': 300,
}
#Configure log file name
LOG_FILE = "stackoverflow.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'collectips (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

3.6，实验结果

4，今天遇到的Python3.x与Python2.x的区别

1，异常处理
以前的python2.x的时候：
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception, e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
 
现在python3.x的时候：
try:
        fp=urllib.request.urlopen(blogurl)
    except Exception as e:
        print (e)
        print('download exception %s' % blogurl)
        return 0
否则会报错的
python2.x中 无括号
print "insert error:",e

Python3.x中
print ("insert error:",e)

5， scrapy的爬虫原理以及各个组件的作用

6.通常防止爬虫被反主要有以下几个策略：

动态设置User-Agent（随机切换User-Agent，模拟不同用户的浏览器信息）

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'

禁用Cookies（也就是不启用cookies middleware，不向Server发送cookies，有些网站通过cookie的使用发现爬虫行为）
- 可以通过COOKIES_ENABLED 控制 CookiesMiddleware 开启或关闭
  # Disable cookies (enabled by default)

COOKIES_ENABLED = False