web爬虫学习（二）——scrapy框架

最新推荐文章于 2023-01-28 15:54:09 发布

又见智能商业

最新推荐文章于 2023-01-28 15:54:09 发布

阅读量582

点赞数

分类专栏： web爬虫文章标签： crawler

本文链接：https://blog.csdn.net/livan1234/article/details/80850926

版权

web爬虫专栏收录该内容

6 篇文章 3 订阅

订阅专栏

笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值，找寻数据的秘密，笔者认为，数据的价值不仅仅只体现在企业中，个人也可以体会到数据的魅力，用技术力量探索行为密码，让大数据助跑每一个人，欢迎直筒们关注我的公众号，大家一起讨论数据中的那些有趣的事情。

我的公众号为：livandata

1.scrapy框架：

然后在cmd中输入：scrapy startproject my_crawler即可创建。

创建不同的爬虫类型可以使用不同的命令:

先建一个普通的爬虫：

Scrapy genspider-t basic qsbk qiushibaike.com

案例为：

（一）test文件：

# -*- coding: utf-8 -*-
import scrapy
from my_crawler.items import MyCrawlerItem
from scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = "test"
    allowed_domains= ["qiushibaike.com"]
    # start_urls = ['http://qiushibaike.com/']

    #对request内容进行设置
    def start_requests(self):
        ua = {"User_Agent", "Mozilla/5.0(Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"}
        yield Request('http://www.qiushibaike.com/',headers=ua)

    def parse(self, response):
        it = MyCrawlerItem()
        it["content"]=response.xpath("//div[@class='content']/span/text()").extract()
        it["link"] =response.xpath("//a[@class='contentHerf']").extract()
        yield it

（二）item文件：

import scrapy

class MyCrawlerItem(scrapy.Item):
    # define the fieldsfor your item here like:
    # name = scrapy.Field()
    pass

（三）pipeline文件：

class MyCrawlerPipeline(object):
    def process_item(self, item, spider):
        for i in range(0, len(item["content"])):
            print(item["content"][i])
            print(item["link"][i])

        return item

2、scrapy的常用命令：

Scrapy框架中有两个命令：

其一为全局命令：

Fetch命令：主要是爬一个网页的，主要参数为：-h；--nolog；；

Scrapy fetch http://www.baidu.com ：爬取一个网页，包含爬取的过程。

Scrapy fetch https://www.baidu.com -–nolog

Runspider 可以不依托scrapy项目独立运行爬虫文件。

Scrapy runspider test.py:运行单独的一个不依托项目的爬虫文件。

Scrapy shell http://www.baidu.com –nolog : 爬取百度，并进入shell的交互终端。

其二为项目命令：

进入到项目中：

Scrapy Bench：测试本地硬件的性能。

Scrapy-t basic weisun baidu.com

-l：用来展示当前项目中有多少爬虫模板；

-t：创建一个爬虫项目；

Basic：创建基础模板，

Scrapy check weisun：检查这个爬虫weisun是否可以运行；

Scrapy crawl weisun：运行weisun爬虫。

Scrapy list：展示当前目录下可以使用的爬虫文件。

Scrapy edit weisun：在Linux下修改weisun爬虫。

3、标签含义：

/获取对应的标签；

/html/head/title

Text（）：获取标签下的内容，即文本信息：

/html/head/title/text()

@定位到标签对应的属性，

//寻找所有的标签：

//li：寻找所有的li标签。

标签[@属性]：//li[@class=”hiddenxl”]/a/@href：a标签下的属性的内容。

其余scrapy内容见scrapy官方笔记：http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html。

4、scrapy爬虫实战：

1）天善智能的爬虫：主要是获取项目内容（具体见ts实例）：

import scrapy
class MyCrawlerItem(scrapy.Item):
    # define the fieldsfor your item here like:
    # name = scrapy.Field()
    title =scrapy.Field()
    link = scrapy.Field()
    stu = scrapy.Field()

spider类中的爬虫主题：

# -*- coding: utf-8 -*-
import scrapy
from my_crawler.items import MyCrawlerItem
from scrapy.http import Request

class QsbkSpider(scrapy.Spider):
    name = 'qsbk'
    allowed_domains= ['hellobi.com']
    start_urls = ['http://edu.hellobi.com/course/125']

    def parse(self, response):
        item = MyCrawlerItem()
        item["title"] =response.xpath("//ol[@class='breadcrumb']/li[@class='active']/text()").extract()
        item["link"] =response.xpath("//ul[@class='navnav-tabs']/li[@class='active']/a/@href").extract()
        item["stu"] =response.xpath("//span[@class='course-view']/text()").extract()
        yield item
        for i in range(1, 125):
            url = "http://edu.hellobi.com/course/"+str(i)
            yield Request(url, callback=self.parse)

使用pipeline之前需要先打开pipeline：

在settings文件中添加：

Seehttp://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'my_crawler.pipelines.MyCrawlerPipeline': 300,
}

Settings设置完成后再在pipeline中添加：

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class MyCrawlerPipeline(object):
  def __init__(self):
        self.fh = open("F:/python_workspace/my_crawler/files/1.txt", "a")
  def process_item(self, item, spider):
        print(item["title"])
        print(item["link"])
        print(item["stu"])
        print("------------------------")
  self.fh.write(item["title"][0]+"\n"+item["link"][0]+"\n"+item["stu"][0]+"\n"+"-----------------------"+"\n")
  return item
  def close_spider(self):
        self.fh.close()

2）自动模拟登陆爬虫的实战：

redir：用来控制跳转之后的页面，<input name="redir" type="hidden" value="https://www.douban.com/"/>，如果登录失败，则自动跳转到首页。

# -*- coding: utf-8 -*-

import scrapy
from scrapy.http import Request
from scrapy.http import FormRequest
import urllib.request

class QsbkSpider(scrapy.Spider):
    name = 'qsbk'
    allowed_domains= ['douban.com']
    header = {"User-Agent": "Mozilla/5.0(Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0"}

    def start_requests(self):
        #首先爬取一次登录页，meta={"cookiejar":1}:开启cookie
        return [Request("https://www.douban.com/accounts/login", callback=self.parse, meta={"cookiejar": 1})]

    def parse(self, response):

        captcha = response.xpath("//img[@id='captcha_image']/@src").extract()
        url = "https://www.douban.com/accounts/login"

        if len(captcha)>0:
            #全自动的使用验证码，需要用到机器学习，此处使用半自动的方式。
            #可以通过接口的形式进行打码，有：打码兔等，进行打码处理。
            print("此时有验证码~")
            localpath = "F:/python_workspace/my_crawler/pic/captcha.png"
            urllib.request.urlretrieve(captcha[0], filename=localpath)
            print("请查看本地验证码图片，并输入验证码：")
            captcha_value = input()

            data = {
                "form_email":"18317065578",
                "form_password":"xujingboyy123",
                "captcha-solution":captcha_value,
                "redir":"https://www.douban.com/people/126344945/",
            }

        else:
            print("此时没有验证码~")
            data = {
                "form_email":"18317065578",
                "form_password":"xujingboyy123",
                "redir":"https://www.douban.com/people/126344945/",
            }
            print("登录中~~~~~~~~")
        #实际的发送表单，在formrequest中进行，需要导入模块，然后发送post请求。
        #以return的方式发送，from_response方法可以直接发送信息；
        #第一个参数为response，即返回值。
        return [FormRequest.from_response(response,
                                         #携带的cookie信息
                                         meta={"cookiejar":response.meta["cookiejar"]},
                                          #仿照浏览器
                                         headers=self.header,
                                         #post表单中的数据
                                         formdata=data,
                                         #设置回调函数，即接下来用哪个方法进行处理
                                         callback=self.next,
                                         )]

    def next(self, response):
        print("此时已经登录完成，并爬取了个人中心的数据")
        title = response.xpath("/html/head/title").extract()
        note = response.xpath("//div[@class= 'note']").extract()
        print(title[0])
        print(note[0])

1）当当商城的爬虫实战，并写进数据库：

http://category.dangdang.com/cp01.54.04.00.00.00.html

http://category.dangdang.com/pg2-cp01.54.04.00.00.00.html

# -*- coding: utf-8 -*-

import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request
class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains= ['dangdang.com']
    start_urls = ['http://dangdang.com/']

    def parse(self, response):
        item = DangdangItem()
        item["title"] =response.xpath("//a[@class='pic']/@title").extract()
        item["link"] =response.xpath("//a[@class='pic']/@href").extract()
        item["comment"] =response.xpath("//a[@class='P_pl']/text()").extract()
        yield item
        for i in range(1, 100):
            url = "http://category.dangdang.com/pg"+str(i)+"-cp01.54.04.00.00.00.html"
            yield Request(url, callback=self.parse)

Items文件中：

import scrapy
class DangdangItem(scrapy.Item):
    title=scrapy.Field()
    link=scrapy.Field()
    comment=scrapy.Field()

settings中取出robut限制：

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
'dangdang.pipelines.DangdangPipeline': 300,
}

Pipeline文件中：

import pymysql

class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn = pymysql.connect(host="127.0.0.1", user="root", passwd="123456", db="livan")
        for i in range(0, len(item["title"])):
            title=item["title"][i]
            link=item["link"][i]
            comment=item["comment"][i]
            print(title)
            print(link)
            print(comment)
            sql="insert intogoods(title, link, comment) values('"+title+"','"+link+"','"+comment+"')"
            conn.query(sql)
        conn.close()
        return item

5、Json的数据处理方式：

Python中有json模块：

Type "copyright","credits" or "license()" for more information.

>>> import json

>>>data='{"id":"2342424324"}'

>>> jdata = json.loads(data)

>>> jdata.keys()

dict_keys(['id'])

>>> jdata['id']

'2342424324'

6、分布式爬虫的构建：

Scrapy如何支持分布式：

如何用scrapy进行分布式，首先需要用到几个工具：1）scrapy；2）scrapy-redis；3）redis；

可以安装好三个工具后进行相应的学习。

又见智能商业

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
web爬虫学习（二）——scrapy框架

笔者是一个痴迷于挖掘数据中的价值的学习人，希望在平日的工作学习中，挖掘数据的价值，找寻数据的秘密，笔者认为，数据的价值不仅仅只体现在企业中，个人也可以体会到数据的魅力，用技术力量探索行为密码，让大数据助跑每一个人，欢迎直筒们关注我的公众号，大家一起讨论数据中的那些有趣的事情。我的公众号为：livandata
复制链接

扫一扫

专栏目录