Python每日学习总结（五）_tital python-CSDN博客

本文链接：https://blog.csdn.net/Lu_qw/article/details/122618068

1.Scrapy爬虫

1.Scrapy框架的安装：

（1）什么是Scrapy框架：Scrapy是一个Python爬虫框架

（2）少坑版安装方式：

2.Scrapy框架常见命令实战：

全局命令（scrapy -h）：fatch（爬）；runspider（运行一个爬虫）......

项目命令：

3.Scrapy爬虫：

第一个Scrapy爬虫：以爬取糗事百科为例

scrapy startproject name（新建爬虫）

scrapy crawl name（运行爬虫）

4.Scrapy自动爬虫实战：

（1）糗事百科自动爬虫实战（crawl）：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from qsauto.items import QsautoItem


class Qiushi1Spider(CrawlSpider):
    name = 'qiushi1'
    allowed_domains = ['qiushibaike.com']
    '''
    start_urls = ['http://qiushibaike.com/']
    '''
    def start_request(self):
        ua={"user-Agent":'Mozilla/5.0(windows NT 6.1; WOW64) Applewebkit/537.36(KHTML, like Gecko) Chrome/49.0.2623.22 SE 2.X MetaSr 1.0'}
        yield Request('http://www.qiushibaike.com/',headers=ua)
    rules = (
        Rule(LinkExtractor(allow=r'acticle'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = QsautoItem
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        i=FirstItem
        i["content"]=response.xpath("//div[@class='content']/span/text()").extract()
        i["link"]=response.xpath("//a[@class='contentHerf']/herf").extract()
        print(i["content"])
        print(i["link"])
        return i

2.自动模拟登陆爬虫实战

（1）自动模拟登陆爬虫实战（豆瓣网）：

3.当当网爬虫实战

（1）当当商城爬虫实战（如何将爬到的内容写进数据库）：

import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request

class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://dangdang.com/']

    def parse(self, response):
        item=DangdangItem()
        item["tital"]=response.xpath("//a[@class='pic']/@tital").extract()
        item["link"] = response.xpath("//a[@class='pic']/@href").extract()
        item["comment"] = response.xpath("//a[@name='_1_p']/text").extract()
        yield item
        for i in range()
            url="http://category.dangdang.com/pg"+str(i)+"-cp01.54.06.00.00.00.html"
            yield Request(url,callback=self.parse)

pipelines：

class DangdangPipeline:
    def process_item(self, item, spider):
        for i in range(0,len(item["tital"])):
            tital=item["tital"]
            link=item["link"]
            comment=item["comment"]
            print(tital)
            print(link)
            print(comment)
        return item

items：

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    tital=scrapy.Field()
    link=scrapy.Field()
    comment=scrapy.Field()