python3 scrapy使用指南02

最新推荐文章于 2021-12-10 19:44:28 发布

六·柒

最新推荐文章于 2021-12-10 19:44:28 发布

阅读量211

点赞数

本文链接：https://blog.csdn.net/qq_43000917/article/details/100164955

版权

例子

import scrapy
from myspider.items import MyspiderItem

class ItcastSpider(scrapy.Spider):
# 爬虫名称, 运行爬虫的时候需要用到, 必须唯一
name = ‘six_seven’
# 允许爬取的域名, 防止爬虫爬到其他网站上了
allowed_domains = [‘six_seven.cn’]
# 起始的URL列表, 爬虫从这些URL开始爬取
start_urls = [‘http://www.six_seven.cn/channel/teacher.shtml’]
def parse(self, response):
# response直接可以通过xpath方法提取数据
# names = response.xpath(’//div[@class=“li_txt”]/h3/text()’)
# print(names)

    # 先分组, 获取包含老师信息的div列表
    divs = response.xpath('//div[@class="li_txt"]')
    # 遍历div, 获取每一个为讲师信息
    for div in divs:
        # --- 2. 创建MyspiderItem对象
        item = MyspiderItem()
        item['name'] = div.xpath('./h3/text()').extract_first()
        item['title'] = div.xpath('./h4/text()').extract_first()
        item['desc'] = div.xpath('./p/text()').extract_first()
        # 打印提取到数据
        # print(item)
        # 把提取到数据交给引擎
        yield item

注意

思考：为什么要使用yield？
让整个函数变成一个生成器，有什么好处呢？
遍历这个函数的返回值的时候，挨个把数据读到内存，不会造成内存的瞬间占用过高
与python3中的range和python2中的xrange,以及游标对象同理
注意：
response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
extract() 返回一个包含有字符串的列表
extract_first() 返回列表中的第一个字符串，列表为空没有返回None
spider中的parse方法必须有
需要抓取的url地址必须属于allowed_domains,但是start_urls中的url地址没有这个限制
启动爬虫的时候注意启动的位置，是在项目路径下启动
通过 yield 把提到的数据交给引擎处理

六·柒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3 scrapy使用指南02

例子import scrapyfrom myspider.items import MyspiderItemclass ItcastSpider(scrapy.Spider):# 爬虫名称, 运行爬虫的时候需要用到, 必须唯一name = ‘six_seven’# 允许爬取的域名, 防止爬虫爬到其他网站上了allowed_domains = [‘six_seven.cn’]# 起始...
复制链接

扫一扫