scrapy 爬虫(糗事百科)第一步
-
准备工作
-
爬取以下内容
-
name age content
-
在H:盘创建爬虫项目
# 创建爬虫项目 (项目名qiu_bai)
scrapy startproject qiu_bai
- 会自动生成以下目录及文件
第二步 切换到spiders目录下
# 生成爬虫文件
scrapy genspider qiubai www.qiushibaike.com
- 会在spider目录生成一个 qiubai.py文件
第三步 打开 items.py
- 粘贴以下代码
import scrapy
# 要爬取的字段
class QiuBaiItem(scrapy.Item):
name = scrapy.Field()
age = scrapy.Field()
content = scrapy.Field()
第四步 打开 qiubai.py
- 复制以下代码
# -*- coding: utf-8 -*-
import re
import scrapy
from qiu_bai.items import QiuBaiItem
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
allowed_domains = ['www.qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/8hr/page/1/']
def parse(self, response):
for each in response.xpath('//div[@id="content-left"]/div'):
item = QiuBaiItem()
try:
name = each.xpath('div/a[2]/h2/text()').extract_first().strip('\n')
except Exception as e:
name='匿名用户'
try:
age = each.xpath('div[1]/div/text()').extract_first()
except Exception as e:
age = '没有年龄'
content = each.xpath('a[1]/div/span/text()').extract_first().strip('\n')
for i in item.fields.keys():
item[i] = eval(i)
yield item
s = response.url
now_page = int(re.search(r'(\d+)/$', s).group(1))
if now_page < 13:
url = re.sub(r'(\d+)/$', str(now_page + 1), s)
print("this is next page url:", url)
print('*' * 100)
yield scrapy.Request(url, callback=self.parse)
第五步 打开 pipelines.py (将爬取到的数据保存成json文件)
import json
class QiuBaiPipeline(object):
def __init__(self):
self.file = open('qiubai.json', 'wb')
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(content.encode('utf8'))
return item
def close_spider(self, spider):
self.file.close()
第六步 打开settings.py
- 复制以下代码
BOT_NAME = 'qiu_bai'
SPIDER_MODULES = ['qiu_bai.spiders']
NEWSPIDER_MODULE = 'qiu_bai.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'qiu_bai.pipelines.QiuBaiPipeline': 300,
}
第七步
- 进入spiders目录
# 在终端输入
scrapy crawl qiubai
- 输入命令然后单击回车开始爬取需要的信息