使用scrapy进行爬虫操作(python)

最新推荐文章于 2024-10-08 12:37:10 发布

AI in Bio

最新推荐文章于 2024-10-08 12:37:10 发布

阅读量349

点赞数 3

分类专栏： python 文章标签： python scrapy 爬虫

本文链接：https://blog.csdn.net/liuwei6843/article/details/141063360

版权

python 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1.创建一个Scrapy项目
先找到安装scrapy的目录：打开cmd命令行，先用cd命令转到该目录下的venv\scripts\，再键入命令：scrapy startproject test1，即可创建一个新的项目。

2.在test1工程之下，新建一个begin.py文件，和scrapy.cfg在同一级目录下，内容如下：

from scrapy import cmdline
	cmdline.execute("scrapy crawl bupt".split())#bupt为爬虫的名字，在spider.py中定义

3.使用pycharm打开工程test1
需要修改的文件：items.py(items数据文件), pipelines.py(管道文件), setting.py(设置文件)
int.py(初始化文件), middlewares.py(中间件文件), scrapy.cfg(项目的配置文件)不需要修改

4.修改items.py文件

import scrapy
	class MyItem(scrapy.Item):
	# define the fields for your item here like:
		school = scrapy.Field()
		link = scrapy.Field()

5.在spiders文件夹下新建一个spider.py文件

import scrapy
	from test1.items import MyItem #从items.py中引入MyItem对象
	class mySpider(scrapy.spiders.Spider):
		name = "bupt" #爬虫的名字是bupt
		allowed_domains = ["bupt.edu.cn/"] #允许爬取的网站域名
		start_urls = ["https://www.bupt.edu.cn/yxjg1.htm"] #初始URL，即爬虫爬取的第一个URL
	
		def parse(self, response): #解析爬取的内容
			item = MyItem() #生成一个在items.py中定义好的Myitem对象,用于接收爬取的
			数据
			for each in response.xpath('/html/body/div/div[2]/div[2]/div/ul/li[4]/div/ul/*'): #用xpath来解析html，div标签中的数据就是我们需要的数据
				item['school'] = each.xpath("a/text()").extract() #学院名称在text中
				item['link'] = each.xpath("a/@href").extract() #学院链接在href中
				if(item['school'] and item['link'] ): #去掉值为空的数据
					yield(item) #返回item数据给到pipelines模块

6.修改pipelines.py

import json

class MyPipeline(object):
	def open_spider(self, spider):
		try: #打开json文件
			self.file = open('MyData.json', "w", encoding="utf-8")
		except Exception as err:
			print(err)

	def process_item(self, item, spider):
		dict_item = dict(item) #生成字典对象
		json_str = json.dumps(dict_item, ensure_ascii=False) + "\n" #生成json串
		self.file.write(json_str) #将json串写入到文件中
		return item
	
	def close_spider(self, spider):
		self.file.close() #关闭文件

7.修改setting.py

ITEM_PIPELINES = {'test1.pipelines.MyPipeline': 300,}

参数是分配给每个类的整型值，确定了它们运行的顺序，item按数字从低到高的顺序，通过pipeline。
通常将这些数字定义在0-1000范围内。

8.运行spider.py，并将其运行时的 Script path 配置项修改为begin.py

AI in Bio

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录