内容及要求
掌握Scrapy爬虫框架基本概念,熟悉Scrapy爬虫框架的基本使用方法。
(1)在各自的电脑上完成Scrapy爬虫框架的部署
(2)使用Scrapy爬虫框架进行简单的数据爬取。
(3)描述实验步骤及方法,提交实验报告。
- 实验方法与步骤
1.先安装scrapy库 : 可以在anconda环境里面安装Scrapy库
2.通过命令scrapy startproject NewSmallexpression创建Scrapy项目
3.在D:\Python Workspace\Test01\HTTP\NewSmallexpression中查看,在pycharm中查看项目结构如下
4.修改items和脚本配置文件
在items内做一下修改
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class NewsmallexpressionItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()
time = scrapy.Field()
view_count = scrapy.Field()
pass
在pipelines内做一下修改
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from turtle import pd
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
def create_engine(param):
pass
class NewsmallexpressionPipeline:
def __int__(self):
self.engine = create_engine('mysql+pymysql://root:jian@127.0.0.1:3306/tipdm')
def process_item(self, item, spider):
data = pd.DataFrame(dict(item))
data.to_sql('tipdm_data', self.engine, if_exists='append', index=False)
data.to_csv('TipDM_data.csv', mode='a+', index=False, sep='|', header=False)
创建spider脚本模板
修改tipdm文件
import scrapy
class TipdmSpider(scrapy.Spider):
name = "tipdm"
allowed_domains = ["www.tipdm.com"]
start_urls = ["https://www.tipdm.com"]
def parse(self, response):
pass
通过命令scrapy crawl tipdm运行爬取结果