刚开始学习爬虫,以下是测试用例
**环境:
(1. pycharm 2019.3
(2. python 3.8 (遇到pip升级到3.9的问题——版本升级)
[
#打开命令提示符cmd(Windows键+R / 直接搜索cmd )
我的对应解决命令是:
python -m pip install --upgrade pip -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
安装某些模块装不上的时候可以用以下命令试试:
pip install 模块名 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
]
(3. 数据库为mysql8.0(workbench8.0)
[
#进入cmd:
遇到数据库连接时间差问题——
mysql -u 用户名 -p密码
show databases;
set global time_zone=’+8:00’;
]
(4. 爬虫URL: http://www.ceic.ac.cn/speedsearch
[
中国地震台网采用ajax加载页面,因为是测试用例,就用默认的页面数据,具体情况具体代码需进一步查看
]
(数据库连接参考博客代码:https://blog.csdn.net/just_so_so_fnc/article/details/72995731)
**scrapy项目展示:(名字不严谨)+数据创建
scrapy项目文件:
数据库创建:
命令:
create database earthdb;
create table 'earthdata'(
`id` int(10) NOT NULL AUTO_INCREMENT,
`level` varchar(100) DEFAULT NULL,
`time` varchar(100) DEFAULT NULL,
`latitude` varchar(100) DEFAULT NULL,
`longitude` varchar(100) DEFAULT NULL,
`depth` varchar(100) DEFAULT NULL,
`address` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
)ENGINE=InnoDB AUTO_INCREMENT=1181 DEFAULT CHARSET=utf8;
select count(name) from earthdata;
**相关代码:
spider爬虫文件:
import scrapy
from ScrapyPro2.items import Scrapypro2Item
class EarthdataSpider(scrapy.Spider):
name = 'earthData'
allowed_domains = ['ceic.ac.cn/speedsearch']
start_urls = ['http://www.ceic.ac.cn/speedsearch']
def parse(self, response):
dataList=response.xpath("//table[@class='speed-table1']/tr")[1:]
item = Scrapypro2Item()
for data in dataList:
item["level"]=data.xpath("./td[1]/text()").extract()
item["time"] = data.xpath("./td[2]/text()").extract()
item["latitude"] = data.xpath("./td[3]/text()").extract()
item["longitude"] = data.xpath("./td[4]/text()").extract()
item["depth"] = data.xpath("./td[5]/text()").extract()
item["address"] = data.xpath("./td[6]/a/text()").extract()
yield item
items文件:
import scrapy
class Scrapypro2Item(scrapy.Item):
#定义需要爬取的数据
#定义数据模型
# define the fields for your item here like:
level = scrapy.Field()
time = scrapy.Field()
latitude = scrapy.Field()
longitude = scrapy.Field()
depth = scrapy.Field()
address = scrapy.Field()
pipelines文件:
import pymysql
class Scrapypro2Pipeline(object):
def process_item(self, item, spider):
conn = pymysql.Connect(
host='localhost',
port=3306,
user='root',
passwd='root',
db='earthdb',
charset='utf8'
)
cursor = conn.cursor()
"""
#执行结束后可注释掉插入部分的代码,把这段查询的代码的注释删掉执行查看结果
sql = "SELECT * FROM earthdata"
cursor.execute(sql)
print(cursor.rowcount)
rr = cursor.fetchall()
for row in rr:
print("Id:=%s, level:=%s, time:=%s, latitude:=%s, longitude:=%s, depth:=%s, address:=%s" % row)
"""
sql_insert = "INSERT INTO earthdata(level,time,latitude,longitude,depth,address) VALUES(%s,%s,%s,%s,%s,%s)"
# 执行语句
cursor.execute(sql_insert, (item['level'], item['time'], item['latitude'], item['longitude'],
item['depth'],item['address']))
# 事务提交,否则数据库得不到更新
conn.commit()
print(cursor.rowcount)
conn.close()
cursor.close()
return item
**结果:
(代码运行结果之前关掉了,就不截图了,这是执行完后对应的数据库信息)
小声明:
学习中的小白,参考了很多博主的代码ORZ请多指教