由于是国家网,有好多限制,首先要加User-Agent,之后用正则表达式经行爬取,好久没有用正则了,还是让老师帮忙的
import requests
import pandas as pd
from lxml import etree
import re
import json
num_mag=[]
orig_time=[]
latitudes=[]
longitudes=[]
depth=[]
epicenter=[]
for i in range(47,49):
start_url='https://www.cea.gov.cn/eportal/ui?pageId=366509¤tPage={}'.format(i)
print('保存第{0}页'.format(i))
header = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response=requests.get(start_url,headers=header).content.decode()
patten=re.compile('\[.*?\]',re.S)
pic_list=re.findall(patten,response)
diss_dict=json.loads(pic_list[0])
for i in diss_dict:
num_mag.append(i['num_mag'])#震级
orig_time.append(i['orig_time'])#发震时刻
latitudes.append(i['latitudes'])#纬度
longitudes.append(i['longitudes'])#经度
try:
if i['depth']:
depth.append(i['depth'])
except Exception as e:
print(e)
epicenter.append(i['epicenter'])#参考位置
data={
'震级':num_mag,
'发震时刻':orig_time,
'纬度':latitudes,
'深度':depth,
'位置':epicenter,
}
df=pd.DataFrame(data)
代码还是有问题的,这样最后无法全部保存,还在修改中,如果有大佬看见这片文章,希望帮助下