以下内容均为个人理解,如有错误,请评论留言,会尽快修改,谢谢!!!
数据爬取的过程
数据来源:http://www.8pu.com/gdp/ranking_2020.html,http://www.8pu.com/gdp/ranking_2019.html,…
通过分析页面源代码,可知要爬取的数据并不是后期渲染,只要获取页面源代码就可以获得想要的数据
1.单页爬取
import requests
url = 'http://www.8pu.com/gdp/ranking_2020.html'
resp = requests.get(url= url)
resp.encoding = 'utf-8'
print(resp.text)
爬取结果部分:
可以看出爬取结果中确实有我们想要的数据
2.单页数据提取
这里采用的解析方法是xpath,导入解析模块:from lxml import etree
通过分析源代码,我们想要的数据被放在<table>
表格中,xpath路径是:/html/body/div[4]/div[1]/div[4]/table
,表格的标题储存在<thead>
中,表格的每一行数据储存在<tr>
中,xpath解析方法和解析原则可以参考:https://www.w3school.com.cn/xpath/index.asp
tree = etree.HTML(resp.text)#以HTML的方式解析响应
countries = tree.xpath('/html/body/div[4]/div[1]/div[4]/table/tbody/tr')#获取的是一个存储了每一行数据的列表
for country in countries:#遍历列表
rank = country.xpath('.//td[1]/font/font/text()')[0]
country_name = country.xpath('./td[2]/a/font/text()')[0]
GDP = country.xpath('./td[3]/font/text()')[0]
total_GDP = country.xpath('./td[4]/font/text()')[0]
area = country.xpath('./td[6]/font/font/text()')[0]#提取相应的元素
结果如下:
在爬取过程中,由于数据并不规整,出现了一些问题:
进行异常处理:
try:
area = country.xpath('./td[6]/font/font/text()')[0]
except:
area = 'NAN'
else:
pass
成功爬取单页数据后,构造url,循环多次调用函数,传入url就可以爬取所有的数据
def get_oneyear(url,year):
...
for year in range(2021,1979,-1):#循环遍历获取1980~2020年的数据
url = 'http://www.8pu.com/gdp/ranking_'+str(year )+'.html' #构造url
get_oneyear(url,year)#调用函数,传入参数
爬虫完整代码
#获取各国历年GDP数据
import requests
from lxml import etree
import pandas as pd
def get_oneyear(url,year):
resp = requests.get(url= url)
resp.encoding = 'utf-8'
print('----------------------------------------')#分割线
print('{}年数据如下:'.format(year))
tree = etree.HTML(resp.text)
countries = tree.xpath('/html/body/div[4]/div[1]/div[4]/table/tbody/tr')
ranks = []
countries_name = []
GDPs = []
total_GDPs = []
areas = []#建立空列表,便于传入数据,使用Dataframe
for country in countries:
#/html/body/div[4]/div[1]/div[4]/table/tbody/tr[1]/td[1]/font/font
rank = country.xpath('.//td[1]/font/font/text()')[0]
ranks.append(rank)#list.append()追加到列表中
#/html/body/div[4]/div[1]/div[4]/table/tbody/tr[1]/td[2]/a/font
country_name = country.xpath('./td[2]/a/font/text()')[0]
countries_name.append(country_name)
#/html/body/div[4]/div[1]/div[4]/table/tbody/tr[1]/td[3]/font/text()
GDP = country.xpath('./td[3]/font/text()')[0]
GDPs.append(GDP)
#/html/body/div[4]/div[1]/div[4]/table/tbody/tr[1]/td[4]/font
total_GDP = country.xpath('./td[4]/font/text()')[0]
total_GDPs.append(total_GDP)
#/html/body/div[4]/div[1]/div[4]/table/tbody/tr[1]/td[6]/font/font
try:#异常处理
area = country.xpath('./td[6]/font/font/text()')[0]
except:
area = 'NAN'
else:
pass
areas.append(area)
dic = {'rank':ranks,'GDP':GDPs,'total_GDP':total_GDPs,'area':areas}#创建字典
frame = pd.DataFrame(dic,index = countries_name)#创建多维数据表
frame.to_csv("./data/{}_year's_GDP.csv".format(year),index=True,header=True)#储存到csv文件中
print(frame)
for year in range(2021,1979,-1):#循环遍历获取1980~2020年的数据
url = 'http://www.8pu.com/gdp/ranking_'+str(year )+'.html' #构造url
get_oneyear(url,year)#调用函数,传入参数
结果展示: