解决问答伙伴的问题:
获取中国大学排名数据(软科数据)
链接:https://www.shanghairanking.cn/rankings/bcur/2020
- 工具:
python,request,xpath,lxml
第一步:引入工具包
import requests
from lxml import etree
第二步:设置请求头
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9',
}
第三步:请求页面数据
html = requests.get('https://www.shanghairanking.cn/rankings/bcur/2020',headers=headers)
print(html)#打印请求状态
第四步:解析数据
etree_html= etree.HTML(html.text)
table_list = etree_html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr')
for i in range(len(table_list)):
rank = table_list[i].xpath('td[1]/text()')[0].replace('\t','').replace('\n','').replace(' ','')
name = table_list[i].xpath('td[2]/a/text()')[0].encode('raw_unicode_escape').decode()
area = table_list[i].xpath('td[3]/text()')[0].encode('raw_unicode_escape').decode().replace('\t','').replace('\n','').replace(' ','')
type = table_list[i].xpath('td[4]/text()')[0].encode('raw_unicode_escape').decode().replace('\t','').replace('\n','').replace(' ','')
score = table_list[i].xpath('td[5]/text()')[0].replace('\t','').replace('\n','').replace(' ','')
print(rank,name,area,type,score)
结果显示
完整代码
import requests
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9',
}
html = requests.get('https://www.shanghairanking.cn/rankings/bcur/2020',headers=headers)
print(html)#打印请求状态
etree_html= etree.HTML(html.text)
table_list = etree_html.xpath('//*[@id="content-box"]/div[2]/table/tbody/tr')
for i in range(len(table_list)):
rank = table_list[i].xpath('td[1]/text()')[0].replace('\t','').replace('\n','').replace(' ','')
name = table_list[i].xpath('td[2]/a/text()')[0].encode('raw_unicode_escape').decode()
area = table_list[i].xpath('td[3]/text()')[0].encode('raw_unicode_escape').decode().replace('\t','').replace('\n','').replace(' ','')
type = table_list[i].xpath('td[4]/text()')[0].encode('raw_unicode_escape').decode().replace('\t','').replace('\n','').replace(' ','')
score = table_list[i].xpath('td[5]/text()')[0].replace('\t','').replace('\n','').replace(' ','')
print(rank,name,area,type,score)
- 注:中文需要处理编码
希望能帮到小伙伴,留了溜了,拜了个拜 ~~~~