Practice 1_使用selenium基本用法实现对大学排名等数据爬取

最新推荐文章于 2024-05-14 16:45:48 发布

zhu小白~

最新推荐文章于 2024-05-14 16:45:48 发布

阅读量221

点赞数

文章标签： selenium 爬虫 python

本文链接：https://blog.csdn.net/m0_51034309/article/details/120319251

版权

本文介绍了如何运用Selenium的webdriver模块进行浏览器模拟，登录并爬取需要登录才能查看的大学排名数据。通过定位元素、提取信息并存储到文件，实现了数据抓取和存储。后续计划扩展爬虫功能以抓取多页数据，并学习更多Python与爬虫相关知识，如正则表达式和数据处理方法。

摘要由CSDN通过智能技术生成

在学习网络爬虫的过程中，通过在mooc上北理嵩天老师的爬虫教学课和平时的自学完成最基础的对无需登录的大学排名网站爬取数据，mooc上老师使用的是bs4库中的BeautifulSoup库以及正则表达式实现对大学排名的爬取。本文中通过selenium中的webdriver实现模拟浏览器的登录和爬取需要的数据，并存储和读取爬取的数据

完整代码如下：

from selenium import webdriver

options=webdriver.ChromeOptions()
options.add_argument('--start-maximized')

'''驱动模拟浏览器并达到指定网页'''
driver = webdriver.Chrome("E:\chromedriver.exe")
driver.get('https://www.shanghairanking.cn/rankings/bcur/2021')

'''使用xpath定位抓取数据'''
names_tags=driver.find_elements_by_xpath("//a[@class='name-cn']")#通过标签名与属性查找
#由于位置与分数用上述方式找不到，通过xpath查找兄弟节点标签的功能查找
locations_tags=driver.find_elements_by_xpath("//td[@class='align-left']/following-sibling::td[1]")
scoles_tags=driver.find_elements_by_xpath("//td[@class='align-left']/following-sibling::td[3]")
info_list=[]
i=0
while i<len(names_tags):
	info_list.append([names_tags[i].get_attribute('textContent').strip(),locations_tags[i].get_attribute('textContent').strip(),scoles_tags[i].get_attribute('textContent').strip()])
	print(info_list[i])#打印输出爬取的数据
	i=i+1
driver.quit()

'''存储数据'''
path='E:/python学习文件/py_爬虫/projects/infor_of_unives.txt'
with open(path,'a') as f:
	for univer in info_list:
		f.write(str(univer)+'\n')
f.close()

'''读取数据'''
with open(path,'r') as f:
	print(f.read())

爬取数据展示（我这里使用列表形式存储）：