动态网页爬虫

最新推荐文章于 2023-12-12 11:38:29 发布

暮雪千

最新推荐文章于 2023-12-12 11:38:29 发布

阅读量179

点赞数

文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/qq_45874317/article/details/121935206

版权

一.在Anaconda的虚拟环境下安装selenium 和webdrive等必要库

1、打开Anaconda Prompt

查看是否有虚拟环境

conda env list

没有虚拟环境的要创建一个虚拟环境

conda create -n env_name python=3.6

安装对应的包

conda create -n env_name numpy matplotlib python=3.6

然后激活虚拟环境

activate your_env_name(虚拟环境名称)

2.安装本次实验所需安装包

selenium

pip install selenium

webdrive

要使用selenium去调用浏览器，还需要一个驱动，不同浏览器的webdriver需要独立安装

知乎 - 安全中心

这里有个链接可以下载

下载好之后将文件添加到PATH中

右键点击此电脑，点击属性

点进去之后找到PATH，点击编辑

点击新建，添加驱动

二.对百度进行自动化测试

代码

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

s = Service("D:\chromedriver.exe")# 驱动path

driver = webdriver.Chrome(service=s)
driver.get("https://www.baidu.com/")

# search=driver.find_element_by_id("kw")这种方式已经被弃用
search=driver.find_element(By.ID,"kw")
search.send_keys("重庆交通大学")

# send_button=driver.find_element_by_id("su")
send_button=driver.find_element(By.ID,"su")
send_button.click()

运行结果

三、爬取名言

代码

# 爬取 http://quotes.toscrape.com/js/ 名人名言

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

s = Service("D:\chromedriver.exe")
# 名言所在网站
driver = webdriver.Chrome(service=s)
driver.get("http://quotes.toscrape.com/js/")
# 表头
csvHeaders = ['作者','名言']
# 所有数据
subjects = []
# 单个数据
subject=[]
# 获取所有含有quote的标签
# res_list=driver.find_elements_by_class_name("quote")# 有些不能用了，这个是版本更新,下面是我适应的
res_list=driver.find_elements(By.CLASS_NAME, "quote")

# 分离出需要的内容
for tmp in res_list:
    subject.append(tmp.find_element(By.CLASS_NAME, "author").text)
    subject.append(tmp.find_element(By.CLASS_NAME, "text").text)
    print(subject)
    subjects.append(subject)
    subject=[]

结果展示

四、爬取京东图书馆

查看京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物！网页源码，分析源码信息，找到对应标签
查看搜索框id以及搜索按钮class

代码

# 爬取 https://www.jd.com/ 京东图书
import csv
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

s = Service("D\chromedriver.exe")

# 京东所在网站
driver = webdriver.Chrome(service=s)
driver.set_window_size(1920,1080)
driver.get("https://www.jd.com/")

# 输入需要查找的关键字
p_input = driver.find_element(By.ID, 'key')
p_input.send_keys('python编程')  # 找到输入框输入
time.sleep(1)
# 点击搜素按钮
button=driver.find_element(By.CLASS_NAME,"button").click()
time.sleep(1)
all_book_info = []
num=200
head=['书名', '价格']
#csv文件的路径和名字
path='./file/book.csv'
def write_csv(head,all_book_info,path):
    with open(path, 'w', newline='',encoding='utf-8') as file:
        fileWriter = csv.writer(file)
        fileWriter.writerow(head)
        fileWriter.writerows(all_book_info)
# 爬取一页
def get_onePage_info(num):
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)
    # 书籍列表
    J_goodsList = driver.find_element(By.ID, "J_goodsList")
    listbook = J_goodsList.find_elements(By.TAG_NAME, "li")
    for res in listbook:
        num = num-1
        book_info = []
        name =res.find_element(By.CLASS_NAME, "p-name").find_element(By.TAG_NAME, "em").text
        price = res.find_element(By.CLASS_NAME, "p-price").find_element(By.TAG_NAME, "i").text
        book_info.append(name)
        book_info.append(price)
        # bookdetail = res.find_element(By.CLASS_NAME, "p-bookdetails")
        # author = bookdetail.find_element(By.CLASS_NAME, "p-bi-name").find_element(By.TAG_NAME, "a").text
        # store = bookdetail.find_element(By.CLASS_NAME, "p-bi-store").find_element(By.TAG_NAME, "a").text
        # book_info.append(author)
        # book_info.append(store)
        all_book_info.append(book_info)
        if num==0:
            break
    return num

while num!=0:
    num = get_onePage_info(num)
    driver.find_element(By.CLASS_NAME, 'pn-next').click()  # 点击下一页
    time.sleep(2)
write_csv(head, all_book_info, path)
driver.close()

运行结果

总结

通过本次实验，完成动态网页的信息爬取，和静态网页一样需要查看网页结构，找到元素id或者利用相关函数得到元素，然后将信息获取，存储。

暮雪千

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
动态网页爬虫

一.在Anaconda的虚拟环境下安装selenium 和webdrive等必要库1、打开Anaconda Prompt查看是否有虚拟环境conda env list没有虚拟环境的要创建一个虚拟环境conda create -n env_name python=3.6安装对应的包conda create -n env_name numpy matplotlib python=3.6...
复制链接

扫一扫