使用selenium爬取zoj题目内容_如何从zoj拉题目-CSDN博客

本文链接：https://blog.csdn.net/weixin_43119449/article/details/103675200

selenium介绍

selenium最初是一个自动化测试工具，而爬虫中使用它主要是为了解决requests无法执行javaScript代码的问题。

selenium的用途

selenium可以驱动浏览器自动执行自定义好的逻辑代码，也就是可以通过代码完全模拟成人类使用浏览器自动访问目标站点并操作，那我们也可以拿它来做爬虫。
selenium本质上是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等…进而拿到网页渲染之后的结果，可支持多种浏览器
具体的在python中selenium的配置可以参考这个网址：https://blog.csdn.net/sinat_37967865/article/details/79343668

本篇文章主要是介绍如何爬取ZOJ上的题目具体信息：

在这里插入图片描述
查看网页源代码，发现其是一个js生成的网页，因此采用python+selenium进行自动化爬取。

详细代码如下：

爬取过程：

爬取网址：https://zoj.pintia.cn/
爬取标题、题号、链接，提交通过数等信息。并将这些信息保存在csv文件中。
爬取每道题目的内容，使用上一步爬取到的链接进行二次爬取，得到题目的题目名、提交的限制时间和内存以及具体的题目信息，最后并保存在文件中。

导入相关库并查看版本信息:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import random
from urllib.parse import urlparse
from lxml import etree
import requests
import csv
import sys
from bs4 import BeautifulSoup 
from selenium import webdriver
import time
import pandas as pd
print(sys.version_info)
for moudel in re,requests,pd,webdriver:
    print(moudel.__name__,"  ",moudel.__version__)

一次爬取（标题、题号、题目链接等）

#coding=utf-8
#生成30个将要爬取的页面
urls = ('https://zoj.pintia.cn/problem-sets/91827364500/problems?page={}'.format(i) for i in range(30))
driver=webdriver.Chrome()#使用google进行页面的访问
driver.maximize_window()#最大化的显示并且进行刷新，（必须使
#用，否则不能够对页面进行刷新导致下面的步骤不能显示输出）
problems_num=[]#题目集合
problems_url=[]#题目网址
problems_name=[]#题目名称
problems_info=[]#题目信息
for url in urls:
    driver.get(url)
    time.sleep(10)#休眠10s,给页面刷新的时间，由于爬取的页面为js生成的
    data = driver.page_source
    soup = BeautifulSoup(data, 'lxml')
    trees = soup.find_all('tr')
    for tree in trees:
        title=tree.select('a')
        for index,ti in enumerate(title):
            if index==0:
                problems_num.append(ti.get_text())
                problems_url.append(ti['href'])
            else:
                problems_name.append(ti.get_text())
            #print(ti['href'],ti.get_text())#提取出两个标签的href和文本内容
            
#返回[<a href="/problem-sets/91827364500/problems/91827364500">1001</a>, 
#<a href="/problem-sets/91827364500/problems/91827364500">A + B Problem</a>]
        if '<td>' in str(tree):
            problems_info.append(tree.get_text())
            #print(tree.get_text())#打印出标签内的文本信息
#保存文件
problems_file_data=pd.DataFrame(columns=['problem number','name','url','other'])
problems_file_data['problem number']=problems_num
problems_file_data['name']=problems_name
problems_file_data['url']=problems_url
problems_file_data['other']=problems_info
print(problems_file_data.shape)#查看形状

在这里插入图片描述

二次爬取(爬取题目具体信息):

import pandas as pd
start_url='https://zoj.pintia.cn'
driver=webdriver.Chrome()
driver.maximize_window()
titles=[]
limits=[]
informations=[]
data=pd.read_csv('./zoj_problem.csv')
problems_url=data.url
for url in problems_url[:100]:
    url=start_url+url
    driver.get(url)
    time.sleep(3)
    data=driver.page_source
    b=BeautifulSoup(data,'html.parser')
    title=b.find("div",{"class":{'title_1gJgg'}}).get_text()#获取题目号
    titles.append(title)
    #时间限制和内存限制
    limit=b.find("div",{"class":{'limitations_1_mdc'}}).get_text()
    limits.append(limit)
    #获取题目的内容
    information=b.find_all("div",{"class":{'content_1QxDh'}})[0]
    li=[]
    for i in information.find_all('p'):
        #print(i.get_text())
        li.append(i.get_text())
    informations.append(li)
ti=pd.DataFrame(columns=['title','limit','information'])
ti['title']=titles
ti['limit']=limits
ti['information']=informations
ti.to_csv('information.csv',index=None)

如有不足，感谢指正，谢谢。