用python爬下杭电OJ所有题目（除了图）

最新推荐文章于 2024-08-01 16:02:51 发布

置顶 edxuanlen

最新推荐文章于 2024-08-01 16:02:51 发布

阅读量2.3k

点赞数 3

分类专栏： python

本文链接：https://blog.csdn.net/edxuanlen/article/details/80252135

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

title: 用python爬hdu题库
date: 2018-05-07 01:39:09
tags:
- python3
- 爬虫
categories: python3

description: 爬取杭电所有题目，杭电最近不太稳定，为了方便刷题，特地将题目爬取下来。

这里涉及到很多的点，比如python利用正则表达式爬下来之后怎么把html标签去除，还有伪装浏览器访问，错误处理。

import webbrowser
import urllib
import requests
import re
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    unicodehtml = html.decode("gbk")    ## 转化为gbk格式
    return unicodehtml

def zhenghe(str1,id,imgre):
    html=getHtml( str1+id )
    return re.findall(imgre,html)

headers = {'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/51.0.2704.63 Safari/537.36'}    ## 改request的头  模拟浏览器
num =1000;
Url = "http://acm.hdu.edu.cn/showproblem.php?pid="
reg = r'<div class=panel_content>.*?[\s\S]*?</div>' ## 正则表达式
imgre=re.compile(reg)
while num<=6275:
        list = zhenghe(Url,str(num),imgre)
        t = open("hdu题库\hdu%s.txt"%num,"w")
        for i in list:
            dr = re.compile(r'<[^>]+>',re.S)    
            dd = dr.sub('',i)
            ## 上面两行时用来去除html标签
            t.write(dd)     ## 写入
            t.write("\n")
        t.close()
        num=num+1

然而有一些问题。就是转化gbk格式失败会终止程序，因此要写个错误处理

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    try:
        unicodehtml = html.decode("gbk")
    except:
        print("%s had not been install\n"%url)  
        return "a"  ## 错误时return一个a
    return unicodehtml

while num<=6275:
        list = zhenghe(Url,str(num),imgre)
        if(list=="a"):  ## 当发生错误时不写入
            continue
        t = open("hdu题库\hdu%s.txt"%num,"w")

到了这一步其实基本都已经实现了，但是还是不太完美，输入输出没有分开，而且也没有标题：
这里写图片描述
这时候详化一下正则表达式就可以了，最终版本代码：

import webbrowser
import urllib
import requests
import re
import codecs
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    try:
        unicodehtml = html.decode("gbk")
    except:
        print("%s had not been install\n"%url)
        return "a"
    return unicodehtml

def zhenghe(str1,id,imgre):
    html=getHtml( str1+id )
    return re.findall(imgre,html)
headers = {'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/51.0.2704.63 Safari/537.36'}
num =1013;
Url = "http://acm.hdu.edu.cn/showproblem.php?pid="
reg = {}
reg[0] = r"<td align=center><h1 style='color:#1A5CC8'>.*?[\s\S]</h1>"
reg[1] = r"<br><br><div class=panel_title align=left>.*?[\s\S]*?</div> <div class=panel_content>"
reg[2] = r"</div> <div class=panel_content>.*?[\s\S]*?<br></div><div class=panel_bottom>"
reg[3] = r'Sample Input</div><div class=panel_content><pre><div style="font-family:Courier New,Courier,monospace;">.*?[\s\S]*?</div>'
reg[4] = r'Sample Output</div><div class=panel_content><pre><div style="font-family:Courier New,Courier,monospace;">.*?[\s\S]*?</div>'
while num<=1013:
    t = open("hdu题库\hdu%s.txt"%num,"a")
    for j in range (0,5):
        imgre=re.compile(reg[j])
        list = zhenghe(Url,str(num),imgre)
        if(list=="a"):
            continue

        for i in list:
            dr = re.compile(r'<[^>]+>',re.S)
            dd = dr.sub('',i)
            dr = re.compile(r'Input',re.S)
            dd = dr.sub('Input\n',dd)
            dr = re.compile(r'Output',re.S)
            dd = dr.sub('Output\n',dd)
            t.write(dd)
            t.write("\n\n")
    t.close()
    num=num+1