用python抓取oj题目（1）——用beautifulsoup分析oj元素

最新推荐文章于 2022-12-13 21:53:30 发布

weixin_30349597

最新推荐文章于 2022-12-13 21:53:30 发布

阅读量163

点赞数

文章标签： python 数据库 php

原文链接：http://www.cnblogs.com/duoduo369/archive/2012/03/30/2425474.html

版权

　　终于搞完了记录一下

　　我的任务是hdoj和toj这两个，事实上也就一个。做hdoj用了4天的样子，toj一上午就ok了、、、所以撇开toj，直接用hdoj的东西来说。也就是肿么把oj上这些字儿啊图片啊神马的抓下来存到数据库的。当然，为了验证是否正确，django这个方便的东西是不能少的。

　　btw：原来django的静态文件是这么个意思啊，这个以后再说、、、

　　首先点开杭电的网址，找到problem archive，进来之后看题目http://acm.hdu.edu.cn/listproblem.php?vol=1，一堆啊，随便点一个题，比如1056（让我很纠结的一个题）http://acm.hdu.edu.cn/showproblem.php?pid=1056，1057http://acm.hdu.edu.cn/showproblem.php?pid=1057，第一件需要做的就是分析这个页面的元素。为嘛那，要知道这些个玩意儿是早晚都要存到数据库里面的，所以首先要看看建的表里面会有那些个列，而且还要看不同题号的题目有那些是相同的东西，写个函数一劳永逸。so，打开火狐或者是chrome的firebugs，可以看到类似这样子的。

　　看看界面里面，貌似题目里面会有1.title 2.limit des 3.problem des 4.input 5.output 6.sample input 7.sample output 8.hint 9.author 10.source 11.recommend 12.imgages。一开始的时候我以为前5项是一定会有的，对啊，肯定会有标题，限制描述，问题描述，输入输出吧，直到我第一次写完之后遇到了奇葩的1056题，这个题竟然没有input，output啊我去，当时我是从第1000题往2000题抓，但是每次到1056的时候，python就给了我一个异常，然后就跪了。我还没弄明白神马事儿的，到处查后来看了看1056，哎，这样啊、、、

　　所以，不要绝对相信一些个东西、、、

　　后来，求助了下学长，他以前做过类似的这种抓oj题的东西，给了我一个图，狠好啊，不敢独享，传上来先，学长是万能的～。当然，我现在的任务只需要看problem那一列。

　　ok，最后发现杭电所有的题目都是 http://acm.hdu.edu.cn/showproblem.php?pid= 加上一个题号（4位），估计oj们也都是用数据库存的。

　　好吧，下面开始对照代码来说说BeautifulSoup是肿么分析网页的。

　　先上代码：

  1 #! -*- encoding:utf-8 -*-
  2 import urllib2
  3 import traceback
  4 from BeautifulSoup import BeautifulSoup
  5 from sqlalchemy import *
  6 from sqlalchemy.orm import *
  7 
  8 def catch(url=None, pro_image='/images/hdoj/'):
  9 ##        """ return 12 infos 
 10 ##        1.title 2.limit des 3.problem des 4.input 5.output 
 11 ##        6.sample input 7.sample output 8.hint 9.author 
 12 ##        10.source 11.recommend 12.imgages
 13 ##        the last element is a list of images """
 14     content_stream = urllib2.urlopen(url)
 15     content = content_stream.read()
 16     print 'catching: ' + url
 17     soup = BeautifulSoup(content)
 18     table = soup.table
 19     
 20     #images the real url
 21     images_src = table.findAll('img')[1:]
 22     images = []
 23 
 24     len_img = len(images_src)
 25 
 26     for i in range(len_img):
 27         image = str(images_src[i].attrs[0][1])
 28         images.append(image)
 29         
 30     # now we change the images url
 31     
 32     for i in range(len_img):
 33         images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1]
 34             
 35     #title
 36     table_title = table.find('h1')
 37     table_title.hidden = True
 38     #title below limits description
 39     table_limit_des = table_title.findNext('span')
 40     table_limit_des.hidden = True
 41     # problem description, input, output, sample input, sample output
 42     try:
 43         table_problem_des = table.find(text='Problem Description').findNext('div', {'class':'panel_content'})
 44         table_problem_des.hidden = True
 45     except Exception as e:
 46         table_problem_des = None
 47 
 48     #input
 49     try:
 50         table_input = table.find(text='Input').findNext('div', {'class':'panel_content'})
 51         table_input.hidden = True
 52     except Exception as e:
 53         table_input = None
 54     #output
 55     try:
 56         table_output = table.find(text='Output').findNext('div', {'class':'panel_content'})
 57         table_output.hidden = True
 58     except Exception as e:
 59         table_output = None
 60     #sample input
 61     try:
 62         table_sample_input = table.find(text='Sample Input').findNext('div', {'class':'panel_content'})
 63         table_sample_input.hidden = True
 64     except Exception as e:
 65         table_sample_input = None
 66     #sample output
 67     try:
 68         table_sample_output = table.find(text='Sample Output').findNext('div', {'class':'panel_content'})
 69         table_sample_output.hidden = True
 70     except Exception as e:
 71         table_sample_output = None
 72 
 73     # hint
 74     try:
 75         table_hint = table_sample_output.i.next.next
 76     except Exception as e:
 77         table_hint = None
 78     try:
 79         table_sample_output = table_sample_output.i.previous.previous.previous
 80     except Exception as e:
 81         pass
 82         
 83     # source
 84     try:
 85         table_source = table.find(text='Source').findNext('div', {'class':'panel_content'})
 86         table_source.hidden = True
 87     except Exception as e:
 88        # print e
 89         table_source = None
 90 
 91     #recommend
 92     try:
 93         table_recommend = table.find(text='Recommend').findNext('div', {'class':'panel_content'})
 94         table_recommend.hidden = True
 95     except Exception as e:
 96       #  print e
 97         table_recommend = None
 98 
 99     # author 
100     try:
101         table_author = table.find(text='Author').findNext('div', {'class':'panel_content'})
102         table_author.hidden = True
103     except Exception as e:
104       #  print e
105         table_author = None
106 
107     
108 
109     
110     info = []
111 
112 
113     info.append(str(table_title))
114     info.append(str(table_limit_des))
115     info.append(str(table_problem_des))
116     info.append(str(table_input))
117     info.append(str(table_output))
118     info.append(str(table_sample_input))
119     info.append(str(table_sample_output))
120     info.append(str(table_hint))
121     info.append(str(table_author))
122     info.append(str(table_source))
123     info.append(str(table_recommend))
124     info.append(images)
125     
126     return info

第二行 importurllib2  导入的是python的一个库

导入之后就能做 14行 content_stream = urllib2.urlopen(url) （打开网页）

15行 content = content_stream.read() (读取网页元素)

你甚至可以print content看一下 和那个网站下firebugs分析的数据一样

第四行 from BeautifulSoup importBeautifulSoup

你甚至可以print content看一下 和那个网站下firebugs分析的数据一样

　　可以看到网页的东西都给抓出来了，真的和firebugs看到的一样，当然，这些个玩意儿，不管是firebugs看到的还是beautifulsoup分析的都是在我们缓存里面的，而不是网上的东西，所以bs（beautifulsoup）里面可以直接修改标签（尤其是更改图片的路径啊）

　　现在，BS登场。首先是漂亮一点儿的显示，下面这个图里这两行就不用解释了，名字就很明显。

　　ok，所有BS的详细介绍可以查阅中文文档：http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

　　我就解释自己的代码好了。

　　首先 18 table = soup.table 取出这张表里面的<table>标签，因为分析了一下下，杭电里面我需要的信息都在<table>标签里面，然后再从table里面找。

　　然后处理图片，为什马要先处理图片那，因为：1、图片需要保存，因此需要原来图片真正的url地址；2、保存下来的网页里面图片的src要改成本地的地址。也就是说如果原src = “/data/images/神马神马”，我需要把他改成“/images/hdoj/神马神马”，然后在存到数据库里面，所以先处理保存图片原地址，然后用BS把缓存中的<img src>改成想要的东西，再进行后面的操作（这就是说为什马BS是在缓存操作，而不是在网上，网上的东西我们是改不了的）。

　　图片的相关代码：

 1         #images the real url
 2     images_src = table.findAll('img')[1:]
 3     images = []
 4 
 5     len_img = len(images_src)
 6 
 7     for i in range(len_img):
 8         image = str(images_src[i].attrs[0][1])
 9         images.append(image)
10         
11     # now we change the images url
12     
13     for i in range(len_img):
14         images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1]

2   images_src = table.findAll('img')[1:]，取出所有的图片标签，可以type（）一下，是<type 'list'>，而image_src【i】的type是这个<class 'BeautifulSoup.Tag'>

3   images = [] 这是最后需要保存到info里面杭电图片真正的url的，（下载图片需要）。

8　　image = str(images_src[i].attrs[0][1]) 这一行比较绕，为神马呢，我在BS文档里面发现了BeautifulSoup.Tag这个里面有个attrs属性，打印出来看了看是这个玩意儿[(u'src', u'http://www.cnblogs.com/../data/images/1828-1.jpg')]，一个list里面有个元组，而元组里面第二个元素刚好是图片的url，so：attrs[0]是这个(u'src', u'http://www.cnblogs.com/../data/images/1828-1.jpg')（一个元组），attrs[0][1]就是图片的url了。

14  images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1] 更改缓存里面的url，pro_image='/images/hdoj/'，这样保存下来的图片就和我本地图片路径一样了，方便后来django显示。

　　图片处理完之后就是文字了，经过多次尝试，题目和限制信息是真的都有的。所以有下面的代码：

1         #title
2     table_title = table.find('h1')
3     table_title.hidden = True
4     #title below limits description
5     table_limit_des = table_title.findNext('span')
6     table_limit_des.hidden = True

2     table_title = table.find('h1') 找到table里面的<h1>标签，因为杭电里面title就是这玩意儿

　第5行同，理不解释了。

6     table_limit_des.hidden = True 是隐藏标签 神马意思那，直接上图，直观：

　　之后的东西就像下面的代码一样，所以只解释第一个了，先上代码：

1      try:
2          table_problem_des = table.find(text='Problem Description').findNext('div', {'class':'panel_content'})
3          table_problem_des.hidden = True
4      except Exception as e:
5          table_problem_des = None

　　为什么要抓异常那，就像我之前说的，oj神马情况都有可能发生，input都可能没有，所以题目描述，input，output之类的必须要抓异常。例如如果找不到problem，那么table.find(text='Problem Description') 返回一个None，而下一个findNext('div', {'class':'panel_content'})就会报异常。简单的方法是建一个表，info里面有这12个信息，如果有一个信息没有的话，就把它赋值为none。

　　最后的信息加到info【】这个list里面返回出去，images是最后一个元素，注意这个时候info里面problem里<img>的src已经和images的不一样了。

　　ok，第一部分catch就到这里了，BeautifulSoup的详细用法还真需要找文档，不麻烦的，在几天内我争取把store（数据库存储用的sqlalchemy）和django（显示静态文件原来这样用）写出来，努力努力。

转载于:https://www.cnblogs.com/duoduo369/archive/2012/03/30/2425474.html

weixin_30349597

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用python抓取oj题目（1）——用beautifulsoup分析oj元素

　　终于搞完了记录一下　　我的任务是hdoj和toj这两个，事实上也就一个。做hdoj用了4天的样子，toj一上午就ok了、、、所以撇开toj，直接用hdoj的东西来说。也就是肿么把oj上这些字儿啊图片啊神马的抓下来存到数据库的。当然，为了验证是否正确，django这个方便的东西是不能少的。　　btw：原来django的静态文件是这么个意思啊，这个以后再说、、、　　首先点开杭电的网...
复制链接

扫一扫