爬取大学公告信息 beautifulsoup的使用

最新推荐文章于 2021-11-14 22:36:12 发布

dreams512

最新推荐文章于 2021-11-14 22:36:12 发布

阅读量452

点赞数

分类专栏： python 爬虫文章标签：爬虫 python

本文链接：https://blog.csdn.net/chuan_yu_chuan/article/details/53996409

版权

python 同时被 2 个专栏收录

22 篇文章 1 订阅

订阅专栏

爬虫

4 篇文章 0 订阅

订阅专栏

# -*-coding:utf-8-*-
import re
import urllib2

from bs4 import BeautifulSoup


def print_zh(key):
    s = "u'%s'" % key
    s = eval(s)
    print(s)

keyList = [u'项目', u'交流']
keyResult = []
url = 'http://urp.tust.edu.cn/bulletinPageList.jsp?pageNum=1&groupIds=Nyw4'
req = urllib2.Request(url)
res = urllib2.urlopen(req)
soup = BeautifulSoup(res.read(), "lxml")
lists = soup.select('li.an-list')
for li in lists:
    lise = li.select('div[class="an-title block"]')
    if lise:
        te = re.findall(r'title="(.*)"', str(lise))[0]
        for key in keyList:
            tempkey = str(repr(key))   # 正则经常用到的repr函数，要查看在Python内部到底是怎么表示的 类似于 u'\u5c31\u4e1a'
            tempkey = repr(tempkey)    # 这一步把转义字符暴露出来，方便匹配  类似于  u'\\u5c31\\u4e1a'
            tempkey = tempkey[3:tempkey.__len__() - 2]   # 这一步把前面的 u' 和后面的 ' 删掉
            if re.search(r'' + tempkey + '', te):
                # 打印中文title
                print_zh(te)
                lise2 = li.select('div[class="dep-angency block"]')
                herf = re.findall(r'href="(.*)"\s', str(lise2))
                # 打印对应链接
                print(herf)
                depart = lise2[0].select('a.deptlink')[0].get_text()
                # 打印对门
                print(depart)
                date_d = li.select("p")
                # 打印日期
                print(date_d[0].get_text())
                print('\n')
                break

参考 http://www.mamicode.com/info-detail-1377315.html

dreams512

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬取大学公告信息 beautifulsoup的使用

# -*-coding:utf-8-*-import reimport urllib2from bs4 import BeautifulSoupdef print_zh(key): s = "u'%s'" % key s = eval(s) print(s)keyList = [u'项目', u'交流']keyResult = []url = 'http://urp
复制链接

扫一扫