Python爬虫练习

最新推荐文章于 2024-07-28 14:14:53 发布

zy_dream

最新推荐文章于 2024-07-28 14:14:53 发布

阅读量810

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/zy_dream/article/details/53493542

版权

python 专栏收录该内容

41 篇文章 0 订阅

订阅专栏

今日爬虫练习，爬取的内容是我校的就业中心网中的内容。是一个基础的爬虫，很适合初学者学习。

使用的是requests和BeautifulSoup。

过程中遇到的问题是乱码问题和url不规则问题：

看这个url获取到是无法直接打开这个链接的。

代码如下：

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def get_subject(url):
    try:
        html=requests.get(url)
    except requests.HTTPError,e:
        if hasattr(e,"reason"):
            print u"链接失败，错误原因",e.reason

    soup=BeautifulSoup(html.text,'html.parser',from_encoding='UTF-8')
    link=soup.find_all('a',href=re.compile(r'zdgz.htm'))
    return link[0]

def enter_zdgz(base,link):
    try:
        # 下一步很重要，括号里面的参数
        info=requests.get(base+link['href']).text
    except requests.HTTPError,e:
        if hasattr(e,"reason"):
            print u"链接失败,错误原因",e.reason

    soup=BeautifulSoup(info,'html.parser',from_encoding='UTF-8')
    link=soup.find_all('a',title=re.compile(r'安排表（12月）'))
    return link

def get_html(base,link):
    try:
        l=link['href']
        str='../'
        temp=l.split(str)[1]
        print "全部安排表链接",base+temp
        info=requests.get(base+temp).text
    except requests.HTTPError,e:
        if hasattr(e,"reason"):
            print u"链接失败，错误原因",e.reason
    return info

def get_info(html):
    try:
        html=html.encode('ISO 8859-1')
        with open('info.txt','w') as file:
            soup=BeautifulSoup(html,'html.parser',from_encoding='utf-8')
            infos=soup.find_all('table',style=re.compile(r'width: 565px;border-collapse: collapse'))
            for info in infos:
                file.write(info.get_text())
    except IOError,e:
        print "文件错误"+str(e)

def main():
    base='http://jiuye.xupt.edu.cn/'
    link=get_subject(base)
    tests=enter_zdgz(base,link)
    for test in tests:
        print test
    html=get_html(base,test)
    print html
    get_info(html)

main()

以上有值得注意的地方有：

1.get_subject获取指定的链接时，可能获取的不止一条链接，find_all()返回的是一个列表。根据具体情况自己取你需要的。例如之前爬豆瓣电影的时候，这里用了for循环来处理。

2.想要获取你找到的标签的具体链接，就用link['href']这样的表示即可

3.对于解决url不规则的问题，我找的办法是字符串的替换。

不知道为什么直接用字符串的replace没效果，之后就改用正则表达式。

但是就今天这个问题而言，正则表达式还是不行，因为# str=re.compile('../') ../表示的是替换凡是XX/类型的字符吧

所以，在用了python的字符串分离函数split()之后，问题解决了。让人感觉python这个函数真的好用。

问题还有：

1.在最后爬取需要写入文本的内容时，正则匹配的识别标签很不合适

infos=soup.find_all('table',style=re.compile(r'width: 565px;border-collapse: collapse'))

找如此标签属性可能是很准确，但是万一类似内容的网站这里就改成566px，564px怎么办？

2.怎样把爬取的内容按格式写入文本中？（其实之前都是写过的）