python爬虫——爬取图书馆借阅数据

最新推荐文章于 2024-06-18 13:03:49 发布

隨兴

最新推荐文章于 2024-06-18 13:03:49 发布

阅读量6.2k

点赞数 2

分类专栏： python 文章标签： urllib BeautiulSoup

本文链接：https://blog.csdn.net/a429367172/article/details/86532683

版权

python 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

环境

python3.6
BeautifulSoup4 —— v4.6

分析

由于图书管理系统很多人密码都未改，为默认密码，刚好最近在学爬虫，想爬出来试试手，并没有任何恶意，侵删。

本次主要包含以下内容：

模拟用户登录的程序
BeautifulSoup文档学习内容
爬取html文件的小程序

模拟用户登录

方法一 requests

首先利用用户名和密码，构造post数据，发送到登录页面，以形成cookie。（注意数据类型也可能是json类型）。
然后根据cookie，利用get方式请求主页。
利用requests方式如下：

import requests
import json

url = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_log = "http://interlib.sdust.edu.cn/opac/reader/doLogin"


#获取会话
req = requests.Session()

loginid = '' #用户名
passwd = '' #密码

#构造登录请求
data = {
    'rdid' : loginid,
    'rdPasswd' : passwd,
    'returnUrl': '',
    'password': ''
}

#post模拟登录
response = req.post(url_log,data = json.dumps(data))

#get进入主页
index1 = req.get(url_history)

print(index1.status_code)
print(index1.text)

方法二模拟登录后再携带得到的cookie访问

利用浏览器的开发者工具。转到network选项卡，并勾选Preserve Log（重要！）。

import json
import hashlib
import sys
import io
import urllib.request
import http.cookiejar

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码

headers = {'User-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'}

url = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_log = "http://interlib.sdust.edu.cn/opac/reader/doLogin"


loginid = ''
passwd = ''

md5 = hashlib.md5(passwd.encode("utf-8"))
passwd = md5.hexdigest()

data = {
    "rdid" : loginid,
    "rdPasswd" : passwd,
    'returnUrl': '/loan/historyLoanList',
    'password': ''
}
# data = json.dumps(data)

data = urllib.parse.urlencode(data).encode('utf-8')


#构造登录请求
req = urllib.request.Request(url_log, headers=headers, data=data)

#构造cookie
cookie = http.cookiejar.CookieJar()

#由cookie构造opener
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

#发送登录请求，此后这个opener就携带了cookie，以证明自己登录过
resp = opener.open(req)

#构造访问请求
req = urllib.request.Request(url, headers = headers)

print(resp.read().decode('utf-8'))

或者是用爬取类的方法

import hashlib
import sys
import io
import urllib.request
import http.cookiejar

url = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_log = "http://interlib.sdust.edu.cn/opac/reader/doLogin"

class Libary:

    def __init__(self):
        self.cnt = 1 #用于改变学号
        self.html_list = []
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')  # 改变标准输出的默认编码
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
        }

    def parse_html(self):
        pass

    def getNextUser(self):
        loginid = ''
        passwd = ''

        #借助谷歌开发工具，查看表项，获知要md5加密
        md5 = hashlib.md5(passwd.encode("utf-8"))
        passwd = md5.hexdigest()

        data = {
                "rdid" : loginid,
                "rdPasswd" : passwd,
                'returnUrl': '/loan/historyLoanList',
                'password': ''
        }
        self.cnt = self.cnt + 1 #用于产生下一个用户的用户名
        data = urllib.parse.urlencode(data).encode('utf-8')
        return data

    def saveAllHtml(self):
        no = 1
        for html in self.html_list:
            with open("E://1//html{}.html".format(no),"wb") as f:
                f.write(html)
            no = no + 1

    def getNextHtml(self):
        # 构造登录请求
        data = self.getNextUser()
        req = urllib.request.Request(url_log, headers=self.headers, data=data)

        # 构造cookie
        cookie = http.cookiejar.CookieJar()

        # 由cookie构造opener
        opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

        # 发送登录请求，此后这个opener就携带了cookie，以证明自己登录过
        resp = opener.open(req)

        # 构造访问请求
        req = urllib.request.Request(url, headers=self.headers)
        self.html_list.append(resp.read())
        return resp.read()

if __name__ == "__main__":
    lib = Libary()
    # print(lib.getNextHtml().decode('utf-8'))
    for i in range(6):
        lib.getNextHtml()
    lib.saveAllHtml()

提高效率设计思路
1.多线程，一起爬，注意线程同步
2.保存html文件，再次爬取时只需根据html文件获取信息即可
3.爬取一次个页面即保存一次

解析数据：

需要解析的html部分如下：
页码部分所在的class = meneame：
   <div class="meneame" style="text-align:right;">
       <span class="disabled">总共 72 条记录数</span>
       <span class="disabled">总共 2 页</span>
当没有借书记录的时候，找不到该类
书单所在id，bookItem_：
   <tr><td width="80"><input type="checkbox" name="bookItemCheckbox"
                           id="bookItem_" value="86806" />  借书</td>
                           <td width="100">0055703</td>
                           <td><a href="/opac/book/86806" target="_blank">教父</a></td>
                           <td width="140">(美)普佐(Mario Puzo)著;文和平注释</td>
                           <td width="100">H319.4/187</td>
                           <td width="120">青岛未分配流通库</td>
                           <td width="100">中文图书</td>
                           <td width="100">2018-01-03</td>
                       </tr>
提取table方法（beautifulsoup4）：

https://blog.csdn.net/yf999573/article/details/53322902

BeautifulSoup文档

主要包含以下内容：

BeautifulSoup对象的构建
soup中Tag对象属性
soup树的遍历
soup树的查找

具体内容较多，不再赘述，具体见文档：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

爬取小程序

# -*- coding:utf-8 -*-
"""    
time   = '2019/1/16 21:47'
author = 'Gregory'
filename = 'lib_craw.py'
"""
import hashlib
import sys
import io
import urllib.request
import http.cookiejar
from bs4 import BeautifulSoup
import math
import time

url_main = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_login = "http://interlib.sdust.edu.cn/opac/reader/doLogin"

#利用urllib库发送request请求，并且用http库保存cookie的信息
class Libary:

    def __init__(self):
        self.page_count = 0
        self.cnt = 1
        self.book_htmls = []
        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')  # 改变标准输出的默认编码
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
        }

    #本次解析html只解析了页码总数
    def parse_html(self):
        self.get_page_count()

    #注意获取的html只有10项一页
    def get_page_count(self, html_ten):
        soup = BeautifulSoup(html_ten, 'lxml')

        self.page_count = 0
        for div in soup.find_all('div'):
            # 查找页码所在的位置
            #soup.find_all['div','meneame']
            if div.get('class') == ['meneame']:
                s = div.span.string
                print(s)
                log_num = s.split(' ')[1]
                log_num = int(log_num)

                self.page_count = math.ceil(float(log_num) / 50) #向上取整
                print("页数是 %d" % self.page_count)
                break

    #构造下一个用户登录请求
    def getNextUser(self):
            loginid = ''
            passwd = ''

        # 借助谷歌开发工具，查看表项，获知要md5加密
        md5 = hashlib.md5(passwd.encode("utf-8"))
        passwd = md5.hexdigest()

        data = {
            "rdid": loginid,
            "rdPasswd": passwd,
            'returnUrl': '/loan/historyLoanList',
            'password': ''
        }
        self.cnt = self.cnt + 1
        data = urllib.parse.urlencode(data).encode('utf-8')
        return data
    
    #保存一个用户所有的借书信息
    def saveAllHtml(self):
        no = 1
        for html in self.book_htmls:
            with open("E:\\craw_lib2\\no{}_page{}.html".format(self.cnt-1,no), "wb") as f:
                f.write(html)
            no = no + 1

    def getNextUserHtml(self):
        # 构造登录请求
        data = self.getNextUser()
        req = urllib.request.Request(url_login, headers=self.headers, data=data)

        # 构造cookie
        cookie = http.cookiejar.CookieJar()

        # 由cookie构造opener
        opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))

        # 发送登录请求，此后这个opener就携带了cookie，以证明自己登录过
        resp = opener.open(req) #注意resp页面最多显示10条记录

        
        self.get_page_count(resp.read())#获取该用户借阅记录的页数，根据总记录数/50向上取整
        #爬取并保存每一页的Html文件
        for page in range(self.page_count):
            #构造一个html显示50条记录的下一页html请求
            page_request_data = {
                'page': page + 1,
                'rows': 50,
                'prevPage': 1,
                'hasNextPage': 'true',
                'searchType': 'title'
            }
            page_request_data = urllib.parse.urlencode(page_request_data).encode('utf-8')
            req = urllib.request.Request(url_history, headers=self.headers,data = page_request_data)
            resp = opener.open(req)

            self.book_htmls.append(resp.read())

        self.saveAllHtml()
        self.book_htmls.clear() #保存该用户的所有html信息，即情况链表，以便下一个爬取用户使用
        # return resp.read()


if __name__ == "__main__":
    lib = Libary()
    # print(lib.getNextHtml().decode('utf-8'))
    for i in range(11):
        lib.getNextUserHtml()
        time.sleep(10) #每个用户之间停顿10秒，防止爬取太频繁，出现封ip的情况

    # lib.saveAllHtml()

'''
有关historyLoan页面的post请求如下：
page: 1
rows: 50
prevPage: 1
hasNextPage: true
searchType: title
'''

结果：

所有html文件如下：

内容：

PS.爬取过程有点low，还请高手多多指教

隨兴

关注

2
点赞
踩
44

收藏

觉得还不错? 一键收藏
3
评论
python爬虫——爬取图书馆借阅数据

环境python3.6 BeautifulSoup4 —— v4.6 分析由于图书管理系统很多人密码都未改，为默认密码，刚好最近在学爬虫，想爬出来试试手，并没有任何恶意，侵删。本次主要包含以下内容：模拟用户登录的程序 BeautifulSoup文档学习内容爬取html文件的小程序模拟用户登录方法一 requests...
复制链接

扫一扫