python爬虫的一次尝试——华北电力大学图书馆读者荐购系统:基于python爬虫的web数据爬取

前言

本学期数据仓库与数据挖掘课程大作业是编程实现一种数据挖掘方法,之前也注意过学校图书馆的荐购系统,数据十分有趣,于是想借这次机会尝试一下。

本章工具

  • python3.7
  • pycharm编辑器
  • chrome谷歌浏览器

网页分析

在学校内网时一般用IP地址访问,在校外学校为我们提供了VPN服务,而且内网VPN支持自动登录并且不限时间,爬取之前我想,将含有我校园VPN登录信息的cookie放进请求头应该就能实现模拟登陆,后面事实证明也是如此。
数据来源主要是以下两个页面:

1.荐购数据

通过网页检查,可以得到url、cookie、用于定位的xpath语句。在这里插入图片描述

2.书目具体信息

此部分注意以下几点

  1. 书目检索页面请求方式为post方式,我们需要用表单提交数据,表单数据还要转换成json类型。
  2. 若未查询到该书,网页返回的response字典中“total”键值为0;若查询到该书,我们取按相关度排序的第一本书。获取可以唯一确定这本书的MARC码并访问链接,在二级页面,就可以得到这本书的详细所有信息。
  3. 其实这里并不能查询到本书上架时间,但是注意到MARC码的005字段是一个时间戳,查询MARC编码规则发现这是一个自定义字段,再结合荐购时间和出版时间,我们有理由推断这就是本书在华电图书馆上架的时间。
  4. 二级页面里很多书的信息格式和排版都不相同,一直没有找到好的爬取策略和存储方式,最后决定把所有信息存入列表,再导出为本地.npy文件。
    详细信息005是一个时间戳

代码部分

1. 荐购数据爬取

用到的第三方库:

  • urllib,py3内置的http请求库
  • lxml,python的一个解析库,用于xpath语句定位
  • xlwt,用于将爬取的数据导入Excel文件

获取网页返回数据的代码:

# page用于网页翻页
page = 1
def get_response(page):
    request = ur.Request(
        url='https://202-204-70-2-8080.webvpn.ncepu.edu.cn/asord/asord_hist.php?page=' + str(page),
		# 请求头里包含了user-agent和cookie,user-agent会随机改变来进行header伪
		# 装,cookie包含了我的登录信息
        headers={
            'User-Agent': user_agent.get_user_agent_pc(),
            'Cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; PHPSESSID=944815ubc3eb1kosegdk1cjie2; webvpn_username=120171080101%7C1586959936%7C1447df2e037c70f1691256345e6162d4f3b7660d; _astraeus_session=aThBa3A3RWFha2pCK1VWRU95dGh5UWJmd2pQbzdPdnlPRzNSNE9CcHp0bWcwNkZ2Qm12UXF6SU5SWEtHRTBERmpqQjFIK2o4eitPQkRMcTdjOUhaR1psSi9IMHF0ekRXR2ZCV00zSDlNWWMraXE3MlRGWDQ4QnhQcVlrWWZuZUxOR1B1OHpNVEIvYzU1RE8zeFFpYjZld2VneUFtVmdhZFg4bVBVYU5ydEdxMi9UTkdabGZuQkNFMWtCVDNRbERPR09MaWFaUWdLemxYODV5MUVidW9OQUFVZERVU1pGKzdSTnIyaHVLeDdpT2dxMSsrTzJVZklRVTU0dHZSVTA0LytQdEhOUlhWQVZlKyt4ZXFhRkI0NkZqWnJTbk8vVWFha1QzY2U2NHBxdkViOXordXNobVdSNk9YdXJmUWpkb25HUm1kMG11emd0UUZlTWFQbWUrQ1BFKzlFV0drQTh5cDBSN1JnVWJpYVBsZmVlWjBTcnJpVmRSbFlvNG9SUzlJLS10aWlTOTF5RG1kWWJ5YjlhZmFaM0Z3PT0%3D--ecb7db64ac12526389f83b88e890f1f910c073a6'
        }
    )
    response = ur.urlopen(request).read().decode('utf-8')
    # print(type(response))
    lxml_x = le.HTML(response)
    return lxml_x

上面用到的user_agent.get_user_agent_pc()方法,获取随机的UserAgent进行header伪装

import random
# pc端的user-agent
user_agent_pc = [
    # 谷歌
    'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36',
    'Mozilla/5.0.html (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.html.1271.64 Safari/537.11',
    'Mozilla/5.0.html (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.html.648.133 Safari/534.16',
    # 火狐
    'Mozilla/5.0.html (Windows NT 6.1; WOW64; rv:34.0.html) Gecko/20100101 Firefox/34.0.html',
    'Mozilla/5.0.html (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
    # opera
    'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.95 Safari/537.36 OPR/26.0.html.1656.60',
    # qq浏览器
    'Mozilla/5.0.html (compatible; MSIE 9.0.html; Windows NT 6.1; WOW64; Trident/5.0.html; SLCC2; .NET CLR 2.0.html.50727; .NET CLR 3.5.30729; .NET CLR 3.0.html.30729; Media Center PC 6.0.html; .NET4.0C; .NET4.0E; QQBrowser/7.0.html.3698.400)',
    # 搜狗浏览器
    'Mozilla/5.0.html (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.html.963.84 Safari/535.11 SE 2.X MetaSr 1.0.html',
    # 360浏览器
    'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.html.1599.101 Safari/537.36',
    'Mozilla/5.0.html (Windows NT 6.1; WOW64; Trident/7.0.html; rv:11.0.html) like Gecko',
    # uc浏览器
    'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.html.2125.122 UBrowser/4.0.html.3214.0.html Safari/537.36',
]
# 移动端的user-agent
user_agent_phone = [
    # IPhone
    'Mozilla/5.0.html (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
    # IPAD
    'Mozilla/5.0.html (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8C148 Safari/6533.18.5',
    'Mozilla/5.0.html (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',
    # Android
    'Mozilla/5.0.html (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
    'Mozilla/5.0.html (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
    # QQ浏览器 Android版本
    'MQQBrowser/26 Mozilla/5.0.html (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',
    # Android Opera Mobile
    'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',
    # Android Pad Moto Xoom
    'Mozilla/5.0.html (Linux; U; Android 3.0.html; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0.html Safari/534.13',
]

def get_user_agent_pc():
    return random.choice(user_agent_pc)

def get_user_agent_phone():
    return random.choice(user_agent_phone)

我们通过get_response()函数已经可以得到网页返回的结果,现在需要筛选我们感兴趣的数据:书名、作者、荐购日期、图书馆购买情况、图书馆的反馈:

n = 1
try:
    for page in range(1, 33):#32页
        response_ = get_response(page)
        for i in range(2, 22):#每页20组数据
        	#xpath语句定位
            title = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][2]/text()')[0]
            author = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][3]/text()')[
                0]
            date = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][5]/text()')[0]
            status = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][6]/text()')[
                0]
            note = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][7]/text()')
            #注意到note和status列表里含有空数据,这里进行替换
            if note:
                pass
            else:
                note = '无'
            if len(status) == 2:
                status = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][6]/font/text()')[0]
            #这里用到了Excel处理,它的预定义下面会讲
            mysheet.write(n, 0, title)
            mysheet.write(n, 1, author)
            mysheet.write(n, 2, date)
            mysheet.write(n, 3, status)
            mysheet.write(n, 4, note)
            n += 1
            # print(title, author, date, status)
            print(n)
except:
    pass

在上面的语句块之前,我们需要定义mysheet变量,它来自于xlwt库中的xlwt.Workbook.add_sheet()方法,用于Excel表格的存取。

workbook = xlwt.Workbook(encoding='utf-8')
mysheet = workbook.add_sheet('荐购数据', cell_overwrite_ok=True)
header = ['书名', '作者', '荐购日期', '状态', '备注']
for i in range(0, 5):
    mysheet.write(0, i, header[i])

最后,本页面爬取完成,数据存入Excel表格中。

workbook.save('荐购表.xls')
print('已导出到Excel表格!')

我们来运行一下:19秒爬取完毕
共用时19秒就爬取完此网站上的632组数据,我们查看根目录会发现多出来的“荐购表.xls"文件,打开后如图在这里插入图片描述
这样,632条荐购数据已经存入了本地,我们接着进行下一步数据爬取。

2. 完整书目信息爬取

用到的第三方库:

  • requests,一个基于urllib的http库
  • lxml,见上文
  • xlrd,用于从Excel文件中读取数据
  • numpy,Numeric Python这里用于对list的本地存取

关键代码:

  • 构造字符串并搜索书籍
# 按照“书名+作者”的形式搜索
for i in range(1, 629):
    title = table.cell(i, 0).value
    author = table.cell(i, 1).value
    search_word.append(title + ' ' + author)
for j in search_word:
    form_data = {"searchWords": [{"fieldList": [{"fieldCode": "", "fieldValue": j}]}], "filters": [],
                 "limiter": [], "sortField": "relevance", "sortType": "desc", "pageSize": 20, "pageCount": 1,
                 "locale": "zh_CN", "first": True}
    url = 'https://202-204-70-2-8080.webvpn.ncepu.edu.cn/opac/ajax_search_adv.php'
    headers = {
        'User-Agent': user_agent.get_user_agent_pc(),
        'Cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; webvpn_username=120171080101%7C1587118484%7C8015f9ca9ec47da4dde6c91989324b8096061339; PHPSESSID=o2613oi8kgcn75eb83qlj3au22; _astraeus_session=ZW5ycnhkN2EwallDUWZ1VDUybzZONi9KUGx1K2N6ZSt2c0Y4QWN2ZXQvU0RPUXhOb2lheWQ2Tjk3ZjlMcUhOSGtiNGwvSUg4WjdNSDRNS1Z6VFlOT2hYTGlNNHQ3dkVBRUhndHhKRXJHLytrUlpIKzd2VHhZRkdNS0pQalhCeDNYdHJLY2tMbG1zTzJKYmwrQUg1dk1yc3lLUHpjeEtXOE0rUlR2ci9WU1Y4MFFoRUNkVHU0TkE0TmVTd2tUL3AyRlRnU1RTTEtPaFdoc0tKb3hoMFovcGpXYTNaRGZMQ20wWlB5Ri9NOG9xcExRdGVjSHpHczFtamlmclRPaXJjNzNidWZCMlpoZ1pTQTlUUnQ5cGN6U0dmdk8wa1ZERXFsVFNFeG55MjJrSWNIdE1wU3lzSXpJY04wTjAwNU9Fd2pDSzlmRjZNVDk1VStIWlFpczJJc0JLWGxLYktvYVZ3aDlEMHRud2xkbVBZN3BvaTcxZWZsdkJFdnZDZXpSRS9qLS01V3ljeVdncEo4QkJTYU5XdklhK2NRPT0%3D--a68f2f55af720ff9da2aba5a4f972f113985f799',
        'Content-Type': 'application/json'
    }
    # 这里提交的是json格式的表单,需要用到json.dumps()函数进行转换
    response = requests.post(url=url, headers=headers, data=json.dumps(form_data))
    r = response.text
    try:
        marcRecNo = r[r.index("\"marcRecNo\":\"") + 13:r.index("\",\"num\":"):]
    except Exception as e:
        marcRecNo = ''
    # 得到每本书唯一的marc码
    marc_s.append(marcRecNo)
  • 根据marc码查询更多信息
for marc in marc_s:
    if marc=='':
        continue
    else:
        request = ur.Request(
            url='https://202-204-70-2-8080.webvpn.ncepu.edu.cn/opac/item.php?marc_no='+marc,
            headers={
                'User-Agent': user_agent.get_user_agent_pc(),
                'cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; webvpn_username=120171080101%7C1587118484%7C8015f9ca9ec47da4dde6c91989324b8096061339; PHPSESSID=o2613oi8kgcn75eb83qlj3au22; _astraeus_session=ZW5ycnhkN2EwallDUWZ1VDUybzZONi9KUGx1K2N6ZSt2c0Y4QWN2ZXQvU0RPUXhOb2lheWQ2Tjk3ZjlMcUhOSGtiNGwvSUg4WjdNSDRNS1Z6VFlOT2hYTGlNNHQ3dkVBRUhndHhKRXJHLytrUlpIKzd2VHhZRkdNS0pQalhCeDNYdHJLY2tMbG1zTzJKYmwrQUg1dk1yc3lLUHpjeEtXOE0rUlR2ci9WU1Y4MFFoRUNkVHU0TkE0TmVTd2tUL3AyRlRnU1RTTEtPaFdoc0tKb3hoMFovcGpXYTNaRGZMQ20wWlB5Ri9NOG9xcExRdGVjSHpHczFtamlmclRPaXJjNzNidWZCMlpoZ1pTQTlUUnQ5cGN6U0dmdk8wa1ZERXFsVFNFeG55MjJrSWNIdE1wU3lzSXpJY04wTjAwNU9Fd2pDSzlmRjZNVDk1VStIWlFpczJJc0JLWGxLYktvYVZ3aDlEMHRud2xkbVBZN3BvaTcxZWZsdkJFdnZDZXpSRS9qLS01V3ljeVdncEo4QkJTYU5XdklhK2NRPT0%3D--a68f2f55af720ff9da2aba5a4f972f113985f799'
            }
        )
        response = ur.urlopen(request).read()
        lxml_ = le.HTML(response)
        def get_value(xpath):
            return lxml_.xpath(xpath)
        inf_s = get_value('//*[@id="item_detail"]/dl/descendant::*/text()')
        del inf_s[-6:]
        print(inf_s)
        m = np.array(inf_s)
        # 将得到的书籍信息存入本地
        try:
            np.save('书籍信息/%s.npy'%inf_s[1], m)
        except:
            pass
  • 运行程序
    在这里插入图片描述
    在这里插入图片描述
    在控制台和代码所在文件夹可以看到书籍信息已经保存到本地了,大功告成!
    虽然看上去简单几行代码就实现了预期目标,但是一路上各种问题不断出现。代码水平实在有限,因此也一直在深夜调试运行,希望不要给学校网络带来压力。
    爬取完成后,我发现数据非常单薄,缺少维度。加之我的数据分析能力实在有限,原计划用此数据做关联规则和基于贝叶斯网络的预测,尝试无果后只好放弃,就当是一次尝试与练习吧,记录一下学习历程,大家笑一笑就好。
  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值