机器学习第三次培训——爬虫概述

最新推荐文章于 2023-11-08 19:57:15 发布

HHUCESTA

最新推荐文章于 2023-11-08 19:57:15 发布

阅读量622

点赞数

本文链接：https://blog.csdn.net/NOTFOUND_Liu/article/details/90381576

版权

人生苦短我用Python

文章目录

1 概述

1. 1 爬虫是什么

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本

原则上,只要是浏览器中显示的内容都可以按照一定格式被爬虫爬下来

2. 2 怎么爬呀

在这里插入图片描述

网络请求:模拟浏览器构造一个网络请求发送给网站服务器；服务器根据你的请求返回数据，称为响应
解析响应:浏览器返回的数据含有一些冗余的信息，或者并不是你想要的数据格式，你需要根据响应提取有用信息
数据保存: 解析后的数据保存在本地磁盘，或者数据库中

2.3 我该学习哪些知识

在这里插入图片描述

如上图，应该说学习爬虫需要的知识还挺多的，从前端知识到python的一些解析库，再到数据库的交互；但是不用害怕，并不是所有的爬虫都需要熟练掌握这些，一些简单的爬虫只需要10几行代码就行了。

2.4 怎么学爬虫

应该说网上的教程挺多的，书、博客、视频……

推荐一个博主:崔庆才

博客地址：https://cuiqingcai.com/927.html

视频地址：https://www.bilibili.com/video/av19057145?from=search&seid=18162235690175184998

推荐一本书:Python爬虫开发与项目实战

https://book.douban.com/subject/27061630/

推荐gayhubs上的一些优秀的项目:

https://www.zhihu.com/question/58151047/answer/640461600

https://github.com/facert/awesome-spider

https://github.com/Nyloner/Nyspider

https://github.com/CriseLYJ/awesome-python-login-model

https://github.com/Jack-Cherish/python-spider

https://github.com/iawia002/annie

重点：多敲代码，多实践

2 网页请求

2.1原理

网页请求的过程分为两个环节：

Request （请求）：每一个展示在用户面前的网页都必须经过这一步，也就是向服务器发送访问请求。
Response（响应）：服务器在接收到用户的请求后，会验证请求的有效性，然后向用户（客户端）发送响应的内容，客户端接收服务器响应的内容，将内容展示出来，就是我们所熟悉的网页请求，如图 8 所示。

网页请求的方式也分为两种：

GET：最常见的方式，一般用于获取或者查询资源信息；请求参数放在url里面
POST：相比 GET 方式，多了以表单形式上传参数的功能。请求参数不放在url里面

所以，在写爬虫前要先确定向谁发送请求，用什么方式发送。

网页请求头：

User-Agent:浏览器标识，如果直接用requests库访问很容易暴露爬虫的身份，被反爬虫，所以要伪装成一般的浏览器

http状态码:

200 —— 请求成功
301 —— 网页被永久转移到其他url
404 —— 网页不存在
500 —— 内部服务器错误

2.2 简单实践

2.2.1 下载一张图片

import requests
res = requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1558123762362&di=7325fef0a13db4518b50d520cdeac899&imgtype=0&src=http%3A%2F%2Fpic.rmb.bdstatic.com%2F153063492896c462ab8ed8e62c6da02eae4a1600ba.jpeg%40wm_2%2Ct_55m%2B5a625Y%2B3L%2BaWsOa9rua1geWJjee6vw%3D%3D%2Cfc_ffffff%2Cff_U2ltSGVp%2Csz_99%2Cx_62%2Cy_62")
with open("a.jpg","wb") as f:
    f.write(res.content)

2.2.2 下载一段音频

with open("a.m4a","wb") as f:
    f.write(requests.get("http://audio.xmcdn.com/group46/M02/AE/D3/wKgKj1tqygrC3FcmACyA_Gr_L8E289.m4a").content)

3 综合案例——爬取猫眼TOP100

import requests
import json
import time
from requests.exceptions import RequestException
import re
from multiprocessing import Pool
import multiprocessing as mp

def get_one_page(url):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"}
    try:
        res = requests.get(url,headers=headers)
        if res.status_code==200:
            return res.text
        return None
    except RequestException:
        return None

def parse_one_page(content):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?title="(.*?)".*?data-src="(.*?)".*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
    #pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'+'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
    items = re.findall(pattern,content)
    for item in items:
        yield {"index": item[0],
               "title": item[1],
               "img": item[2],
               "stars": item[3].strip()[3:],
               "releasetime": item[4].strip()[5:],
               "score": item[5]+item[6]}

def save_to_file(item):
    with open("result.txt","a",encoding="utf-8") as f:
        f.write(json.dumps(item,ensure_ascii=False)+"\n")
        f.close()

def main(offset):
    url = "https://maoyan.com/board/4?offset="+str(offset)
    content = get_one_page(url)
    items = parse_one_page(content)
    for item in items:
        print(item)
        save_to_file(item)

if __name__ == "__main__":
    start = time.time()
    pool = Pool(processes=mp.cpu_count())
    pool.map(main,[i*10 for i in range(10)])
    pool.close()
    pool.join()
    # for i in range(10):
    #     main(i*10)
    end = time.time()
    print("共花了{}".format(end-start))