爬虫学习路线

最新推荐文章于 2024-08-07 08:19:34 发布

郭大头哈哈哈

最新推荐文章于 2024-08-07 08:19:34 发布

阅读量298

点赞数

本文链接：https://blog.csdn.net/weixin_44359695/article/details/100704242

版权

爬虫学习路线

爬虫的学习由浅到深为

- 静态页面爬取

- 动态页面爬取

- 分布式加载页面爬取

静态页面爬取：

环境安装：

（1）安装http请求模块：requests模块

pip install requests

（2）安装网页解析模块lxml模块

pip install lxml

爬取步骤：

（1）导入对用的包模块

import requests
from lxml import etree

（2）构建网络请求，把结果返回给response

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36'
'(KHTML, like Gecko)Chrome/75.0.3770.142 Safari/537.36'
}
response = requests.get('http://www.shicimingju.com/',headers = headers)

解释：
（1）hearders的 user-agent请求头:
用来伪装网络请求的发送对象，把发送者伪装成浏览器
（2）requests请求有两个参数：请求网址链接，请求头

(3)将下载的网页转化为Html结构进行解析（HTML只能解析文本内容,解析成一个节点元素）

html = etree.HTML(response.text)

（4）提取需要的网页数据信息（基于HTML结构完成xpath解析）

data = html.xpath('//div[ends-with(@class,"100")]')

xpath语法详情请查看此处

（5）将数据存储到文件或者数据库

静态页面-----多进程爬取数据

（1）安装多进程模块multiprocessing

pip install multiprocessing

（2）爬虫思路：

定义初始请求函数完成网络请求
定义完成数据解析任务的回调函数（有多个）
定义callback回调函数
定义主函数
以下是一个完整的请求诗词名句网部分照片和链接的爬虫代码，仅供参考。

import requests, os, re
from lxml import etree
from multiprocessing import Pool, cpu_count

# 定义函数完成网络请求
def start_request(*, url, headers=None, parse):
    if headers is None:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

    response = requests.get(url=url, headers=headers)
    if response.status_code == 200:
        return parse(response)
    return None

# 定义函数完成数据解析任务
def parser1(response):
    html = etree.HTML(response.text)
    img_urls = html.xpath('//div[contains(@class, "min-width-100")]//a/img/@src')
    return img_urls

# 定义回掉函数parse2
import random
def parser2(response):
    name = random.randint(1,100)
    print(response.content)
    with open('{0}.jpg'.format(name), 'wb+') as f:
        f.write(response.content)
    return None

# 定义函数完成当子进程完成时，主进程需要调用的操作
def callBack(msg):
    global tasks
    #将子进程的返回结果添加到任务调度器中
    # 判断msg数据结构是否是列表
    if isinstance(msg, list):
        for item in msg:
            tasks.append(item)
    elif msg is None:
        tasks.append(msg)


# 判定当前的文件是否在主进程中完成
if __name__ == '__main__':
    # 定义列表用来充当任务调度器
    tasks = []
    #构建进程池，完成任务的传递
    pool = Pool(cpu_count())
    #使用进程池子完成任务的分发
    pool.apply_async(start_request, kwds={'url':'http://www.shicimingju.com/', 'parse':parser1}, callback=callBack)
    while True:
       #  使用循环完成任务调度器中未完成任务的调度
       for item in tasks:
           if item is not None:
               pool.apply_async(start_request, kwds={'url':item, 'parse':parser2}, callback=callBack)
               tasks.remove(item)
       #判断调度器中任务是否全部完成
       or_continue = False
       for item in tasks:
            if item is not None:
                or_continue = True
                break
       if len(tasks) == 0:
           or_continue = True
       if or_continue == False:
           break
    #关闭进程池
    pool.close()
    #设置主进程等待子进程完成
    pool.join()

别着急，非静态页面爬虫数据正则更新中。。。