爬虫学习

最新推荐文章于 2024-10-11 17:30:36 发布

胡萝卜粥

最新推荐文章于 2024-10-11 17:30:36 发布

阅读量228

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_42750816/article/details/116452397

版权

个人爬虫学习

前言

此文用于自我提升获取网页信息能力

requests实战

肯德基餐厅查询
在KFC网站可以找寻到一个关于城市的门店信息的单独页面
http://www.kfc.com.cn/kfccda/storelist/index.aspx

在对页面进行输入城市和点击查询后,可以明显在开发调试里看到一条带有地址的请求

而在请求头里可以发现请求地址和请求数据
分析完成

代码部分

import requests

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

headers  ={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}

value= input("请输入要查询的城市:")

params = {
    'cname': '',
    'pid': '',
    'keyword': value,
    'pageIndex': '1',
    'pageSize': '10'
}

response = requests.post(url,params = params ,headers = headers )
text = response.text
print(text)

这里只做信息获取不做信息处理。

药监局信息
http://scxk.nmpa.gov.cn:81/xk/

点击进入网站进行简单分析
会发现在网站刷新的时候，会有一条请求数据，里面返回的数据刚好是页面显示的数据那么就简单了，直接模仿网站请求

向请求地址发送数据，而这里的page,productName,conditionType可以看出是网站的查询条件,我们第一次刷新数据没有填写任何的条件，这就是默认的第一页，没有检索词，许可证编号查询.
OK，分析完成

代码部分

import requests

url = "http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

params = {
    'on': 'true',
    'page': '1',
    'pageSize': '15',
    'productName':'',
    'conditionType': '1',
    'applyname': '',
    'applysn': '',
}

reponse =  requests.post(url= url, params = params, headers=headers).json()
value = []
for i in reponse['list']:
    value.append(i['EPS_NAME'])
print(value)

这里只取出来结果的名字，相当于是个数据的小加工吧

爬虫数据解析（bs4,xpath,正则表达式)

图片网分析下载
http://pic.netbian.com/4kmeinv/

直接进入图片网站
F12打开开发工具分析请求

可以看到第一条请求就是一个获取HTML的GET请求
所以直接模仿请求使用GET就好，也不用做携带的数据

将GET请求的数据取出来会发现乱码
在这里插入图片描述
但是HTML上面有写明编码格式，所以取下来转换gbk就可以了

接下来分析HTML里的内容
这里可以使用https://tool.lu/
中的HTML工具美化

可以很明显的发现
在这里插入图片描述
这个div class="slist"中包含的就是页面的图片显示内容
src的部分中uploads应该就是下载了
我们将图片地址拼接齐全就能拿到地址下载

重要的部分是如何准确的取出HTML中我们所需要的信息
这里使用xpath，可以以节点的方式定位取出信息
在全篇find可以发现class="slist"的只有一个那么可以用全篇匹配//的方式去定位class，然后再选择下面的节点ul，li
//div[@class=“slist”]/ul/li
在定位完每张图片的list后，再定位每张图片的src
./a/img/@src
分析完成

代码部分.

import requests
from lxml import etree

url = 'https://pic.netbian.com/4kmeinv/'

repsons = requests.get(url=url)
repsons.encoding = 'gbk'
html = repsons.text
xpathBody = etree.HTML(html)
listLie = xpathBody.xpath('//div[@class="slist"]/ul/li')

i = 1
for listOne in listLie:
    picDateUrl = 'https://pic.netbian.com/'  +  str(listOne.xpath('./a/img/@src')[0])
    picDate = requests.get(url=picDateUrl).content
    picName = ".\\pic\\" + str(i) + '.jpg'
    with open(picName,'wb') as fb:
        fb.write(picDate)
    i=i+1

图片分析下载完成.

小说网分析获取
此次选取网站为http://xiyouji.5000yan.com/

先做简单分析
也是简单的get请求刷新HTML，直接模仿headers头即可
将返回的内容放到HTM工具美化后发现

每一个索引和页面显示是一致的，而且其中的href内容就是章节内容的地址

分析完成，这里使用bs4的方式获取内容
包含章节名的标签是section，并且全篇只有一个这个标签，使用find(‘section’).find_all(‘a’)的方式获取section下的所有a标签，在使用get(‘href’)获取a标签下的href内容

在这里插入图片描述
在点击了章节名后，跳转页面的加载方式也是get请求.

章节里的内容取出后可以发现文章内容是被包裹在div class=‘grap’标签中
find(‘div’,class_=“grap”)即可定位
分析完成

代码部分.

from bs4 import BeautifulSoup
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = "http://xiyouji.5000yan.com/"

page_text = requests.get(url ,headers = headers)
page_text.encoding='utf-8'
html = page_text.text

book = BeautifulSoup(html,'lxml')
listLie = book.find('section').find_all('a')

txtName = ".\\book\\" +'xiyou.txt'
for listOne in listLie:
    url_text = listOne.get('href')

    text = requests.get(url=url_text, headers=headers)
    text.encoding='utf-8'
    text_html = text.text

    txt = BeautifulSoup(text_html,'lxml')
    contents = txt.find('div',class_="grap")
    content = contents.text

    with open(txtName, 'wb') as fb:
        fb.write(content.encode())

下载完成.

异步爬取和同步爬取

异步和同步的时间消耗，这里还是以4k图片网做练习(网站分析就不重复写了)
简单的分析出网站代码然后将图片地址列表保存

同步直接循环列表地址爬取即可
异步使用Pool开启线程池

代码部分.

import requests
import re
from lxml import etree
from multiprocessing.dummy import Pool
import time

def get_url():
    url = 'https://pic.netbian.com/4kmeinv/'

    repsons = requests.get(url=url)
    repsons.encoding = 'gbk'
    html = repsons.text
    xpathBody = etree.HTML(html)
    listLie = xpathBody.xpath('//div[@class="slist"]/ul/li')

    completeListLie = []

    for listOne in listLie:
        picDateUrl = 'https://pic.netbian.com/' + str(listOne.xpath('./a/img/@src')[0])
        completeListLie.append(picDateUrl)
    return completeListLie

def save_pic(url):
    picDate = requests.get(url=url).content
    ex = "/uploads/allimg/(.*?)/"
    picName = ".\\pic\\" + re.findall(ex,url,re.S)[0] + '.jpg'
    with open(picName, 'wb') as fb:
        fb.write(picDate)

'''
#同步操作
if __name__ == "__main__":
    start = time.time()
    completeListLie = get_url()
    for listOne in completeListLie:
        save_pic(listOne)
    print("同步消耗时间为:", time.time()-start)

'''
#异步操作
if __name__ == "__main__":
    start = time.time()
    completeListLie = get_url()
    pool = Pool(3) #启用3个线程
    pool.map(save_pic,completeListLie)
    print("异步消耗时间为:", time.time()-start)

运行程序后，可以看到异步消耗的时间少了很多

多任务异步协程

多任务异步协程爬取图片
此处还是用4K图片网做爬取对象（图片地址就不重复分析了，还是同之前一样取得图片下载地址，放进list中）
使用协程需明白1.协程对象2.任务对象3.事件对象4.async特殊函数5.回调函数
首先将创建协程对象注入爬取地址，将协程对象放入任务对象，之后创建事件，将任务对象注入事件并启动事件。
.代码部分.

import time
import asyncio
import aiohttp
import requests
from lxml import etree
import re

'''获取下载图片地址'''
def get_url():
    url = 'https://pic.netbian.com/4kmeinv/'

    repsons = requests.get(url=url)
    repsons.encoding = 'gbk'
    html = repsons.text
    xpathBody = etree.HTML(html)
    listLie = xpathBody.xpath('//div[@class="slist"]/ul/li')

    completeListLie = []

    for listOne in listLie:
        picDateUrl = 'https://pic.netbian.com/' + str(listOne.xpath('./a/img/@src')[0])
        completeListLie.append(picDateUrl)
    return completeListLie

'''特殊函数'''
async def do_some_work(url):
    '''异步操作不可使用同步，会变成同步操作'''
    async with aiohttp.ClientSession() as session :
        async with await session.get(url) as response:
            reader =await response.content.read()

            ex = "/uploads/allimg/(.*?)/"
            number_name = re.findall(ex, url, re.S)[0]
            picName = ".\\pic\\" + number_name+ '.jpg'
            with open(picName, 'wb') as fb:
                fb.write(reader)
            return number_name


'''回调函数必定有一个参数,task为任务对象'''
def callback(task):
    name = task.result()
    print(name+'已经下载好了')

if __name__ == "__main__":
    start = time.time()
    urls=get_url()
    tasks = []

    for url in urls:
        coroutine = do_some_work(url)#协程对象coroutine

        task = asyncio.ensure_future(coroutine)#任务对象tsak
        task.add_done_callback(callback)
        tasks.append(task)

    loop = asyncio.get_event_loop()#创建循环事件对象
    loop.run_until_complete(asyncio.wait(tasks))#注册对象启动事件，asyncio.wait(tasks)将任务列表中的每一个任务挂起，挂起:让当前任务对象交出CPU使用权
    print("协程消耗时间为:", time.time()-start)