爬取PPT

最新推荐文章于 2024-05-10 19:42:04 发布

灯繁

最新推荐文章于 2024-05-10 19:42:04 发布

阅读量1.8k

点赞数 4

文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_52300580/article/details/110674818

版权

爬取PPT

记录写爬取PPT的第一次
作为大一的新生，其实是第一次写爬虫，可能过程有点繁琐，但是也容易理解的呢，可能写的有不太好的地方，希望指正

前言

写这篇文章的目的，也在于警醒自己，还有理清思路，也能更好的写爬虫，确实写的也挺弱的

二、使用步骤

1.引入库

代码如下（示例）：

from bs4 import BeautifulSoup
from lxml import etree
import requests
from selenium import webdriver
import urllib
import time
import os

代码如下

这次爬取的网站是优品

# http://www.ypppt.com/moban/lunwen/list-2.html
# http://www.ypppt.com/moban/lunwen/
# /html/body/div[2]/div/ul/li[1]/a

from bs4 import BeautifulSoup
import requests
import time

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38"
}

time.sleep(4)
num = 1
page = 1
for page in range(1, 6):
    if page == 1:
        new_url = 'http://www.ypppt.com/moban/lunwen/'
    else:
        new_url = ['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
        new_url = new_url[0]    #列表（被称为打了激素的数组）：可以存储任意数据类型的集合（一个变量中可以存储多个信息）,相当于数组
    #   new_url = 'http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)
    print("正在爬取" + new_url)
    response = requests.get(new_url, headers=headers)
    response.encoding = 'utf-8'
    jx = BeautifulSoup(response.content, 'lxml')
    mains = jx.find('ul', {'class': 'posts clear'})
    main_ppts = mains.find_all('li')
    for i in main_ppts:
        a = i.a.attrs['href']
        b = requests.get('http://www.ypppt.com' + a)
        b.encoding = b.apparent_encoding

        c = BeautifulSoup(b.content, 'lxml')
        down = c.find('div', {'class': 'button'})
        down1 = down.a.attrs['href']
        down_1 = requests.get('http://www.ypppt.com' + down1)
        down_1.encoding = down_1.apparent_encoding

        down_2 = BeautifulSoup(down_1.content, 'lxml')
        e = down_2.find('ul', {'class': 'down clear'})
        f = e.find('li')
        downlaod_url = f.a.attrs['href']
        download = requests.get(url=downlaod_url, headers=headers).content

        with open(str(num) + '.zip', 'wb') as f:
            f.write(download)
        print(str(num) + '下载成功')
        num += 1

第一步肯定要写请求头
headers = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400”
}
其实中间的time.sleep可以参考其他文献，毕竟我也不太清楚，还是太弱了

之后引入所有库之后就能开始解析网页了

第一步

登录网站之后发现除了第一页之外，其余的页数都有规律可循
所以就用了一个判断的方法

for page in range(1, 6):

if page == 1:    
    new_url = 'http://www.ypppt.com/moban/lunwen/'
else:
    new_url =['http://www.ypppt.com/moban/lunwen/list-{}.html'.format(page)]
    new_url = new_url[0]

第二步
其实我还是比较喜欢用BeautifulSoup解析网页
代码就是

BeautifulSoup(response.content,‘lxml’)

切记一定要在后面加content，否则就会出错，找这个解决方法也找了特别长时间，最后才找到的

-第三步
最后就能开始正式解析网页中的东西了
就到我最喜欢的bs4了
原理就是一步一步的找标签

a= i.a.attrs['href']
        b=requests.get('http://www.ypppt.com'+a)

我就是用这种方法模拟点进网页的，毕竟还不怎么会selenium,就只能用这种方法

第四步
这个里面好像没写，其实可以建一个文件夹，把这些都存起来的
比如引入的os库就是，os.mkdir()就是，有兴趣的可以看一下大佬们的

灯繁

关注

4
点赞
踩
9

收藏

觉得还不错? 一键收藏
3
评论
爬取PPT

爬取PPT记录写爬取PPT的第一次作为大一的新生，其实是第一次写爬虫，可能过程有点繁琐，但是也容易理解的呢，可能写的有不太好的地方，希望指正前言写这篇文章的目的，也在于警醒自己，还有理清思路，也能更好的写爬虫，确实写的也挺弱的二、使用步骤1.引入库代码如下（示例）：from bs4 import BeautifulSoupfrom lxml import etreeimport requestsfrom selenium import webdriverimport urllib
复制链接

扫一扫