python使用beautifulsoup4和selenium爬取文章列表动态加载的数据查找无效链接爬虫

最新推荐文章于 2023-06-01 21:56:14 发布

夏洛特疯猫

最新推荐文章于 2023-06-01 21:56:14 发布

阅读量486

点赞数

分类专栏：爬虫文章标签： python web

本文链接：https://blog.csdn.net/weixin_44867493/article/details/107317850

版权

爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、因为线上服务器限制了访问次数会拦截访问次数，所以以本地测试，本地测试的资源文件和线上一致

二、因为页面的列表数据是动态加载的，不能直接使用beautifulsoup进行页面的熬汤，需要使用selenium库中的webdriver，调取chrome，将页面加载出来后，进行beautifulsoup熬汤。

三、因为链接一般会出现在a标签的href和img标签中的scr，而且链接分为相对路径和绝对路径，相对路径是/开头是本目录的数据，绝对路径还是以http开头的图片和资源地址，需要对这两种情况进行判断

四、从列表进入页面才能遍历整个列表，需要用到循环和二次熬汤，比较麻烦。

第一步

引入n多东西

#coding=utf8

from bs4 import BeautifulSoup
import requests
import bs4
import time
from selenium import webdriver

引入日志模块

import logging
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')

第二步、调取浏览器

def gethtmlurl(url):
    logging.info('开始测试')
    browser = webdriver.Chrome()
    browser.get(url)
    time.sleep(10)
    return browser.page_source

第三步、编写熬汤程序

def getlinkurl(source):
    logging.info('开始执行熬汤')
    soup=BeautifulSoup(source,'html.parser')
    for child in soup.find_all('ul',id='list'):#找到文章列表的大的div
        for i in child.find_all('li'):#找到每一个文章
            '''
            print(i.a.attrs['href']
            '''
            print(i.find(class_='title').string)#打印出每一篇文章的标题
            a=i.a.attrs['href']#找到文章链接
            if a.startswith('http'):#如果地址是http开头的外链，非相对路径，直接访问
                try:
                    b=requests.get(a)
                    b.raise_for_status()#判断返回的是否为200
                    print('_____________文章链接可以访问_____________')
                except:#如果不是200
                    print(a)#打印出不能访问的链接
                    print('*************文章链接不可以访问***************')
                    logging.info('本次熬汤结束')
            else:#进入详情页面
                b=requests.get('http://localhost/'+a)
                b.encoding=b.apparent_encoding#对详情页面进行编码处理
                c=b.text
                soup_b=BeautifulSoup(c,'html.parser')#熬汤

                for child in soup_b.find_all('div',class_='text'):#循环详情页
                 
                    for child_b in child.find_all('a'):#找到所有的a标签
                        try:
                            link=child_b.attrs['href']#获取a标签的值
                            if link.startswith('/cqzoo'):#判断本站内添加的图片资源地址
                                try:
                                    check=requests.get('http://localhost'+link)#文章内的图片添加的打开链接非外链
                                    check.raise_for_status()#检查返回值
                                    print(link)
                                    print('_____________文章图片或者资源链接可以访问_____________')
                                except:
                                    print(link)
                                    print('***************文章图片或者资源链接不能访问***********')
                            else:
                                try:
                                    check=requests.get(link)#直接访问链接
                                    check.raise_for_status()
                                    print('_____________外部链接可以访问_____________')
                                except:
                                    print(link)#访问不到外部链接，打印外部链接
                                    print('**********外部链接不可以访问*******************')
                        except:
                                print('*************a标签无herf值*****************')
               
                    for child_c in child.find_all('img'):#找到所有的img标签
                        img_link=child_c.attrs['src']
                        if img_link.startswith('/cqzoo'):
                            try:
                                img_check=requests.get('http://localhost'+img_link)
                                img_check.raise_for_status()
                                print(img_link)
                                print('______________文章图片可以访问________________')
                            except:
                                print(img_link)
                                print('***************文章图片不能访问***********')
                        else:
                            try:
                                check=requests.get(img_link)
                                check.raise_for_status()
                                print(img_link)
                                print('_____________图片外部链接可以访问_____________')
                            except:
                                print(link)#访问不到外部链接，打印外部链接
                                print('**********图片外部链接不可以访问*******************') 
                    logging.info('本次熬汤结束')

第四步、将所有的函数串联起来然后执行

def run():
    url=('http://localhost/cqzoo/zxdt/list.html?_t=1594198936938&type=yqxw')
    source=gethtmlurl(url)
    getlinkurl(source)
    
run()

查看结果：

夏洛特疯猫

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录