Python Day4 爬虫-selenium滚动和常见反爬

KathAmy

已于 2022-08-17 19:30:35 修改

阅读量814

点赞数 2

分类专栏： selenium Python 爬虫文章标签： python 爬虫 selenium

于 2022-08-16 19:16:51 首次发布

本文链接：https://blog.csdn.net/qq_67780151/article/details/126370731

版权

本文介绍了Python爬虫中使用Selenium进行页面滚动的技巧，以及如何应对常见的反爬策略。首先分析了知乎网站的数据结构，然后详细讲解了如何利用Selenium实现页面自动滚动，以及如何通过requests库结合登录后的cookie进行自动登录。此外，还探讨了selenium获取和使用cookie的方法，并讨论了requests库在使用代理IP进行爬取时的实际应用。

摘要由CSDN通过智能技术生成

Day4 selenium滚动和常见fanpa

文章目录

Day4 selenium滚动和常见fanpa

1. zhi网页面数据分析

'''
Author:KathAmy
Date:2022/8/16  9:15
键盘敲烂，共同进步！
'''
from selenium.webdriver import Chrome
from time import sleep
from bs4 import BeautifulSoup


def analysis_data(html: str):  # 分析数据
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select_one('.wx-tit>h1')
    if title:
        title = title.text

    author = soup.select_one('#authorpart a')
    if author:
        author = author.text

    organization = soup.select_one('.wx-tit>h3:nth-child(3) a')
    if organization:
        organization = organization.text

    print(title)
    print(author)
    print(organization)
    print('-----------------------------------华丽的分割线-----------------------------------')


def get_paper(key_word='数据分析'):
    # 1.创建浏览器打开中国知网，输入搜索关键字
    global b
    b = Chrome()
    b.get('https://www.cnki.net/')
    b.find_element_by_id('txt_SearchText').send_keys(f'{
     key_word}\n')
    sleep(1)

    # 2.获取搜索结果
    for x in range(5):
        # 获取一页的数据
        all_a = b.find_elements_by_css_selector('.result-table-list .name>a'