Python进行多线程爬取数据通用模板

最新推荐文章于 2024-10-02 10:53:34 发布

q56731523

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量518

点赞数

文章标签： python 开发语言 golang 后端爬虫

本文链接：https://blog.csdn.net/weixin_44617651/article/details/134310188

版权

首先，我们需要导入所需的库，包括requests和BeautifulSoup。requests库用于发送HTTP请求，BeautifulSoup库用于解析HTML文档。

在这里插入图片描述

import requests
from bs4 import BeautifulSoup

然后，我们需要定义一个函数来发送HTTP请求并返回响应。在这个函数中，我们使用requests库的get方法来发送一个GET请求到指定的URL，并指定我们使用的代理。

def get(url, proxies):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers, proxies=proxies)
    return response

接下来，我们需要定义一个函数来解析响应并提取我们需要的信息。在这个函数中，我们使用BeautifulSoup库的find_all方法来查找所有的段落，并使用列表推导式来提取每一段中的文本。

def parse(response):
    soup = BeautifulSoup(response.text, 'html.parser')
    paragraphs = soup.find_all('p')
    text = [p.text for p in paragraphs]
    return text

最后，我们需要调用这些函数来发送请求，解析响应，并打印出提取的文本。

proxies = {
    'http': 'http://duoip:8000',
    'https': 'http://duoip:8000'
}

url = '目标网站'
response = get(url, proxies)
text = parse(response)
print(text)