Python抓取网页数据实践的基本思路

qq_29798761

于 2024-08-22 17:52:05 发布

阅读量366

点赞数 4

分类专栏： python 文章标签： python 开发语言 django pygame virtualenv tornado flask

本文链接：https://blog.csdn.net/qq_29798761/article/details/141434291

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1. 安装必要的工具和库

Python: 用于编写爬虫脚本。
requests: 一个用于发送HTTP请求的库。
BeautifulSoup: 一个用于解析HTML文档的库。
pymysql: 一个用于连接和操作MySQL数据库的Python库。

你可以通过pip安装这些库：

sh复制代码

pip install requests beautifulsoup4 pymysql

2. 编写爬虫脚本

创建一个Python脚本来抓取官网的数据。这通常涉及发送HTTP请求，获取响应，然后使用BeautifulSoup解析HTML内容。

import requests  
from bs4 import BeautifulSoup  
  
def scrape_data(url):  
    response = requests.get(url)  
    soup = BeautifulSoup(response.content, 'html.parser')  
      
    # 根据官网的结构提取数据  
    # 例如，找到所有的文章标题  
    titles = soup.find_all('h2')  # 这只是一个例子，实际标签可能会有所不同  
    return [title.text for title in titles]  
  
if __name__ == '__main__':  
    url = 'YOUR_TARGET_WEBSITE_URL'  
    titles = scrape_data(url)  
    print(titles)

实践：网页的友情链接

为了改写您提供的Python代码，以便从指定的HTML页面中获取具有特定类标签下的所有<a>标签的title和href值，我们需要对BeautifulSoup的使用进行一定的调整。以下是一个示例代码，它展示了如何定位到一个具有特定类名的父元素，然后遍历该元素下的所有<a>标签，并提取它们的title和href属性。

请注意，由于我无法直接访问http://**.html这个URL来验证具体的HTML结构，我将假设您想要从一个具有类名（比如class-name）的<div>元素中获取<a>标签。如果您的实际HTML结构不同，请相应地调整类名或其他选择器。

import requests  
from bs4 import BeautifulSoup  
  
def scrape_data(url):  
    # 发送HTTP GET请求  
    response = requests.get(url)  
    # 检查请求是否成功  
    response.raise_for_status()  # 如果请求失败，将抛出HTTPError异常  
      
    # 使用BeautifulSoup解析HTML内容  
    soup = BeautifulSoup(response.content, 'html.parser')  
      
    # 定位到具有特定类名的父元素，这里以'class-name'为例，请根据实际情况替换  
    parent_element = soup.find(class_='class-name')  # 如果类名有多个，使用class_=['class1', 'class2']  
      
    # 如果找到了父元素，则继续提取其中的<a>标签  
    if parent_element:  
        # 遍历父元素下的所有<a>标签  
        links = []  
        for a_tag in parent_element.find_all('a'):  
            # 提取title和href属性，如果属性不存在，则使用None或空字符串作为默认值  
            title = a_tag.get('title', '')  
            href = a_tag.get('href')  
            # 将提取的信息以字典形式添加到列表中  
            links.append({'title': title, 'href': href})  
        return links  
    else:  
        # 如果没有找到父元素，返回空列表或错误信息  
        return []  # 或抛出异常，取决于您的需求  
  
if __name__ == '__main__':  
    url = 'http://**.html'  
    links = scrape_data(url)  
    # 打印提取的链接信息  
    for link in links:  
        print(link)

在这个示例中，scrape_data函数首先发送一个HTTP GET请求到指定的URL，并使用BeautifulSoup解析返回的HTML内容。然后，它尝试找到一个具有特定类名的父元素，并遍历该元素下的所有<a>标签。对于每个<a>标签，它都会提取title和href属性，并将这些信息以字典的形式存储在一个列表中。最后，该函数返回包含所有链接信息的列表。

3. 将数据存储到MySQL数据库

接下来，你需要将爬取到的数据插入到MySQL数据库中。

import pymysql  
  
def store_in_database(data):  
    connection = pymysql.connect(host='localhost', user='root', password='password', db='mydb')  
    try:  
        with connection.cursor() as cursor:  
            sql = "INSERT INTO `mytable` (`title`) VALUES (%s)"  
            cursor.executemany(sql, [(title,) for title in data])  
        connection.commit()  
    finally:  
        connection.close()  
  
store_in_database(titles)