记一次波折的 cursor 生成代码抓取《数据库内核月报》

       今晚跟cursor斗智斗勇一个小时,终于成功实现抓取《数据库内核月报》的需求,比起csdn过程坎坷不少,知识点也更多,也更充分体现了提示词、需求拆解和分析的重要性。

        过程太长,先简单记录聊天过程,后续有时间再细化整理

一、 初始需求

你是一个Python爬虫专家,你需要生成一段Python代码,获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在本地taobao目录中。
其他注意事项:
1. 分辨页面渲染方式,避免链接抓取错误
2. 为提升获取效率,你应该考虑利用多线程及其他技术提效
3. 避免被反爬技术拦截

推荐代码如下:

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_URL = "http://mysql.taobao.org/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
MAX_RETRIES = 3
THREADS = 8

os.makedirs(SAVE_DIR, exist_ok=True)

def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
            if resp.status_code == 200:
                return resp
        except Exception:
            if i == retries - 1:
                raise
            time.sleep(random.uniform(1, 3))
    return None

def get_all_links():
    resp = get_with_retry(BASE_URL)
    soup = BeautifulSoup(resp.text, "lxml")
    links = []
    for a in soup.select("a"):
        href = a.get("href", "")
        # 文章链接通常形如 202x-xx.html
        if re.match(r"\d{4}-\d{2}\.html", href):
            links.append(BASE_URL + href)
    return links

def save_article(url):
    try:
        resp = get_with_retry(url)
        soup = BeautifulSoup(resp.text, "lxml")
        # 获取文章标题
        title = soup.title.text.strip() if soup.title else url.split("/")[-1]
        # 获取正文内容
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
        # 生成markdown内容
        md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
        # 文件名
        filename = os.path.join(SAVE_DIR, url.split("/")[-1].replace(".html", ".md"))
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filename}")
        # 随机延时,防止被封
        time.sleep(random.uniform(0.5, 1.5))
    except Exception as e:
        print(f"下载失败: {url},原因: {e}")

def main():
    links = get_all_links()
    print(f"共发现{len(links)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url) for url in links]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()

二、 共发现0篇文章

文章获取失败,提示“共发现0篇文章,开始下载”,分析原因并优化代码实现

问题分析:

  • 你的爬虫代码之所以“共发现0篇文章”,很可能是因为页面源码中的文章链接格式与你的正则表达式不符,或者页面内容被重定向、反爬等原因影响,导致没有正确抓取到链接。

优化建议:

  • 调试页面源码:直接打印页面源码的前几百字符,确认文章链接的真实格式。
  • 放宽正则匹配:有些链接可能不是 202x-xx.html,而是 202x-xx/ 或其他格式。
  • 检查编码和反爬:有些网站会根据 User-Agent 返回不同内容,或返回乱码。

调试代码片段(建议先运行,观察输出):

import requests

url = "http://mysql.taobao.org/monthly/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
resp = requests.get(url, headers=headers)
print(resp.status_code)
print(resp.apparent_encoding)
print(resp.text[:1000])  # 打印前1000字符

请运行上面代码,贴出输出内容,这样我可以帮你精准调整正则和解析逻辑,确保能抓到所有文章链接。

如果你希望我直接给出更鲁棒的爬虫代码,也可以告知你希望的策略(比如:更宽松的正则、直接抓取所有以 .html 结尾的链接等)。

结果如下

D:\book_download\venv\Scripts\python.exe D:/book_download/book_download/book/ai_get_url.py
200
utf-8
<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">

<!-- <title>数据库内核月报</title> -->
<meta name="description" content="数据库内核月报, 来着阿里云 PolarDB 数据库内核团队。">

<link rel="stylesheet" href="/monthly/css/typo.css">
<link rel="stylesheet" href="/monthly/css/animate.css">
<link rel="stylesheet" href="/monthly/css/main.css">
<link rel="canonical" href="http://mysql.taobao.org//monthly/">
<link rel="alternate" type="application/rss+xml" title="数据库内核月报" href="http://mysql.taobao.org//monthly/feed.xml" />

<link rel="stylesheet" href="//cdn.staticfile.org/highlight.js/8.3/styles/tomorrow.min.css">
<script src="/monthly/js/highlight.min.js"></script>
<!-- <link rel="stylesheet" href="/monthly/themes/tomorrow.css">
<script src="/monthly/highlight/highlight.pack.js"> -->
<script>hl

Process finished with exit code 0

    感谢你提供的输出!

    页面返回状态码 200,编码为 utf-8,且源码内容正常,但你贴出的只是前几行,还未包含文章链接部分。

    不过,这已经说明页面没有被重定向或反爬拦截。

    调试代码如下:

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://mysql.taobao.org/monthly/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    resp = requests.get(url, headers=headers)
    soup = BeautifulSoup(resp.text, "lxml")
    
    # 打印所有a标签的href
    for a in soup.find_all("a"):
        print(a.get("href"))

    请运行上面代码,把输出贴出来,这样我可以帮你精准写出能抓到所有文章链接的正则和爬虫代码。

    调试代码返回结果如下

    /monthly/
    /monthly/2025/04
    /monthly/2025/03
    /monthly/2025/02
    /monthly/2025/01
    /monthly/2024/12
    /monthly/2024/11
    /monthly/2024/10

    ...

    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    MAX_RETRIES = 3
    THREADS = 8
    
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
        for i in range(retries):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
                if resp.status_code == 200:
                    return resp
            except Exception:
                if i == retries - 1:
                    raise
                time.sleep(random.uniform(1, 3))
        return None
    
    def get_all_article_links():
        resp = get_with_retry(MONTHLY_URL)
        soup = BeautifulSoup(resp.text, "lxml")
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/YYYY/MM 目录
            m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href)
        return sorted(links)
    
    def save_article(url):
        try:
            resp = get_with_retry(url)
            soup = BeautifulSoup(resp.text, "lxml")
            # 获取文章标题
            title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
            # 获取正文内容
            content_div = soup.find("div", class_="content")
            if not content_div:
                content_div = soup.find("div", id="content")
            content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
            # 生成markdown内容
            md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
            # 文件名
            filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
            with open(filename, "w", encoding="utf-8") as f:
                f.write(md_content)
            print(f"已保存: {filename}")
            time.sleep(random.uniform(0.5, 1.5))
        except Exception as e:
            print(f"下载失败: {url},原因: {e}")
    
    def main():
        links = get_all_article_links()
        print(f"共发现{len(links)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url) for url in links]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()

    三、抓取目录链接而非文章内容

    有两个问题
    1. 你抓取的文章中文显示有乱码
    05
    MySQL · 5.7优化 · Metadata Lock子系统的优化

    06
    MySQL · 5.7特性 · 在线Truncate undo log 表空间

    2. 你抓取的是每个目录中的文章链接,而非目录中实际的文章内容

    你的反馈非常关键,下面针对两个问题详细分析并给出解决方案:


    1. 中文乱码问题

    原因分析:

    • 乱码通常是因为 requests 没有正确识别网页编码,或者网页实际编码与 resp.text 的解码方式不一致。
    • 该站点页面 <meta charset="utf-8">,但有时 requests 识别不准,需强制设定编码。

    解决方法:

    • 在获取响应后,强制设置 resp.encoding = 'utf-8',再用 resp.text。

    2. 抓取的是目录中的文章链接,而非实际文章内容

    原因分析:

    • 你现在抓到的是每个月的目录页(如 /monthly/2025/04),但实际内容在该目录页的正文部分(通常是 <div class="content">),而不是该页下的其他链接。
    • 你只需抓取每个目录页的正文内容即可,无需再进入子链接。
    • 只抓取每个 /monthly/YYYY/MM 目录页的正文内容。
    • 强制设置编码,解决中文乱码。
    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    MAX_RETRIES = 3
    THREADS = 8
    
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
        for i in range(retries):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
                resp.encoding = 'utf-8'  # 强制设定编码,防止中文乱码
                if resp.status_code == 200:
                    return resp
            except Exception:
                if i == retries - 1:
                    raise
                time.sleep(random.uniform(1, 3))
        return None
    
    def get_all_article_links():
        resp = get_with_retry(MONTHLY_URL)
        soup = BeautifulSoup(resp.text, "lxml")
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/YYYY/MM 目录
            m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href)
        return sorted(links)
    
    def save_article(url):
        try:
            resp = get_with_retry(url)
            soup = BeautifulSoup(resp.text, "lxml")
            # 获取文章标题
            title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
            # 获取正文内容
            content_div = soup.find("div", class_="content")
            if not content_div:
                content_div = soup.find("div", id="content")
            if content_div:
                content = content_div.get_text("\n", strip=True)
            else:
                content = soup.get_text("\n", strip=True)
            # 生成markdown内容
            md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
            # 文件名
            filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
            with open(filename, "w", encoding="utf-8") as f:
                f.write(md_content)
            print(f"已保存: {filename}")
            time.sleep(random.uniform(0.5, 1.5))
        except Exception as e:
            print(f"下载失败: {url},原因: {e}")
    
    def main():
        links = get_all_article_links()
        print(f"共发现{len(links)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url) for url in links]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()

    3. 需求再拆解

    中文显示问题解决了,但抓取的仍然是目录下的文章标题而非文章内容。这样,你先尝试帮我获取 @http://mysql.taobao.org/monthly/2024/12/ 目录下的文章内容,还是以markdown格式保存 在taobao目录下

    import os
    import requests
    from bs4 import BeautifulSoup
    
    url = "http://mysql.taobao.org/monthly/2024/12/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    resp = requests.get(url, headers=headers, timeout=10)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text, "lxml")
    
    # 获取标题
    title = soup.find("h1")
    md_content = f"# {title.text.strip() if title else '数据库内核月报'}\n\n原文链接: [{url}]({url})\n\n---\n\n"
    
    # 获取所有条目(通常是h3或h2标题+后续内容)
    for tag in soup.find_all(['h3', 'h2']):
        # 只抓取以“# XX”开头的条目
        if tag.text.strip().startswith("#"):
            md_content += f"## {tag.text.strip()}\n\n"
            # 收集该标题下的所有段落,直到下一个h3/h2
            for sibling in tag.find_next_siblings():
                if sibling.name in ['h3', 'h2']:
                    break
                # 只抓正文段落和列表
                if sibling.name in ['p', 'ul', 'ol', 'pre', 'blockquote']:
                    md_content += sibling.get_text("\n", strip=True) + "\n\n"
    
    filename = os.path.join(SAVE_DIR, "2024-12.md")
    with open(filename, "w", encoding="utf-8") as f:
        f.write(md_content)
    
    print(f"已保存: {filename}")

    还是有问题,你并没有打开目录中的子链接,例如第一篇文章打开后是 @http://mysql.taobao.org/monthly/2024/12/01/ ,你应该逐个打开链接并获取文章内容

    import os
    import re
    import requests
    from bs4 import BeautifulSoup
    
    BASE_URL = "http://mysql.taobao.org"
    MONTH_URL = "/monthly/2024/12/"
    FULL_MONTH_URL = BASE_URL + MONTH_URL
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def get_soup(url):
        resp = requests.get(url, headers=HEADERS, timeout=10)
        resp.encoding = 'utf-8'
        return BeautifulSoup(resp.text, "lxml")
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/2024/12/01/ 这种子文章链接
            if re.match(r"/monthly/2024/12/\d{2}/$", href):
                # 绝对化
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def get_article_content(article_url):
        soup = get_soup(article_url)
        # 标题
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        # 正文
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
        return f"## {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n\n---\n"
    
    def main():
        article_links = get_article_links(FULL_MONTH_URL)
        print(f"共发现{len(article_links)}篇文章")
        md_content = f"# 数据库内核月报 2024/12\n\n原文目录: [{FULL_MONTH_URL}]({FULL_MONTH_URL})\n\n---\n\n"
        for link in article_links:
            print(f"抓取: {link}")
            md_content += get_article_content(link)
        filename = os.path.join(SAVE_DIR, "2024-12.md")
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filename}")
    
    if __name__ == "__main__":
        main()

    就是这样,干得漂亮~ 但你需要将每篇文章单独存放在 .md文件里,并以文章标题作为文件名,加油

    import os
    import re
    import requests
    from bs4 import BeautifulSoup
    
    BASE_URL = "http://mysql.taobao.org"
    MONTH_URL = "/monthly/2024/12/"
    FULL_MONTH_URL = BASE_URL + MONTH_URL
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        # 只保留中英文、数字、下划线和横线
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        resp = requests.get(url, headers=HEADERS, timeout=10)
        resp.encoding = 'utf-8'
        return BeautifulSoup(resp.text, "lxml")
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/2024/12/01/ 这种子文章链接
            if re.match(r"/monthly/2024/12/\d{2}/$", href):
                # 绝对化
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def save_article(article_url):
        soup = get_soup(article_url)
        # 标题
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        # 正文
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(SAVE_DIR, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
    
    def main():
        article_links = get_article_links(FULL_MONTH_URL)
        print(f"共发现{len(article_links)}篇文章")
        for link in article_links:
            save_article(link)
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()
    

    四、 格式化内容

    内容ok,帮我再格式化一下里面的文本和代码,使其比较美观

    import os
    import re
    import requests
    from bs4 import BeautifulSoup
    
    BASE_URL = "http://mysql.taobao.org"
    MONTH_URL = "/monthly/2024/12/"
    FULL_MONTH_URL = BASE_URL + MONTH_URL
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        # 只保留中英文、数字、下划线和横线
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        resp = requests.get(url, headers=HEADERS, timeout=10)
        resp.encoding = 'utf-8'
        return BeautifulSoup(resp.text, "lxml")
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/2024/12/01/ 这种子文章链接
            if re.match(r"/monthly/2024/12/\d{2}/$", href):
                # 绝对化
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def format_content(content_div):
        md_lines = []
        for elem in content_div.children:
            if elem.name == 'pre':
                # 代码块
                code = elem.get_text("\n", strip=True)
                md_lines.append(f'\n```\n{code}\n```\n')
            elif elem.name in ['ul', 'ol']:
                # 列表
                for li in elem.find_all('li'):
                    md_lines.append(f'- {li.get_text(strip=True)}')
                md_lines.append('')
            elif elem.name == 'blockquote':
                # 引用
                quote = elem.get_text("\n", strip=True)
                md_lines.append(f'> {quote}\n')
            elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                # 标题
                level = int(elem.name[1])
                md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
            elif elem.name == 'p' or elem.name is None:
                # 普通段落
                text = elem.get_text("\n", strip=True)
                if text:
                    md_lines.append(f'{text}\n')
        return '\n'.join(md_lines)
    
    def save_article(article_url):
        soup = get_soup(article_url)
        # 标题
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        # 正文
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = format_content(content_div)
        else:
            content = soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(SAVE_DIR, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
    
    def main():
        article_links = get_article_links(FULL_MONTH_URL)
        print(f"共发现{len(article_links)}篇文章")
        for link in article_links:
            save_article(link)
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()
    

    五、 合并代码,实现初始需求

    嗯,就是这样,结合以上思路,回到我们最开始的需求,帮我抓取获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在本地taobao目录中,其余要求不变

    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    THREADS = 8
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        for _ in range(3):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10)
                resp.encoding = 'utf-8'
                if resp.status_code == 200:
                    return BeautifulSoup(resp.text, "lxml")
            except Exception:
                time.sleep(random.uniform(1, 2))
        return None
    
    def get_all_month_links():
        soup = get_soup(MONTHLY_URL)
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/YYYY/MM 目录
            m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href + "/")
        return sorted(links)
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        if not soup:
            return links
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/YYYY/MM/XX/ 这种子文章链接
            m = re.match(r"/monthly/\d{4}/\d{2}/\d{2}/$", href)
            if m:
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def format_content(content_div):
        md_lines = []
        for elem in content_div.children:
            if elem.name == 'pre':
                code = elem.get_text("\n", strip=True)
                md_lines.append(f'\n```\n{code}\n```\n')
            elif elem.name in ['ul', 'ol']:
                for li in elem.find_all('li'):
                    md_lines.append(f'- {li.get_text(strip=True)}')
                md_lines.append('')
            elif elem.name == 'blockquote':
                quote = elem.get_text("\n", strip=True)
                md_lines.append(f'> {quote}\n')
            elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                level = int(elem.name[1])
                md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
            elif elem.name == 'p' or elem.name is None:
                text = elem.get_text("\n", strip=True)
                if text:
                    md_lines.append(f'{text}\n')
        return '\n'.join(md_lines)
    
    def save_article(article_url):
        soup = get_soup(article_url)
        if not soup:
            print(f"获取失败: {article_url}")
            return
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = format_content(content_div)
        else:
            content = soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(SAVE_DIR, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
        time.sleep(random.uniform(0.5, 1.2))
    
    def main():
        month_links = get_all_month_links()
        print(f"共发现{len(month_links)}个月份")
        all_article_links = []
        for month_url in month_links:
            article_links = get_article_links(month_url)
            print(f"{month_url} 发现{len(article_links)}篇文章")
            all_article_links.extend(article_links)
        print(f"总共发现{len(all_article_links)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url) for url in all_article_links]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()
    

    六、 目录细化,抓取图片

    干得漂亮~ 需求已基本实现,有一些优化项:
    1. 帮我按照链接中的年份及月份创建子目录,并将对应月份的文章放入其中
    2. 只需要抓取2024及2025年的文章
    3. 如果文章中有表格、图片等,也需要抓取下来

    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    from urllib.parse import urljoin, urlparse
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    THREADS = 8
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        for _ in range(3):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10)
                resp.encoding = 'utf-8'
                if resp.status_code == 200:
                    return BeautifulSoup(resp.text, "lxml")
            except Exception:
                time.sleep(random.uniform(1, 2))
        return None
    
    def get_all_month_links():
        soup = get_soup(MONTHLY_URL)
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 只抓2024和2025年
            m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href + "/")
        return sorted(links)
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        if not soup:
            return links
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/2024/12/01/ 这种子文章链接
            m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
            if m:
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def download_image(img_url, save_path):
        try:
            resp = requests.get(img_url, headers=HEADERS, timeout=10)
            if resp.status_code == 200:
                with open(save_path, 'wb') as f:
                    f.write(resp.content)
                return True
        except Exception:
            pass
        return False
    
    def format_table(table):
        rows = table.find_all('tr')
        md = []
        for i, row in enumerate(rows):
            cols = row.find_all(['td', 'th'])
            line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
            md.append(line)
            if i == 0:
                md.append('|' + '---|' * len(cols))
        return '\n'.join(md) + '\n'
    
    def format_content(content_div, article_url, save_dir):
        md_lines = []
        for elem in content_div.children:
            if elem.name == 'pre':
                code = elem.get_text("\n", strip=True)
                md_lines.append(f'\n```\n{code}\n```\n')
            elif elem.name in ['ul', 'ol']:
                for li in elem.find_all('li'):
                    md_lines.append(f'- {li.get_text(strip=True)}')
                md_lines.append('')
            elif elem.name == 'blockquote':
                quote = elem.get_text("\n", strip=True)
                md_lines.append(f'> {quote}\n')
            elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                level = int(elem.name[1])
                md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
            elif elem.name == 'p' or elem.name is None:
                text = elem.get_text("\n", strip=True)
                if text:
                    md_lines.append(f'{text}\n')
            elif elem.name == 'table':
                md_lines.append(format_table(elem))
            elif elem.name == 'img':
                img_src = elem.get('src')
                if img_src:
                    img_url = urljoin(article_url, img_src)
                    img_name = os.path.basename(urlparse(img_url).path)
                    img_save_path = os.path.join(save_dir, img_name)
                    if download_image(img_url, img_save_path):
                        md_lines.append(f'![{img_name}]({img_name})\n')
        # 处理段落内图片
        for img in content_div.find_all('img'):
            img_src = img.get('src')
            if img_src:
                img_url = urljoin(article_url, img_src)
                img_name = os.path.basename(urlparse(img_url).path)
                img_save_path = os.path.join(save_dir, img_name)
                if download_image(img_url, img_save_path):
                    # 替换原img为markdown图片语法
                    img.replace_with(f'![{img_name}]({img_name})')
        return '\n'.join(md_lines)
    
    def save_article(article_url, year, month):
        soup = get_soup(article_url)
        if not soup:
            print(f"获取失败: {article_url}")
            return
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        # 按年份和月份创建子目录
        subdir = os.path.join(SAVE_DIR, year, month)
        os.makedirs(subdir, exist_ok=True)
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = format_content(content_div, article_url, subdir)
        else:
            content = soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(subdir, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
        time.sleep(random.uniform(0.5, 1.2))
    
    def main():
        month_links = get_all_month_links()
        print(f"共发现{len(month_links)}个月份")
        all_tasks = []
        for month_url in month_links:
            m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
            if not m:
                continue
            year, month = m.group(1), m.group(2)
            article_links = get_article_links(month_url)
            print(f"{month_url} 发现{len(article_links)}篇文章")
            for link in article_links:
                all_tasks.append((link, year, month))
        print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()
    

    嗯,图片不要单独保存,需要嵌入回.md文档里,这样方便用户查询

    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    from urllib.parse import urljoin
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    THREADS = 8
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        for _ in range(3):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10)
                resp.encoding = 'utf-8'
                if resp.status_code == 200:
                    return BeautifulSoup(resp.text, "lxml")
            except Exception:
                time.sleep(random.uniform(1, 2))
        return None
    
    def get_all_month_links():
        soup = get_soup(MONTHLY_URL)
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href + "/")
        return sorted(links)
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        if not soup:
            return links
        for a in soup.find_all("a", href=True):
            href = a["href"]
            m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
            if m:
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def format_table(table):
        rows = table.find_all('tr')
        md = []
        for i, row in enumerate(rows):
            cols = row.find_all(['td', 'th'])
            line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
            md.append(line)
            if i == 0:
                md.append('|' + '---|' * len(cols))
        return '\n'.join(md) + '\n'
    
    def format_content(content_div, article_url):
        md_lines = []
        for elem in content_div.children:
            if elem.name == 'pre':
                code = elem.get_text("\n", strip=True)
                md_lines.append(f'\n```\n{code}\n```\n')
            elif elem.name in ['ul', 'ol']:
                for li in elem.find_all('li'):
                    md_lines.append(f'- {li.get_text(strip=True)}')
                md_lines.append('')
            elif elem.name == 'blockquote':
                quote = elem.get_text("\n", strip=True)
                md_lines.append(f'> {quote}\n')
            elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                level = int(elem.name[1])
                md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
            elif elem.name == 'p' or elem.name is None:
                text = elem.get_text("\n", strip=True)
                if text:
                    md_lines.append(f'{text}\n')
            elif elem.name == 'table':
                md_lines.append(format_table(elem))
            elif elem.name == 'img':
                img_src = elem.get('src')
                if img_src:
                    img_url = urljoin(article_url, img_src)
                    md_lines.append(f'![image]({img_url})\n')
        # 处理段落内图片
        for img in content_div.find_all('img'):
            img_src = img.get('src')
            if img_src:
                img_url = urljoin(article_url, img_src)
                img.replace_with(f'![image]({img_url})')
        return '\n'.join(md_lines)
    
    def save_article(article_url, year, month):
        soup = get_soup(article_url)
        if not soup:
            print(f"获取失败: {article_url}")
            return
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        subdir = os.path.join(SAVE_DIR, year, month)
        os.makedirs(subdir, exist_ok=True)
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = format_content(content_div, article_url)
        else:
            content = soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(subdir, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
        time.sleep(random.uniform(0.5, 1.2))
    
    def main():
        month_links = get_all_month_links()
        print(f"共发现{len(month_links)}个月份")
        all_tasks = []
        for month_url in month_links:
            m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
            if not m:
                continue
            year, month = m.group(1), m.group(2)
            article_links = get_article_links(month_url)
            print(f"{month_url} 发现{len(article_links)}篇文章")
            for link in article_links:
                all_tasks.append((link, year, month))
        print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()
    

    图片没有显示成功呢

    结合你这个优化思路,再优化下之前那段代码,使抓取的图片能在.md文件中直接展示

    import os
    import re
    import time
    import random
    import requests
    from bs4 import BeautifulSoup
    from concurrent.futures import ThreadPoolExecutor, as_completed
    from urllib.parse import urljoin
    
    BASE_URL = "http://mysql.taobao.org"
    MONTHLY_URL = BASE_URL + "/monthly/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
    }
    SAVE_DIR = "taobao"
    THREADS = 8
    os.makedirs(SAVE_DIR, exist_ok=True)
    
    def safe_filename(title):
        # 只保留中英文、数字、下划线和横线
        return re.sub(r'[\\/:*?"<>|]', '', title).strip()
    
    def get_soup(url):
        for _ in range(3):
            try:
                resp = requests.get(url, headers=HEADERS, timeout=10)
                resp.encoding = 'utf-8'
                if resp.status_code == 200:
                    return BeautifulSoup(resp.text, "lxml")
            except Exception:
                time.sleep(random.uniform(1, 2))
        return None
    
    def get_all_month_links():
        soup = get_soup(MONTHLY_URL)
        links = set()
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 只抓2024和2025年
            m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
            if m:
                links.add(BASE_URL + href + "/")
        return sorted(links)
    
    def get_article_links(month_url):
        soup = get_soup(month_url)
        links = []
        if not soup:
            return links
        for a in soup.find_all("a", href=True):
            href = a["href"]
            # 匹配 /monthly/2024/12/01/ 这种子文章链接
            m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
            if m:
                if not href.startswith("http"):
                    href = BASE_URL + href
                links.append(href)
        return sorted(set(links))
    
    def format_table(table):
        rows = table.find_all('tr')
        md = []
        for i, row in enumerate(rows):
            cols = row.find_all(['td', 'th'])
            line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
            md.append(line)
            if i == 0:
                md.append('|' + '---|' * len(cols))
        return '\n'.join(md) + '\n'
    
    def format_content(content_div, article_url):
        md_lines = []
        for elem in content_div.children:
            if elem.name == 'pre':
                code = elem.get_text("\n", strip=True)
                md_lines.append(f'\n```\n{code}\n```\n')
            elif elem.name in ['ul', 'ol']:
                for li in elem.find_all('li', recursive=False):
                    # 处理列表项中的图片
                    imgs = li.find_all('img')
                    for img in imgs:
                        img_src = img.get('src')
                        if img_src:
                            img_url = urljoin(article_url, img_src)
                            md_lines.append(f'![image]({img_url})')
                    text = li.get_text(" ", strip=True)
                    if text:
                        md_lines.append(f'- {text}')
                md_lines.append('')
            elif elem.name == 'blockquote':
                quote = elem.get_text("\n", strip=True)
                md_lines.append(f'> {quote}\n')
            elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                level = int(elem.name[1])
                md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
            elif elem.name == 'p' or elem.name is None:
                # 处理段落中的图片
                if hasattr(elem, 'find_all'):
                    imgs = elem.find_all('img')
                    for img in imgs:
                        img_src = img.get('src')
                        if img_src:
                            img_url = urljoin(article_url, img_src)
                            md_lines.append(f'![image]({img_url})')
                text = elem.get_text("\n", strip=True)
                if text:
                    md_lines.append(f'{text}\n')
            elif elem.name == 'table':
                md_lines.append(format_table(elem))
        return '\n'.join(md_lines)
    
    def save_article(article_url, year, month):
        soup = get_soup(article_url)
        if not soup:
            print(f"获取失败: {article_url}")
            return
        h2 = soup.find("h2")
        title = h2.text.strip() if h2 else article_url.split("/")[-2]
        filename = safe_filename(title) + ".md"
        # 按年份和月份创建子目录
        subdir = os.path.join(SAVE_DIR, year, month)
        os.makedirs(subdir, exist_ok=True)
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = format_content(content_div, article_url)
        else:
            content = soup.get_text("\n", strip=True)
        md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
        filepath = os.path.join(subdir, filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filepath}")
        time.sleep(random.uniform(0.5, 1.2))
    
    def main():
        month_links = get_all_month_links()
        print(f"共发现{len(month_links)}个月份")
        all_tasks = []
        for month_url in month_links:
            m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
            if not m:
                continue
            year, month = m.group(1), m.group(2)
            article_links = get_article_links(month_url)
            print(f"{month_url} 发现{len(article_links)}篇文章")
            for link in article_links:
                all_tasks.append((link, year, month))
        print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
            for future in as_completed(futures):
                pass
        print("全部下载完成。")
    
    if __name__ == "__main__":
        main()

    终于终于,需求实现~

    七、 执行效果

    抓取成功~

    文章效果

    八、 最佳提示词总结

    总结我们的对话,我应该给你什么样的提示词,能让你在第一次就生成这段目标代码呢

    最佳提示词(Prompt)示例

    你是一个Python爬虫专家。请帮我用Python编写一个高效的爬虫,要求如下:

    1. 目标网站:抓取 http://mysql.taobao.org/monthly/ 下2024和2025年所有月报文章。
    1. 目录结构:自动识别所有2024和2025年下的月份和文章链接,并按“年份/月份”创建本地子目录(如 taobao/2024/12/)。
    1. 内容保存:每篇文章内容以 markdown 格式单独保存为 .md 文件,文件名为文章标题,放在对应的子目录下。
    1. 内容格式:正文内容需美观排版,支持代码块、表格(转为markdown表格)、列表、引用等常见格式。
    1. 图片处理:文章中的图片在 markdown 文件中以原始网络URL嵌入(即 ![alt](url)),无需下载到本地,确保在md预览时可直接显示。
    1. 效率与反爬:支持多线程下载,设置User-Agent、随机延时、自动重试等反爬措施。
    1. 健壮性:遇到网络异常、页面结构变化等情况能自动跳过并继续。
    1. 依赖说明:请注明所需的第三方库。

    请直接给出完整、可运行的Python代码。


    这样写的好处

    • 明确了目标范围(2024/2025年、所有月报、所有文章)
    • 明确了目录结构和文件命名
    • 明确了内容格式和图片处理方式
    • 明确了效率和反爬要求
    • 明确了健壮性和依赖

    你可以根据实际需求增减细节,但只要覆盖上述要点,基本可以一次性获得你想要的高质量爬虫代码!

    九、 知识点学习

    为了实现这个需求,你的最终代码生成思路是什么样的呢,帮忙一步步列出。这段代码中的重点和难点分别是什么,涉及到哪些知识点?帮忙结合代码分析,我不太清楚你的步骤对应代码中的哪些内容

    1. 步骤分解与代码对应

    2. 重点与难点代码分析

    重点1:高质量内容解析与格式化
    • format_content 函数负责将HTML正文转为美观的Markdown,兼容各种标签(段落、标题、代码块、表格、图片等)。
    • format_table 函数将HTML表格转为Markdown表格。
    难点1:图片URL处理
    • 用 urljoin(article_url, img_src) 保证图片URL完整,插入 ![image](url),确保md中能直接显示图片。
    难点2:多线程与异常处理
    • ThreadPoolExecutor 并发抓取,get_soup 内部自动重试,保证大批量抓取时的健壮性。
    难点3:文件名合法性
    • safe_filename 函数去除标题中的非法字符,避免文件保存出错。

    3. 涉及知识点与代码位置

    • requests/BeautifulSoup:所有网页请求与HTML解析相关函数
    • 正则表达式:链接筛选、文件名处理
    • os/path:目录和文件操作
    • concurrent.futures:多线程并发
    • Markdown语法:format_content、format_table等
    • 异常处理:try/except、重试机制

    4. 总结

    • 你可以对照每个函数和代码块,理解它在整个流程中的作用。
    • 重点和难点都在内容解析、图片处理、多线程和健壮性上,代码中都有明确实现。
    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包

    打赏作者

    Hehuyi_In

    你的鼓励将是我创作的最大动力

    ¥1 ¥2 ¥4 ¥6 ¥10 ¥20
    扫码支付:¥1
    获取中
    扫码支付

    您的余额不足,请更换扫码支付或充值

    打赏作者

    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值