Python实现爬取WordPress文章并转换成CSDN文章格式

MWHLS

于 2021-05-29 16:17:38 发布

阅读量206

点赞数

分类专栏： python 文章标签： python 爬虫 WordPress csdn

本文链接：https://blog.csdn.net/asd123pwj/article/details/117387279

版权

python 专栏收录该内容

96 篇文章 6 订阅

订阅专栏

文章首发及后续更新：https://mwhls.top/2186.html
新的更新内容请到mwhls.top查看。
无图/无目录/格式错误/更多相关请到上方的文章首发页面查看。

推荐参考：Python爬虫入门

1. 项目介绍

2. 项目思路

3. 代码

项目介绍

爬取博客主页的前N篇文章，按照标题、内容的顺序返回。
效果不错，几乎所有的目的都完成了。
但美中不足的是不能自动代发，还是需要自己发。

项目思路

思路和前一篇文章很像，我这里就介绍一下大概思路以及问题处理，其余不再赘述了。
- 见：Python爬取StackOverflow问题页面并转换为WordPress可用格式

获取页面html文本
从博客主页获取文章链接
- 分割的时候，re总是匹配最多的内容，而这些内容之间又有变动的东西，不能像上篇文章一样用split分片。
  - 即，使用 href="(.)" 匹配时，它总是从第一个href=“匹配到最后一篇文章的”。
- 然后查了几个方法，试了用re的split，但依然是同样问题，虽然分割了，但还是把不该包括的东西也包括进去了。
- 最后在re的官方文档里面找到了解决办法：
  - 加一个?，将 (.) 改成 (.*?)，即可达到非贪婪匹配的效果，匹配最少的字符，
  - 见：https://docs.python.org/zh-cn/3/library/re.html
爬取上一步获取的链接对应的文章
- 在转换成CSDN格式的时候，原本是打算用管理员账号的编辑模式页面，
- 因为我一直都是用这个页面来发文章的。
- 然后试了试直接爬取，可行，于是直接爬取。
- 但在代码格式的处理中有问题，如果连续空两行，就会出问题。
- 最后在<code>标签前后加上了WordPress的code标签才解决。
输出结果

代码

import re
import urllib.request
import urllib.error
from bs4 import BeautifulSoup
import pyperclip


def main():
    article_num = input("转换文章数（默认3，最多15）：")
    if type(article_num) == str and article_num == '':
        article_num = 3
    else:
        article_num = int(article_num)

    homepage = "https://mwhls.top/"
    html = ask_url_get_html(homepage)
    url = get_url_from_homepage(html, article_num)

    for pos in range(article_num - 1, -1, -1):
        print('第{0}篇文章的html文本获取中...'.format(article_num - pos))
        html = ask_url_get_html(url[pos])
        html_data = get_data_from_post_page(html, url[pos])
        print('获取成功，文章标题已粘贴至剪切板。')
        pyperclip.copy(html_data[0])
        input("按下回车复制文章内容。")
        pyperclip.copy(html_data[1])
        input("已获取文章内容，按下回车继续。")
    print("转换成功，程序正常结束。")


def ask_url_get_html(url):
    #   从url中获取html文件
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    request = urllib.request.Request(url, headers=head)
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html


def get_url_from_homepage(html, article_num):
    #   从主页中获得文章链接
    find_url = re.compile('href="(.*?)" rel=')
    soup = BeautifulSoup(html, "html.parser")
    item = str(soup.find_all("h2", class_='post-title', limit=article_num))
    url = re.findall(find_url, item)
    return url


def get_data_from_post_page(html, url):
    #   从文章页面获取信息并转换成CSDN格式
    find_content = re.compile('</div>(.*)]', re.DOTALL)
    find_title = re.compile('title">(.*)</h1>')
    soup = BeautifulSoup(html, "html.parser")
    item = str(soup.find_all("h1", class_='post-title'))
    title = re.findall(find_title, item)
    item = str(soup.find_all("div", class_='entry'))
    item = item.replace('<pre class="wp-block-code"><code>', '\n\n<!-- wp:code -->\n<pre class="wp-block-code"><code>')
    item = item.replace('</code></pre>', '</code></pre>\n<!-- wp:code -->\n')

    content = re.findall(find_content, item)
    content[0] = """
*文章首发及后续更新：[{0}]({1})
新的更新内容请到[mwhls.top](https://mwhls.top/)查看。
无图/无目录/格式错误/更多相关请到上方的文章首发页面查看。*
    """.format(url, url) + content[0]
    html_data = []
    html_data.append(title[0])
    html_data.append(content[0])

    return html_data


if __name__ == '__main__':
    main()

MWHLS

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python实现爬取WordPress文章并转换成CSDN文章格式

文章首发及后续更新：https://mwhls.top/2186.html新的更新内容请到mwhls.top查看。无图/无目录/格式错误/更多相关请到上方的文章首发页面查看。推荐参考：Python爬虫入门目录 1. 项目介绍 2. 项目思路 3. 代码项目介绍爬取博客主页的前N篇文章，按照标题、内容的顺序返回。效果不错，几乎所有的目的都完成了。但美中不足的是不能自动代发，还是需要自己发。项目思路思路和前一篇文章很像，我这里就介绍一下大概思路以及问题处理，其余不再赘述了。见：Python爬
复制链接

扫一扫

专栏目录