python爬取博客_爬取博客园文章

最新推荐文章于 2024-08-03 20:32:40 发布

weixin_39942995

最新推荐文章于 2024-08-03 20:32:40 发布

阅读量621

点赞数

文章标签： python爬取博客

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39942995/article/details/112876934

版权

本文演示了如何使用Python的requests和BeautifulSoup库爬取博客园的文章，包括获取文章标题、链接、日期、点赞数、评论数和浏览量，并将数据存储到MySQL数据库中。

摘要由CSDN通过智能技术生成

本文将为您描述爬取博客园文章,具体操作方法:

目录main文件spider-cnblogs代码

一直想整个爬虫玩玩，之前用Java试过...的确是术业有专攻啊，Python写起爬虫来更加方便

今天的成果：

main文件

主要的方法都封装在了spider-cnblogs里了，这里主要传递一个url，待会代码贴在后边

spider-cnblogs

大致的思路是这样的，先用requests发送请求，然后使用BeautifulSoup进行html解析，(推荐使用CSS选择器的方式获取想要的内容)，解析完成后持久化到数据库，这里使用了阿里云的ECS，里面安装了一个MySQL。

代码

main.py

from black_fish.cnblogs.spider_cnblogs import Cnblogs

if __name__ == '__main__':

# index,48,候选

Cnblogs.executeSpider("https://www.cnblogs.com")

Cnblogs.executeSpider("https://www.cnblogs.com/aggsite/topviews")

Cnblogs.executeSpider("https://www.cnblogs.com/candidate/")

spider-cnblogs

import requests

from bs4 import BeautifulSoup

import pymysql

class Cnblogs:

def __init__(self, id, title, href, date, star_num, comment_num, view_num):

self.id = id

self.title = title

self.href = href

self.date = date

self.star_num = star_num

self.view_num = view_num

self.comment_num = comment_num

def print(self):

print(self.id, self.title, self.href, self.date, self.star_num, self.comment_num, self.view_num)

def executeSpider(cnblogs_url):

response = requests.get(cnblogs_url);

bs = BeautifulSoup(response.text);

# 获取标题&链接

mainItems = bs.select(".post-item-title");

# 获取发布日期，点赞数，评论数，浏览量

timeItems = bs.select(".post-item-foot>.post-meta-item span");

t_list = []

for t_index, timeItem in enumerate(timeItems):

t_list.append(timeItem.string)

db = pymysql.connect("47.103.6.247", "username", "password", "black_fish_db")

cursor = db.cursor()

sql = "insert into cnblogs(title, href, date, star_num, comment_num, view_num) value(%s,%s,%s,%s,%s,%s)"

for m_index, main_item in enumerate(mainItems):

cnblog = Cnblogs(0, main_item.string, main_item.attrs['href'],

t_list[m_index * 4], int(t_list[m_index * 4 + 1]), int(t_list[m_index * 4 + 2]),

int(t_list[m_index * 4 + 3]))

val = (cnblog.title, cnblog.href, cnblog.date, cnblog.star_num, cnblog.comment_num, cnblog.view_num)

print(val)

cursor.execute(sql, val)

db.commit()

db.close()爬取博客园文章就为您介绍到这里，感谢您关注懒咪学编程c.lanmit.com.

本文地址：https://c.lanmit.com/bianchengkaifa/Python/102238.html

weixin_39942995

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。