爬csdn博客信息

LucianaiB

于 2024-08-21 09:21:11 发布

阅读量163

点赞数 7

分类专栏：爬虫学习文章标签：爬虫 python

本文链接：https://blog.csdn.net/lwcwam/article/details/141380433

版权

爬虫学习专栏收录该内容

8 篇文章 0 订阅

订阅专栏

直接看效果：

背景：

用于爬取CSDN博客用户文章信息的Python脚本。它使用了requests库来发送HTTP请求，并利用pyquery库来解析返回的HTML内容。用户需要输入CSDN的用户名（例如“lwcwam”），然后程序会访问该用户的博客列表页面，提取并打印出一些基本信息，如原创文章数、粉丝数、喜欢数和评论数等。

代码的主要流程如下：

用户输入：程序首先要求用户输入CSDN的ID。
构建URL：根据输入的ID和当前页面号，构建访问该用户博客的URL。
发送请求：使用requests.get()方法访问构建的URL，并获取页面内容。
解析数据：通过pyquery解析HTML，提取所需的信息，包括用户的基本数据和每一页的文章列表。
循环获取信息：程序会循环访问每一页的文章，直到没有更多的内容可供爬取为止。

在提取信息时，代码使用了CSS选择器来定位特定的HTML元素，从而获取文章的标题、日期、阅读数和评论数等数据。每一页的信息都会被打印出来，方便用户查看。

直接上代码：

# 例如输入：lwcwam

import requests
from pyquery import PyQuery as pq

# 当前的博客列表页号
page_num = 1

account = str(input('print csdn id:'))
#account = "lwcwam"
# 首页地址
baseUrl = 'http://blog.csdn.net/' + account
# 连接页号，组成爬取的页面网址
myUrl = baseUrl + '/article/list/' + str(page_num)

headers = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
# 构造请求

# 访问页面
myPage = requests.get(myUrl,headers=headers).text

doc = pq(myPage)

data_info = doc("aside .data-info dl").items()
for i,item in enumerate(data_info):
    if i==0:
        print("原创:"+item.attr("title"))
    if i==1:
        print("粉丝:"+item.attr("title"))
    if i==2:
        print("喜欢:"+item.attr("title"))
    if i==3:
        print("评论:"+item.attr("title"))

grade_box = doc(".grade-box dl").items()
for i,item in enumerate(grade_box):
    if i==0:
        childitem = item("dd > a")
        print("等级:"+childitem.attr("title")[0:2])
    if i==1:
        childitem = item("dd")
        print("访问:"+childitem.attr("title"))
    if i==2:
        childitem = item("dd")
        print("积分:"+childitem.attr("title"))
    if i==3:
        print("排名:"+item.attr("title"))


# 获取每一页的信息
while True:

    # 首页地址
    baseUrl = 'http://blog.csdn.net/' + account
    # 连接页号，组成爬取的页面网址
    myUrl = baseUrl + '/article/list/' + str(page_num)
    # 构造请求
    myPage = requests.get(myUrl,headers=headers).text
    if len(myPage) < 30000:
        break

    print('-----------------------------第 %d 页---------------------------------' % (page_num,))

    doc = pq(myPage)
    articles = doc(".article-list > div").items()
    articleList = []
    for i,item in enumerate(articles):
        if i == 0:
            continue
        title = item("h4 > a").text()[2:]
        date = item("p > .date").text()
        num_item = item("p > .read-num").items()
        ariticle = [date, title]
        for j,jitem in enumerate(num_item):
            if j == 0:
                read_num = jitem.text()
                ariticle.append(read_num)
            else:
                comment_num = jitem.text()
                ariticle.append(comment_num)
        articleList.append(ariticle)
    for item in articleList:
        if(len(item)==4):
            print("%s %s %s %s"%(item[0],item[1],item[2],item[3]))
    page_num = page_num + 1