《python爬虫学习》之爬取csdn网站的博主文章列表

最新推荐文章于 2024-08-14 11:00:39 发布

九圣残炎

最新推荐文章于 2024-08-14 11:00:39 发布

阅读量2.9k

点赞数 2

分类专栏： # Python爬虫文章标签： python xpath

本文链接：https://blog.csdn.net/qq_40878316/article/details/106383759

版权

Python爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

前言：

上一篇文章，采用爬取接口的方法爬取到的文章数量只有十篇，但我们看到的文章却不止十篇，甚至点刷新之后会不断增加，而且之前的文章还存在，这说明虽然接口一次只传十条数据，但页面会将已接收的数据缓存到本地，增加到页面中，大概是这样。

爬取接口的方法比较简单，只要将要传递的参数陈列分析出来，基本都能获取后台返回给前端的数据。不过很多网站的数据是找不到对应接口的，这时候就需要使用其他的方法来爬取数据，例如xPath。这里以csdn博主文章栏目为例子练习。

步骤1：进入开发者模式

随便找一个比较厉害的大佬，进他的文章列表，使用F12进入开发者模式。

步骤2：分析页面

通过移动鼠标到左边的代码上，可以看到页面出现蓝色半透明层覆盖页面一部分内容。而我们所要爬取的数据就在这里。

分析后，我们可以知道我们要爬取的内容路径为：div[@class="article-list"]下的div[@class=” article-item-box csdn-tracking-statistics”]下的h4中。

步骤3：xpath语句

通过上面分析，我们大致知道我们要爬取的内容所在的位置，也能轻易推敲出我们的xpath语句：

//*/div[@class="article-list"]/div/h4

这里xpath的写法可以有很多中，如果实在不知道怎么写，也可以直接用copy XPath方法获取，然后稍微修改即可。

步骤4：写爬虫

完成上面步骤，我们就可以开始写我们的爬虫。

from urllib import request

url="https://blog.csdn.net/博主id"
html=request.urlopen(request.Request(url)).read().decode('utf8')
print(html)
# 这里打印出来看看对不对
f = open('c.html', 'w', encoding='utf8')
f.write(html)
f.close()

这爬取的内容是整一个页面的内容，而我们使用xPath的目的就是为了从中提炼出我们要的数据。

from urllib import request
from lxml import etree

url="https://blog.csdn.net/ThinkWon"
html=request.urlopen(request.Request(url)).read().decode('utf8')


htmls=etree.HTML(html)
result=htmls.xpath('//*/div[@class="article-list"]/div/h4')

for res in result:
    # 地址比较简单，就直接取a标签里的的href内容就行
    url=res.xpath('a/@href')[0]
    # 直接爬取text（）会把空格和换行算作字符，所以需要提取一下
    title = res.xpath('a/text()')[1].replace(' ','')
    print(f'{title}:{url}')

打印结果：

当然，这里爬取的文章只是第一页的内容，如果自己关注的博主文章写的多，明显以上代码并不能爬取博主其他页的文章。那么怎么解决这个问题呢？

其实点击第二页或者下一页时，url地址就会发生改变，而我们点回第一页时，url就不是我们最初进入该博主的文章页的url，而是真正的url：https://blog.csdn.net/博主Id/article/list/页数

所以要爬取其他页的文章，我们只需将url的改变一下即可：

url=‘https://blog.csdn.net/博主id/article/list/‘+page

步骤5：写入excel表格

from urllib import request
from lxml import etree
import xlwt
url="https://blog.csdn.net/thinkwon/article/list/2"
html=request.urlopen(request.Request(url)).read().decode('utf8')

htmls=etree.HTML(html)
result=htmls.xpath('//*/div[@class="article-list"]/div/h4')

workbook=xlwt.Workbook()
sheet=workbook.add_sheet('csdn博主文章')
sheet.write(0,0,'标题')
sheet.write(0,1,'文章地址')
i=0
for res in result:
    i+=1
    url=res.xpath('a/@href')[0]
    title = res.xpath('a/text()')[1].replace(' ','')
    sheet.write(i, 0, title)
    sheet.write(i, 1, url)
workbook.save('csdn.xls')

运行结果：

步骤6：重构

完成上面步骤，我们就获得了我们需要爬取的数据，基本就可以收工了。不过为了代码的简洁性，即使代码量不多，我还是将重新整理了一下

from urllib import request
from lxml import etree
import xlwt
import ssl
import random
#添加user-agent用户代理
ua_list=['Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
         'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
         ]
xPathStr='//*/div[@class="article-list"]/div/h4'
# 读取网站
def readURL(url):
    user_agent=random.choice(ua_list)
    headers={'User-Agent':user_agent}
    context=ssl._create_unverified_context()
    html = request.urlopen(request.Request(url,headers=headers),context=context).read().decode('utf8')
    return html
# xpath节点获取并写入表格
def htmlxPath(html,sheet,i):
    htmls = etree.HTML(html)
    result = htmls.xpath(xPathStr)
    for res in result:
        i += 1
        url = res.xpath('a/@href')[0]
        title = res.xpath('a/text()')[1].replace(' ', '')
        sheet.write(i, 0, title)
        sheet.write(i, 1, url)
    return i

def articleSpider(url,beginPage,endPage):
    workbook = xlwt.Workbook()
    sheet = workbook.add_sheet('csdn博主文章')
    sheet.write(0, 0, '标题')
    sheet.write(0, 1, '文章地址')
    i=0
    for page in range(beginPage,endPage+1):
        # 拼接地址，?t=1表示栓选原创文章
        fullurl=f'{url}{page}?t=1'
        html=readURL(fullurl)
        i=htmlxPath(html,sheet,i)
    workbook.save('csdn.xls')

if __name__=='__main__':
    bloggerId=input('请输入要爬取的博主Id：')
    beginPage=int(input('请输入开始的页数：'))
    endPage=int(input('请输入结束的页数：'))
    url=f'https://blog.csdn.net/{bloggerId}/article/list/'
    articleSpider(url,beginPage,endPage)

爬取结果：