利用xpath爬取博客园新闻

最新推荐文章于 2023-03-02 17:02:00 发布

不吃猪大肠的小可爱一枚

最新推荐文章于 2023-03-02 17:02:00 发布

阅读量619

点赞数 1

分类专栏：爬虫文章标签： python mysql xpath

本文链接：https://blog.csdn.net/qq_52190083/article/details/113387359

版权

爬虫专栏收录该内容

3 篇文章 1 订阅

订阅专栏

使用python爬取播客园信息

本篇文章主要向读者介绍如何使用lxml爬取播客园新闻模块的内容，并且将获取到的内容存储到mysql数据库中。
在这里插入图片描述

一、明确爬取的网址、内容：
本次爬虫我们爬取的目的网址是播客园新闻模块，通过手动翻页可以发现前三页的网址url：

https://news.cnblogs.com/
https://news.cnblogs.com/n/page/2/
https://news.cnblogs.com/n/page/3/
可知每页的url多遵循一定的规律，将第一页的url改写成https://news.cnblogs.com/n/page/1/，发现也可以正常浏览页面，这样我们的目的url就构造好了；下一步我们就是确定要爬取的内容，以及我们如何获取我们想要的内容。
二、我们爬取的主要内容有标题、文章内容推荐人数、评论人数、浏览人数、发布人以及发布日期，以上就是我们爬取的主要内容；下面我们就思考用什么方法去获取我们想要的内容，笔者是通过lxml解析文本然后通过xpath来获取想要的文本内容；关于lxml解析文本以及通过xpath获取文本内容笔者在这里做简单的介绍：
使用lxml解析文本

from lxml import etree  
    res=requests.get(url,headers=headers)
    html=etree.HTML(res.text)   #通过lxml解析后返回的是一个element的对象

通过xpath获取文本内容：
通过Google 浏览器的开发者工具检查页面元素，使用元素探查器点击你想要获取的元素所在位置，鼠标右键选择copy xpath就可以获取元素了

但是通过上面获取的xpath并不能获取到文本信息，如果想要获取到文本内容，我们必须还要在刚刚获取到的xpath后面加上“/text()”才能获取到文本内容
三、通过上面的步骤我们已经获取到了数据，接下来，我们就应该思考我们应该如何存储我们获取到的数据；这时候可能大部分人会想到用记事本、Excel存储，其实这样也可以，但是，如果到后期我们存储的数据量越来越大，你们觉得使用上述的种方法还合适吗？所以，我们必须要换一种方式来存储我们的数据，本篇文章，笔者是使用mysql数据库来储存数据；下面介绍如何在python中使用mysql数据库储存数据：

import pymysql
conn=pymysql.connect(host="",user="",passwd="",port=,db="")  #连接数据库
cursor=conn.cursor()
#使用sql语句插入信息
cursor.execute(sql)
#提交信息
conn.commit()

四、下面给出完整的爬虫代码：

import time
import requests
from lxml import etree
import pymysql
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
conn=pymysql.connect(host="",user="",passwd="",port=,db="")
cursor=conn.cursor()
def get_infor(url):
    res=requests.get(url,headers=headers)
    html=etree.HTML(res.text)
    # 获取大标签
    items=html.xpath('//*[@id="news_list"]/div[@class="news_block"]')   #//div[@id="news_list"]/div[@class="news_block"]
    try:
        for item in items:
            title=item.xpath('div[@class="content"]/h2/a/text()')[0].strip()
            text_content=item.xpath('div[2]/div[1]/text()')[1].strip()
            recommend_byte=item.xpath('div/div[@class="diggit"]/span[@class="diggnum"]/text()')

            recommend=recommend_byte[0].strip()  if len(recommend_byte)!=0 else "空"
  	        comment_byte=item.xpath('div/div/div[@class="entry_footer"]/span[@class="comment"]/a/text()')
            
            comment=comment_byte[0].strip()  if len(comment_byte)!=0 else "空"
            browse=item.xpath('div[@class="content"]/div[@class="entry_footer"]/span[@class="view"]/text()')[0].strip()
            publisher=item.xpath('div[@class="content"]/div/span[@class="tag"]/a[@class="gray"]/text()') 
            time=item.xpath('div/div[@class="entry_footer"]/span[@class="gray"]/text()')[0].strip()
            cursor.execute("insert into dbname values(%s,%s,%s,%s,%s,%s,%s)",
                       (str(title),str(text_content),str(recommend),str(comment),str(browse),str(publisher),str(time)))
          
    except IndexError:
        pass
def main():
    start_time=time.time()
    urls=['https://news.cnblogs.com/n/page/{}/'.format(str(i)) for i in range(1,101)]
    for url in urls:
        get_infor(url)
    conn.commit()
    end_time=time.time()
    print("共花费{}秒时间".format(end_time-start_time))
if __name__ == '__main__':
    main()

文章还存在不足之处，欢迎各位指导

不吃猪大肠的小可爱一枚

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
利用xpath爬取博客园新闻

本篇文章主要向读者介绍如何使用lxml爬取播客园新闻模块的内容，并且将获取到的内容存储到mysql数据库中。一、明确爬取的网址、内容：本次爬虫我们爬取的目的网址是播客园新闻模块，通过手动翻页可以发现前三页的网址url：https://news.cnblogs.com/https://news.cnblogs.com/n/page/2/https://news.cnblogs.com/n/page/3/可知每页的url多遵循一定的规律，将第一页的url改写成https://news.cnblog
复制链接

扫一扫