爬虫案例——中超联赛新闻

最新推荐文章于 2023-05-30 19:35:06 发布

Crazy ProMonkey

最新推荐文章于 2023-05-30 19:35:06 发布

阅读量7.2k

点赞数 5

分类专栏： python 项目练习文章标签： python 爬虫正则表达式

本文链接：https://blog.csdn.net/fleehom/article/details/121393198

版权

python 同时被 2 个专栏收录

18 篇文章 4 订阅

订阅专栏

项目练习

8 篇文章 0 订阅

订阅专栏

要求：

http://sports.163.com/zc/

提取网站中全部新闻标题名称，标题路由地址，标签，时间，评论数保存到文档中

案例分析：

（1）请求部分

查看网站信息可知，该网站的请求地址会随着页面的变化而发生变化。如果是第一页，则可以直接引用网址，如果是单位数的页面，可以直接改变个位数页，以此类推...在这里，我们就可以对路由地址进行if分类判断

for i in range(1, 20+1):
    if i == 1:
        url = "https://sports.163.com/zc/"
    elif 2 <= i <= 9:
        url = f"https://sports.163.com/special/00051C89/zc_0{i}.html"
    else:
        url = f"https://sports.163.com/special/00051C89/zc_{i}.html"
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.9 Safari/537.36",
        # headers部分中，如果请求不到的话，可以再加上Cookie
    }

（2）解析部分

首先取出信息所在的节点，里面就包含了每页中的所有新闻信息的列表，如果用正则方法爬取，代码如下：

html_str = response.content.decode()
node = re.search('<div class="new_list">(.*?)<div class="news_page clearfix">', html_str, re.S).group(1)
node_list = re.findall('<div class="news_item">(.*?)</ul>', node, re.S)

如果用xpath方法爬取，代码如下：

html_str = response.content.decode()
root = etree.HTML(html_str)
li_node_list = root.xpath("//div[@class='new_list']/div[@class='news_item']")

当然，也可以用beautifulsoup方法，方法众多，可以任意选用

当我们解析出列表标签后，我们就可以进行深入解析了

正则方法：大家要熟练掌握re.search()和re.findall()为主的正则匹配方法

for item in node_list:
    rank += 1
    title = re.search('<h3><a href="(.*?)">(.*?)</a></h3>', item, re.S).group(1)
    title_url = re.search('<h3><a href="(.*?)">(.*?)</a></h3>', item, re.S).group(2)
    tag = re.search('<div class="keywords">(.*?)</div>', item, re.S).group(1)
    tag = re.findall('>(.*?)</a>', tag, re.S)
    tag = "、".join(tag)
    date = re.search('<div class="post_date">(.*?)</div>', item, re.S).group(1)
    comment = re.search('<span class="icon">(.*?)</span>', item, re.S).group(1)
    print(f"排名：{rank}，标题：{title}, 标题路由：{title_url}，标签：{tag}，发布时间：{date}，评论：{comment}")

xpath方法：

    # 按列取数据 防止数据不好一一对应
    for li_node in li_node_list:
        rank += 1
        # 先取包含所有数据整个父节点  然后再一一节点处理
        """
        select_one分开写的时候 如果想从某个节点继续往后访问 需要记得加个点
        """
        title = li_node.xpath('./h3/a/text()')
        title = "".join(title)
        # select_one('h3>a').string
        title_url = li_node.xpath('./h3/a/@href')
        title_url = "".join(title_url)
        # select_one('h3>a')["href"]
        tag = li_node.xpath('./div[@class="info"]/div[@class="keywords"]/a/text()')
        tag = " ".join(tag)
        date = li_node.xpath('./div[@class="info"]/div[2]/text()')
        date = "".join(date)
        comment = li_node.xpath('./div[@class="info"]/div[3]/a[@class="comment"]/span[@class="icon"]/text()')
        comment = "".join(comment)
        print(f"排名：{rank}，标题 ：{title}，标题路由：{title_url}，标签：{tag}，发表时间：{date}，评论人数：{comment}")

解析完数据之后，就要进行数据的保存，可以保存到数据库中，也可以报道到文本txt中

保存到文档：

with open('./体育新闻.csv', 'a+', encoding='utf-8') as f:
    f.write(f'排名：{rank}；标题：{title}；标题路由：{title_url}；标签：{tag}；发布时间：{date}；评论：{comment}\n')
    f.close()

保存到数据库：

from a_help.mysql_helper import MySQLHelper

mysql_helper = MySQLHelper(cache_count=1)

...

mysql_helper.insert("news", ["rank", "title", "title_url", "tag", "date", "comment"], [rank, title, title_url, tag, date, comment])

mysql_helper.close()

当我们爬取的数据量比较大时，我们就需要考虑项目的稳定性了。比如我们可以加time.sleep()，也可以写逻辑做容错处理，比如一些简单的处理：

time.sleep(2)
if response:
    ...
else:
    print(f"当前第{i}页请求出错")

    def get(self, url,params=None,headers=None,timeout=None):
        response = None
        for i in range(1,self.max_retry_count+1):
            try:
                ...
            except Exception as e:
                print(f"当前重试次数：{i}，错误信息：{e}...")
                ...
        return response

    def post(self,url,headers=None,timeout=None,data=None, json=None):
        response = None
        for i in range(1, self.max_retry_count + 1):
            try:
                ...
            except Exception as e:
                print(f"当前重试次数：{i}，错误信息：{e}...")
                ...
        return response

结果展示：