创新实训【5】——爬取搜狗资讯

最新推荐文章于 2023-09-17 21:28:35 发布

ayy洋

最新推荐文章于 2023-09-17 21:28:35 发布

阅读量483

点赞数

本文链接：https://blog.csdn.net/weixin_43710646/article/details/115462391

版权

爬取内容

本周用selenium+chromeDriver爬取了搜狗资讯中有关山东大学的新闻，包括新闻标题，链接，时间和来源，一共爬取了100页，获得数据380多条，在链接中改变页数page={}爬取不同网页的内容。
爬取链接：https://www.sogou.com/sogou?interation=1728053249&interV=&pid=sogou-wsse-c7dec8e09376bf8e&query=%E5%B1%B1%E4%B8%9C%E5%A4%A7%E5%AD%A6&page=1&ie=utf8

爬取工具

selenium
chromeDriver

具体代码

每个新闻包含在class='vrwrap’的html里。
标题包含在class='vr-title’的html中，链接在a标签的href属性中，内容在a标签的文本。

title=context.find_element_by_class_name("vr-title")
href=title.find_element_by_tag_name("a").get_attribute("href") #链接
topic=title.text  #标题

时间信息包含在class='fz-mid’的html中，第一个span内容为来源，第二个span文本为时间。

time1=context.find_element_by_class_name("fz-mid").find_elements_by_tag_name("span")
laiyuan=time1[0].text  #新闻来源
time2=time1[1].text    #时间

完整代码如下：

import time
from selenium import webdriver
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

url='https://www.sogou.com/sogou?interation=1728053249&interV=&pid=sogou-wsse-c7dec8e09376bf8e&query=%E5%B1%B1%E4%B8%9C%E5%A4%A7%E5%AD%A6&page={}&ie=utf8'
driver=webdriver.Chrome()

href_list=[]
topic_list=[]
time_list=[]
laiyuan_list=[]

# 获取不同页数的网页信息
def next_page(page):
    # 解决加载超时出错
    try:
        driver.get(url.format(str(page)))
        time.sleep(1)
    except TimeoutError:
        return print("TimeoutError")

# 获取网页信息
def get_urls():
    try:
        main=driver.find_element_by_class_name("results")
        
        for context in main.find_elements_by_class_name("vrwrap"):

            url=[]
            title=context.find_element_by_class_name("vr-title")
            href=title.find_element_by_tag_name("a").get_attribute("href") #链接
            topic=title.text  #标题

            time1=context.find_element_by_class_name("fz-mid").find_elements_by_tag_name("span")
            laiyuan=time1[0].text  #新闻来源
            time2=time1[1].text    #时间

            url.append(href)
            url.append(topic)
            url.append(time2)
            url.append(laiyuan)
            print(url) #显示当前新闻内容

            href_list.append(href)
            topic_list.append(topic)
            time_list.append(time2)
            laiyuan_list.append(laiyuan)
           
    except Exception as err:
        print("未爬取成功：", err)

def main():
    for i in range(1, 100):
        next_page(i)
        print("爬取第{}页内容".format(i))
        get_urls()

    
    dframe = pd.DataFrame({'链接': href_list, '主题': topic_list, '时间': time_list,'来源':laiyuan_list})
    dframe.to_csv('sougou_urls.csv', index=False, sep=',', encoding='utf_8_sig')
    driver.quit()
    print("爬取搜狗新闻完成！")

if __name__ == '__main__':
    main()

爬取结果

在这里插入图片描述
csv内容如下：大约爬取了380多条数据

ayy洋

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
创新实训【5】——爬取搜狗资讯

爬取内容本周用selenium+chromeDriver爬取了搜狗资讯中有关山东大学的新闻，包括新闻标题，链接，时间和来源，一共爬取了100页，获得数据380多条，在链接中改变页数page={}爬取不同网页的内容。爬取链接：https://www.sogou.com/sogou?interation=1728053249&interV=&pid=sogou-wsse-c7dec8e09376bf8e&query=%E5%B1%B1%E4%B8%9C%E5%A4%A7%E5%AD%
复制链接

扫一扫