Python爬虫案例-知乎热搜

网站地址

知乎热榜 - 知乎

爬虫代码

import requests
import time
from bs4 import BeautifulSoup
import json
def get_zhihu_hot():
    while True:
        url = "https://www.zhihu.com/billboard"
        resp = requests.get(url)
        resp.encoding = 'utf-8'
        html = resp.text
        soup = BeautifulSoup(html,'html.parser')
        news = soup.findAll(class_='HotList-itemTitle')
        # print(len(news))
        news_ls = []
        title_ls = []
        for new in news:
            title = new.text
            # print(title)
            title_ls.append(title)
        js_text_dict = json.loads(soup.find('script',{'id':'js-initialData'}).get_text())
        #print(js_text_dict['initialState']['topstory']['hotList'])
        js_text_dict = js_text_dict['initialState']['topstory']['hotList']
        url_ls = []
        for new in js_text_dict:
            url = new['target']['link']['url']
            url_ls.append(url)

        news_ls = [{'title':title_ls[i],'url':url_ls[i]} for i in range(len(title_ls))]
        news_ls.reverse()
        # print(news_ls)
        i = 0
        for new in news_ls:
            i += 1
            print(('\033[1;37m'+str(i)+'\033[0m').center(50,"*"))
            print('\033[1;36m'+new.get('title')+'\033[0m')


        news_length = len(news_ls)
        # news_ls.reverse()
        user_input = input("输入新闻编号获取进一步访问的超链接,输入q/Q退出,输入r/R刷新热榜:")
        if user_input == 'q' or user_input == 'Q':
            break
        elif user_input == 'r' or user_input == 'R':
            continue
        elif user_input in [str(i) for i in range(1,news_length+1)]:
            news_index = eval(user_input) - 1
            print(news_ls[news_index].get('url'))
            print("\033[1;33m" + "按住Ctrl键,点击超链接进行访问" + "\033[0m")
            print('\033[5;31m'+'10s后自动刷新热榜'+'\033[0m')
            time.sleep(10)
            continue
        else:
            print("Invalid User Input.")
            print('\033[5;31m'+"3s后自动刷新热榜"+'\033[0m')
            time.sleep(3)
            continue
    print("Over,退出知乎热搜!")
if __name__ == '__main__':
    get_zhihu_hot()

爬虫结果

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
以下是一个简单的scrapy爬取知乎门话题的案例: 首先,需要安装scrapy和其他需要的库: ``` pip install scrapy pip install requests pip install scrapy-splash ``` 然后,创建一个新的scrapy项目: ``` scrapy startproject zhihu cd zhihu ``` 接着,在`settings.py`中添加一些配置: ```python BOT_NAME = 'zhihu' SPIDER_MODULES = ['zhihu.spiders'] NEWSPIDER_MODULE = 'zhihu.spiders' ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 3 DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPLASH_URL = 'http://localhost:8050' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' ``` 在这里,我们使用了Splash来渲染网页,因此需要添加一些相关的配置。`DOWNLOAD_DELAY`是下载延迟时间,为了避免被网站封禁,最好设置一个较长的时间。 接下来,创建一个名为`zhihu_spider.py`的Spider类: ```python import scrapy from scrapy_splash import SplashRequest class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/hot'] script = ''' function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(2)) return splash:html() end ''' def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, endpoint='execute', args={ 'lua_source': self.script }) def parse(self, response): for item in response.css('.HotItem'): yield { 'title': item.css('.HotItem-title a::text').get(), 'link': item.css('.HotItem-title a::attr(href)').get(), } ``` 在这里,我们使用了SplashRequest来请求页面,并使用Lua脚本来等待页面加载完毕。然后,我们使用CSS选择器来提取门话题的标题和链接,并将它们存储在字典中,然后使用yield返回。 最后,运行爬虫: ``` scrapy crawl zhihu -o zhihu.csv ``` 这将抓取知乎门话题的标题和链接,并将它们存储在CSV文件中。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

数智侠

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值