微博关键词搜索并爬取前40页内容与图片

微博关键词搜索并爬取前40页内容与图片

# -*- coding: utf-8 -*-
"""
@author: tanderick
"""
import requests
import re 
import os
import urllib.parse
import time

#header文件     
headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0)'}
#搜索名词
keyword = '简历'
#创建同名文件夹
filepath = str(r'C:/weibo/'+keyword)
if not os.path.exists(filepath):
    os.mkdir(filepath)
#搜索名词下40页源码并保存为txt文件
kw=urllib.parse.quote(keyword)
s_url ='https://s.weibo.com/weibo?q='+kw+'&wvr=6&b=1&Refer=SWeibo_box'
f = requests.get(s_url,headers = headers)
for i in range(40):
  html = requests.get(s_url+'&page='+str(i),headers = headers)
  html = html.text
  html =urllib.parse.unquote(html)  
  print(i)  
  with open(filepath+'/'+keyword+'.txt','a',encoding ="utf-8") as f:
     f.write(html)
  time.sleep(0.5)
#打开该文件  
with open(filepath+'/'+keyword+'.txt','r',encoding ="utf-8") as h:
     html = h.read()
#解析内容并下载          
uids = re.findall('<a href="//weibo.com/(.*?)?refer_flag=1001030103_" class=".*?" target=".*?" nick-name="(.*?)" suda-data=".*?">.*?</a>',html)
contents = re.findall(' <p class="txt" node-type="feed_list_content" nick-name=".*?">(.*?)</p>',html,re.S)
pic_id = re.findall('<!--card-wrap-->(.*?)<!--/card-wrap-->',html,re.S)
for i in range(len(uids)):
    uid,nickname = uids[i]
    out_filepath =filepath+'/'+nickname
    if not os.path.exists(out_filepath):
        os.mkdir(out_filepath)
    with open(out_filepath+'/微博内容.txt','a',encoding ="utf-8") as f:
        f.write(str(uids[i])+'\r\n'+re.sub('<.*?>','',contents[i],re.S))
    #获取用户名与微博内容    
    pic_urls1 = re.findall('img src="(.*?)jpg".*?',pic_id[i])
    pic_urls2 = re.findall('cover_img=(.*?)jpg.*?',pic_id[i])
    for url1 in pic_urls1:
        url1 = re.sub(r'https:','',str(url1))    
        filename = url1.split('/')[-1]
        response = requests.get(r'http:'+url1+'jpg',headers=headers)
        with open(out_filepath+'/'+filename+'jpg','wb') as f:
          f.write(response.content)
        print(r'http:'+url1+'jpg'+'下载完成')
    for url2 in pic_urls2:         
        url2 = re.sub(r'https:','',str(url2))
        filename = url1.split('/')[-1]
        response = requests.get(r'http:'+url2+'jpg',headers=headers)
        with open(out_filepath+'/'+filename+'jpg','wb') as f:
          f.write(response.content)
        print(r'http:'+url2+'jpg'+'下载完成')
    #下载图片
  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是Python爬取微博关键词的代码示例: ```python import requests from bs4 import BeautifulSoup import jieba from wordcloud import WordCloud # 设置请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} # 获取搜索结果面 def get_page(keyword, page): url = 'https://s.weibo.com/weibo?q=' + keyword + '&page=' + str(page) response = requests.get(url, headers=headers) return response.text # 解析搜索结果面,获取微博内容 def get_content(html): soup = BeautifulSoup(html, 'html.parser') content_list = soup.find_all('p', class_='txt') content = '' for item in content_list: content += item.text.strip() return content # 对微博内容进行分词 def cut_words(content): word_list = jieba.cut(content) words = ' '.join(word_list) return words # 生成词云图片 def generate_wordcloud(words): wc = WordCloud(background_color='white', width=800, height=600, max_words=200, font_path='msyh.ttc') wc.generate(words) wc.to_file('wordcloud.png') # 主函数 def main(): keyword = input('请输入要搜索关键词:') page = int(input('请输入要搜索数:')) content = '' for i in range(1, page+1): html = get_page(keyword, i) content += get_content(html) words = cut_words(content) generate_wordcloud(words) if __name__ == '__main__': main() ``` 以上代码可以实现输入关键词数,然后爬取对应数的微博内容,对内容进行分词并生成词云图片。需要注意的是,爬取微博数据需要登录账号并获取cookie,否则会被反爬虫机制拦截。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值