爬取百度贴吧的图片和文本_孙笑川吧图片_明星吧特性分析

项目简介
运行结果
源码分享

前言:

百度贴吧相信大家都不陌生,尤其是里面的孙笑川吧,各个都是人才,说话又好听,我超喜欢这里的,加之里面的图片都是一些让人难绷的叼图,所以我就想把孙笑川吧的图片给爬取出来,非常有意思。

另外,各个明星都有各个明星的特点,他们的贴吧中粉丝都在讨论不同的东西,比如“周杰伦吧”里面就总有"jay""七里香"等字眼,”成龙吧“里面就经常出现"大哥""家具"这些词条,为了分析每个明星的粉丝们都在讨论什么,所以我把贴吧里的文字也爬取了出来。

需要注意的是,百度贴吧被爬取的次数有限制,代码大概运行十几次后就会被禁止访问,不过第二天就会恢复。这边建议封完自己的去拿别人的过来封,嘻嘻。

运行效果截图:

有些图片很色情就不放原图了。

代码运行结果:

取每位明星吧里分词后排名前五的词条,相信我不说每列是谁,大家也能猜的出来是谁。

源码分享:

爬取孙笑川吧图片:

import re
import requests
import time
import random
for i in range(1,20):
    url="https://tieba.baidu.com/p/8050287300?pn={}".format(i)
    headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42X-Requested-With: XMLHttpRequest",
    "Cookie": "XFI=a5bbdd70-642a-11ed-a0a1-91d71200a2e8; XFCS=33EAF997561B674B61CFD3469FEA15054AB500CDFA4D32A88ADF55772AA63245; XFT=Mj3VnuLxUJJ/d0Me3UShbIVroqQzOuWrYIvrpa0nBMQ=; BIDUPSID=BA745359E82F5E0A1F2B091C736EE8CD; PSTM=1650023160; BAIDUID=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; ZFY=o2CTysYrVqiUpL3Di4E8kPdXLULIBxTTVZn8xyENFt4:C; BAIDUID_BFESS=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; BAIDU_WISE_UID=wapp_1660812573465_626; BDUSS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; BDUSS_BFESS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; __bid_n=1839cde069ef62a43f4207; FPTOKEN=30$lMnUx+NkhcziIqjKUNetIdfmhaukrzccGcAgYqG0eTOdAzr07UvVI2dDNNXciG+FqwkZfPNaWbv7v9Yp6TUN5tJVaqUA6iq4KtdupC6OaOVGf8puh4hnGIMohZDLIwZiKcWI/apqpWx2NjEFETAPDSTS3LVhlFDaHfkUiFe8ipXfZ0/lkjIyreNhG0g2uy2AaDNLRErA5LFma/yizGzDn3dZ8MdaI8YR2htj8AdV3yxLvJVF4DGBaq58mT/n728ZJ/ntysH5BaPtA+mg1lJImaYrzlP6yDgzx8z2ReV4r8r43oy5yvGT5GpyoblhiOz+ryDHTJ6wHj8OMyXj/tY7snNwEsBqblN2xqH9FYGPUbNAwHt1pV/uhROsy3ZzSYje|VYqO1EkQ8dy8eW62vQK/eYS5Q1BVpnLEZSrMkNu385U=|10|311a49937aca6ce15b55d87f381fefaa; STOKEN=939f0c0a936030bf02b66d4eb5aa20b366686231255cd17021b959957291cbd6; Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948=1666344120,1667450846,1668434758; BAIDU_SSP_lcr=https://cn.bing.com/; st_key_id=17; wise_device=0; USER_JUMP=-1; XFI=7586dc90-642a-11ed-a0ce-0fec53f1f1b5; XFCS=2B041B127A582EA6FFF97095BEA9EC56068B0F57764DA3C17ACC53D8BC300454; XFT=yWG0OigEzQ5fh4ZHbB5W0rb8LeeN/8TuzIIUQXUmV6M=; ab_sr=1.0.1_ZWQwYjRiYWQxOGY0YzBjMjgzNTU1MzYyMjM5Y2NjMTc3ZmFhNTZlYTM5NGEzMzJmYWQxOTFlYWQzZTg1YTQ2N2VmMzM4M2E5NTU0OTEyMGI3MWRiZmQ4ZWM0YzNkZjY4MDk2YjhlZTlkYjZmZDk3MWIxYTNlMDQzNzkxOTJmNWU1NzNhZjAzNWUzOTJjOGM2MGI5MjI1MTRjYjJlNGQ5NTQwODk4Njg0MmYxODFhYjA1MDdmMjA2NjQyYjE4MjQz; st_data=d5b9f7b5af80ec681b49ceaa652fea7785f2ad5026a601d56116b65ce60b4dcc79af420023f54f4c0741aeb0dc8de114bdd3ceeb3290242f90d195a6c7fbb09c54bd3acef20260ba8bb8966a1b3672128e519e4042d68ad85b03770a58a2e4fe; st_sign=bfc22e6d; tb_as_data=7435724ab3f9915f4a3519e8f1d11ce614f0902ab47f57d9a2a0f307c51ebb081a7c618af0134fa660fe004cd9754b7ad77d18c9a044a197cd64a0e3356c9910ede7142ea8cb3eb8cd50f6ec20557c801060058915f304c11b1117c48dc0f1fcc5099a83a9dcf8039412371b42f1e81e; RT=\"z=1&dm=baidu.com&si=f09dd14d-413e-4059-a2ab-b14805f9503c&ss=laguz51o&sl=2y&tt=1zur&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1bwxq&ul=1bwt5&hd=1bxib\"; Hm_lpvt_98b9d8c2fd6608d564bf2ac2ae642948=1668436993",
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6"

    }
    response=requests.get(url,headers=headers)
    html_data=response.text
    img_urls=re.findall('<img class="BDE_Image" src="(.*?)" size=".*?" changedsize=".*?" width=".*?" height=".*?">',html_data)
    # 保持图片
    for x in img_urls:
        n=random.random()
        img_resp=requests.get(x)
        img_name="{}.jpg".format(n)
        with open("img_sun_xiao_chuan/"+img_name,mode="wb") as f:
            f.write(img_resp.content)
        print("over",img_name)
        time.sleep(1)
print("all_over!!!")

将明星贴吧的文本写入excel:

# -*- coding: utf-8 -*-
import xlwt
import requests
import re
import time
import random
from bs4 import BeautifulSoup
from urllib import parse

alist=[]
str=input("请输入你最喜欢的明星名字(李荣浩,周杰伦,陈奕迅等):")
for i in range(1,10):
    url="https://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(parse.quote(str),i*50)

    headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42X-Requested-With: XMLHttpRequest",
    "Cookie": "XFI=a5bbdd70-642a-11ed-a0a1-91d71200a2e8; XFCS=33EAF997561B674B61CFD3469FEA15054AB500CDFA4D32A88ADF55772AA63245; XFT=Mj3VnuLxUJJ/d0Me3UShbIVroqQzOuWrYIvrpa0nBMQ=; BIDUPSID=BA745359E82F5E0A1F2B091C736EE8CD; PSTM=1650023160; BAIDUID=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; ZFY=o2CTysYrVqiUpL3Di4E8kPdXLULIBxTTVZn8xyENFt4:C; BAIDUID_BFESS=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; BAIDU_WISE_UID=wapp_1660812573465_626; BDUSS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; BDUSS_BFESS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; __bid_n=1839cde069ef62a43f4207; FPTOKEN=30$lMnUx+NkhcziIqjKUNetIdfmhaukrzccGcAgYqG0eTOdAzr07UvVI2dDNNXciG+FqwkZfPNaWbv7v9Yp6TUN5tJVaqUA6iq4KtdupC6OaOVGf8puh4hnGIMohZDLIwZiKcWI/apqpWx2NjEFETAPDSTS3LVhlFDaHfkUiFe8ipXfZ0/lkjIyreNhG0g2uy2AaDNLRErA5LFma/yizGzDn3dZ8MdaI8YR2htj8AdV3yxLvJVF4DGBaq58mT/n728ZJ/ntysH5BaPtA+mg1lJImaYrzlP6yDgzx8z2ReV4r8r43oy5yvGT5GpyoblhiOz+ryDHTJ6wHj8OMyXj/tY7snNwEsBqblN2xqH9FYGPUbNAwHt1pV/uhROsy3ZzSYje|VYqO1EkQ8dy8eW62vQK/eYS5Q1BVpnLEZSrMkNu385U=|10|311a49937aca6ce15b55d87f381fefaa; STOKEN=939f0c0a936030bf02b66d4eb5aa20b366686231255cd17021b959957291cbd6; Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948=1666344120,1667450846,1668434758; BAIDU_SSP_lcr=https://cn.bing.com/; st_key_id=17; wise_device=0; USER_JUMP=-1; XFI=7586dc90-642a-11ed-a0ce-0fec53f1f1b5; XFCS=2B041B127A582EA6FFF97095BEA9EC56068B0F57764DA3C17ACC53D8BC300454; XFT=yWG0OigEzQ5fh4ZHbB5W0rb8LeeN/8TuzIIUQXUmV6M=; ab_sr=1.0.1_ZWQwYjRiYWQxOGY0YzBjMjgzNTU1MzYyMjM5Y2NjMTc3ZmFhNTZlYTM5NGEzMzJmYWQxOTFlYWQzZTg1YTQ2N2VmMzM4M2E5NTU0OTEyMGI3MWRiZmQ4ZWM0YzNkZjY4MDk2YjhlZTlkYjZmZDk3MWIxYTNlMDQzNzkxOTJmNWU1NzNhZjAzNWUzOTJjOGM2MGI5MjI1MTRjYjJlNGQ5NTQwODk4Njg0MmYxODFhYjA1MDdmMjA2NjQyYjE4MjQz; st_data=d5b9f7b5af80ec681b49ceaa652fea7785f2ad5026a601d56116b65ce60b4dcc79af420023f54f4c0741aeb0dc8de114bdd3ceeb3290242f90d195a6c7fbb09c54bd3acef20260ba8bb8966a1b3672128e519e4042d68ad85b03770a58a2e4fe; st_sign=bfc22e6d; tb_as_data=7435724ab3f9915f4a3519e8f1d11ce614f0902ab47f57d9a2a0f307c51ebb081a7c618af0134fa660fe004cd9754b7ad77d18c9a044a197cd64a0e3356c9910ede7142ea8cb3eb8cd50f6ec20557c801060058915f304c11b1117c48dc0f1fcc5099a83a9dcf8039412371b42f1e81e; RT=\"z=1&dm=baidu.com&si=f09dd14d-413e-4059-a2ab-b14805f9503c&ss=laguz51o&sl=2y&tt=1zur&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1bwxq&ul=1bwt5&hd=1bxib\"; Hm_lpvt_98b9d8c2fd6608d564bf2ac2ae642948=1668436993",
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6"
    }
    response=requests.get(url,headers=headers)
    html_data=response.text
    txt_content=re.findall('<a rel=".*?" href=".*?" title=".*?" target="_blank" class="j_th_tit ">(.*?)</a>',html_data)
    alist.append(txt_content)
book = xlwt.Workbook(encoding='utf-8',style_compression=0)
sheet = book.add_sheet('豆瓣电影Top250',cell_overwrite_ok=True)
sheet.write(0,0,"粉丝们的文章标题")
for i in range(len(alist)):
        data = alist[i]
        for j in range(len(alist[i])):
            sheet.write(i+1,j,data[j])
savepath = r'C:\Users\黄永生\Desktop\Excel\chenyixun.xls'
book.save(savepath)

 将明星吧里的文本进行分析,统计出词条前五名:

这里有个stopwords用来去除没用的标点符号,主页的另一个文章提供。

import csv
import requests
import re
import jieba
from lxml import etree
import time
import random
from bs4 import BeautifulSoup
from urllib import parse
f=open("data2.txt",'w',encoding='utf-8')
str=input("请输入你最喜欢的明星名字(李荣浩,周杰伦,陈奕迅等):")
for i in range(1,10):
    url="https://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}".format(parse.quote(str),i*50)

    headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42X-Requested-With: XMLHttpRequest",
    "Cookie": "XFI=a5bbdd70-642a-11ed-a0a1-91d71200a2e8; XFCS=33EAF997561B674B61CFD3469FEA15054AB500CDFA4D32A88ADF55772AA63245; XFT=Mj3VnuLxUJJ/d0Me3UShbIVroqQzOuWrYIvrpa0nBMQ=; BIDUPSID=BA745359E82F5E0A1F2B091C736EE8CD; PSTM=1650023160; BAIDUID=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; ZFY=o2CTysYrVqiUpL3Di4E8kPdXLULIBxTTVZn8xyENFt4:C; BAIDUID_BFESS=B3720BA771F1EB781AC45E9BF578CE8E:FG=1; BAIDU_WISE_UID=wapp_1660812573465_626; BDUSS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; BDUSS_BFESS=9MZ1ZLU0lEQ3FCZ1hjdTB3anJ2Ulh4SUFGU0J6THFJU0Q2RXcwenQzSUVpbGRqSVFBQUFBJCQAAAAAAAAAAAEAAABuK2~-us3O0tK7xvCwzsLc38IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT9L2ME~S9jWk; __bid_n=1839cde069ef62a43f4207; FPTOKEN=30$lMnUx+NkhcziIqjKUNetIdfmhaukrzccGcAgYqG0eTOdAzr07UvVI2dDNNXciG+FqwkZfPNaWbv7v9Yp6TUN5tJVaqUA6iq4KtdupC6OaOVGf8puh4hnGIMohZDLIwZiKcWI/apqpWx2NjEFETAPDSTS3LVhlFDaHfkUiFe8ipXfZ0/lkjIyreNhG0g2uy2AaDNLRErA5LFma/yizGzDn3dZ8MdaI8YR2htj8AdV3yxLvJVF4DGBaq58mT/n728ZJ/ntysH5BaPtA+mg1lJImaYrzlP6yDgzx8z2ReV4r8r43oy5yvGT5GpyoblhiOz+ryDHTJ6wHj8OMyXj/tY7snNwEsBqblN2xqH9FYGPUbNAwHt1pV/uhROsy3ZzSYje|VYqO1EkQ8dy8eW62vQK/eYS5Q1BVpnLEZSrMkNu385U=|10|311a49937aca6ce15b55d87f381fefaa; STOKEN=939f0c0a936030bf02b66d4eb5aa20b366686231255cd17021b959957291cbd6; Hm_lvt_98b9d8c2fd6608d564bf2ac2ae642948=1666344120,1667450846,1668434758; BAIDU_SSP_lcr=https://cn.bing.com/; st_key_id=17; wise_device=0; USER_JUMP=-1; XFI=7586dc90-642a-11ed-a0ce-0fec53f1f1b5; XFCS=2B041B127A582EA6FFF97095BEA9EC56068B0F57764DA3C17ACC53D8BC300454; XFT=yWG0OigEzQ5fh4ZHbB5W0rb8LeeN/8TuzIIUQXUmV6M=; ab_sr=1.0.1_ZWQwYjRiYWQxOGY0YzBjMjgzNTU1MzYyMjM5Y2NjMTc3ZmFhNTZlYTM5NGEzMzJmYWQxOTFlYWQzZTg1YTQ2N2VmMzM4M2E5NTU0OTEyMGI3MWRiZmQ4ZWM0YzNkZjY4MDk2YjhlZTlkYjZmZDk3MWIxYTNlMDQzNzkxOTJmNWU1NzNhZjAzNWUzOTJjOGM2MGI5MjI1MTRjYjJlNGQ5NTQwODk4Njg0MmYxODFhYjA1MDdmMjA2NjQyYjE4MjQz; st_data=d5b9f7b5af80ec681b49ceaa652fea7785f2ad5026a601d56116b65ce60b4dcc79af420023f54f4c0741aeb0dc8de114bdd3ceeb3290242f90d195a6c7fbb09c54bd3acef20260ba8bb8966a1b3672128e519e4042d68ad85b03770a58a2e4fe; st_sign=bfc22e6d; tb_as_data=7435724ab3f9915f4a3519e8f1d11ce614f0902ab47f57d9a2a0f307c51ebb081a7c618af0134fa660fe004cd9754b7ad77d18c9a044a197cd64a0e3356c9910ede7142ea8cb3eb8cd50f6ec20557c801060058915f304c11b1117c48dc0f1fcc5099a83a9dcf8039412371b42f1e81e; RT=\"z=1&dm=baidu.com&si=f09dd14d-413e-4059-a2ab-b14805f9503c&ss=laguz51o&sl=2y&tt=1zur&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=1bwxq&ul=1bwt5&hd=1bxib\"; Hm_lpvt_98b9d8c2fd6608d564bf2ac2ae642948=1668436993",
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6"
    }
    response=requests.get(url,headers=headers)
    html_data=response.text
    txt_content=re.findall('<a rel=".*?" href=".*?" title=".*?" target="_blank" class="j_th_tit ">(.*?)</a>',html_data)
    for txt in txt_content:
        f.write(txt)
stopwords = [line.strip() for line in open('hit_stopwords.txt',encoding='utf-8').readlines()]
# print(stopwords)
f1=open('data2.txt','r',encoding='utf-8')
code=[]
for i in f1.read().split(' '):
    words = jieba.lcut(i)
    code+=words
d={}
for word in code:
    if word not in stopwords:
        d[word]=d.get(word,0)+1
ls=list(d.items())
ls.sort(key=lambda s:s[-1],reverse=True)
p=[]
for j in range(5):
    p.append(ls[j][0])
write=csv.writer(open("data1.csv",'w',encoding='utf-8'))
write.writerow(p)  #第一行






因为百度贴吧url链接的特殊性,直接输入明星名字,就能爬取相应贴吧的文本,就比如

直接键盘输入“周杰伦”

如果这篇文章能帮助到你,求点个赞赞。

Guff_hys_python数据结构,大数据开发学习,python实训项目-CSDN博客

氺字数:

百度贴吧是中国最大的中文社区之一,是由百度公司推出的一个以兴趣为基础的社区平台。在百度贴吧上,用户可以创建自己喜欢的话题吧,分享自己的兴趣爱好,与志同道合的朋友交流。百度贴吧涵盖了各种各样的话题,包括影视、音乐、游戏、动漫、美食、旅行等等,用户可以在这里找到和自己兴趣相关的内容,交流经验和见解。

百度贴吧提供了丰富的互动功能,用户可以在贴吧中发帖、回帖、点赞、收藏等,还可以关注自己感兴趣的话题和用户,及时获取最新的信息和动态。同时,百度贴吧也支持图片、视频、音频等多种形式的内容发布,让用户可以更加丰富地表达自己的想法和感受。

百度贴吧作为一个开放、自由的社区平台,秉承着“让每个人都能找到自己的兴趣”的理念,为用户提供了一个交流、分享的空间,让用户可以在这里找到志同道合的朋友,畅所欲言,尽情展示自己的个性和才华。无论是寻找兴趣爱好还是结交朋友,百度贴吧都能满足用户的需求,是一个让人流连忘返的社区平台。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Guff_hys

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值