python爬取晋江_Python爬取晋江文学城网友交流区（兔区）帖子里的共多少个id

最新推荐文章于 2021-02-10 18:18:52 发布

weixin_39557576

最新推荐文章于 2021-02-10 18:18:52 发布

阅读量199

点赞数

文章标签： python爬取晋江

晋江文学城网友交流区，俗称兔区，是一个以明星八卦为主要讨论内容的匿名论坛。

1：该区帖子特点如下：

第一：论坛中每一个帖子回复只会显示一个id；

第二：同一个帖子里，同一个登录账号的id是固定不变的。

2：在取得一个帖子内有多少个固定id时，按照以下思路：

第一：该贴有多少页;

第二：找到id;

第二：对于多次回复的同一账号id的去重。

3：分析网页的特点:

第一：定位这帖子共多少页:

首先打开帖子的第一页：

以帖子为例（找一个不引战的帖子很难，我寻思二次元应该好一点）

网址：http://bbs.jjwxc.net/showmsg.php?board=2&boardpagemsg=1&id=6577455

可以看到首页有一个“共5页”，所以我们就可以知道这个帖子有5页了，所以把这个参数取下来就行。

具体参数右键-“检查网页源代码”可以找到：

58694dada5af

1.PNG

58694dada5af

2.PNG

另外还有帖子是一页的情况，那么是没有这个参数的，因此具体的代码如下：

#1:先查一下这个帖子一共有几页

print ("请输入帖子第一页的网址:" )

url = str( input() )

req = requests.get(url,cookies=cookies,headers=headers)

text = req.content.decode('GB2312','ignore')

soup = BeautifulSoup(text,features='lxml')

try :

page_top = soup.find(name='div',attrs={'id':'pager_top'}).text

page_count_text = re.findall(r'\d+(?:\.\d+)?', page_top)

page_count = int ( page_count_text[0] )

print ( '该帖子一共：' + str(page_count) + '页')

except :

page_count = 1

print ( '该帖子一共：' + str(page_count) + '页')

第二：查询固定id

如图具体的id，右键-“网页源代码”查看id对应的具体的参数：

58694dada5af

3.PNG

58694dada5af

4.PNG

因此具体的代码如下：

i = 1

count_id = 1

for i in range(page_count):

url_2 = url+ str( '&page=' )+ 'str(i-1)'

req = requests.get(url_2,cookies=cookies,headers=headers)

text = req.content.decode('GB2312','ignore')

soup = BeautifulSoup(text,features='lxml')

authorname = soup.find_all(name='td',attrs={'class':'authorname'})

list = []

for id in authorname:

id_list =list.append( id.find(color="#999999").string )

#因为一个楼里同一个人多次回复，所以我们可能会重复的id，因此需要去重

dis_li = []

for i in list:

if i not in dis_li:

dis_li.append(i)

#依次输出去重后的id

print ('帖子的具体id：')

for ev_id in dis_li:

print ( str(count_id) + '______' + ev_id)

count_id = count_id + 1

time.sleep(5)

运行效果：

58694dada5af

5.PNG

完整的代码如下：

# -*- coding: utf-8 -*-

import requests

from bs4 import BeautifulSoup

import time

from future.backports.http import cookies

from http import cookies

from pip._internal import req

import xml.etree.ElementTree as ET

import re

headers = {'user-agent':

'自行补充'

}

cookies = {'cookies':

'自行补充'

}

#1:先查一下这个帖子一共有几页

print ("请输入帖子第一页的网址:" )

url = str( input() )

req = requests.get(url,cookies=cookies,headers=headers)

text = req.content.decode('GB2312','ignore')

soup = BeautifulSoup(text,features='lxml')

try :

page_top = soup.find(name='div',attrs={'id':'pager_top'}).text

page_count_text = re.findall(r'\d+(?:\.\d+)?', page_top)

page_count = int ( page_count_text[0] )

print ( '该帖子一共：' + str(page_count) + '页')#帖子大于1页时

except :

page_count = 1

print ( '该帖子一共：' + str(page_count) + '页')#帖子只有1页

#2:将每一个的ID提取出来

i = 1

count_id = 1

for i in range(page_count):

url_2 = url+ str( '&page=' )+ 'str(i-1)'

req = requests.get(url_2,cookies=cookies,headers=headers)

text = req.content.decode('GB2312','ignore')

soup = BeautifulSoup(text,features='lxml')

authorname = soup.find_all(name='td',attrs={'class':'authorname'})

list = []

for id in authorname:

id_list =list.append( id.find(color="#999999").string )

#因为一个楼里同一个人多次回复，所以我们可能会重复的id，因此需要去重

dis_li = []

for i in list:

if i not in dis_li:

dis_li.append(i)

#依次输出去重后的id

print ('帖子的具体id：')

for ev_id in dis_li:

print ( str(count_id) + '______' + ev_id)

count_id = count_id + 1

time.sleep(5)

weixin_39557576

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬取晋江_Python爬取晋江文学城网友交流区（兔区）帖子里的共多少个id

晋江文学城网友交流区，俗称兔区，是一个以明星八卦为主要讨论内容的匿名论坛。1：该区帖子特点如下：第一：论坛中每一个帖子回复只会显示一个id；第二：同一个帖子里，同一个登录账号的id是固定不变的。2：在取得一个帖子内有多少个固定id时，按照以下思路：第一：该贴有多少页;第二：找到id;第二：对于多次回复的同一账号id的去重。3：分析网页的特点:第一：定位这帖子共多少页:首先打开帖子的第一页：以帖子为...
复制链接

扫一扫