有个博客很详细https://blog.csdn.net/weixin_42488570/article/details/80794087
要求:用户ID,用户等级,用户性别,发表段子文字信息,好笑数量和评论数量,如下图所示:
用户ID
user = re.findall('<h2.*?>(.*?)</h2>', text, flags=re.DOTALL)
文字
text = re.findall('<div class="content">.*?<span>(.*?)</span>', text, re.S)
import requests
from lxml import etree
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
# 'referer': 'https://dytt8.net/html/gndy/dyzz/list_23_2.html'
}
def judgment_sex(class_name):
if class_name == 'womenIcon':
return '女'
else:
return '男'
def parse_page(url):
response = requests.get(url, headers=headers)
text = response.text
users = re.findall('<h2.*?>(.*?)</h2>', text, flags=re.DOTALL)
sexs = re.findall('<div class="articleGender(.*?)">', text, re.S)
contents = re.findall('<div class="content">.*?<span>(.*?)</span>', text, re.S)
laughs = re.findall('<i class="number.*?>(\d+)</i>', text, flags=re.DOTALL)
info_lists = []
for value in zip(users, sexs, contents, laughs):
user, sex, content, laugh = value
info = {
'user': user,
'sex': judgment_sex(sex),
'content': content,
'laugh': laugh
}
info_lists.append(info)
print(info_lists)
#保存到本地,可以不保存
for info_list in info_lists:
f = open('C:\\Users\\wei\\Desktop\\qiushi.txt', 'a+')
try:
f.write(info_list['user'] + '\n')
f.write(info_list['sex'] + '\n')
f.write(info_list['content'] + '\n')
f.write(info_list['laugh'] + '\n')
f.close()
except UnicodeEncodeError:
pass
def spider():
url = 'https://www.qiushibaike.com/text/page/2/'
parse_page(url)
if __name__ == '__main__':
spider()
结果
我们可以看到输出结果和空格
优化去掉其他的字符串
修改代码如下