5.17
学习正则表达式
爬取斗破苍穹小说主要代码如下
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
f = open('C:\\Users\\456\\Desktop\\doupo.txt','a+')
def get_info(url):
res =requests.get(url,headers=headers)
if res.status_code==200:
contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S)
for content in contents:
f.write(content+'\n')
else:
pass
#出现的问题
headers = {‘User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/73.0.3683.103 Safari/537.36’}
应该写成
headers = {‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}
爬取嗅事百科的段子 主要代码如下
def judgment_sex(class_name):
if class_name == 'womenIcon':
return '女'
else:
return '男'
def get_info(url):
res =requests.get(url,headers=headers)
ids = re.findall('<h2>(.*?)</h2>',res.text,re.S)
print(ids)
levels =re.findall('<div class="articleGender manIcon">(.*?)</div>',res.text,re.S)
print(levels)
sexs = re.findall('<div class="articleGender (.*?)">', res.text, re.S)
print(sexs)
contents =re.findall('<div class="content">.*?<span>(.*?)</span>',res.text,re.S)
print(contents)
for id,level,sex,content in zip(ids,levels,sexs,contents):
info = {
'id':id.strip(),
'level':level.strip(),
'sex':judgment_sex(sex),
'content':content.strip()
}
info_lists.append(info)
if __name__ == "__main__":
urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(num)) for num in range(1,36)]
for single_url in urls:
get_info(single_url)
for info_list in info_lists:
f = open('C:\\Users\\456\\Desktop\\duanzi.txt', 'a+',encoding='utf-8')
try:
f.write(info_list['id']+'\n',)
f.write(info_list['level'] + '\n')
f.write(info_list['sex'] + '\n')
f.write(info_list['content'] + '\n')
f.close()
except UnboundLocalError:
pass
发生的错误及解决
1.使用zip时,for循环的的info总是为空,对zip的操作不然熟悉,后面接着把前面的ids,levels,sexs,contents,全部输出,发现
contents为空,恍然大悟,contents =\re.findall()这句语句写错了,导致匹配为空,改好之后,输出至桌面的txt文件
2.f = open(‘C:\\Users\\456\\Desktop\\duanzi.txt’, ‘a+’,encoding=‘utf-8’) 书写时直接复制的路径,应该用俩斜杠\\而不是\
还用没用utf-8形成了乱码。
剩余问题
无法区分**<\p>**标签里的是小说正文,还是广告,或者别的链接
段子文档里的 <\br>没法消除