爬取网站段子

最新推荐文章于 2019-01-30 12:13:32 发布

南方划水的banana

最新推荐文章于 2019-01-30 12:13:32 发布

阅读量244

点赞数 1

分类专栏： python 爬虫文章标签：正则表达式 url

本文链接：https://blog.csdn.net/superce/article/details/66529776

版权

python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

使用requests库和正则表达式爬取段子并保存到.txt文件

lianjie:https://github.com/Spacewe/python

import re
import requests
import sys
reload(sys)
sys.setdefaultencoding("utf-8")


url="http://hahahahhaahah.com/"
# url=""
header = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
haha = requests.get(url,headers=header)
haha.encoding='utf-8'    
# print haha.text
heihei=re.findall('<p>(.*?)</p>',haha.text,re.S)

fp=open('neihan.txt', 'wb')
# fp.write(heihei.text)

for each in heihei:
    print each
    print '-'*100
    fp.write(each)
    fp.write("\n\n")    防止被覆盖
fp.close()