爬内涵段子，开心一刻

最新推荐文章于 2019-04-18 00:26:19 发布

半吊子Py全栈工程师

最新推荐文章于 2019-04-18 00:26:19 发布

阅读量3.1k

点赞数

分类专栏：爬虫 python之多方面应用文章标签：爬虫 py2

本文链接：https://blog.csdn.net/qq_26877377/article/details/79532657

版权

爬虫同时被 2 个专栏收录

69 篇文章 4 订阅

订阅专栏

python之多方面应用

13 篇文章 2 订阅

订阅专栏

使用py2爬取笑话，~~

# coding=utf-8

import urllib2
import re

class Pacong(object):
def __init__(self,begin=1):
self.begin = begin
self.confirm = True
self.filename = 1
def get_html(self):
"""获得html网页文件"""
url = "http://xiaohua.zol.com.cn/aiqing/"+str(self.begin)+".html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
resquest = urllib2.Request(url,headers=headers)
date = urllib2.urlopen(resquest)
html = date.read().decode("gbk").encode("utf-8")
self.clear(html)
# print html

def clear(self,html):
"""使用正则整理html获取想要的文件"""
pattern = re.compile('<div\sclass="summary-text">(.*?)</div>',re.S)
content_list = pattern.findall(html)
content_list = content_list
# print content_list

#每一次写入都使用不同的文件
filenames = "第" + str(self.filename) + "页.txt"
for i in content_list:
i = i.replace("<p>"," ").replace("</p>......"," ").replace("</p>"," ").replace("&nbsp","")
self.writePage(i,filenames)
# print i
self.filename += 1

def writePage(self,i,filenames):
"""将整理好的内容写入到本地文件"""

with open(filenames,"a+") as f:
f.write(i)

def command(self):
"""控制程序的运行"""
while self.confirm:
duanzi.get_html()
com = raw_input("是否继续爬去网页：(是按任意键,退出输入exit)")
if com == "exit":

break

self.begin += 1
if __name__ == '__main__':
duanzi = Pacong()
duanzi.command()

半吊子Py全栈工程师

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬内涵段子，开心一刻

使用py2爬取笑话，~~# coding=utf-8import urllib2import reclass Pacong(object): def __init__(self,begin=1): self.begin = begin self.confirm = True self.filename = 1 def get_html(self...
复制链接

扫一扫