简单的python爬虫——贴吧上取邮箱

最新推荐文章于 2018-04-18 22:13:12 发布

liuyukuan

最新推荐文章于 2018-04-18 22:13:12 发布

阅读量629

点赞数

分类专栏： Python 正则表达式爬虫

Python 同时被 3 个专栏收录

99 篇文章 18 订阅

订阅专栏

正则表达式

17 篇文章 0 订阅

订阅专栏

爬虫

12 篇文章 1 订阅

订阅专栏

这是一个比较简单的爬虫，只用到了两个简单的库re和urllib，
程序使用的是python2.7
urllib模块是用来获取原文网页，
re模块是用来匹配特定的字符的，
1.获取链接的最后一页

html = urllib.urlopen(url).read()
reyuan = r'<a href=".*?pn=(.*?)">尾页</a>'
recom = re.compile(reyuan)
refind = re.findall(recom,html)
 
 1
2
3
4
 
 1
2
3
4

注意事项：设置编码类型为utf-8，如果定义gb2312，不能获取到网页的尾页，这是字符编码的问题，python3中就没有这样的问题
2.逐页遍历，获得邮箱

a = 尾页数#由上面的代码得到
while i<=int(a):
        content = urllib.urlopen(url+str(i)).read()
        print("现在在下载第"+str(i)+"页，总共"+str(a) +"页")
        i += 1
        pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,5}'
        items =re.findall(pattern,content)
        for item in items:
           print item
 
 1
2
3
4
5
6
7
8
9
 
 1
2
3
4
5
6
7
8
9

注意事项：如果按照上面输出的是乱码可以这么写

 print("现在在下载第".decode("utf-8").encode("gb2312")+str(i)+"页，总共".decode("utf-8").encode("gb2312")+str(get_ye(url)) +"页".decode("utf-8").encode("gb2312"))
        i += 1
 
 1
2
 
 1
2

3.将邮箱保存到文件中

file = open("E:\\python\\qqcom1.txt","w+")
file.write(item+ '\n')
file.close()
 
 1
2
3
 
 1
2
3

注意事项：记得最后关闭文件

4.整理代码

#coding:utf-8
import urllib
import re
file = open("E:\\python\\qqcom1.txt","w+")
url = "http://tieba.baidu.com/p/4194772383?pn="
def get_ye(url):

    html = urllib.urlopen(url).read()
    reyuan = r'<a href=".*?pn=(.*?)">尾页</a>'
    recom = re.compile(reyuan)
    refind = re.findall(recom,html)
    return refind[0]


def get_qq():
    i = 1
    j = 1
    while i<=int(get_ye(url)):
        content = urllib.urlopen(url+str(i)).read()
        print("现在在下载第"+str(i)+"页，总共"+str(get_ye(url)) +"页")
        i += 1
        pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,5}'
        items =re.findall(pattern,content)
        for item in items:
            file.write(item+ '\n')
            j += 1
    else:
        print "结束"
        file.write(str(j)+ '\n')
        print j
        file.close()



if __name__=="__main__":
    get_qq()
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

如果输出乱码就可以在每个字符串后面加上decode(“utf-8”).encode(“gb2312”)就可以了

（写于2015年12月16日，http://blog.csdn.net/bzd_111）

liuyukuan

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
简单的python爬虫——贴吧上取邮箱

这是一个比较简单的爬虫，只用到了两个简单的库re和urllib，程序使用的是python2.7 urllib模块是用来获取原文网页， re模块是用来匹配特定的字符的， 1.获取链接的最后一页html = urllib.urlopen(url).read()reyuan = r'尾页'recom = re.compile(reyuan)refind = re.findal
复制链接

扫一扫