python最简单的爬取邮箱地址_python简单爬虫，抓取邮箱

最新推荐文章于 2023-09-26 05:08:50 发布

weixin_39880623

最新推荐文章于 2023-09-26 05:08:50 发布

阅读量821

点赞数

文章标签： python最简单的爬取邮箱地址

最近，老师给了一个练习是，实现一个爬虫，就爬大概100个网页，匹配出邮箱。

于是，我花了几天时间，熟悉熟悉了python,就有了下面这个超级简单的爬虫程序。各种毛病。。。。。。

这里先说明一下，python库的安装，因为我在这上面浪费了不少时间。

首先是pip和distribute。这两个是用来管理和安装python库的。具体请看这里http://jiayanjujyj.iteye.com/blog/1409819

在windows下，在命令行中python distribute_setup.py (在distribute_setup.py这个文件目录下)。然后就可以用easy_install 命令来装其他模块了。

pyquery有一个依赖库，是lxml。这个模块要用到本机c语言的编译器，如果本机装有VS或者mingw相关的东西，容易遇到各种装不上的问题。理论上来说是，只要正确配置，就会利用本机的编译器将lxml模块装好。但是，我是各种郁闷装不上。。于是找到了这里http://www.lfd.uci.edu/~gohlke/pythonlibs/。

在那个网站里面，找到对应的版本，装上就ok了。

1 importurllib22 importre3 from pyquery importPyQuery as pq4 from lxml importetree5 importsys6 importcopy7 ##reload(sys)

8 ##sys.setdefaultencoding("utf8")

11 mailpattern = re.compile('[^\._:>\\-][\w\.-]+@(?:[A-Za-z0-9]+\.)+[A-Za-z]+')12 #mailpattern = re.compile('[A-Za-z0-9]+@(?:[A-Za-z0-9]+\.)+[A-Za-z]+')

14 url = "http://www.xxx.cn"

15 firstUrls = []#to store the urls

16 secondUrls =[]17 count = 1 #to count levels

18 furls = open("E:/py/crawler/urlsRecord.txt","a")19 fmail = open("E:/py/crawler/mailresult.txt","a")20

23 def geturls(data): #the function to get the urls in the html

24 urls =[]25 d =pq(data)26 label_a = d.find('a')#用pyquery库去找到 a 标签.

27 iflabel_a:28 label_a_href = d('a').map(lambda i, e:pq(e)('a').attr('href'))29 for u inlabel_a_href:30 if u[0:10]!="javascript":31 if u[0:4] == "http":32 urls.append(u)33 else:34 urls.append(url +u)35 for u inurls:36 furls.write(u)37 furls.write('\n')38 returnurls39

41 def savemails(data): #the function to save the emails

42 mailResult =mailpattern.findall(data)43 ifmailResult:44 for u inmailResult:45 printu46 fmail.write(u)47 fmail.write('\n')48

49 defgethtml(url):50 fp =urllib2.urlopen(url)51 mybytes =fp.read()52 myWebStr = mybytes.decode("gbk") #这里读取出来要从bytes到文本

53 fp.close()54 returnmyWebStr55

57 furls.write(url+'\n')58

59 myWebStr =gethtml(url)60 ifmyWebStr:61 savemails(myWebStr)62 firstUrls =geturls(myWebStr)63 iffirstUrls:64 for i inrange(0,len(firstUrls)):65 html =gethtml(firstUrls[i])66 ifhtml:67 savemails(html)68 ## tempurls = geturls(html) #这里本来想再抓一层，慢得要死，就没再继续了

69 ## if tempurls:

70 ## nexturls = nexturls + tempurls

72 ## if nexturls:

73 ## for i in range(0,len(nexturls)):

74 ## nexthtml = gethtml(nexturls[i])

75 ## if nexthtml:

76 ## savemails(nexthtml)

80 fmail.close()81 furls.close()82

现在这个程序存在的问题就是：

1.如果直接运行，就会出现编码问题：

Traceback (most recent call last):

File"E:\py\crawler.py", line 67, in savemails(html)

File"E:\py\crawler.py", line 46, insavemails

fmail.write(u)

UnicodeEncodeError:'ascii' codec can't encode character u'\u81f3'in position 0: ordinal not in range(128)

然后我google之，是因为编码问题。

reload(sys）

sys.setdefaultencoding("utf8")

用这个方法即可解决，即我在最开始的代码里面第7,8注释的两行。不过问题又出现了，虽然不会出现上面的报错，但是第45行的 print 语句无效了。而且无论在何处的print语句均无效了。这是为何。。。。。。

在46行中，我试着把出现问题的部分print出来，发现，是因为链接中里面出现了：

至huaweibiancheng@163.com

然后fmail.write(u)的时候，碰到这种就写不了。我查了下，刚好‘至’的unicode 编码就是 81f3 （在这里查http://ipseeker.cn/tools/pywb.php）

到此处，难道是write()不能写中文？我用如下代码测试：

poem = '至huaweibiancheng@163.com'f= open("E:/py/poem.txt","w")

f.write(poem)

f.close()

f= open("E:/py/poem.txt",'r')whileTrue:

line=f.readline()if len(line) ==0:break

print(line)

f.close()

结果：

至huaweibiancheng@163.com

接着把代码中,改成“poem = u'\u81f3' ”,一模一样的错误出现了：

Traceback (most recent call last):

File"E:\py\test.py", line 212, in f.write(poem)

UnicodeEncodeError:'ascii' codec can't encode character u'\u81f3'in position 0: ordinal not in range(128)

也就是说，在抓取的网页中是以 " u'\u81f3'huaweibiancheng@163.com " 存在。然后不能写入文件，出错。

求高人解答啊。。。。

weixin_39880623

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python最简单的爬取邮箱地址_python简单爬虫，抓取邮箱

最近，老师给了一个练习是，实现一个爬虫，就爬大概100个网页，匹配出邮箱。于是，我花了几天时间，熟悉熟悉了python,就有了下面这个超级简单的爬虫程序。各种毛病。。。。。。这里先说明一下，python库的安装，因为我在这上面浪费了不少时间。首先是pip和distribute。这两个是用来管理和安装python库的。具体请看这里http://jiayanjujyj.iteye.com/blog/1...
复制链接

扫一扫