多线程爬取百度百科

最新推荐文章于 2024-09-26 19:15:00 发布

ao91544

最新推荐文章于 2024-09-26 19:15:00 发布

阅读量105

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/vorphan/p/7476431.html

版权

前言：
EVERNOTE里的一篇笔记，我用了三个博客才学完...真的很菜...百度百科和故事网并没有太过不一样，修改下编码，debug下，就可以爬下来了，不过应该是我爬的东西太初级了，而且我爬到3000多条链接时，好像被拒绝了...爬取速度也很慢，估计之后要接触一些优化或者多进程，毕竟python是假的多线程。
本博客参照代码及PROJECT来源：http://kexue.fm/archives/4385/

源代码：

 1 #! -*- coding:utf-8 -*-
 2 import requests as rq
 3 import re
 4 import time
 5 import datetime
 6 from multiprocessing.dummy import Pool,Queue
 7 import pymysql
 8 from urllib import parse
 9 import html
10 import importlib
11 from urllib.request import urlopen
12 from bs4 import BeautifulSoup
13 unescape = html.unescape #用来实现对HTML字符的转移
14 
15 tasks = Queue()
16 tasks_pass = set() #已队列过的链接
17 tasks.put('http://baike.baidu.com/item/科学')
18 count = 0 #已爬取页面总数
19 
20 url_split_re = re.compile('&|\+')
21 def clean_url(url):
22     url = parse.urlparse(url)
23     return url_split_re.split(parse.urlunparse((url.scheme, url.netloc, url.path, '', '', '')))[0]
24 
25 def main():
26     global count,tasks_pass
27     while True:
28         url = tasks.get() #取出一个url，并且在队列中删除掉
29         web = rq.get(url).content.decode('utf8','ignore')
30         urls = re.findall(u'href="(/item/.*?)"', web) #查找所有站内链接
31         for u in urls:
32             try:
33                 u = rq.get(u).content.decode('utf8')
34             except:
35                 pass
36             u = 'http://baike.baidu.com' + u
37             u = clean_url(u)
38             if (u not in tasks_pass): #把还没有队列过的链接加入队列
39                 tasks.put(u)
40                 tasks_pass.add(u)
41             web1 = rq.get(u).content.decode('utf8', 'ignore')
42             bsObj = BeautifulSoup(web1, "lxml")
43             text = bsObj.title.get_text()
44             print(datetime.datetime.now(), '   ', u, '   ', text)
45             db = pymysql.connect("localhost", "testuser", "test123", "TESTDB", charset='utf8')
46             dbc = db.cursor()
47             sql = "insert ignore into baidubaike(url,title) values(%s,%s);"
48             data = (u, text)
49             dbc.execute(sql, data)
50             dbc.close()
51             db.commit()
52         count += 1
53         if count % 100 == 0:
54             print(u'%s done.' % count)
55 
56 pool = Pool(4, main) #多线程爬取，4是线程数
57 total = 0
58 while True: #这部分代码的意思是如果20秒内没有动静，那就结束脚本
59     time.sleep(60)
60     if len(tasks_pass) > total:
61         total = len(tasks_pass)
62     else:
63         break
64 
65 pool.terminate()
66 print("terminated normally")

BUG：

raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

问题在于没有伪装请求头

来源：http://blog.csdn.net/u013424864/article/details/60778031

转载于:https://www.cnblogs.com/vorphan/p/7476431.html

ao91544

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫