转一个python写的多线程代理服务器抓取,保存,验证程序

最新推荐文章于 2024-08-21 15:12:02 发布
popeyes_hsz
最新推荐文章于 2024-08-21 15:12:02 发布
阅读量1k
点赞数 3
分类专栏： Web技术文章标签： python 多线程服务器 output html build
本文链接：https://blog.csdn.net/popeyes_hsz/article/details/2247588
版权
Web技术专栏收录该内容
3 篇文章 0 订阅
订阅专栏
用php写过一个，不过由于 php 不支持多线程，抓取和验证速度都非常的慢
(尽管libcurl可以实现多线程抓取,但他也只限于抓取网页这个功能，抓回来的数据进行再处理很麻烦).

于是决定用python重新写,python支持多线程啊。
已经有一年多没有用过 python了，很多语法，语言特性都快忘记得差不多了。经过三天业余时间的
摸索，今天我写的这个程序终于可以和大家交流了。

下面放出源代码: 希望有高手能帮我共同完善,
这个程序是我学python语言以来写的第二个程序，应该有很多写得不够简洁的地方，希望行家多多指点

程序现有功能:
   1. 能自动从12个网站抓取代理列表，并保存到数据库里面
   2. 自动验证每个代理是否可用,并保存验证时的响应时间做为判断代理速度的依据
   3. 能分类输出代理信息，已验证的，未验证的，高度匿名代理，普通匿名代理，透明代理到不同文件
   4   支持的输出格式有 xml,htm,csv,txt,tab   每种文件都能自定义字段和格式
   5. 扩展性比较强, 要添加一个新的抓取网站只需要改变一个全局变量，添加两个函数 (有详细接口说明)
   6.   用 sqlite 做数据库，小巧，方便，简单，0配置，0安装，放在屁股口袋里就可以带走
   7. 多线程抓取，多线程验证

我的运行环境：windows xp + python v2.4 ,其他版本未测试

程序下载:   点击这里(242kb)
代码的注释非常详细,python 初学者都可以看懂， 12个网站抓取分析的正则表达式都有详细注释
  1 # -*- coding: gb2312 -*-
  2 # vi:ts=4:et
  3 
  4 """
  5 目前程序能从下列网站抓取代理列表
  6 
  7 http://www.cybersyndrome.net/
  8 http://www.pass-e.com/
  9 http://www.cnproxy.com/
 10 http://www.proxylists.net/
 11 http://www.my-proxy.com/
 12 http://www.samair.ru/proxy/
 13 http://proxy4free.com/
 14 http://proxylist.sakura.ne.jp/
 15 http://www.ipfree.cn/
 16 http://www.publicproxyservers.com/
 17 http://www.digitalcybersoft.com/
 18 http://www.checkedproxylists.com/
 19 
 20 问:怎样才能添加自己的新网站，并自动让程序去抓取?
 21 答:
 22 
 23 请注意源代码中以下函数的定义.从函数名的最后一个数字从1开始递增，目前已经到了13    
 24 
 25 def build_list_urls_1(page=5):
 26 def parse_page_2(html=''):
 27 
 28 def build_list_urls_2(page=5):
 29 def parse_page_2(html=''):
 30 
 31 .......
 32 
 33 def build_list_urls_13(page=5):
 34 def parse_page_13(html=''):
 35 
 36 
 37 你要做的就是添加 build_list_urls_14 和 parse_page_14 这两个函数
 38 比如你要从 www.somedomain.com 抓取 
 39     /somepath/showlist.asp?page=1
 40     ...  到
 41     /somepath/showlist.asp?page=8  假设共8页
 42 
 43 那么 build_list_urls_14 就应该这样定义
 44 要定义这个page这个参数的默认值为你要抓取的页面数8，这样才能正确到抓到8个页面
 45 def build_list_urls_14(page=8):   
 46     ..... 
 47     return [        #返回的是一个一维数组，数组每个元素都是你要抓取的页面的绝对地址
 48         'http://www.somedomain.com/somepath/showlist.asp?page=1',
 49         'http://www.somedomain.com/somepath/showlist.asp?page=2',
 50         'http://www.somedomain.com/somepath/showlist.asp?page=3',
 51         ....
 52         'http://www.somedomain.com/somepath/showlist.asp?page=8'
 53     ]
 54 
 55 接下来再写一个函数 parse_page_14(html='')用来分析上面那个函数返回的那些页面html的内容
 56 并从html中提取代理地址
 57 注意： 这个函数会循环处理 parse_page_14 中的所有页面，传入的html就是那些页面的html文本
 58 
 59 ip:   必须为 xxx.xxx.xxx.xxx 数字ip格式，不能为 www.xxx.com 格式
 60 port: 必须为 2-5位的数字
 61 type: 必须为 数字 2,1,0,-1 中的其中一个。这些数字代表代理服务器的类型
 62       2:高度匿名代理  1: 普通匿名代理  0:透明代理    -1: 无法确定的代理类型
 63  #area: 代理所在国家或者地区， 必须转化为 utf8编码格式  
 64 
 65 def parse_page_14(html=''):
 66     ....
 67     return [
 68         [ip,port,type,area]         
 69         [ip,port,type,area]         
 70         .....                      
 71         ....                       
 72         [ip,port,type,area]        
 73     ]
 74 
 75 最后，最重要的一点:修改全局变量 web_site_count的值，让他加递增1  web_site_count=14
 76 
 77 
 78 
 79 问：我已经按照上面的说明成功的添加了一个自定义站点，我要再添加一个，怎么办?
 80 答：既然已经知道怎么添加 build_list_urls_14 和 parse_page_14了
 81 
 82 那么就按照同样的办法添加
 83 def build_list_urls_15(page=5):
 84 def parse_page_15(html=''):
 85 
 86 这两个函数，并 更新全局变量   web_site_count=15
 87 
 88 """
 89 
 90 
 91 import urllib,time,random,re,threading,string
 92 
 93 web_site_count=13   #要抓取的网站数目
 94 day_keep=2          #删除数据库中保存时间大于day_keep天的 无效代理
 95 indebug=1
 96 
 97 thread_num=100                   # 开 thread_num 个线程检查代理
 98 check_in_one_call=thread_num*25  # 本次程序运行时 最多检查的代理个数
 99 
100 
101 skip_check_in_hour=1    # 在时间 skip_check_in_hour内,不对同一个代理地址再次验证
102 skip_get_in_hour=8      # 每次采集新代理的最少时间间隔 (小时)
103 
104 proxy_array=[]          # 这个数组保存将要添加到数据库的代理列表 
105 update_array=[]         # 这个数组保存将要更新的代理的数据 
106 
107 db=None                 #数据库全局对象
108 conn=None
109 dbfile='proxier.db'     #数据库文件名
110 
111 target_url="http://www.baidu.com/"   # 验证代理的时候通过代理访问这个地址
112 target_string="030173"               # 如果返回的html中包含这个字符串，
113 target_timeout=30                    # 并且响应时间小于 target_timeout 秒 
114                                      #那么我们就认为这个代理是有效的 
115 
116 
117 
118 #到处代理数据的文件格式，如果不想导出数据，请让这个变量为空  output_type=''
119 
120 output_type='xml'                   #以下格式可选,  默认xml
121                                     # xml
122                                     # htm           
123                                     # tab         制表符分隔, 兼容 excel
124                                     # csv         逗号分隔,   兼容 excel
125                                     # txt         xxx.xxx.xxx.xxx:xx 格式
126 
127 # 输出文件名 请保证这个数组含有六个元素
128 output_filename=[
129             'uncheck',             # 对于未检查的代理,保存到这个文件
130             'checkfail',           # 已经检查，但是被标记为无效的代理,保存到这个文件
131             'ok_high_anon',        # 高匿代理(且有效)的代理,按speed排序，最块的放前面
132             'ok_anonymous',        # 普通匿名(且有效)的代理,按speed排序，最块的放前面
133             'ok_transparent',      # 透明代理(且有效)的代理,按speed排序，最块的放前面
134             'ok_other'             # 其他未知类型(且有效)的代理,按speed排序
135             ]
136 
137 
138 #输出数据的格式  支持的数据列有  
139 # _ip_ , _port_ , _type_ , _status_ , _active_ ,
140 #_time_added_, _time_checked_ ,_time_used_ ,  _speed_, _area_
141 
142 output_head_string=''             # 输出文件的头部字符串
143 output_format=''                  # 文件数据的格式    
144 output_foot_string=''             # 输出文件的底部字符串
145 
146 
147 
148 if   output_type=='xml':
149     output_head_string="<?xml version='1.0' encoding='gb2312'?><proxylist>/n"
150     output_format="""<item>
151             <ip>_ip_</ip>
152             <port>_port_</port>
153             <speed>_speed_</speed>
154             <last_check>_time_checked_</last_check>
155             <area>_area_</area>
156         </item>
157             """
158     output_foot_string="</proxylist>"
159 elif output_type=='htm':
160     output_head_string="""<table border=1 width='100%'>
161         <tr><td>代理</td><td>最后检查</td><td>速度</td><td>地区</td></tr>
162         """
163     output_format="""<tr>
164     <td>_ip_:_port_</td><td>_time_checked_</td><td>_speed_</td><td>_area_</td>
165     </tr>
166     """
167     output_foot_string="</table>"
168 else:
169     output_head_string=''
170     output_foot_string=''
171 
172 if output_type=="csv":
173     output_format="_ip_, _port_, _type_,  _speed_, _time_checked_,  _area_/n"
174 
175 if output_type=="tab":
176     output_format="_ip_/t_port_/t_speed_/t_time_checked_/t_area_/n"
177 
178 if output_type=="txt":
179     output_format="_ip_:_port_/n"
180 
181 
182 # 输出文件的函数
183 def output_file():
184     global output_filename,output_head_string,output_foot_string,output_type
185     if output_type=='':
186         return
187     fnum=len(output_filename)
188     content=[]
189     for i in range(fnum):
190         content.append([output_head_string])
191 
192     conn.execute("select * from `proxier` order by `active`,`type`,`speed` asc")
193     rs=conn.fetchall()
194 
195     for item in rs:
196         type,active=item[2],item[4]
197         if   active is None:
198             content[0].append(formatline(item))   #未检查
199         elif active==0:
200             content[1].append(formatline(item))   #非法的代理
201         elif active==1 and type==2:
202             content[2].append(formatline(item))   #高匿   
203         elif active==1 and type==1:
204             content[3].append(formatline(item))   #普通匿名  
205         elif active==1 and type==0:
206             content[4].append(formatline(item))   #透明代理             
207         elif active==1 and type==-1:
208             content[5].append(formatline(item))   #未知类型的代理
209         else:
210             pass
211 
212     for i in range(fnum):
213         content[i].append(output_foot_string)
214         f=open(output_filename[i]+"."+output_type,'w')
215         f.write(string.join(content[i],''))
216         f.close()
217 
218 #格式化输出每条记录
219 def formatline(item):
220     global output_format
221     arr=['_ip_','_port_','_type_','_status_','_active_',
222         '_time_added_','_time_checked_','_time_used_',
223         '_speed_','_area_']
224     s=output_format
225     for i  in range(len(arr)):
226         s=string.replace(s,arr[i],str(formatitem(item[i],i)))
227     return s
228 
229 
230 #对于数据库中的每个不同字段，要处理一下，中文要编码，日期字段要转化
231 def formatitem(value,colnum):
232     global output_type
233     if (colnum==9):
234         value=value.encode('cp936')
235     elif value is None:
236         value=''
237 
238     if colnum==5 or colnum==6 or colnum==7:      #time_xxxed
239         value=string.atof(value)
240         if value<1:
241             value=''
242         else:
243             value=formattime(value)
244 
245     if value=='' and output_type=='htm':value=' '
246     return value
247 
248 
249 
250 def check_one_proxy(ip,port):
251     global update_array
252     global check_in_one_call
253     global target_url,target_string,target_timeout
254 
255     url=target_url
256     checkstr=target_string
257     timeout=target_timeout
258     ip=string.strip(ip)
259     proxy=ip+':'+str(port)
260     proxies = {'http': 'http://'+proxy+'/'}
261     opener = urllib.FancyURLopener(proxies)
262     opener.addheaders = [
263         ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)')
264         ]
265     t1=time.time()
266 
267     if (url.find("?")==-1):
268         url=url+'?rnd='+str(random.random())
269     else:
270         url=url+'&rnd='+str(random.random())
271 
272     try:
273         f = opener.open(url)
274         s= f.read()
275         pos=s.find(checkstr)
276     except:
277         pos=-1
278         pass
279     t2=time.time()
280     timeused=t2-t1
281     if (timeused<timeout and pos>0):
282         active=1
283     else:
284         active=0
285     update_array.append([ip,port,active,timeused])
286     print len(update_array),' of ',check_in_one_call," ",ip,':',port,'--',int(timeused)
287 
288 
289 def get_html(url=''):
290     opener = urllib.FancyURLopener({})      #不使用代理
291     #www.my-proxy.com 需要下面这个Cookie才能正常抓取
292     opener.addheaders = [
293             ('User-agent','Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)'),
294             ('Cookie','permission=1')
295             ]
296     t=time.time()
297     if (url.find("?")==-1):
298         url=url+'?rnd='+str(random.random())
299     else:
300         url=url+'&rnd='+str(random.random())
301     try:
302         f = opener.open(url)
303         return f.read()
304     except:
305         return ''
306 
307 
308 
309 
310 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

311 
312 def build_list_urls_1(page=5):
313     page=page+1
314     ret=[]
315     for i in range(1,page):
316         ret.append('http://proxy4free.com/page%(num)01d.html'%{'num':i})
317     return ret
318 
319 def parse_page_1(html=''):
320     matches=re.findall(r'''
321             <td>([/d/.]+)<//td>[/s/n/r]*   #ip
322             <td>([/d]+)<//td>[/s/n/r]*     #port
323             <td>([^/<]*)<//td>[/s/n/r]*    #type 
324             <td>([^/<]*)<//td>             #area 
325             ''',html,re.VERBOSE)
326     ret=[]
327     for match in matches:
328         ip=match[0]
329         port=match[1]
330         type=match[2]
331         area=match[3]
332         if (type=='anonymous'):
333             type=1
334         elif (type=='high anonymity'):
335             type=2
336         elif (type=='transparent'):
337             type=0
338         else:
339             type=-1
340         ret.append([ip,port,type,area])
341         if indebug:print '1',ip,port,type,area
342     return ret
343 
344 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

345 
346 
347 def build_list_urls_2(page=1):
348     return ['http://www.digitalcybersoft.com/ProxyList/fresh-proxy-list.shtml']
349 
350 def parse_page_2(html=''):
351     matches=re.findall(r'''
352         ((?:[/d]{1,3}/.){3}[/d]{1,3})/:([/d]+)      #ip:port
353         /s+(Anonymous|Elite Proxy)[+/s]+            #type
354         (.+)/r?/n                                   #area
355         ''',html,re.VERBOSE)
356     ret=[]
357     for match in matches:
358         ip=match[0]
359         port=match[1]
360         type=match[2]
361         area=match[3]
362         if (type=='Anonymous'):
363             type=1
364         else:
365             type=2
366         ret.append([ip,port,type,area])
367         if indebug:print '2',ip,port,type,area
368     return ret
369 
370 
371 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

372 
373 
374 def build_list_urls_3(page=15):
375     page=page+1
376     ret=[]
377     for i in range(1,page):
378         ret.append('http://www.samair.ru/proxy/proxy-%(num)02d.htm'%{'num':i})
379     return ret
380 
381 def parse_page_3(html=''):
382     matches=re.findall(r'''
383         <tr><td><span/sclass/="/w+">(/d{1,3})<//span>/. #ip(part1)
384         <span/sclass/="/w+">                            
385         (/d{1,3})<//span>                               #ip(part2)
386         (/./d{1,3}/./d{1,3})                            #ip(part3,part4)
387 
388         /:/r?/n(/d{2,5})<//td>                          #port
389         <td>([^<]+)</td>                                #type
390         <td>[^<]+<//td>                                
391         <td>([^<]+)<//td>                               #area
392         <//tr>''',html,re.VERBOSE)
393     ret=[]
394     for match in matches:
395         ip=match[0]+"."+match[1]+match[2]
396         port=match[3]
397         type=match[4]
398         area=match[5]
399         if (type=='anonymous proxy server'):
400             type=1
401         elif (type=='high-anonymous proxy server'):
402             type=2
403         elif (type=='transparent proxy'):
404             type=0
405         else:
406             type=-1
407         ret.append([ip,port,type,area])
408         if indebug:print '3',ip,port,type,area
409     return ret
410 
411 
412 
413 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

414 
415 def build_list_urls_4(page=3):
416     page=page+1
417     ret=[]
418     for i in range(1,page):
419         ret.append('http://www.pass-e.com/proxy/index.php?page=%(n)01d'%{'n':i})
420     return ret
421 
422 def parse_page_4(html=''):
423     matches=re.findall(r"""
424         list
425         /('(/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})'        #ip
426         /,'(/d{2,5})'                                   #port
427         /,'(/d)'                                        #type
428         /,'([^']+)'/)                                   #area
429         /;/r?/n""",html,re.VERBOSE)
430     ret=[]
431     for match in matches:
432         ip=match[0]
433         port=match[1]
434         type=match[2]
435         area=match[3]
436         area=unicode(area, 'cp936')
437         area=area.encode('utf8')
438         if (type=='1'):      #type的判断可以查看抓回来的网页的javascript部分
439             type=1
440         elif (type=='3'):
441             type=2
442         elif (type=='2'):
443             type=0
444         else:
445             type=-1
446         ret.append([ip,port,type,area])
447         if indebug:print '4',ip,port,type,area
448     return ret
449 
450 
451 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

452 
453 
454 def build_list_urls_5(page=12):
455     page=page+1
456     ret=[]
457     for i in range(1,page):
458         ret.append('http://www.ipfree.cn/index2.asp?page=%(num)01d'%{'num':i})        
459     return ret
460 
461 def parse_page_5(html=''):
462     matches=re.findall(r"<font color=black>([^<]*)</font>",html)    
463     ret=[]
464     for index, match in enumerate(matches):
465         if (index%3==0):
466             ip=matches[index+1]
467             port=matches[index+2]
468             type=-1      #该网站未提供代理服务器类型        
469             area=unicode(match, 'cp936') 
470             area=area.encode('utf8') 
471             if indebug:print '5',ip,port,type,area
472             ret.append([ip,port,type,area])         
473         else:
474             continue
475     return ret
476 
477 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

478 
479 
480 def build_list_urls_6(page=3):
481     page=page+1
482     ret=[]
483     for i in range(1,page):
484         ret.append('http://www.cnproxy.com/proxy%(num)01d.html'%{'num':i})        
485     return ret
486 
487 def parse_page_6(html=''):
488     matches=re.findall(r'''<tr>
489         <td>([^&]+)                     #ip
490         &#8204‍
491         /:([^<]+)                       #port
492         </td>
493         <td>HTTP</td>
494         <td>[^<]+</td>
495         <td>([^<]+)</td>                #area
496         </tr>''',html,re.VERBOSE)   
497     ret=[]
498     for match in matches:
499         ip=match[0]
500         port=match[1]
501         type=-1          #该网站未提供代理服务器类型
502         area=match[2]
503         area=unicode(area, 'cp936') 
504         area=area.encode('utf8') 
505         ret.append([ip,port,type,area])
506         if indebug:print '6',ip,port,type,area
507     return ret
508 
509 
510 
511 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

512 
513 
514 
515 def build_list_urls_7(page=1):
516     return ['http://www.proxylists.net/http_highanon.txt']
517 
518 def parse_page_7(html=''):
519     matches=re.findall(r'(/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})/:(/d{2,5})',html) 
520     ret=[]
521     for match in matches:
522         ip=match[0]
523         port=match[1]
524         type=2         
525         area='--'
526         ret.append([ip,port,type,area])
527         if indebug:print '7',ip,port,type,area
528     return ret
529 
530 
531 
532 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

533 
534 
535 
536 
537 def build_list_urls_8(page=1):
538     return ['http://www.proxylists.net/http.txt']
539 
540 def parse_page_8(html=''):
541     matches=re.findall(r'(/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})/:(/d{2,5})',html)
542     ret=[]
543     for match in matches:
544         ip=match[0]
545         port=match[1]
546         type=-1
547         area='--'
548         ret.append([ip,port,type,area])
549         if indebug:print '8',ip,port,type,area
550     return ret
551 
552 
553 
554 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

555 
556 
557 def build_list_urls_9(page=6):
558     page=page+1
559     ret=[]
560     for i in range(0,page):
561         ret.append('http://proxylist.sakura.ne.jp/index.htm?pages=%(n)01d'%{'n':i})
562     return ret
563 
564 def parse_page_9(html=''):
565     matches=re.findall(r'''
566         (/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})        #ip
567         /:(/d{2,5})                                 #port
568         <//TD>[/s/r/n]*
569         <TD>([^<]+)</TD>                            #area
570         [/s/r/n]*
571         <TD>([^<]+)</TD>                            #type
572     ''',html,re.VERBOSE)
573     ret=[]
574     for match in matches:
575         ip=match[0]
576         port=match[1]
577         type=match[3]
578         area=match[2]
579         if (type=='Anonymous'):
580             type=1
581         else:
582             type=-1
583         ret.append([ip,port,type,area])
584         if indebug:print '9',ip,port,type,area
585     return ret
586 
587 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

588 
589 def build_list_urls_10(page=5):
590     page=page+1
591     ret=[]
592     for i in range(1,page):
593         ret.append('http://www.publicproxyservers.com/page%(n)01d.html'%{'n':i})
594     return ret
595 
596 def parse_page_10(html=''):
597     matches=re.findall(r'''
598         (/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})    #ip
599         <//td>[/s/r/n]*
600         <td[^>]+>(/d{2,5})<//td>                #port
601         [/s/r/n]*
602         <td>([^<]+)<//td>                       #type
603         [/s/r/n]*
604         <td>([^<]+)<//td>                       #area
605         ''',html,re.VERBOSE)
606     ret=[]
607     for match in matches:
608         ip=match[0]
609         port=match[1]
610         type=match[2]
611         area=match[3]
612         if (type=='high anonymity'):
613             type=2
614         elif (type=='anonymous'):
615             type=1
616         elif (type=='transparent'):
617             type=0
618         else:
619             type=-1
620         ret.append([ip,port,type,area])
621         if indebug:print '10',ip,port,type,area
622     return ret
623 
624 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

625 
626 
627 
628 def build_list_urls_11(page=10):
629     page=page+1
630     ret=[]
631     for i in range(1,page):
632         ret.append('http://www.my-proxy.com/list/proxy.php?list=%(n)01d'%{'n':i})
633 
634     ret.append('http://www.my-proxy.com/list/proxy.php?list=s1')
635     ret.append('http://www.my-proxy.com/list/proxy.php?list=s2')
636     ret.append('http://www.my-proxy.com/list/proxy.php?list=s3')
637     return ret
638 
639 def parse_page_11(html=''):
640     matches=re.findall(r'(/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})/:(/d{2,5})',html)
641     ret=[]
642 
643     if (html.find('(Level 1)')>0):
644         type=2
645     elif (html.find('(Level 2)')>0):
646         type=1
647     elif (html.find('(Level 3)')>0):
648         type=0
649     else:
650         type=-1
651 
652     for match in matches:
653         ip=match[0]
654         port=match[1]
655         area='--'
656         ret.append([ip,port,type,area])
657         if indebug:print '11',ip,port,type,area
658     return ret
659 
660 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

661 
662 
663 
664 def build_list_urls_12(page=4):
665     ret=[]
666     ret.append('http://www.cybersyndrome.net/plr4.html')
667     ret.append('http://www.cybersyndrome.net/pla4.html')
668     ret.append('http://www.cybersyndrome.net/pld4.html')
669     ret.append('http://www.cybersyndrome.net/pls4.html')
670     return ret
671 
672 def parse_page_12(html=''):
673     matches=re.findall(r'''
674         onMouseOver/=
675         "s/(/'(/w/w)/'/)"                           #area
676         /sonMouseOut/="d/(/)"/s?c?l?a?s?s?/=?"?
677         (/w?)                                       #type    
678         "?>
679         (/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})        #ip
680         /:(/d{2,5})                                 #port
681         ''',html,re.VERBOSE)
682     ret=[]
683     for match in matches:
684         ip=match[2]
685         port=match[3]
686         area=match[0]
687         type=match[1]
688         if (type=='A'):
689             type=2
690         elif (type=='B'):
691             type=1
692         else:
693             type=0
694         ret.append([ip,port,type,area])
695         if indebug:print '12',ip,port,type,area
696     return ret
697 
698 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

699 
700 
701 def build_list_urls_13(page=3):
702     url='http://www.checkedproxylists.com/'
703     html=get_html(url)
704     matchs=re.findall(r"""
705         href/='([^']+)'>(?:high_anonymous|anonymous|transparent)
706         /sproxy/slist<//a>""",html,re.VERBOSE)
707     return map(lambda x: url+x, matchs)
708 
709 def parse_page_13(html=''):
710     html_matches=re.findall(r"eval/(unescape/('([^']+)'/)",html)
711     if (len(html_matches)>0):
712         conent=urllib.unquote(html_matches[0])
713     matches=re.findall(r"""<td>(/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3})<//td>
714             <td>(/d{2,5})<//td><//tr>""",conent,re.VERBOSE)
715     ret=[]
716     if   (html.find('<title>Checked Proxy Lists - proxylist_high_anonymous_')>0):
717         type=2
718     elif (html.find('<title>Checked Proxy Lists - proxylist_anonymous_')>0):
719         type=1
720     elif (html.find('<title>Checked Proxy Lists - proxylist_transparent_')>0):
721         type=0
722     else:
723         type=-1
724 
725     for match in matches:
726         ip=match[0]
727         port=match[1]
728         area='--'
729         ret.append([ip,port,type,area])
730         if indebug:print '13',ip,port,type,area
731     return ret
732 
733 ################################################################################
#
##        by Go_Rush(阿舜) from http://ashun.cnblogs.com/
#
################################################################################

734 
735 
736 
737 #线程类
738 
739 class TEST(threading.Thread):
740     def __init__(self,action,index=None,checklist=None):
741         threading.Thread.__init__(self)
742         self.index =index
743         self.action=action
744         self.checklist=checklist
745 
746     def run(self):
747         if (self.action=='getproxy'):
748             get_proxy_one_website(self.index)
749         else:
750             check_proxy(self.index,self.checklist)
751 
752 
753 def check_proxy(index,checklist=[]):
754     for item in checklist:
755         check_one_proxy(item[0],item[1])
756 
757 
758 def patch_check_proxy(threadCount,action=''):
759     global check_in_one_call,skip_check_in_hour,conn
760     threads=[]
761     if   (action=='checknew'):        #检查所有新加入，并且从未被检查过的
762         orderby=' `time_added` desc '
763         strwhere=' `active` is null '
764     elif (action=='checkok'):         #再次检查 以前已经验证成功的 代理
765         orderby=' `time_checked` asc '
766         strwhere=' `active`=1 '
767     elif (action=='checkfail'):       #再次检查以前验证失败的代理
768         orderby=' `time_checked` asc '
769         strwhere=' `active`=0 '
770     else:                            #检查所有的 
771         orderby=' `time_checked` asc '
772         strwhere=' 1=1 '
773     sql="""
774            select `ip`,`port` FROM `proxier` where
775                  `time_checked` < (unix_timestamp()-%(skip_time)01s) 
776                  and %(strwhere)01s 
777                  order by %(order)01s 
778                  limit %(num)01d
779         """%{     'num':check_in_one_call,
780              'strwhere':strwhere,
781                 'order':orderby,
782             'skip_time':skip_check_in_hour*3600}
783     conn.execute(sql)
784     rows = conn.fetchall()
785 
786     check_in_one_call=len(rows)
787 
788     #计算每个线程将要检查的代理个数
789     if len(rows)>=threadCount:
790         num_in_one_thread=len(rows)/threadCount
791     else:
792         num_in_one_thread=1
793 
794     threadCount=threadCount+1
795     print "现在开始验证以下代理服务器....."
796     for index in range(1,threadCount):
797      #分配每个线程要检查的checklist,并把那些剩余任务留给最后一个线程               
798         checklist=rows[(index-1)*num_in_one_thread:index*num_in_one_thread]
799         if (index+1==threadCount):
800             checklist=rows[(index-1)*num_in_one_thread:]
801 
802         t=TEST(action,index,checklist)
803         t.setDaemon(True)
804         t.start()
805         threads.append((t))
806     for thread in threads:
807         thread.join(60)
808     update_proxies()            #把所有的检查结果更新到数据库
809 
810 
811 def get_proxy_one_website(index):
812     global proxy_array
813     func='build_list_urls_'+str(index)
814     parse_func=eval('parse_page_'+str(index))
815     urls=eval(func+'()')
816     for url in urls:
817         html=get_html(url)
818         print url
819         proxylist=parse_func(html)
820         for proxy in proxylist:
821             ip=string.strip(proxy[0])
822             port=string.strip(proxy[1])
823             if (re.compile("^/d{1,3}/./d{1,3}/./d{1,3}/./d{1,3}$").search(ip)):
824                 type=str(proxy[2])
825                 area=string.strip(proxy[3])
826                 proxy_array.append([ip,port,type,area])
827 
828 
829 def get_all_proxies():
830     global web_site_count,conn,skip_get_in_hour
831 
832     #检查最近添加代理是什么时候，避免短时间内多次抓取
833     rs=conn.execute("select max(`time_added`) from `proxier` limit 1")
834     last_add=rs.fetchone()[0]
835     if (last_add and my_unix_timestamp()-last_add<skip_get_in_hour*3600):
836         print """
837  放弃抓取代理列表!
838  因为最近一次抓取代理的时间是: %(t)1s
839  这个时间距离现在的时间小于抓取代理的最小时间间隔: %(n)1d 小时
840  如果一定要现在抓取代理，请修改全局变量: skip_get_in_hour 的值
841             """%{'t':formattime(last_add),'n':skip_get_in_hour}
842         return
843 
844     print "现在开始从以下"+str(web_site_count)+"个网站抓取代理列表...."
845     threads=[]
846     count=web_site_count+1
847     for index in range(1,count):
848         t=TEST('getproxy',index)
849         t.setDaemon(True)
850         t.start()
851         threads.append((t))
852     for thread in threads:
853         thread.join(60)
854     add_proxies_to_db()
855 
856 def add_proxies_to_db():
857     global proxy_array
858     count=len(proxy_array)
859     for i in range(count):
860         item=proxy_array[i]
861         sql="""insert into `proxier` (`ip`,`port`,`type`,`time_added`,`area`) values('
862     """+item[0]+"',"+item[1]+","+item[2]+",unix_timestamp(),'"+clean_string(item[3])+"')"
863         try:
864             conn.execute(sql)
865             print "%(num)2.1f/%/t"%{'num':100*(i+1)/count},item[0],":",item[1]
866         except:
867             pass
868 
869 
870 def update_proxies():
871     global update_array
872     for item in update_array:
873         sql='''
874              update `proxier` set `time_checked`=unix_timestamp(), 
875                 `active`=%(active)01d, 
876                  `speed`=%(speed)02.3f                 
877                  where `ip`='%(ip)01s' and `port`=%(port)01d                            
878             '''%{'active':item[2],'speed':item[3],'ip':item[0],'port':item[1]}
879         try:
880             conn.execute(sql)
881         except:
882             pass
883 
884 #sqlite 不支持 unix_timestamp这个函数,所以我们要自己实现
885 def my_unix_timestamp():
886     return int(time.time())
887 
888 def clean_string(s):
889     tmp=re.sub(r"['/,/s]", ' ', s)
890     return re.sub(r"/s+", ' ', tmp)
891 
892 def formattime(t):
893     return time.strftime('%c',time.gmtime(t+8*3600))
894 
895 
896 def open_database():
897     global db,conn,day_keep,dbfile
898 
899     try:
900         from pysqlite2 import dbapi2 as sqlite
901     except:
902         print """
903         本程序使用 sqlite 做数据库来保存数据，运行本程序需要 pysqlite的支持
904         python 访问 sqlite 需要到下面地址下载这个模块 pysqlite,  272kb
905         http://initd.org/tracker/pysqlite/wiki/pysqlite#Downloads
906         下载(Windows binaries for Python 2.x)
907         """
908         raise SystemExit
909 
910     try:
911         db = sqlite.connect(dbfile,isolation_level=None)
912         db.create_function("unix_timestamp", 0, my_unix_timestamp)
913         conn  = db.cursor()
914     except:
915         print "操作sqlite数据库失败，请确保脚本所在目录具有写权限"
916         raise SystemExit
917 
918     sql="""
919        /* ip:     只要纯ip地址(xxx.xxx.xxx.xxx)的代理 */
920        /* type:   代理类型 2:高匿 1:普匿 0:透明 -1: 未知 */
921        /* status: 这个字段本程序还没有用到，留在这里作以后扩展*/ 
922        /* active: 代理是否可用  1:可用  0:不可用  */ 
923        /* speed:  请求相应时间，speed越小说明速度越快 */ 
924 
925         CREATE TABLE IF NOT EXISTS  `proxier` (
926           `ip` varchar(15) NOT NULL default '',    
927           `port` int(6)  NOT NULL default '0',
928           `type` int(11) NOT NULL default '-1',    
929           `status` int(11) default '0',            
930           `active` int(11) default NULL,           
931           `time_added` int(11)  NOT NULL default '0',  
932           `time_checked` int(11) default '0',      
933           `time_used` int(11)  default '0',            
934           `speed` float default NULL,             
935           `area` varchar(120) default '--',      /*  代理服务器所在位置 */
936           PRIMARY KEY (`ip`) 
937         );
938         /*
939         CREATE INDEX IF NOT EXISTS `type`        ON proxier(`type`);
940         CREATE INDEX IF NOT EXISTS `time_used`   ON proxier(`time_used`);
941         CREATE INDEX IF NOT EXISTS `speed`       ON proxier(`speed`);
942         CREATE INDEX IF NOT EXISTS `active`      ON proxier(`active`);
943         */
944         PRAGMA encoding = "utf-8";      /* 数据库用 utf-8编码保存 */
945     """
946     conn.executescript(sql)
947     conn.execute("""DELETE FROM `proxier`
948                         where `time_added`< (unix_timestamp()-?) 
949                         and `active`=0""",(day_keep*86400,))
950 
951     conn.execute("select count(`ip`) from `proxier`")
952     m1=conn.fetchone()[0]
953     if m1 is None:return
954 
955     conn.execute("""select count(`time_checked`) 
956                         from `proxier` where `time_checked`>0""")
957     m2=conn.fetchone()[0]
958 
959     if m2==0:
960         m3,m4,m5=0,"尚未检查","尚未检查"
961     else:
962         conn.execute("select count(`active`) from `proxier` where `active`=1")
963         m3=conn.fetchone()[0]
964         conn.execute("""select max(`time_checked`), min(`time_checked`) 
965                              from `proxier` where `time_checked`>0 limit 1""")
966         rs=conn.fetchone()
967         m4,m5=rs[0],rs[1]
968         m4=formattime(m4)
969         m5=formattime(m5)
970     print """
971     共%(m1)1d条代理，其中%(m2)1d个代理被验证过，%(m3)1d个代理验证有效。
972             最近一次检查时间是：%(m4)1s
973             最远一次检查时间是: %(m5)1s
974     提示：对于检查时间超过24小时的代理，应该重新检查其有效性
975     """%{'m1':m1,'m2':m2,'m3':m3,'m4':m4,'m5':m5}
976 
977 
978 
979 def close_database():
980     global db,conn
981     conn.close()
982     db.close()
983     conn=None
984     db=None
985 
986 if __name__ == '__main__':
987     open_database()
988     get_all_proxies()
989     patch_check_proxy(thread_num)
990     output_file()
991     close_database()
992     print "所有工作已经完成"
转自 http://www.cnblogs.com/ashun/archive/2007/06/01/python_proxy_checker.html
popeyes_hsz
关注
3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
转一个python写的多线程代理服务器抓取,保存,验证程序

用php写过一个，不过由于 php 不支持多线程，抓取和验证速度都非常的慢(尽管libcurl可以实现多线程抓取,但他也只限于抓取网页这个功能，抓回来的数据进行再处理很麻烦).于是决定用python重新写,python支持多线程啊。已经有一年多没有用过 python了，很多语法，语言特性都快忘记得差不多了。经过三天业余时间的摸索，今天我写的这个程序终于可以和大家交流了。下面放出源代码: 希望
复制链接

扫一扫