版本:Python3.6
库:atexit, re, threading, time, urllib3, bs4
亚马逊有反爬虫机制,header中至少要加入一个信息,此例中加入UA,不过仍然时常不好使,需要重复尝试。
# _*_coding:utf-8_*_ # created by Zhang Q.L.on 2018/5/7 0007 from atexit import register from re import compile from threading import Thread from time import ctime import urllib3 import bs4 header = { 'User-Agent': 'AppleWebKit/537.36 (KHTML, like Gecko)' } headerSample = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36' } REGEX = compile('#([\d,]+) in Books') url = 'https://item.jd.com/7081550.html' urltest = 'https://www.amazon.com//dp/' urltest2 = 'https://www.amazon.com//dp/0132269937' ISBNs = { '0132269937':'Core Python Programming', '0132356139':'Python Web Development with Django', '0137143419':'Python Fundamentals', } def httpget(isbn): http = urllib3.PoolManager() #首先产生一个PoolManager实例 urllib3.disable_warnings() #忽略https的无效证书警报 # page = http.request('GET','%s'%urltest2,headers=header) #发起GET请求 page = http.request('GET','%s%s'%(urltest,isbn),headers=header) #发起GET请求 print(page.status) #服务器返回的状态代码 # print(page.data) #服务器返回的数据,返回的是xml字符串 # print(page.data.decode()) #利用默认'utf-8'编码格式去解码 res = bs4.BeautifulSoup(page.data,'lxml') #利用lxml模块解码 res = str((res)) # print(res) return REGEX.findall(res)[0] def _showRanking(isbn): print('- %r ranked %s'%(ISBNs[isbn], httpget(isbn))) def _main(): print('At',ctime(),'on Amazon...') for isbn in ISBNs: Thread(target=_showRanking, args=(isbn,)).start() @register def _atexit(): print('all DONE at:',ctime()) if __name__ == '__main__': _main()
输出结果:
D:\装机软件\python3.6\python3.exe C:/Users/Administrator/PycharmProjects/Python核心编程/多线程编程/amazon-nothread.py At Tue May 8 15:10:44 2018 on Amazon... 200 200 200 - 'Python Fundamentals' ranked 4,517,952 - 'Python Web Development with Django' ranked 1,243,459 - 'Core Python Programming' ranked 674,874 all DONE at: Tue May 8 15:10:50 2018 Process finished with exit code 0
与不引入线程的程序进行对比,主要有两个区别:
1.由于是并发处理模式,处理时间变短;
2.引入线程之后处理结果输出的顺序按完成的顺序输出,而单线程版本按照变量的顺序,也就是由字典的键决定的。