[多线程]亚马逊图书排名查询

最新推荐文章于 2024-11-11 22:41:20 发布

weixin_33819479

最新推荐文章于 2024-11-11 22:41:20 发布

阅读量150

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/auqarius/p/9008408.html

版权

版本：Python3.6

库：atexit, re, threading, time, urllib3, bs4

亚马逊有反爬虫机制，header中至少要加入一个信息，此例中加入UA，不过仍然时常不好使，需要重复尝试。

# _*_coding:utf-8_*_
# created by Zhang Q.L.on 2018/5/7 0007
from atexit import register
from re import compile
from threading import Thread
from time import ctime
import urllib3
import bs4 header = { 'User-Agent': 'AppleWebKit/537.36 (KHTML, like Gecko)' } headerSample = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36' } REGEX = compile('#([\d,]+) in Books') url = 'https://item.jd.com/7081550.html' urltest = 'https://www.amazon.com//dp/' urltest2 = 'https://www.amazon.com//dp/0132269937' ISBNs = { '0132269937':'Core Python Programming', '0132356139':'Python Web Development with Django', '0137143419':'Python Fundamentals', } def httpget(isbn): http = urllib3.PoolManager() #首先产生一个PoolManager实例 urllib3.disable_warnings() #忽略https的无效证书警报 # page = http.request('GET','%s'%urltest2,headers=header) #发起GET请求 page = http.request('GET','%s%s'%(urltest,isbn),headers=header) #发起GET请求 print(page.status) #服务器返回的状态代码 # print(page.data) #服务器返回的数据，返回的是xml字符串 # print(page.data.decode()) #利用默认'utf-8'编码格式去解码 res = bs4.BeautifulSoup(page.data,'lxml') #利用lxml模块解码 res = str((res)) # print(res) return REGEX.findall(res)[0] def _showRanking(isbn): print('- %r ranked %s'%(ISBNs[isbn], httpget(isbn))) def _main(): print('At',ctime(),'on Amazon...') for isbn in ISBNs: Thread(target=_showRanking, args=(isbn,)).start() @register def _atexit(): print('all DONE at:',ctime()) if __name__ == '__main__': _main()

输出结果：

D:\装机软件\python3.6\python3.exe C:/Users/Administrator/PycharmProjects/Python核心编程/多线程编程/amazon-nothread.py
At Tue May  8 15:10:44 2018 on Amazon...
200
200
200
- 'Python Fundamentals' ranked 4,517,952
- 'Python Web Development with Django' ranked 1,243,459
- 'Core Python Programming' ranked 674,874
all DONE at: Tue May  8 15:10:50 2018

Process finished with exit code 0

与不引入线程的程序进行对比，主要有两个区别：

1.由于是并发处理模式，处理时间变短；

2.引入线程之后处理结果输出的顺序按完成的顺序输出，而单线程版本按照变量的顺序，也就是由字典的键决定的。

转载于:https://www.cnblogs.com/auqarius/p/9008408.html