实例1:京东商品页面的爬取
讲义中的:https://item.jd.com/2967929.html
这使用的:https://item.jd.com/46106440551.html
爬取时需要登录吗?
>>> import requests
>>> r = requests.get("https://item.jd.com/46106440551.html")
>>> r.status_code
200
>>> r.encoding
'UTF-8'
>>> r.text[:1000]
"<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F46106440551.html'</script>"
全代码如下:(运行结果同上)
import requests
url = "https://item.jd.com/46106440551.html"
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print("爬取失败")
实例2:亚马逊商品页面的爬取
https://www.amazon.cn/gp/product/B01M8L5Z3Y
>>> r =requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
>>> r.status_code
503
>>> r.encoding
'ISO-8859-1'
>>> r.encoding = r.apparent_encoding
>>> r.text
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Amazon CAPTCHA</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n var ue_t0 = (+ new Date()),\n ue_csm = window,\n ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n ue_furl = "fls-cn.amazon.cn",\n ue_mid = "AAHKV2X7AFYLW",\n ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n ue_sn = "opfcaptcha.amazon.cn",\n ue_id = \'E7YJGGJ3D0XX4DAM3FFV\';\n}\n</script>\n</head>\n<body>\n\n<!--\n To discuss automated access to Amazon data please contact api-services-support@amazon.com.\n For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.\n-->\n\n<!--\nCorreios.DoNotSend\n-->\n\n<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">\n\n <div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">\n\n <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>\n\n <div class="a-box a-alert a-alert-info a-spacing-base">\n <div class="a-box-inner">\n <i class="a-icon a-icon-alert"></i>\n <h4>请输入您在下方看到的字符</h4>\n <p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>\n </div>\n </div>\n\n <div class="a-section">\n\n <div class="a-box a-color-offset-background">\n <div class="a-box-inner a-padding-extra-large">\n\n <form method="get" action="/errors/validateCaptcha" name="">\n <input type=hidden name="amzn" value="xlyeQnq5dAW5IOMzogbbFQ==" /><input type=hidden name="amzn-r" value="/gp/product/B01M8L5Z3Y" />\n <div class="a-row a-spacing-large">\n <div class="a-box">\n <div class="a-box-inner">\n <h4>请输入您在这个图片中看到的字符:</h4>\n <div class="a-row a-text-center">\n <img src="https://images-na.ssl-images-amazon.com/captcha/docvmtpr/Captcha_pdbxophjdl.jpg">\n </div>\n <div class="a-row a-spacing-base">\n <div class="a-row">\n <div class="a-column a-span6">\n <label for="captchacharacters">输入字符</label>\n </div>\n <div class="a-column a-span6 a-span-last a-text-right">\n <a οnclick="window.location.reload()">换一张图</a>\n </div>\n </div>\n <input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">\n </div>\n </div>\n </div>\n </div>\n\n <div class="a-section a-spacing-extra-large">\n\n <div class="a-row">\n <span class="a-button a-button-primary a-span12">\n <span class="a-button-inner">\n <button type="submit" class="a-button-text">继续购物</button>\n </span>\n </span>\n </div>\n\n </div>\n </form>\n\n </div>\n </div>\n\n </div>\n\n </div>\n\n <div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>\n\n <div class="a-text-center a-spacing-small a-size-mini">\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_claim?ie=UTF8&nodeId=200347160">使用条件</a>\n <span class="a-letter-space"></span>\n <span class="a-letter-space"></span>\n <span class="a-letter-space"></span>\n <span class="a-letter-space"></span>\n <a href="https://www.amazon.cn/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=200347130">隐私声明</a>\n </div>\n\n <div class="a-text-center a-size-mini a-color-secondary">\n © 1996-2015, Amazon.com, Inc. or its affiliates\n <script>\n if (true === true) {\n document.write(\'<img src="https://fls-cn.amaz\'+\'on.cn/\'+\'1/oc-csi/1/OP/requestId=E7YJGGJ3D0XX4DAM3FFV&js=1" />\');\n };\n </script>\n <noscript>\n <img src="https://fls-cn.amazon.cn/1/oc-csi/1/OP/requestId=E7YJGGJ3D0XX4DAM3FFV&js=0" />\n </noscript>\n </div>\n </div>\n <script>\n if (true === true) {\n var head = document.getElementsByTagName(\'head\')[0],\n prefix = "https://images-cn.ssl-images-amazon.com/images/G/01/csminstrumentation/",\n elem = document.createElement("script");\n elem.src = prefix + "csm-captcha-instrumentation.min.js";\n head.appendChild(elem);\n\n elem = document.createElement("script");\n elem.src = prefix + "rd-script-6d68177fa6061598e9509dc4b5bdd08d.js";\n head.appendChild(elem);\n }\n </script>\n</body></html>\n'
>>> r.request.headers
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> kv = {'user-agent':'Mozilla/5.0'}
>>> url="https://www.amazon.cn/gp/product/B01M8L5Z3Y"
>>> r = requests.get(url,headers = {'user-agent':'Mozilla//5.0'})
>>> r.status_code
200
>>> r.request.headers
{'user-agent': 'Mozilla//5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> r.text[:1000]
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Amazon CAPTCHA</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n var ue_t0 = (+ new Date()),\n ue_csm = window,\n ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n ue_furl = "fls-cn.amazon.cn",\n ue_mid = "AAHKV2X7AFYLW",\n '
>>>
全代码如下:(这里使用的案例1的url,结果不一样了)
import requests
url = "https://item.jd.com/46106440551.html"
try:
kv = {'user-agent':'Mozilla/5.0'}
r = requests.get(url,headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[1000:2000])
except:
print("爬取失败")
实例3:百度/360搜索关键字提交
搜索引擎关键词提交接口
百度的关键词接口:http://www.baidu.com/s?wd=keyword
360的关键词接口:http://www.so.com/s?q=keyword
>>> import requests
>>> kv = {'wd':'Python'}
>>> r = requests.get("http://www.baidu.com/s",params=kv)
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
>>> len(r.text)
527861
百度搜索全代码如下:
import requests
keyword = "Python"
try:
kv = {'wd':keyword}
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")
>>> import requests
>>> kv = {'q':'Python'}
>>> r = requests.get('http://www.so.com/s',params=kv)
>>> r.status_code
200
>>> r.request.url
'https://www.so.com/s?q=Python'
>>> len(r.text)
285264
>>>
360搜索全代码
import requests
keyword = "Python"
try:
kv = {'q':keyword}
r = requests.get("http://www.so.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")
实例4:网络图片的爬取和存储
网络图片链接的格式:
http://www.example.com/picture.jpg
国家地理:http://www.nationalgeographic.com.cn/
选择一个图片Web页面:
http://www.nationalgeographic.com.cn/photography/photo_of_the_day/3921.html
图片地址:http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg
>>> import requests
>>> path ="D://abc.jpg"
>>> url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
>>> r=requests.get(url)
>>> r.status_code
200
>>> with open(path,'wb') as f:
f.write(r.content)
228206
>>> f.close
<built-in method close of _io.BufferedWriter object at 0x0441AC28>
完整代码如下:
import requests
import os
url = "http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
root = "D://abc.jpg"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r=requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
实例5:IP地址归属地的自动查询
https://m.ip138.com/iplookup.asp?ip=ipaddress
>>> import requests
>>> url='https://m.ip138.com/iplookup.asp?ip='
>>> r=requests.get(url+'202.204.80.112')
#小编这里报错:目标计算机积极拒绝无法连接
>>> r.status_code
200
>>> r.text[-500:]
"'\x078\x07#���QE\x00J�1�\x1fO�Gr̮�X��O\x04�B�\x1e;�Ǩ��\x00���\x17,�ry<�O9<�})���m�7o�?�\x1e�Q@\x19\r#�\x0c\x19��s�N��S�4��\x18�\x16\x03�O\x1cz�ۏ��4Q@\x16��H�\x13�(\x00��ہ�?\n�!��\x0eX�\x10H#�P}\x0e(��%�G+˹���>�:��嘀�\x00�#��\x14VS�z~�\x06n$K� .\x08�����0F ��\x7f�\x14U��_?́\x0e�.�bx�I��%\x03a�p\x17\x1c\x0e3�\x1f^�QT\x00�,u\x07\x07\x07�鏧'��I-8$�\x00 \x13�9=\x07A�QE\x00R��\x16���7s������\x12+\x07$�F\t��\x07\x1c\x1c�;zv��\x0cg�?��|���X\x1c��y� �����^��-�7O�\x1f_�\x14PH����ǒ3����VA&'�'k\x10�$�\x0cp9�}(�� \x04�D��\x1drs�w���P��H�\t$���z�E\x002�$D1\x04����pH�ק|�l�S��(�\x08�'�O~�ک�3\x1cd�ө'֊)�u�03䕊�pAV\x04\x1c��oS\\�f�q��x\x18��Q�8���\x0b$\x00\x0f\x03�����VKO&�\\�\x00�\x008�3��\n(�R��?�̽$�$�8'��9?�5\x12�m\x1f3t\x1d���E\x06\x07��"
全代码如下:(爬取失败了)
import requests
url='https://m.ip138.com/iplookup.asp?ip='
try:
r=requests.get(url+'202,204.80.112')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:])
except:
print("爬取失败")