菜鸟爬虫实战入门-1

最新推荐文章于 2024-04-27 17:07:54 发布

今天不穿棉裤

最新推荐文章于 2024-04-27 17:07:54 发布

阅读量200

点赞数 1

分类专栏： Python爬虫实战文章标签： python

本文链接：https://blog.csdn.net/m0_46641433/article/details/107804129

版权

Python爬虫实战专栏收录该内容

2 篇文章 0 订阅

订阅专栏

北理慕课爬虫菜鸟入门

第一次系统的学习Python第三方库；第一次写CSDN。希望大家多多包涵！

1.京东商品爬取

#完整代码片
import requests
url="https://item.jd.com/67119061697.html"
try:
    r=requests.get(url)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    return r.text
except:
    return "爬取失败"
#IDLE逐行解释
>>>import requests
>>>url="https://item.jd.com/67119061697.html"
>>>r=requests.get(url)
>>>r.status_code
>>>200
>>>r.encoding
>>>'UTF-8'
>>>r.text[:1000]
>>>"<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F67119061697.html'</script>"

京东商品爬取出来跟课堂中出现的有所差别，这里我们先跳过去，往下接着看。

2.亚马逊商品爬取

#完整代码片
import requests
url="https://www.amazon.cn/dp/B08BVMKLHS?ref_=Oct_DLandingS_D_651328d8_60&smid=A3CQWPW49OI3BQ"
kv={'user-agent':'Mozilla/5.0'}
try:
    r=requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")
#IDLE逐行解释
>>> import requests
>>> r=requests.get("https://www.amazon.cn/dp/B08BVMKLHS?ref_=Oct_DLandingS_D_651328d8_60&smid=A3CQWPW49OI3BQ")
>>> r.status_code
503
>>> r.encoding
'ISO-8859-1'
>>> r.encoding=r.apparent_encoding
>>> r.text[:1000]
'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->\n<!--[if IE 7]>    <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->\n<!--[if IE 8]>    <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!-->\n<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title dir="ltr">Amazon CAPTCHA</title>\n<meta name="viewport" content="width=device-width">\n<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">\n<script>\n\nif (true === true) {\n    var ue_t0 = (+ new Date()),\n        ue_csm = window,\n        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },\n        ue_furl = "fls-cn.amazon.cn",\n        ue_mid = "AAHKV2X7AFYLW",\n '
>>> r.request.headers
{'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> kv={'user-agent':'Mozilla/5.0'}
>>> url="https://www.amazon.cn/dp/B08BVMKLHS?ref_=Oct_DLandingS_D_651328d8_60&smid=A3CQWPW49OI3BQ"
>>> r=requests.get(url,headers=kv)
>>> r.status_code
200
>>> r.request.headers
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

当我们用IDLE爬取亚马逊页面时：发现他的status_code并非200，这说明我们的的访问出现了错误。

在这里我们访问一下r.request.headers：发现爬虫HTTP的头部信息是**‘python-requests/2.24.0’**这说明我们的爬虫忠实的告诉亚马逊：***我们的访问是由一个Python的Request库产生的，亚马逊可以拒绝这样的请求。***因此，我们需要更改头部信息，模拟一个浏览器！

之后就是构造一个键值对，去更改我们的头部信息！**kv={‘user-agent’:‘Mozilla/5.0’}**这样我们就伪装成了一个浏览器，可以获得商品的信息！

3.百度/360搜索关键词提交

不管百度还是360，它们都提供了一个搜索关键词的接口。

百度：http://www.baidu.com/s?wd=keyword

360:http://www.so.com/s?q=keyword
只要替换keyword便实现对关键词的搜索

#完整代码片
import requests
keyword = "Python"
try:
    kv = {'wd':keyword}
    r=requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

这里用到params方法：向url中增加相关内容。

4.网络图片的爬取与存储

#完整代码片
import requests
import os
url = "http://image.ngchina.com.cn/2018/0815/20180815032736913.jpg"
root = "D://"
path = root+url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r=requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

在这里我们先找到某个网站的图片地址，见到其末尾是.jpg等便表示这是一个文件可以存储。调用os库查看是否有你想要的路径否则建立路径。注意到***path = root+url.split(’/’)[-1]***这实际上是找到了以图片名来存储得到路径。之后用with as操作文件，用二进制方式写图片。

我们设计程序时必须保证程序尽可能的容纳所有可能出现的错误情况，以保持程序的稳定性。

5.IP地址查询

这边我们用到了IP138网站查询IP地址~

#完整代码片
import requests
url = "https://www.ip138.com/iplookup.asp?ip="
try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:])
except:
    print("爬取失败")

看了看其他博主的博客，这个代码片爬不出来不知道怎么回事。。这个类比上面百度360搜索，都有搜索的接口。

今天的实例分享就到这里啦！写博客或许能帮助到大家一点点，也是为了我之后复习的方便哈哈。例子中的不足，之后明白过来会慢慢改正的啦~ bye~

今天不穿棉裤

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
菜鸟爬虫实战入门-1

北理慕课爬虫菜鸟入门第一次系统的学习Python第三方库；第一次写CSDN。希望大家多多包涵！1.京东商品爬取#完整代码片import requestsurl="https://item.jd.com/67119061697.html"try: r=requests.get(url) r.raise_for_status() r.encoding=r.apparent_encoding return r.textexcept: return "爬取失败"
复制链接

扫一扫