爬虫入门

最新推荐文章于 2021-10-04 15:29:43 发布

dengdouweng1282

最新推荐文章于 2021-10-04 15:29:43 发布

阅读量329

点赞数

文章标签：爬虫 python json

原文链接：http://www.cnblogs.com/lqerio/p/11163853.html

版权

什么是爬虫

爬虫就是给网站发起请求，并从响应中提取需要的数据的自动化程序

爬虫工作流程

第一步：发起请求

一般是通过HTTP库，对目标站点进行请求。等同于自己打开浏览器，输入网址
协议：http、https
常用库：urllib、urllib3、requests

第二步：获取响应内容（response）

服务器会返回请求的内容，一般为：HTML，二进制文件（视频，音频），文档，Json字符串等

第三步：解析内容

寻找自己需要的信息，就是利用正则表达式或者其他库提取目标信息
常用库：re、beautifulsoup4、xpath

第四步：保存数据

将解析得到的数据持久化到文件或者数据库中

urllib 简单使用

# 发起网络请求，获得服务器返回给我们的内容。
# 从 urllib 包引入 request 模块
from urllib import request

# 发起请求的URL地址
url = "http://www.baidu.com/"

# 发起请求，请求百度，获取网络请求返回的相应，使用response对象接收
response = request.urlopen(url)

#取到了html内容之后，获取的是字节形式
html = response.read()
#print(html)


#with open("./test1.txt",mode="wb") as fr:
#    fr.write(html)

'''
获取到了html内容之后，获取的是字节形式，如果遇到多字节的字符，就无法正常显示了。所以需要进行解码
response对象中的decode方法可以对html内容进行解码。解码的时候可以指定对应的字符编码集
'''
html = html.decode("UTF-8")
#print(html)

# 获取服务器进行响应的url地址，服务器返回的url不一定是我们使用的url，有可能是服务器内部发生了跳转的url地址
print(response.geturl())

# 获取服务器返回的元信息， 在http协议和HTTPS中，返回的就是 headers 信息
print(response.info())

# 返回服务器给我们返回的请求状态码
# 200 表示请求成功
# 302 重定向
# 404 表示请求资源为找到
# 500 表示服务器内部错误
print(response.getcode())

结果

http://www.baidu.com/
Bdpagetype: 1
Bdqid: 0x87631268003a535b
Cache-Control: private
Content-Type: text/html
Cxy_all: baidu+f902db9f8065b69f528ff8659528b11a
Date: Wed, 10 Jul 2019 06:48:16 GMT
Expires: Wed, 10 Jul 2019 06:47:36 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=06DAF6CD02CFB27033679A54438F5D23:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=06DAF6CD02CFB27033679A54438F5D23; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1562741296; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: delPer=0; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=1432_21082_29238_28519_29099_28836_29220; path=/; domain=.baidu.com
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked


200

requests 简单使用

import requests

url = 'http://www.baidu.com'

r = requests.get(url)

# 出现乱码
# print(r.text)  


print(r.encoding)  
#  encoding 是  iso-8859-1

"""
    requests 的 response的编码：
    在 requests.utils 中的 get_encoding_from_headers 方法 ，进行的编码判断
    1、response的headers中，设置了 content-type， 
        其中必须是包含 text ,但是没有设置 charset，那么response的 encoding会设置为  ISO-8859-1
    2、response的headers中，设置了 content-type， 
        值包含text，并且设置了charset，那么 encoding 就是 这个 charset 的值
    3、response的headers中没有设置 content-type，那么 encoding 就是  UTF-8
"""

# 乱码的解决方式，指定 encoding 为 UTF-8
r.encoding = 'UTF-8'  
html1 = r.text

#输出正常
print(html1)

结果

ISO-8859-1
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

解析内容

## 通过正则匹配
import re
import requests

url = 'http://ip.tool.chinaz.com/'
#url = 'https://www.baidu.com/'
r = requests.get(url)

s = r.text
# print(s)
''''''
这里，chinaz的含有ip地址的源码为 <dd class="fz24">59.172.176.224</dd>，
可以在chinaz的网页，使用google内核的浏览器f12查看
''''''
reg_str='<dd class=\"fz24\">(.*?)</dd>'
res = re.compile(pattern=reg_str)

print(res.findall(s))

得到

['59.172.176.224']

from urllib.request import urlopen
from bs4 import BeautifulSoup

response = urlopen("http://www.weather.com.cn/weather/101200101.shtml")

soup = BeautifulSoup(response, "html.parser")

print(soup)

# select 传入对应的选择器就能得到该选择器下所有内容  存入list
date = soup.select("li.sky > h1")
wea = soup.select("li.sky > p.wea")
tem = soup.select("li.sky > p.tem")
dir = soup.select("p.win")
level = soup.select("p.win > i")


record = []

for i in range(len(date)):
    _date = date[i].text
    _wea = wea[i].text
    # _tem = tem[i].text.strip().replace("\n","")
    _tem = "".join(tem[i].stripped_strings)
    span = dir[i].select(("span"))
    _dir = span[0].get("title") + "转" + span[1].get("title")
    _level = level[i].text
    record.append([_date, _wea, _tem, _dir, _level])
    
    
    
    print([_date, _wea, _tem, _dir, _level])

得到

<!DOCTYPE html>

<html>
<head>
<link href="http://i.tq121.com.cn" rel="dns-prefetch"/>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>【武汉天气】武汉天气预报,蓝天,蓝天预报,雾霾,雾霾消散,天气预报一周,天气预报15天查询</title>
<meta content="zh-cn" http-equiv="Content-Language"/>
<meta content="武汉天气预报,武汉今日天气,武汉周末天气,武汉一周天气预报,武汉蓝天,武汉蓝天预报,武汉雾霾,武汉雾霾消散,武汉40日天气预报" name="keywords">
<meta content="武汉天气预报，及时准确发布中央气象台天气信息，便捷查询武汉今日天气，武汉周末天气，武汉一周天气预报，武汉蓝天预报，武汉天气预报，武汉40日天气预报，还提供武汉的生活指数、健康指数、交通指数、旅游指数，及时发布武汉气象预警信号、各类气象资讯。" name="description"/>
<style>
     .xyn-zan {
            width: 30px;
            height: 51px;
            background: url(http://i.tq121.com.cn/i/weather2015/city/zan-bj.png) no-repeat;
            margin-bottom: 7px;
            text-align: center
        }

        .xyn-zan img {
            padding-top: 5px;
            cursor: pointer;
        }

        .xyn-zan p {
            line-height: 15px;
            color: #a4a4a4;
        }

        .xyn-weather-box {
            width: 354px;
            height: 261px;
            border: 1px solid #c2e5ff;
            background: #fff;
            position: absolute;
            top: -67px;
            left: 250px;
            z-index: 999999;
            display: none;

#还有很长，就不放了