python办公自动化 Task4 简单的python爬虫

最新推荐文章于 2022-08-16 00:26:28 发布

沐三言

最新推荐文章于 2022-08-16 00:26:28 发布

阅读量195

点赞数

文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_70899430/article/details/125905296

版权

一、requests模块

import requests
re=requests.get('https://www.baidu.com')
print(f're的类型:{type(re)}')
print(f're的status_code:{re.status_code}')

out:

re的类型:<class 'requests.models.Response'>

re的status_code:200

注：①requests注意s，https注意s；

②我们可以通过res.status_code的值来判断请求是否成功，其中200位成功。

二、下载txt文件

import requests
res=requests.get("https://apiv3.shanbay.com/codetime/articles/mnvdu")
print('res的status_code:%s'%res.status_code)
with open('鲁迅文章.txt','w') as file:
    print('正在保存文章')
    file.write(res.text)

out:

        res的status_code:200

        正在下载中
        2661

三、下载图片

import requests
res=requests.get('https://img-blog.csdnimg.cn/20210424184053989.PNG')
print('res的status_code:%s'%res.status_code)
with open('datawhale.png','wb') as ff:
    print('正在爬取图片')
    ff.write(res.content)

out:

        res的status_code:200

        正在爬取图片

        36585

注：无论是爬取图片还是文字，都必须使用with...as...:语句，直接打开保存会无效。

四、HTML解析

向浏览器中输入某个网址，浏览器回向服务器发出请求，然后服务器就会作出响应。其实，服务器返回给浏览器的这个结果就是HTML代码，浏览器会根据这个HTML代码将网页解析成平时我们看到的那样。

import requests
res=requests.get('https://baidu.com')
res.encoding='utf-8'
print(res.text)

out:

<!DOCTYPE html>
<html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

注：如果不以‘utf-8’即对应的编码打开，会出现乱码。常见的编码方式有 ASCII、GBK、UTF-8 等。

五、BeautifulSoup简介

import io
import sys
import requests
from bs4 import BeautifulSoup
headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}

res=requests.get('https://book.douban.com/top250',headers=headers)
soup=BeautifulSoup(res.text,'lxml')
print(f'soup的类型:{type(soup)}')
print(soup)

out:

soup的类型:<class 'bs4.BeautifulSoup'>

（太长，略）

注：①headers表示我们的请求网页的头，对于没有headers的请求可能会被服务器判定为爬虫而拒绝提供服务。但大部分网站不存在此问题。

BeautifulSoup提供了一些查找的方法：

find() 返回符合条件的首个数据

find_all() 返回符合条件的所有数据

import io
import sys
import requests
from bs4 import BeautifulSoup
headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}

res=requests.get('https://book.douban.com/top250',headers=headers)
soup=BeautifulSoup(res.text,'lxml')
print(soup.find('a'))
print(soup.find_all('a')) #返回一个列表 包含了所有的<a>标签

out:

<a class="nav-login" href="https://accounts.douban.com/passport/login?source=book" rel="nofollow">登录/注册</a>

略

除了传入 HTML 标签名称外，BeautifulSoup 还支持其他的定位

# 定位div开头 同时id为'doubanapp-tip的标签
soup.find('div', id='doubanapp-tip')
# 定位a抬头 同时class为rating_nums的标签
soup.find_all('span', class_='rating_nums')
#class是python中定义类的关键字，因此用class_表示HTML中的class

六、实战：自如公寓数据抓取

爬取【武汉租房房源价格信息】-武汉自如网中房屋的名称，房屋的价格，房屋的面积，房屋的朝向，房屋的户型，房屋的位置，房屋的楼层，是否有电梯，房屋的年代，门锁情况，绿化情况，创建excel表格。

（1）使用很多UA头来防止被识别为爬虫，这些UA头容易在网上查到。

#能一定程度能保护爬虫
user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"]

（2）爬取前50页

import requests
from bs4 import BeautifulSoup
import random
import time
import csv

user_agents=user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"]
def get_info():
    csvheader=['名称','面积','朝向','户型','位置','楼层','是否有电梯','建成时间',' 门锁','绿化']
    file=open('自如武汉房屋信息.csv','a+',newline='')
    writer=csv.writer(file)
    writer.writerow(csvheader)
    for i in range(1,50):
        timelist={1,2,3}
        time.sleep(random.choice(timelist))
        print("先休息一会儿啦！")
        url='https://wh.ziroom.com/z/p%s/'%i
        print(f'正在爬取第{i}页')
        headers={'User_Agent':random.choice(user_agents)}
        res=requests.get(url,headers=headers)
        res.encoding=res.apparent_encoding
        soup=BeautifulSoup(res.text,'lxml')
        all_info=soup.find_all('div',class_='info-box')
        for info in all_info:
            href=info.find('a')
            if href != None:
                href='https:'+href['href']
                try:      
                    print('正在爬取'+href)
                    house_info=get_house_info(href)
                    writer.writerow(house_info)
                except:
                    print('爬取%s失败'%href)

（3）爬取房屋信息

import requests
from bs4 import BeautifulSoup
import time
import random
import csv
'''    ['名称','面积','朝向','户型','位置','楼层','是否有电梯','建成时间',' 门锁','绿化']
    room_info=[name,area,orien,area_type,location,loucen,dianti,niandai,mensuo,lvhua]
'''
def get_house_info(href):
    time.sleep(1)
    headers={'User_Agent':random.choice(user_agents)}
    res1=requests.get(href,headers=headers)
    res1=res1.content.decode('utf-8','ignore')
    soup=BeautifulSoup(res1,'lxml')
    name=soup.find('h1',class_="Z_name").text  #h1 class="Z_name"
    sinfo=soup.find('dir',class_='Z_home_b clearfix').find_all('dd') #<div class="Z_home_b clearfix">
    area=sinfo[0].text
    orien=sinfo[1].text
    area_type=sinfo[2].text
    cinfo=soup.find('ul',class_="Z_home_o").find_all('li') #<ul class="Z_home_o">
    location=cinfo[0].find('span',class_="va").text #<span class="va">
    loucen=cinfo[1].find('span',class_="va").text
    dianti=cinfo[2].find('span',class_="va").text
    niandai=cinfo[3].find('span',class_="va").text
    mensuo=cinfo[4].find('span',class_="va").text
    lvhua=cinfo[5].find('span',class_"va").text
    room_info=[name,area,orien,area_type,location,loucen,dianti,niandai,mensuo,lvhua]
    return room_info
    
if __name__ == '__main__':
    get_info()