学校实训的第二天

最新推荐文章于 2019-06-27 11:44:44 发布

她是爱是暖是光

最新推荐文章于 2019-06-27 11:44:44 发布

阅读量315

点赞数

分类专栏： python爬虫文章标签： python爬虫

本文链接：https://blog.csdn.net/Curtainner/article/details/83547358

版权

python爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

今天主要学习爬虫，欢迎大家评论交流。

工具：pycharm专业版（社区版也可以哈，不过功能没有专业版多），第三方库：requests、BeautifulSoup4、lxml、html5lib

requests库的API地址：http://docs.python-requests.org/en/master/

1、首先我们先来介绍下爬虫是什么？

爬虫，英文名Spider，是一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息。

2、那么爬虫能用来干什么？

利用爬虫技术抓取公司用户信息，分析网站用户，网络爬虫技术在商业银行的应用等等

3、爬虫的基本流程

（1）发起请求：通过url向服务器发起request请求，请求可以包含额外的header信息。

（2）获取响应内容：如果服务器正常响应，那我们将会收到一个response，response即为我们所请求的网页内容，或许包含HTML，Json字符串或者二进制的数据（视频、图片）等。

（3）解析内容：如果是HTML代码，则可以使用网页解析器进行解析，如果是Json数据，则可以转换成Json对象进行解析，如果是二进制的数据，则可以保存到文件进行进一步处理。

（4）保存数据：可以保存到本地文件，也可以保存到数据库（MySQL，Redis，Mongodb等）

4、那么现在我们开始学习爬虫的编写

我们以校花网为例（我们的老师一看就是老司机了，一上来就教我们爬这个，哈哈哈），网站为http://www.xiaohuar.com/，下面附上代码，有详细注释的。

# /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os       #对目录的操作：创建、删除、移动
from urllib.request import urlretrieve      #将远程数据下载到本地。

#爬取校花信息
def get_xiaohua_info():
    #定义头部信息，模拟浏览器提交
    headers={
        'UserAgent':
            'Mozilla / 5.0(WindowsNT6.1;WOW64) AppleWebKit / '
            '537.36(KHTML, likeGecko) Chrome / 63.0.3239.132Safari / 537.36'}
    # 模拟浏览器发出http请求:get或post方法
    responses = requests.get(url=URL,headers=headers)
    #查看状态码，如果是200则为成功
    #print(responses.status_code)
    if responses.status_code==200:
        #responses.encoding设置编码格式
        responses.encoding='utf-8'
        #responses.text返回网页源代码
        #responses.content返回二进制数
        #print(responses.text)

        #通过bs4，构造过滤器，筛选内容
        #BeautifulSoup()函数的参数(二进制内容,指定解析器：html5lib或lxml)
        bs = BeautifulSoup(responses.content,'html5lib')
        #定义过滤规则：div lanmu、div ul、div ul li
        #fina_all()函数参数：根据标签名字过滤，根据属性名#value筛选
        div_list = bs.find_all('div',attrs={'class':'all_lanmu'})
        file = open('校花网数据.txt','w',encoding='utf-8')
        txt = ''
        #print(div_list)
        #遍历all_lanmu列表
        for div_lanmu in div_list:
            div_title = div_lanmu.find('div',attrs={'class':'title'})
            a_title = div_lanmu.find('a')
            #tag.string：获取标签内容
            #print(a_title.string)
            lanmu_title = a_title.string
            txt += lanmu_title + '\n\n'
            ul = div_lanmu.find('ul',attrs={'class':'twoline'})
            #判断是否为空
            if ul != None:
                li_list = div_lanmu.find_all('li')
                #print(li_list)
                #采集目标：名字，学校，点赞，路径（图片，二级页面）
                for li in li_list:
                    name = li.find('span').string
                    school = li.find('b',attrs={'class':'b1'}).string
                    like = li.find('b',attrs={'class':'b2'}).string
                    img_path = li.find('img')['lazysrc']
                    two_page = li.find('a')['href']
                    #print(name,school,like,img_path,two_page)
                    txt += '姓名:'+name+'\n'
                    txt += '学校:' + school + '\n'
                    txt += '点赞:' + like + '\n'
                    txt += '详情页:' + two_page + '\n'
                    if URL not in img_path:
                        img_path = URL+img_path
                    txt += '图片:' + img_path + '\n'
                    get_xiaohua_pic(img_path=img_path,name=name)
        file.write(txt)
        file.close()
    else:
        print('访问不了')

#爬取校花图片并下载
def get_xiaohua_pic(img_path,name):
    download = 'download'
    if not os.path.exists(download):
        os.mkdir(download)
    #name = img_path.split('/')  #拆分字符串
    #name = name[len(name)-1]    #获取最后一位的内容
    #捕捉异常
    try:
        urlretrieve(img_path,download+'/'+name+'.jpg')
    except:
        print('SORRY~下载不了')

if __name__ == '__main__':
    #目标网站
    URL = 'http://www.xiaohuar.com/'
    #调用函数
    get_xiaohua_info()

建议大家看看beatifulsoup4的用法，个人觉得find和find_all方法还是比较令人头疼的，当然也可以用lxml来解析页面，以后有机会会学习。

她是爱是暖是光

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
学校实训的第二天

今天主要学习爬虫，欢迎大家评论交流。工具：pycharm专业版（社区版也可以哈，不过功能没有专业版多），第三方库：requests、BeautifulSoup4、lxml、html5librequests库的API地址：http://docs.python-requests.org/en/master/1、首先我们先来介绍下爬虫是什么？爬虫，英文名Spider，是一段自动抓取互联网信息的程...
复制链接

扫一扫