1. 爬虫简介

橘长

已于 2022-02-11 08:28:36 修改

阅读量256

点赞数

分类专栏：爬虫文章标签：爬虫 python 搜索引擎

于 2022-02-11 08:27:39 首次发布

本文链接：https://blog.csdn.net/qq_43635902/article/details/122871469

版权

爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

爬虫

找工作的时候,筛选岗位.

爬虫:从互联网上自动化的获取大量信息的一段程序
大量:达到千万级以上,才能称之为大量数据

Crawler/网络机器人/网络蜘蛛
爬虫/crawler/Spider

定义

自动化的从互联网上获取大量信息的一段程序

为啥需要爬虫?

1. 自动化的获取数据,节省人力,省钱
2. 大数据:数据源
3. 数据分析/数据挖掘:数据源
4. 可以给搜索引擎提供数据
5. 薪资高,就业范围比较广

爬虫的分类

1. 通用爬虫
		功能强大,采集面广,一般用于搜索引擎
2. 聚焦爬虫
		功能单一(只针对于一个网站/APP),99.9999%都是聚焦爬虫
3. 增量式爬虫(更新)
		一定要区分新老数据

君子协议

1. 在网站的根目录下,加上robots.txt可以查看君子协议,规定了哪些数据谁能爬,哪些数据谁不能爬
		如果违反了君子协议,君子协议是为了规范对方网站,如果违反了,有可能被告,但是如果不违反,随便爬.(是否用作商用)
2. sitemap
		为了搜索引擎能够快速检索网站

爬虫的原理

1. 一切以数据为导向,只要能拿下来数据(不择手段)
2. 爬虫的原理:请求与响应

第一个爬虫程序

1.如何发送一个get请求
	在python中有一个内置包urllib
	方法：urlopen()在request里面
	from urllib.request import urlopen

from urllib.request import urlopen

url = 'http://www.baidu.com'
res = urlopen(url=url)
# 获得二进制的字符串
res = res.read()
print(res.decode())

压缩包

gzip
import gzip

# 解压
gzip.decompress()

# 压缩
gzip.compress()

xpath

第三方的库
xpath属于lxml库
pip install lxml
设置pypi清华源：
设为默认
升级 pip 到最新的版本 (>=10.0.0) 后进行配置：

pip install pip -U
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
如果您到 pip 默认源的网络连接较差，临时使用本镜像站来升级 pip：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U


1. namenode 就是标签名
2. /：表示选择当前节点
3. //：选择所有的内容
4. @属性名
5. []：筛选元素
6. text()：取标签里的文本

from urllib.request import urlopen
import gzip
from lxml import etree

url = 'https://www.xbiquge.la/xiaoshuodaquan/'

res = urlopen(url=url,timeout=3)
try:
    res = res.read()
    res = gzip.decompress(res)
    res = res.decode('utf-8')
except:
    res = res.decode('utf-8')
# res必须是一个字符串
ele = etree.HTML(res)
book_name = ele.xpath("//div[@class='novellist']/ul/li/a/text()")
book_urls = ele.xpath("//div[@class='novellist']/ul/li/a/@href")
for book_index,book_url in enumerate(book_urls):
    res = urlopen(url=book_url,timeout=3)
    res = res.read()
    try:
        res = gzip.decompress(res)
        res = res.decode('utf-8')

    except:
        res = res.decode('utf-8')

    ele = etree.HTML(res)
    chapter_name = ele.xpath('//div[@id="list"]/dl/dd/a/text()')
    chapter_urls = ele.xpath('//div[@id="list"]/dl/dd/a/@href')
    for chapter_index,chapter_url in enumerate(chapter_urls):

        res = urlopen(url='https://www.xbiquge.la'+chapter_url,timeout=3)
        res = res.read()
        try:
            res = gzip.decompress(res)
            res = res.decode()

        except:
            res = res.decode()

        ele = etree.HTML(res)
        content = ele.xpath('//div[@id="content"]/text()')
        s = ''
        with open('小说/'+book_name[book_index]+'.txt','a+',encoding='utf-8') as w:
            content = chapter_name[chapter_index]+'\n\n\n\n\n\n\n'+s.join(content)+'\n'
            print(content)
            w.write(content)



# /0/951/827334.html
# /0/951/827334.html

作业

爬到一半，拉闸了，从头爬？继续爬？
如何继续爬？

51job
爬工作岗位的信息
岗位名
薪资
福利
工作职责
工作地点
公司名称
职位亮点
公司主营业务
公司官网
经验要求
学历要求

橘长

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
1. 爬虫简介

爬虫找工作的时候,筛选岗位.爬虫:从互联网上自动化的获取大量信息的一段程序大量:达到千万级以上,才能称之为大量数据Crawler/网络机器人/网络蜘蛛爬虫/crawler/Spider定义自动化的从互联网上获取大量信息的一段程序为啥需要爬虫?1. 自动化的获取数据,节省人力,省钱2. 大数据:数据源3. 数据分析/数据挖掘:数据源4. 可以给搜索引擎提供数据5. 薪资高,就业范围比较广爬虫的分类1. 通用爬虫功能强大,采集面广,一般用于搜索引擎2. 聚焦爬
复制链接

扫一扫

专栏目录