爬虫学习笔记

最新推荐文章于 2024-07-25 18:55:58 发布

P.A.I

最新推荐文章于 2024-07-25 18:55:58 发布

阅读量180

点赞数

分类专栏： Python 文章标签： python 正则表达式 xpath html

本文链接：https://blog.csdn.net/qq_45625654/article/details/114800757

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫学习笔记（仅供参考）

1.正则表达式

1.正则表达式的基本符号

“.”：一个点号可以代表除了换行符意外地任何一个字符
“*”：一个星号可以表示它前面的一个子表达式
“?”：问号表示它前面的子表达式0次或者1次
\：转义字符串
“()”:括号可以把内容提取出来

2.在Python中使用正则表达式

findall : 以列表的形式返回所有满足要求的字符串

re.findall(pattern,string,flags=0)
pattern: 正则表达式
string：匹配的字符串
flasgs：表示特俗功能的标志

import re 
# findall 
 
content = '我的微博密码是：123456，qq密码是：64815681，卡密码是：46431316，git密码是：546646'
passworld = re.findall("：(.*?)，",content)
print(passworld)

包含多个"(.*?)"则返回的是元组，对应位置的地方插入对应的值，flags可以使用re.S这个flag忽略“\n”（换行符）（在Python3.9中使用和不使用结果一致）

search:和findall不一样的是，search只会匹配满足要求的第一个字符串

re.search(pattern,string,flags=0)

返回的结果是一个正则表达式对象需要使用group()调用结果，而无结果返回的是None

passworld_search = re.search('密码是：(.*?)，', content)
print(passworld_search)
print(passworld_search.group(0))
print(passworld_search.group(1))
# 注意匹配字符串中的英中文字符串
# 使用group得到是全部字符串内容
# group()参数等于1的时候，返回的是匹配到的字符串内容

总结
- “.*”：贪婪模式，获取最长满足条件的字符串
- “.*?”：非贪婪模式，获取最短的能满足条件的字符串

3.正则表达式的提取技巧

不需要使用compile
先抓大在抓小
（.*?）括号内和括号外

2.简单的网页爬虫开发

1.Python第三方库

#使用pip 安装第三方库 
pip install 库名

2.requests获取网页源代码

GET方式

import requests               # 导入requests模块
url = "网址"                  # 目标地址
html = requests.get(url)      # 请求页面 
html_bytes = html.content     # 获取页面源码
html_str = html_bytes.decode() # 将页面源码转变为字符串类

import requests


url = 'https://www.baidu.com/'
html = requests.get(url)
html_bytes = html.content
html_str = html_bytes.decode()
print(html)
# print(html_bytes)
print(html_str)

POST方式

post请求和get请求比起来更加高大上，需要参数，不同的页面需要的参数类型也不一样，可能是字典，也可能是json

impor requests
url = '目标网址'
data = {
    key_1:value1,
    key_2:value2
}
html_content = requests.post(url, data= data).content.decode()

结合requests与正则表达式

import requests
import re

url = 'https://www.baidu.com/'
html = requests.get(url)
html_bytes = html.content
html_str = html_bytes.decode()
html_content = re.findall('<title>(.*?)<',html_str,re.S)
print(html)
print(html_str)
print(html_content)

3.多线程爬虫

多进程库(mulitprocessing)

def pool_num(num):
    return  num*num


if __name__ == '__main__':
    pool = Pool(3)
    li_num = [x for x in range(0,10)]
    result = pool.map(pool_num,li_num)
    print(result)

# 多线程和单线程对比
# 单线程
def quert(url):
    requests.get(url)

if __name__ == '__main__':
    url = 'https://www.baidu.com/'
    start_time = time.time()
    for i in range(0,100):
        quert(url)
    end_time = time.time()
    print(end_time-start_time)
    
# 12.462252140045166


# 多线程
def quert(url):
    requests.get(url)

if __name__ == '__main__':
    url = 'https://www.baidu.com/'
    url_list = []
    pool_num = Pool(5)
    start_time = time.time()
    for i in range(0,100):
        url_list.append(url)
    pool_num.map(quert,url_list)
    end_time = time.time()
    print(end_time-start_time)
    
    
# 2.781147003173828

多线程使用的时间大概的单线程的5分之1，当操作动作越来越大的时候，多线程的效率可能会低于单线程，这时候就要使用异步来解决了

4.爬虫常见的搜索算法

深度优先算法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GGSo6toO-1615723197760)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210301153358903.png)]

算法路径如图所示
广度优先算法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PHeqasXc-1615723197764)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210301153612433.png)]

广度优先算法

3.高性能HTML解析

1.Xpath

一种查询语言
```
pip install lxml
```

import lxml.html
selector = lxml.fromstring('网页源码')
info = selector.xpath('Xpath路径')
# 核心思想，写Xpath就是写地址

# 获取文本内容
//标签1[@属性='属性值']/标签2[@属性='属性值']/../text()
# 获取属性
//标签1[@属性='属性值']/标签2[@属性='属性值']/../@属性n

无属性标签和属性相同的标签，属性值可以省略
特殊情况
1. 以相同字符串开头的
```
<body>
    <div id = 'test-1'>need</div>
    <div id = 'test-2'>need</div>
    <div id = 'testresult'>need</div>
    <div id = 'qwer'>don't need</div>
</body>    
# 获取前面的标签
# //div[start-with(@id,'test')]/text()
```
2. 属性值包含相同的字符串
```
# //div[contains(@id,'test')]/text()
```
3. 对Xpath返回的对象也可以再次执行Xpath，同理也是先抓大在抓小
4. 在一个大标签中包含小标签和文本内容的时候，需要使用string(.)
5. 最牛逼也是最好的方法，当你写不出Xpath路径的时候，打开你的谷歌浏览器，F12 右键 copy xpath 就可以得到准确的路径了

2.Beautiful Soup4（BS4）

pip install beautiful Soup4

解析源码

soup = Beautifulsoup('网页源码',解析器)
soup = Beautifulsoup(source,'lxml') or 'html.parser'

查找内容
1. find()
  - find(属性值)，返回的tag对象
  - 直线对返回的对象使用 .siring获取标签中的文字信息
  - 有多个满足条件的对象时，返回第一个对象
2. find_all()
  - 返回的是tag对象组成的列表，没有满足的对象返回None
3. 参数
  - ```
  find_all(name, attrs, recursive, text, **kwargs)
  name: html标签名
  attrs: 属性字典
  recursive: 是否检索子标签
  text: 可以是字符串或者正则表达式，匹配文本信息。
```

4.异步加载

1.AJAX

可以在不刷新页面的情况下更新数据，一般使用html搭建框架用AJAX来传输数据

2.JSON

一种格式化字符串，类似Python的字典和列表的结合，是一种轻量级的数据交换格式

import JSON
person_json = json.dumps('person')  # person 为需要转换格式的数据

import json
person = {
    "basic_info":{
        'name':123,
        'age':456,
        'sex':789
    },
    "work_info": {
        'name': 123,
        'age': 456,
        'sex': 789
    },
}

new_json = json.dumps(person,indent=4)

print(new_json)

3.异步GET和POST请求

打开浏览器的开发者模式，network里面去查看请求的方式
伪装AJAX，在JS代码中伪装信息，解决方式直接用正则表达式匹配源码，解析

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ePOiuaUf-1615723197766)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\image-20210302145142257.png)]
```
import requests
import re
import json

url = 'http://exercise.kingname.info/exercise_ajax_2.html'
html = requests.get(url).content.decode()
data = re.search("'(.*?)'", html, re.S).group(1)
data = json.loads(data)
print(data['code'])
```

4.多次请求的异步加载

请求第一个页面，返回第二个请求的参数，第二个请求返回的是第三个请求的参数，只有在解析出上一个请求中有用的信息才能发起下一个请求

5.请求头（Headers）

将自己的headers伪装起来

# 基于异步加载的模拟登陆

import requests
import json

url = 'http://exercise.kingname.info/exercise_headers_backend'
data = {
'Host': 'exercise.kingname.info',
'Connection': 'keep-alive',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'anhao': 'kingname',
'Content-Type': 'application/json; charset=utf-8',
'Referer': 'http://exercise.kingname.info/exercise_headers.html',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.9',
'Cookie': '__cfduid=d629ac17306baf9b8309880cf39bb1b341614666646',
}
html = requests.get(url, headers=data).content.decode()
print(json.loads(html))

html = requests.get(url, headers=封装的头部字典)
html = requests.post(url, json = json, headers = 封装的头部字典)

为了保险起见，最好把所有的头部信息都设置好，虽然某些网站只使用修改UA就可以访问了

6.模拟浏览器（匹配不上webdriver和chrome版本,暂时跳过,后续补上）

Selenium
- pip install selenium

5.Scrapy分布式爬虫

基础部分（跳过）
项目文件结构
- spriders文件夹：存放爬虫文件的文件夹
- items.py：定义需要爬取的数据
- pipelines.py：负责数据抓取以后的处理工作
- settings.py：爬虫的各种配置信息

1.Scrapy与MongoDB

第一步：在items.py文件中定义需要爬取的数据.

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    price = scrapy.Field()
    size = scrapy.Field()
    comments = scrapy.Field()

class PersonInfoItem(scrapy.Item):
    
    name = scrapy.Field()      # 名字
    age = scrapy.Field()       # 年龄
    salary = scrapy.Field()    # 收入
    phone = scrapy.Field()     # 手机号

第二步：在pipelines.py文件中，写入连接数据库的代码以及保存数据到数据库的代码（实例使用的MongoDB

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

import pymongo
from scrapy.utils.project import get_project_settings


class BaiduPipeline:

    def __init__(self):
        settings = get_project_settings()
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db =client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
        person_info = dict(item)
        self.post.insert(person_info)
        return item

在setting.py中配置数据库需要的参数（MongoDB）

MONGODB_HOST = 'localhost'      # 地址
MONGODB_PORT = 27017            # 端口
MONGODB_DBNAME = 'Chapetr6'     # 数据库的名称
MONGODB_DOCNAME = 'spider'      # 数据集

第三步：在爬虫文件中提交事务

其中最折磨人的地方就是导入items包，如果报错
- 将含有items.py的文件设置为根文件

import scrapy
from baidu.items import PersonInfoItem

class BaiduExampleSpider(scrapy.Spider):
    name = 'baidu_example'
    start_urls = ['http://exercise.kingname.info/exercise_xpath_3.html']

    def parse(self, response):
        list = response.xpath('/html/body/div/table/tbody/tr')
        for i in list:
            item = PersonInfoItem()
            person_info = list.xpath('td/text()').extract()
            item['name'] = person_info[0]
            item['age'] = person_info[1]
            item['salary'] = person_info[2]
            item['phone'] = person_info[3]
            print(person_info[0], person_info[1], person_info[2], person_info[3])
            yield item    # **提交给item去处理**

2.Scrapy与Redis（电脑有问题，测试失败）

pip install scrapy_redis
介绍：

使用Redis来作为队列

运行：

cd进入存放Redis的文件夹，运行redis-server.exe

_info = list.xpath(‘td/text()’).extract()
item[‘name’] = person_info[0]
item[‘age’] = person_info[1]
item[‘salary’] = person_info[2]
item[‘phone’] = person_info[3]
print(person_info[0], person_info[1], person_info[2], person_info[3])
yield item # 提交给item去处理


### 2.Scrapy与Redis**（电脑有问题，测试失败）**

+ pip install scrapy_redis

+ 介绍：

		使用Redis来作为队列

运行：

		cd进入存放Redis的文件夹，运行redis-server.exe

首先我们需要做的就是修改爬虫的继承类

P.A.I

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记

爬虫学习笔记（仅供参考）1.正则表达式1.正则表达式的基本符号“.”：一个点号可以代表除了换行符意外地任何一个字符“*”：一个星号可以表示它前面的一个子表达式“?”：问号表示它前面的子表达式0次或者1次\：转义字符串“()”:括号可以把内容提取出来2.在Python中使用正则表达式 findall : 以列表的形式返回所有满足要求的字符串re.findall(pattern,string,flags=0)pattern: 正则表达式string：匹配的字符串flasgs：
复制链接

扫一扫