DAY2 跨模态实践Jina生态

最新推荐文章于 2024-08-24 11:00:49 发布

木臂阿童铁啊

最新推荐文章于 2024-08-24 11:00:49 发布

阅读量116

点赞数 1

文章标签： jina python 开发语言

本文链接：https://blog.csdn.net/qq_39668239/article/details/128729312

版权

Jina是一个用于处理非结构化数据的开源框架，强调前后端分离和异步处理。它使用Executor和Flow来组织处理逻辑，通过yaml配置文件定义任务。DocumentArray是存储多层次非结构化数据的工具，支持文本、图像和视频等多种模态。文章展示了如何使用Jina进行文本的切割、转换和匹配，以及图像和视频的处理和切割。

摘要由CSDN通过智能技术生成

Task 02: Jina 生态

学习内容

时间：Day 2-3
目标：熟悉 Jina 生态与相关操作
文档：[jina.md]

练习内容

成功启动 grpc 服务
在 Jina 的 Docarray 中导入任意模态的数据
代码练习：code/jina_demo

快速上手jina

定义yaml文件来指定Flow运行逻辑。Flow可以看作是一系列任务的协调器。

Flow是用于将将多个Executor连接起来。而Executor是作为jina中的算法单元，例如编码器和排序等算法函数表述。

# toy.yml
jtype: Flow
with:
  port: 51000
  ## 设置远程调用协议（Remote Procedure Call）
  protocol: grpc
executors:
## python文件中的类和函数
- uses: FooExecutor
  name: foo
  py_modules:
    - test.py
- uses: BarExecutor
  name: bar
  py_modules:
    - test.py

Executor是jina中构建前后端分离网站的核心组件，前端通过指定url的方式对后端接口访问，后端收到并解析请求后，传递到请求中指定的方法进行执行。

# 创建 test.py 文件与 YAML 文件在同一目录下
# 导入 document、executor 和 flow 以及 requests 装饰器
from jina import DocumentArray, Executor, requests, Document

# 编写 FooExecutor 与 BarExecutor 类，类中定义了函数 foo 和 bar
# 该函数从网络请求接收 DocumentArray (先暂时不需要理解它是什么)，并在其内容后面附加 "foo was here" 与 "bar was here"
class FooExecutor(Executor):
    @requests # 用于指定路由，类似网页访问 /index 和 /login 会被路由到不同的方法上是用样的概念。
    # requests装饰器，通过该方法中设定on参数绑定指定路由，@requests(on='/index')。
    def foo(self, docs: DocumentArray, **kwargs):
        docs.append(Document(text='foo was here'))
class BarExecutor(Executor):
    @requests
    def bar(self, docs: DocumentArray, **kwargs):
        docs.append(Document(text='bar was here'))

启动远程调用服务：

jina flow --uses toy.yml

在这里插入图片描述

创建client.py文件并执行

# 从 Jina 中导入连接的客户端与 Document
from jina import Client, Document

c = Client(host='grpc://0.0.0.0:51000')  # 如果运行提示失败，可尝试使用localhost
result = c.post('/', Document()) # 将一个空的 Document 传到服务端执行
print(result.texts)

执行结果如下：
在这里插入图片描述

jina在yaml中读取type、端口和协议，按顺序读取modules中定义的函数,对端口进行监听。client.py以post请求发送Document，服务器端对其请求解析，并顺序执行定义好的函数（对result进行append），结果如下图。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2TABOjXA-1674039091031)(C:\Users\Lenovo\AppData\Roaming\Typora\typora-user-images\image-20230117223103494.png)]$

除了这种利用grpc进行通信外，在jina中可以使用纯python方式进行Flow调用。

from jina import Document, DocumentArray, Flow, Executor, requests

class FooExecutor(Executor):
    @requests
    def foo(self, docs: DocumentArray, **kwargs):
        docs.append(Document(text='foo was here'))


class BarExecutor(Executor):
    @requests
    def bar(self, docs: DocumentArray, **kwargs):
        docs.append(Document(text='bar was here'))


f = (
    Flow()
    .add(uses=FooExecutor, name='fooExecutor')
    .add(uses=BarExecutor, name='barExecutor')
)  # 创建一个空的 Flow

with f:  # 启动 Flow
    response = f.post(
        on='/'
    ) # 向 flow 发送一个请求
    print(response.texts)

通过 YAML 方式将 Executor 和 Flow 分开有以下优点：

服务器上的数据流是非阻塞和异步的，当 Executor 处于空闲状态时，会立即处理新的请求。
必要时会自动添加负载平衡，以确保最大吞吐量。

DocArray

定义

用于存储非结构化数据和数据结构的工具包，特点是层次和嵌套。

DocArray 有不同的层级结构，分层存储，第一层可以是一个整体的视频，第二层是该视频的不同镜头，第三层可以是镜头的某一帧。也可以是其他模态，比如第四层存储台词段落，第五层存储 … 既可以通过某个画面的描述搜索，也可以通过台词的意思去搜索，这样搜索的颗粒度，结构的多样性和结果的丰富度，都比传统文本检索好很多。

基本概念

Document：一种表示嵌套非结构化数据的数据结构，是 DocArray 的基本数据类型。无论是处理文本、图像、视频、音频、3D、表格或它们的嵌套或组合，都可以用 Document 来表示，从而使得各类数据的结构都非常规整，方便后续处理
DocumentArray：用于高效访问、处理和理解多个文档的容器，可以保存多个 Document 的列表
Dataclass：用于直观表示多模式数据的高级API

文本处理

创建文本

Document可以创建各种类型的数据

from jina import Document
d=Document(text='hello,world!')
print(d.text)
# 如果文本数据很大，或者自URI，可以先定义URI，然后将文本加载到文档中
d = Document(uri='https://www.w3.org/History/19921103-hypertext/hypertext/README.html')
d.load_uri_to_text()
print(d.text)
# 支持多语言
d = Document(text='👋	नमस्ते दुनिया!	你好世界！こんにちは世界！	Привет мир!')
print(d.text)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YSAuQHbO-1674039091031)(C:\Users\Lenovo\AppData\Roaming\Typora\typora-user-images\image-20230118094517957.png)]$

切割文本

from jina import Document  # 导包

d = Document(text='👋	नमस्ते दुनिया!	你好世界！こんにちは世界！	Привет мир!')
d.chunks.extend([Document(text=c) for c in d.text.split('!')])  # 按'!'分割
d.summary()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yZ9yH2XH-1674039091032)(C:\Users\Lenovo\AppData\Roaming\Typora\typora-user-images\image-20230118104445921.png)]$

文本转向量

from jina import DocumentArray, Document  # 导包

# DocumentArray 相当于一个用于存放 Document的list
da = DocumentArray([Document(text='hello world'), 
                    Document(text='goodbye world'),
                    Document(text='hello goodbye')])
# 构建单词表
vocab = da.get_vocabulary()  # 输出：{'hello': 2, 'world': 3, 'goodbye': 4}

# 转为ndarray
for d in da:
    d.convert_text_to_tensor(vocab, max_length=10)  # 转为tensor向量，max_length为向量最大值，可不设置
    print(d.tensor) 
   
 # ndarray
for d in da:
    d.convert_tensor_to_text(vocab)
    print(d.text)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oXZOK6Vm-1674039091032)(C:\Users\Lenovo\AppData\Roaming\Typora\typora-user-images\image-20230118110830102.png)]$

文本匹配

Document库很强大，直接使用load_uri_to_text可以从uri中load对应的文档。

from jina import Document,DocumentArray
d=Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
da=DocumentArray(Document(text=s.strip()) for s in d.text.split('n') if s.strip()) #strip用于字符串传输，去除头尾的编码
da.apply(lambda d: d.embed_feature_hashing())
q=(
    Document(text='she entered the room')
    .embed_feature_hashing()
    .match(da,limit=5,exclude_self=True,metric='jaccard',use_scipy=True)
)
print(q)
print(q.matches[:,('text','scores__jaccard')])

在这里插入图片描述

图像处理

#读取图片并转为tensor张量形式
Document.load_uri_to_image_tensor()
#图像处理
d = (
    Document(uri='apple.png')
    .load_uri_to_image_tensor()
    .set_image_tensor_shape(shape=(224, 224))  # 设置shape
    .set_image_tensor_normalization()  # 标准化
    .set_image_tensor_channel_axis(-1, 0)  # 更改通道
)
# tensor转为图像
# channel_axis=0 从(3,244,244)转为(224,224)
d.save_image_tensor_to_file('apple-proc.png', channel_axis=0)

由于现实中很多图像都是复杂的，包含很多元素信息，难以进行搜索。

图像切割

#使用(64,64)窗口进行切割图像 as_chunks=True使切割的图像张量添加到Document的chunks中 strides设置步长
d.load_uri_to_image_tensor()
d.convert_image_tensor_to_sliding_windows(window_shape=(64, 64),strides=(10,10),as_chunks=True)
# 保存图像
d.chunks.plot_image_sprites('test_simpsons-chunks-stride-10.png')

在这里插入图片描述

视频处理

与图片处理不同的是，读取的tensor张量是个四维数组，第一维度表示时间也可以被叫做视频帧id。

例如读取的shape为（600，176，320，3）,若视频长度为10s。那么视频每帧的图像为（176，320，3），视频帧率为600/10=60fps。

#导入视频。
d=Document(uri='toy.mp4')
#提取关键帧可以添加only_keyframes参数 盲猜计算相关度，取出相关度最小的帧图像
d.load_uri_to_video_tensor()
for b in d.tensor:
    #使用append可以将读取的视频张量放入chunk中
    d.chunks.append(Document(tensor=b))
# 保存chunks为图像
d.chunks.plot_image_sprites('mov.png')
# 张量保存为图像
d = (
    Document(uri='toy.mp4')
    .load_uri_to_video_tensor()  # 读取视频
    .save_video_tensor_to_file('60fps.mp4', 60)  # 将其保存为60fps的视频
)

放入chunk中
d.chunks.append(Document(tensor=b))

保存chunks为图像

d.chunks.plot_image_sprites(‘mov.png’)

张量保存为图像

d = (
Document(uri=‘toy.mp4’)
.load_uri_to_video_tensor() # 读取视频
.save_video_tensor_to_file(‘60fps.mp4’, 60) # 将其保存为60fps的视频
)

木臂阿童铁啊

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DAY2 跨模态实践Jina生态

Jina包含三大组件：Document、Flow和Executor。Document是DocArray中的基本数据类型，DocArray 使用的是分层存储，分层结构的多样性和结果的丰富度，使结果比传统文本检索好很多。Flow可以看作是一系列任务的协调器。Flow用于将多个Executor连接和分配。第三层可以是镜头的某一帧。也可以是其他模态，比如第四层存储台词段落，第五层存储 … 既可以通过某个画面的描述搜索，也可以通过台词的意思去搜索，这样搜索的颗粒度，结构的多样性和结果的丰富度，都比传统文本检索好很多。
复制链接

扫一扫