ggml库文档说明docs

背景

近期学习PowerInfer代码,其是基于llama.cpp改的,所以本质上还是去读llama.cpp的代码;而llama.cpp又基于ggml库完成了相关建图和计算的工作。

无论是llama.cpp,还是ggml库,除了项目目录下的docs目录和代码中的注释,基本上没有官方参考了,很难受。

偶然点开ggml.h,发现刚开头有一些注释,属于是非常清晰、简洁、有用的参考了。故将翻译后的文档和英语原文放上来,供学习llama.cpp和ggml库的朋友们参考。

Origin:https://github.com/ggerganov/ggml/blob/master/include/ggml.h

翻译后的文档

GGML Tensor Library

本文件仍在编写中。如果您希望涵盖某些特定主题,请随时发表评论:

https://github.com/ggerganov/whisper.cpp/issues/40

概述

该库实现了:

  • 一组张量运算
  • 自动微分
  • 基本优化算法

该库的目的是为各种机器学习任务提供一种简约的方法。这包括但不限于以下内容:

  • 线性回归
  • 支持向量机
  • 神经网络

该库允许用户使用可用的张量运算来定义特定函数。该函数定义通过计算图在内部表示。函数定义中的每个张量运算对应于图中的一个节点。定义了计算图后,用户可以选择计算函数的值和/或相对于输入变量的梯度。此外,还可以使用可用的优化算法之一来优化函数。

例如,我们在此定义函数:f(x) = a*x^2 + b

{
struct ggml_init_params params = {
    .mem_size = 16*1024*1024,
    .mem_buffer = NULL,
};

// 内存分配在此处发生
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

ggml_set_param(ctx, x); // x 是一个输入变量

struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * x2 = ggml_mul(ctx, x, x);
struct ggml_tensor * f = ggml_add(ctx, ggml_mul(ctx, a, x2), b);...


}

请注意,上面的函数定义不涉及任何实际计算。只有当用户明确请求时,才会执行计算。例如,计算函数在x = 2.0处的值:

{
...

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, f); // 备注:通过拓扑排序建立计算图。进行拓扑排序时,是从最后一个节点往前走的。所以在这里传入最后一个结果Tensor。这基于“每个算子可能具有多个输入,但最终只有1个输出”的假设进行的。

// 设置输入变量和参数值
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
ggml_set_f32(b, 4.0f);

ggml_graph_compute_with_ctx(ctx, &gf, n_threads);

printf(“f = %f\n”, ggml_get_f32_1d(f, 0));

...


}

实际计算在ggml_graph_compute()函数中执行。

ggml_new_tensor_…()函数创建新的张量。它们被分配到提供给ggml_init()函数的内存缓冲区中。您必须注意不要超过内存缓冲区的大小。因此,您必须事先知道计算需要多少内存。或者,您可以分配足够大的内存,并在定义计算图后,调用ggml_used_mem()函数来找出实际需要的内存量。

ggml_set_param()函数将张量标记为输入变量。自动微分和优化算法使用该变量。

上述方法允许一次定义函数图,然后多次计算其前向或后向图。所有计算都将使用ggml_init()函数中分配的同一内存缓冲区。这样,用户就可以避免运行时内存分配的开销。

该库支持多维张量——最多4个维度。FP16和FP32数据类型是首要考虑的对象,但理论上该库可以扩展为支持FP8和整数数据类型。

每个张量运算都会产生一个新的张量。最初,该库仅支持一元和二元运算。大多数可用运算属于这两类之一。随着时间的推移,很明显该库需要支持更复杂的运算。支持这些运算的方法尚不明确,但以下运算中演示了一些示例:

  • ggml_permute()
  • ggml_conv_1d_1s()
  • ggml_conv_1d_2s()

对于每个张量运算符,库都实现了前向和后向计算函数。前向函数根据输入张量值计算输出张量值。后向函数根据输出张量的伴随计算输入张量的伴随。关于此含义的详细解释,请参加微积分课程或观看以下视频:

什么是自动微分?
https://www.youtube.com/watch?v=wG_nF1awSSY

张量数据(ggml_tensor结构体)

张量通过ggml_tensor结构存储在内存中。该结构提供有关张量大小、数据类型以及存储张量数据的内存缓冲区的信息。此外,它还包含指向“源”张量的指针,即用于计算当前张量的张量。例如:

{
struct ggml_tensor * c = ggml_add(ctx, a, b);
// 由于张量c是a+b的结果,所以它的“源”张量分别是a和b
assert(c->src[0] == a); 
assert(c->src[1] == b);
}

多维张量按行优先顺序存储。ggml_tensor结构包含每个维度中的元素数量(“ne”)以及字节数(“nb”,又称步长)的字段。这允许在内存中存储不连续的张量,这对于转置和置换等操作非常有用。所有张量操作都必须考虑步长,而不能假设张量在内存中是连续的。

张量数据通过“数据”指针访问。例如:

{
const int nx = 2;
const int ny = 3;

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nx, ny);

for (int y = 0; y < ny; y++) {
    for (int x = 0; x < nx; x++) {
       *(float *) ((char *) a->data + y*a->nb[1] + x*a->nb[0]) = x + y;
    }
}

...
}

此外,还可以使用辅助函数,例如ggml_get_f32_1d()和ggml_set_f32_1d()。

英语原文

GGML Tensor Library

This documentation is still a work in progress. If you wish some specific topics to be covered, feel free to drop a comment:

https://github.com/ggerganov/whisper.cpp/issues/40

Overview

This library implements:

  • a set of tensor operations
  • automatic differentiation
  • basic optimization algorithms

The aim of this library is to provide a minimalistic approach for various machine learning tasks. This includes, but is not limited to, the following:

  • linear regression
  • support vector machines
  • neural networks

The library allows the user to define a certain function using the available tensor operations. This function definition is represented internally via a computation graph. Each tensor operation in the function definition corresponds to a node in the graph. Having the computation graph defined, the user can choose to compute the function’s value and/or its gradient with respect to the input variables. Optionally, the function can be optimized using one of the available optimization algorithms.

For example, here we define the function: f(x) = a*x^2 + b

{
struct ggml_init_params params = {
.mem_size = 16*1024*1024,
.mem_buffer = NULL,
};

// memory allocation happens here
struct ggml_context * ctx = ggml_init(params);

struct ggml_tensor * x = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);

ggml_set_param(ctx, x); // x is an input variable

struct ggml_tensor * a = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * b = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
struct ggml_tensor * x2 = ggml_mul(ctx, x, x);
struct ggml_tensor * f = ggml_add(ctx, ggml_mul(ctx, a, x2), b);

...
}

Notice that the function definition above does not involve any actual computation. The computation is performed only when the user explicitly requests it. For example, to compute the function’s value at x = 2.0:

{
...

struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, f);

// set the input variable and parameter values
ggml_set_f32(x, 2.0f);
ggml_set_f32(a, 3.0f);
ggml_set_f32(b, 4.0f);

ggml_graph_compute_with_ctx(ctx, &gf, n_threads);

printf("f = %f\n", ggml_get_f32_1d(f, 0));

...
}

The actual computation is performed in the ggml_graph_compute() function.

The ggml_new_tensor_…() functions create new tensors. They are allocated in the memory buffer provided to the ggml_init() function. You have to be careful not to exceed the memory buffer size. Therefore, you have to know in advance how much memory you need for your computation. Alternatively, you can allocate a large enough memory and after defining the computation graph, call the ggml_used_mem() function to find out how much memory was actually needed.

The ggml_set_param() function marks a tensor as an input variable. This is used by the automatic differentiation and optimization algorithms.

The described approach allows to define the function graph once and then compute its forward or backward graphs multiple times. All computations will use the same memory buffer allocated in the ggml_init() function. This way the user can avoid the memory allocation overhead at runtime.

The library supports multi-dimensional tensors - up to 4 dimensions. The FP16 and FP32 data types are first class citizens, but in theory the library can be extended to support FP8 and integer data types.

Each tensor operation produces a new tensor. Initially the library was envisioned to support only the use of unary and binary operations. Most of the available operations fall into one of these two categories. With time, it became clear that the library needs to support more complex operations. The way to support these operations is not clear yet, but a few examples are demonstrated in the following operations:

  • ggml_permute()
  • ggml_conv_1d_1s()
  • ggml_conv_1d_2s()

For each tensor operator, the library implements a forward and backward computation function. The forward function computes the output tensor value given the input tensor values. The backward function computes the adjoint of the input tensors given the adjoint of the output tensor. For a detailed explanation of what this means, take a calculus class, or watch the following video:

What is Automatic Differentiation?
https://www.youtube.com/watch?v=wG_nF1awSSY

Tensor data (struct ggml_tensor)

The tensors are stored in memory via the ggml_tensor struct. The structure provides information about the size of the tensor, the data type, and the memory buffer where the tensor data is stored. Additionally, it contains pointers to the “source” tensors - i.e. the tensors that were used to compute the current tensor. For example:

{
struct ggml_tensor * c = ggml_add(ctx, a, b);

assert(c->src[0] == a);
assert(c->src[1] == b);
}

The multi-dimensional tensors are stored in row-major order. The ggml_tensor struct contains fields for the number of elements in each dimension (“ne”) as well as the number of bytes (“nb”, a.k.a. stride). This allows to store tensors that are not contiguous in memory, which is useful for operations such as transposition and permutation. All tensor operations have to take the stride into account and not assume that the tensor is contiguous in memory.

The data of the tensor is accessed via the “data” pointer. For example:

{
const int nx = 2;
const int ny = 3;

struct ggml_tensor * a = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nx, ny);

for (int y = 0; y < ny; y++) {
for (int x = 0; x < nx; x++) {
*(float *) ((char *) a->data + y*a->nb[1] + x*a->nb[0]) = x + y;
}
}

...
}

Alternatively, there are helper functions, such as ggml_get_f32_1d() and ggml_set_f32_1d() that can be used.

<think>好的,用户想了解如何部署本地的DeepSeek。我需要先确认他们指的是哪个具体的DeepSeek项目,因为可能有不同的版本或应用场景。比如,是语言模型还是其他类型的模型? 首先,我应该回忆一下DeepSeek的相关信息。DeepSeek是一个由深度求索公司开发的中文AI模型,有不同参数量级的版本,比如7B、67B等。用户可能想部署其中的某个版本用于本地推理。 接下来,用户可能对AI部署不太熟悉,所以步骤需要详细但清晰。常见步骤包括环境准备、模型下载、安装依赖、编写推理代码等。可能还需要提到硬件要求,比如GPU显存的需求。 然后,考虑到用户可能遇到的常见问题,比如模型文件下载、依赖冲突、CUDA版本问题,需要给出提示或解决方案。此外,是否需要使用特定的,比如transformers、vllm或者llama.cpp,这些都需要说明。 用户可能没有明确说明是哪种部署方式,比如使用Hugging Face的Transformers还是其他推理加速框架,所以应该涵盖几种常见的方法,并比较它们的优缺点。 另外,权限问题,比如是否需要申请访问权限,模型是否开源,需要提醒用户查看官方文档获取最新信息,因为模型的开源状态可能有变化。 最后,示例代码部分要确保正确,并且提醒用户替换模型路径,以及根据实际情况调整参数。可能还需要说明运行示例的命令,比如Python脚本的执行方式。 总结一下,回答的结构应该是:确认模型类型,列出步骤,详细说明每个步骤,提供代码示例,注意事项,以及参考链接。同时保持语言简洁明了,避免技术术语过多,必要时解释关键步骤。</think>要部署本地的 DeepSeek 模型(如 DeepSeek-7B/67B 等中文大模型),可按照以下通用步骤操作。以下是基于开源模型和常见工具的部署指南: --- ### **1. 环境准备** - **硬件要求**: - GPU(推荐NVIDIA,显存≥16GB,如 3090/A100) - CPU/RAM(若仅用CPU推理,需高性能多核CPU和大内存) - **软件依赖**: - Python 3.8+ - PyTorch 2.0+ - CUDA/cuDNN(与PyTorch版本匹配) - 依赖:`transformers`, `accelerate`, `sentencepiece` 等 ```bash # 示例:安装基础依赖 pip install transformers accelerate sentencepiece ``` --- ### **2. 获取模型文件** - **官方渠道**: - 从 Hugging Face Hub 下载(需确认模型是否开源): ```bash git clone https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat ``` - 若需申请访问权限,需在官方页面(如 ModelScope 或 Hugging Face)提交申请。 --- ### **3. 本地推理代码** #### **方法一:使用 Transformers ** ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_path = "本地模型路径/deepseek-llm-7b-chat" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto") prompt = "你好,请介绍一下中国" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### **方法二:使用 vLLM 加速(推荐)** ```bash pip install vllm ``` ```python from vllm import LLM, SamplingParams sampling_params = SamplingParams(temperature=0.8, max_tokens=100) llm = LLM(model="本地模型路径/deepseek-llm-7b-chat") prompts = ["你好,请介绍一下中国"] outputs = llm.generate(prompts, sampling_params) print(outputs[0].outputs[0].text) ``` #### **方法三:CPU 推理(需量化)** 使用 `llama.cpp` 等工具将模型转换为 GGUF 格式后运行: ```bash # 转换模型格式 python convert.py --input_dir ./deepseek-model --output_dir ./ggml-model --quantization q4_0 # 运行推理 ./main -m ./ggml-model/ggml-model-q4_0.gguf -p "你好" ``` --- ### **4. 常见问题** - **显存不足**:尝试量化(如 4/8-bit)或使用 `device_map="auto"` 分散加载。 - **模型权限**:部分模型需申请授权,访问 [DeepSeek-HuggingFace](https://huggingface.co/deepseek-ai) 查看详情。 - **依赖冲突**:推荐使用虚拟环境(如 conda/venv)。 --- ### **5. 参考资源** - DeepSeek 官方 GitHub:https://github.com/deepseek-ai - Hugging Face 模型:https://huggingface.co/deepseek-ai - vLLM 文档:https://docs.vllm.ai --- 根据实际模型版本和硬件调整参数,部署前建议查阅官方文档获取最新指南。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值