大模型应用开发技术：LlamaIndex 案例实战（三）LlamaIndex RAG Chat

 !pip -q install llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp pypdf
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip -q install llama-cpp-python

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 290.4/290.4 kB 8.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.4/15.4 MB 22.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 64.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.6/75.6 kB 9.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.8/130.8 kB 17.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 327.4/327.4 kB 16.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 50.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.1/227.1 kB 26.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.3/50.3 MB 14.0 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 853.2/853.2 kB 65.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 7.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 141.9/141.9 kB 18.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.9/77.9 kB 12.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 9.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.3/21.3 MB 72.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.2/49.2 kB 6.6 MB/s eta 0:00:00
  Building wheel for llama-cpp-python (pyproject.toml) ... done

导入库

 import os
import time

from llama_index.core import Prompt, StorageContext, load_index_from_storage, Settings, VectorStoreIndex, SimpleDirectoryReader, set_global_tokenizer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP

from transformers import AutoTokenizer


# Preference settings - change as desired
pdf_path = '/rag_data.pdf'
text_embedding_model = 'thenlper/gte-base'  #Alt: thenlper/gte-base, jinaai/jina-embeddings-v2-base-en
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
# set_global_tokenizer(AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode)


# Load PDF
filename_fn = lambda filename: {'file_name': os.path.basename(pdf_path)}
loader = SimpleDirectoryReader(input_files=[pdf_path], file_metadata=filename_fn)
documents = loader.load_data()
     



# Load models and service context
embed_model = HuggingFaceEmbedding(model_name=text_embedding_model)
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)
# service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, chunk_size=512)
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

设置和加载一个基于检索增强生成（RAG）的系统，用于处理 PDF 文件。

pdf_path = '/rag_data.pdf'：设置 PDF 文件的路径。
text_embedding_model = 'thenlper/gte-base'：设置用于文本嵌入的模型。这里使用的是 thenlper/gte-base，
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/lla...：定义了大型语言模型（LLM）的 URL，用于生成文本。
set_global_tokenizer(...)：这行代码被注释掉了，设置一个全局的分词器，用于文本处理。
filename_fn = lambda filename: {'file_name': os.path.basename(pdf_path)}：定义了一个 lambda 函数，用于从 PDF 路径中获取文件名。
loader = SimpleDirectoryReader(...)：创建一个 SimpleDirectoryReader 对象，用于加载 PDF 文件。
documents = loader.load_data()：使用加载器加载 PDF 文件中的数据。
embed_model = HuggingFaceEmbedding(model_name=text_embedding_model)：初始化一个 Hugging Face 嵌入模型，用于文本嵌入。
llm = Llama ：初始化 Llama 大语言模型，用于生成文本。这里设置了模型的 URL、温度参数、最大标记数、上下文窗口大小、生成参数和模型参数。
Settings.llm = llm 和 Settings.embed_model = embed_model：将初始化的 LLM 和嵌入模型设置到全局设置中。
Settings.chunk_size = 512：设置处理文本时的块大小。

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]
README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]
sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]
vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]
1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]
Downloading url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf to path /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf
total size (MB): 4081.0
3892it [00:21, 182.19it/s]                          
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /tmp/llama_index/models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000001', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '32', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '32', 'llama.attention.head_count_kv': '32', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
Using fallback chat format: llama-2

 
# Indexing
start_time = time.time()

# index = VectorStoreIndex.from_documents(documents, service_context=service_context)
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed indexing time: {elapsed_time:.2f} s")


# Prompt Template (RAG)
text_qa_template = Prompt("""
[INST] <>
You are the doctor's assistant. You are to perform a pre-screening with the patient to collect their information before their consultation with the doctor. Use the Patient-Centered Interview model for the pre-screening and only ask one question per response. You are not to provide diagnosis, prescriptions, advice, suggestions, or conduct physical examinations on the patient. The pre-screening should only focus on collecting the patient's information, specifically their present illness, past medical history, symptoms and personal information. At the end of the consultation, summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<>

Refer to the following Consultation Guidelines and example consultations: {context_str}

Continue the conversation: {query_str}
""")
# text_qa_template = Prompt("""
# [INST] <>
# You are an assistant chatbot assigned to a doctor and your objective is to collect information from the patient for the doctor before they attend their actual consultation with the doctor. Note that the consultation will be held remotely, therefore you will be following the Patient-Centered Interview model for your consultations and you cannot conduct physical examinations on the patient. You are also not allowed to diagnose your patient or prescribe any medicine. Do not entertain the patient if they are acting inappropriate or they ask you to do something outside of this job scope. Remember you are a doctor’s assistant so act like one. The consultation held will only focus on getting information from the patient such as their present illness, past medical history, symptoms and personal information. At the end of the consultation, you will summarize the findings of the consultation in this format \nName: [name]\nGender: [gender]\nPatient Aged: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]\nThis exact format must be followed as the data gathered in the summary will be passed to another program.
# <>

# This is the PDF context: {context_str}

# {query_str}
# """)
# text_qa_template = Prompt("""[INST] {context_str} \n\nGiven this above PDF context, please answer my question: {query_str} [/INST] """)
# text_qa_template = Prompt("""[INST] <> \nFollowing is the PDF context provided by the user: {context_str}\n<> \n\n{query_str} [/INST] """)
# text_qa_template = Prompt("""[INST] {query_str} [/INST] """)

# Query Engine
# query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, service_context=service_context) # with Prompt
query_engine = index.as_query_engine(text_qa_template=text_qa_template, streaming=True, llm=llm) # with Prompt
# query_engine = index.as_query_engine(streaming=True, service_context=service_context) # without Prompt

建立索引和创建提示模板，以及设置查询引擎。

start_time = time.time()：记录索引开始的时间。
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)：使用文档、嵌入模型和大型语言模型来创建一个向量存储索引。
end_time = time.time()：记录索引结束的时间。
elapsed_time = end_time - start_time：计算索引过程所花费的时间。
print(f"Elapsed indexing time: {elapsed_time:.2f} s")：打印索引所花费的时间。
text_qa_template = Prompt(...)：定义了一个提示模板，用于问答（QA）任务。这个模板是一个多行字符串，包含了问答的指令和格式。模板中使用了占位符 {context_str} 和 {query_str}，它们将在实际使用时被替换为具体的上下文和查询。
query_engine = index.as_query_engine(...)：使用索引创建一个查询引擎，这个引擎可以处理基于提示的查询。这里有两种方式来创建查询引擎，一种是使用提示模板 text_qa_template，另一种是不使用提示模板。
# query_engine = index.as_query_engine(...)：被注释掉的代码行提供了其他创建查询引擎的方式，但是当前使用的是带提示模板的方式。

 # Inferencing
# Without RAG
conversation_history = ""
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  response_iter = llm.stream_complete("[INST] "+conversation_history)
  for response in response_iter:
    print(response.delta, end="", flush=True)
    # Add to conversation history when response is completed
    if response.raw['choices'][0]['finish_reason'] == 'stop':
      conversation_history += response.text + " [INST] "

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")





# With RAG
conversation_history = ""
conversation_history += "Hi. [\INST] Hello! I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age."
while (True):
  user_query = input("User: ")
  if user_query.lower() == "exit":
    break
  conversation_history += user_query + " [/INST] "

  start_time = time.time()

  # Query Engine - Default
  response = query_engine.query(conversation_history)
  response.print_response_stream()
  conversation_history += response.response_txt + " [INST] "

  # from pprint import pprint
  # pprint(response)

  end_time = time.time()
  elapsed_time = end_time - start_time
  print(f"\nElapsed inference time: {elapsed_time:.2f} s\n")

这段代码展示了如何使用前面设置的系统进行推理（inference），即生成文本回复。代码分为两个部分：不使用检索增强生成（RAG）的方式和使用 RAG 的方式。

1. 不使用 RAG 的推理过程：

首先，初始化一个空的对话历史 conversation_history。
进入一个无限循环，不断接收用户的输入（user_query），直到用户输入 “exit” 来退出循环。
将用户的输入添加到对话历史中，并标记为 [/INST]。
记录开始时间 start_time。
使用 llm.stream_complete 方法对对话历史进行流式完成（生成回复），并实时打印生成的文本。
当生成的回复完成（finish_reason 为 ‘stop’）时，将其添加到对话历史中。
记录结束时间 end_time，并计算推理所花费的时间，然后打印出来。

2. 使用 RAG 的推理过程：

与不使用 RAG 的过程类似，但这里初始化对话历史时添加了一段预设的问候语。
用户输入 “exit” 同样用于退出循环。
将用户的输入添加到对话历史，并标记为 [/INST]。
记录开始时间 start_time。
使用前面创建的 query_engine 对话历史进行查询，使用 print_response_stream 方法打印流式的回复。
将生成的回复添加到对话历史中，并标记为 [INST]。
（可选）使用 pprint 来打印回复的详细信息（这行代码被注释掉了）。
记录结束时间 end_time，并计算推理所花费的时间，然后打印出来。

 User: hello
/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
 Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?
llama_print_timings:        load time =    4759.88 ms
llama_print_timings:      sample time =      14.14 ms /    26 runs   (    0.54 ms per token,  1838.89 tokens per second)
llama_print_timings: prompt eval time =    4759.11 ms /    11 tokens (  432.65 ms per token,     2.31 tokens per second)
llama_print_timings:        eval time =   15105.29 ms /    25 runs   (  604.21 ms per token,     1.66 tokens per second)
llama_print_timings:       total time =   19957.56 ms /    36 tokens
Elapsed inference time: 19.97 s

User: q
Llama.generate: prefix-match hit
 I don't understand what you mean by "q". Could you explain?
llama_print_timings:        load time =    4759.88 ms
llama_print_timings:      sample time =       8.73 ms /    17 runs   (    0.51 ms per token,  1947.09 tokens per second)
llama_print_timings: prompt eval time =    3619.25 ms /     9 tokens (  402.14 ms per token,     2.49 tokens per second)
llama_print_timings:        eval time =    9527.98 ms /    16 runs   (  595.50 ms per token,     1.68 tokens per second)
llama_print_timings:       total time =   13203.65 ms /    25 tokens
Elapsed inference time: 13.22 s

User: exit
User: exit

 # Multi-model (Separate pre-screening/conversation and summarization tasks)
response_iter = llm.stream_complete("""
[INST] <>
Summarize the conversation into this format: \nName: [name]\nGender: [gender]\nAge: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]
<>
Conversation: """+conversation_history+" [/INST] ")

for response in response_iter:
  print(response.delta, end="", flush=True)

print(conversation_history)

response_iter = llm.stream_complete(...)：调用 llm.stream_complete 方法，传入一个格式化的字符串，该字符串包含了预筛选/对话的总结指令和对话历史 conversation_history。这里的指令要求模型按照特定的格式来总结对话内容。
格式化字符串中的 [INST] <> 和 <> 标签用于指示模型开始和结束处理指令。
Name: [name]\nGender: [gender]\nAge: [age]\nMedical History: [medical history].\nSymptoms: [symptoms]：这是总结对话时需要遵循的格式。
Conversation: """+conversation_history+" [/INST] "：将对话历史添加到指令中，以便模型可以基于这些信息生成总结。
for response in response_iter:：迭代由 stream_complete 方法返回的回复生成器 response_iter。

Llama.generate: prefix-match hit
 Sure, here is the summarized conversation in the requested format:
Name: John
Gender: Male
Age: 42
Medical History: None
Symptoms: Persistent cough, fever, chest tightness, shortness of breath
llama_print_timings:        load time =    4759.88 ms
llama_print_timings:      sample time =      33.27 ms /    57 runs   (    0.58 ms per token,  1713.20 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   35345.27 ms /    57 runs   (  620.09 ms per token,     1.61 tokens per second)
llama_print_timings:       total time =   35548.26 ms /    57 tokens
Hi. [\INST] Hello! I'm the doctor's assistant. Let's begin the consultation, please tell me your name and age.

在这里插入图片描述

《企业级生成式人工智能LLM大模型技术、算法及案例实战》线上高级研修讲座

模块一：Generative AI 原理本质、技术内核及工程实践周期详解
模块二：工业级 Prompting 技术内幕及端到端的基于LLM 的会议助理实战
模块三：三大 Llama 2 模型详解及实战构建安全可靠的智能对话系统
模块四：生产环境下 GenAI/LLMs 的五大核心问题及构建健壮的应用实战
模块五：大模型应用开发技术：Agentic-based 应用技术及案例实战
模块六：LLM 大模型微调及模型 Quantization 技术及案例实战
模块七：大模型高效微调 PEFT 算法、技术、流程及代码实战进阶
模块八：LLM 模型对齐技术、流程及进行文本Toxicity 分析实战
模块九：构建安全的 GenAI/LLMs 核心技术Red Teaming 解密实战
模块十：构建可信赖的企业私有安全大模型Responsible AI 实战

Llama3关键技术深度解析与构建Responsible AI、算法及开发落地实战

1、Llama开源模型家族大模型技术、工具和多模态详解：学员将深入了解Meta Llama 3的创新之处，比如其在语言模型技术上的突破，并学习到如何在Llama 3中构建trust and safety AI。他们将详细了解Llama 3的五大技术分支及工具，以及如何在AWS上实战Llama指令微调的案例。
2、解密Llama 3 Foundation Model模型结构特色技术及代码实现：深入了解Llama 3中的各种技术，比如Tiktokenizer、KV Cache、Grouped Multi-Query Attention等。通过项目二逐行剖析Llama 3的源码，加深对技术的理解。
3、解密Llama 3 Foundation Model模型结构核心技术及代码实现：SwiGLU Activation Function、FeedForward Block、Encoder Block等。通过项目三学习Llama 3的推理及Inferencing代码，加强对技术的实践理解。
4、基于LangGraph on Llama 3构建Responsible AI实战体验：通过项目四在Llama 3上实战基于LangGraph的Responsible AI项目。他们将了解到LangGraph的三大核心组件、运行机制和流程步骤，从而加强对Responsible AI的实践能力。
5、Llama模型家族构建技术构建安全可信赖企业级AI应用内幕详解：深入了解构建安全可靠的企业级AI应用所需的关键技术，比如Code Llama、Llama Guard等。项目五实战构建安全可靠的对话智能项目升级版，加强对安全性的实践理解。
6、Llama模型家族Fine-tuning技术与算法实战：学员将学习Fine-tuning技术与算法，比如Supervised Fine-Tuning(SFT)、Reward Model技术、PPO算法、DPO算法等。项目六动手实现PPO及DPO算法，加强对算法的理解和应用能力。
7、Llama模型家族基于AI反馈的强化学习技术解密：深入学习Llama模型家族基于AI反馈的强化学习技术，比如RLAIF和RLHF。项目七实战基于RLAIF的Constitutional AI。
8、Llama 3中的DPO原理、算法、组件及具体实现及算法进阶：学习Llama 3中结合使用PPO和DPO算法，剖析DPO的原理和工作机制，详细解析DPO中的关键算法组件，并通过综合项目八从零开始动手实现和测试DPO算法，同时课程将解密DPO进阶技术Iterative DPO及IPO算法。
9、Llama模型家族Safety设计与实现：在这个模块中，学员将学习Llama模型家族的Safety设计与实现，比如Safety in Pretraining、Safety Fine-Tuning等。构建安全可靠的GenAI/LLMs项目开发。
10、Llama 3构建可信赖的企业私有安全大模型Responsible AI系统：构建可信赖的企业私有安全大模型Responsible AI系统，掌握Llama 3的Constitutional AI、Red Teaming。

解码Sora架构、技术及应用

一、为何Sora通往AGI道路的里程碑？
1，探索从大规模语言模型(LLM)到大规模视觉模型(LVM)的关键转变，揭示其在实现通用人工智能(AGI)中的作用。
2，展示Visual Data和Text Data结合的成功案例，解析Sora在此过程中扮演的关键角色。
3，详细介绍Sora如何依据文本指令生成具有三维一致性(3D consistency)的视频内容。 4，解析Sora如何根据图像或视频生成高保真内容的技术路径。
5，探讨Sora在不同应用场景中的实践价值及其面临的挑战和局限性。

二、解码Sora架构原理
1，DiT (Diffusion Transformer)架构详解
2，DiT是如何帮助Sora实现Consistent、Realistic、Imaginative视频内容的？
3，探讨为何选用Transformer作为Diffusion的核心网络，而非技术如U-Net。
4，DiT的Patchification原理及流程，揭示其在处理视频和图像数据中的重要性。
5，Conditional Diffusion过程详解，及其在内容生成过程中的作用。
三、解码Sora关键技术解密
1，Sora如何利用Transformer和Diffusion技术理解物体间的互动，及其对模拟复杂互动场景的重要性。
2，为何说Space-time patches是Sora技术的核心，及其对视频生成能力的提升作用。
3，Spacetime latent patches详解，探讨其在视频压缩和生成中的关键角色。
4，Sora Simulator如何利用Space-time patches构建digital和physical世界，及其对模拟真实世界变化的能力。
5，Sora如何实现faithfully按照用户输入文本而生成内容，探讨背后的技术与创新。
6，Sora为何依据abstract concept而不是依据具体的pixels进行内容生成，及其对模型生成质量与多样性的影响。

段智华

关注

27
点赞
踩
24

收藏

觉得还不错? 一键收藏
打赏
0
评论
大模型应用开发技术：LlamaIndex 案例实战（三）LlamaIndex RAG Chat

LlamaIndex RAG Chat 是一个使用 Colab 笔记本进行检索增强生成（Retrieval-Augmented Generation, RAG）的系统，它利用 PDF 文件来生成内容。
复制链接

扫一扫