基于chinese-llama-plus北大团队推出法律大模型，数据与模型全部开源，模型合并使用全流程

weixin_39394916

已于 2023-06-14 18:19:20 修改

阅读量712

点赞数 1

文章标签： llama 人工智能

于 2023-06-14 17:13:18 首次发布

本文链接：https://blog.csdn.net/weixin_39394916/article/details/131212005

版权

上篇分享了一个法律的大模型，lawGPt，目前看回答一些基本的法律问题还是可以的，昨天又发现，北京大学也开源了一个法律大模型，lawyer-llama，通过在大规模法律语料中进行训练，系统学习中国的法律知识体系使得模型可以掌握中国的法律知识并应用于中国的法律实务。

看看论文的例子

对比上图左侧的BELLE（Be Everyone's Large Language model Engine）模型，如果提问“中国的法定结婚年龄”，可以看到 Lawyer LLaMA 给出了一个正确的，并且更像是 Lawyer 的回答。并且，即使是提供了必要的法律条文，如上图问题B，BELLE 也无法给出一个正确的回答，而 Lawyer LLaMA 则有理有据的颇具专业性的很好的回答了这个问题。

其实从 BELLE 的回答中也可以看出，直接将这样一个大模型套在专业的垂直领域下往往会出现许多问题，作者团队认为，要使得大模型可以很好地适应法律领域的特殊要求，必须要满足以下三个条件，分别是：

精确的表意，避免歧义：在法律领域中常常会有仅仅更换一个字词，就会导致法律关系构建出截然相反的结果，譬如中文中定金与订金仅相差一个字，但是其含义与法律效力在合同法中却完全不同；
理解与区分法律术语：在法律中，有许多特有的特定词汇，许多术语仅仅出现在法律领域中，如法人这个概念，而还有更多术语可能在法律领域与日常生活领域拥有着不尽相同的含义，也需要模型加以区分；
能够理解实际情况：除了对法律术语与法律分析要有基本的了解与系统的掌握以外，模型还应当具有精确理解现实生活问题的能力，即模型需要拥有一个应用法律理论来解决特定问题的核心能力。

基于上述理论，作者团队便基于开源的 LLaMA 模型期望通过以下几步解决法律领域大模型的适用问题：

法律相关知识注入：通过收集大量法律领域的原始文本，如法律条文、司法解释与国家法律文件，对原始模型使用新数据进行继续训练；
特定领域技能习得：一个良好的法律大模型应该能够解决法律领域的常见问题，如概念解释、案例分析与法律咨询，因此作者收集了一组实际的任务案例，使用 ChatGPT 生成相应答案从而进行监督微调，以使得模型具有解决法律领域特定任务的能力；
信息检索减轻幻觉：为了减轻大模型的机器幻觉问题，作者同时引入了一个信息检索模块，在生成每个回复之前，都首先利用用户的查询与上下文检索相关法律条文，基于这些法律条文再去生成相应回复。

通过以上三步，作者团队便成功完成了 Lawyer LLaMA 的构建，Lawyer LLaMA 的整体运作流程如下图所示：

我们直接看看论文给出的效果吧

同等对比，在各个不同角度看，lawyer-llama确实要好很多

论文中还有具体的对比数据，感兴趣可以看看

https://arxiv.org/pdf/2305.15062.pdf

合并使用流程

到了实操阶段了，哈哈

首先是获取权重，分两部分，（虽然官方git写了三步）

1、下载7B中的 consolidated.00.pth

https://huggingface.co/nyanko7/LLaMA-7B/tree/main

2、下载lawyer-llama权重

https://huggingface.co/pkupie/lawyer-llama-13b-beta1.0/tree/main

合并使用官方给的decrypt.py脚本

for f in "/path/to/model/pytorch_model"*".enc"; \
    do if [ -f "$f" ]; then \
       python3 decrypt.py "$f" "/path/to_original_llama/7B/consolidated.00.pth" "/path/to/model"; \
    fi; \
done

脚本在此


import os
import sys
import hashlib
import multiprocessing
import os


def xor_bytes(data, key):
    return bytes(a ^ b for a, b in zip(data, (key * (len(data) // len(key) + 1))[:len(data)]))

def xor_worker(task_queue, result_queue):
    while True:
        chunk_idx, data, key = task_queue.get()
        result_queue.put((chunk_idx, xor_bytes(data, key)))
        task_queue.task_done()

def write_result_chunk(fp, w_chunk_idx, pending, hasher):
    if not pending:
        return w_chunk_idx, pending
    pending.sort()
    for pending_idx, (chunk_idx, chunk) in enumerate(pending):
        if chunk_idx != w_chunk_idx:
            return w_chunk_idx, pending[pending_idx:]
        fp.write(chunk)
        hasher.update(chunk)
        w_chunk_idx += 1
    return w_chunk_idx, []

def main(input_file, key_file, output_dir):
    worker_count = max(1, os.cpu_count() - 1)
    print(f"Decrypting file {input_file} with {worker_count} workers")

    task_queue = multiprocessing.JoinableQueue(worker_count * 3)
    result_queue = multiprocessing.Queue()
    processes = [
        multiprocessing.Process(target=xor_worker, args=(task_queue, result_queue))
        for _ in range(worker_count)
    ]
    for p in processes:
        p.daemon = True
        p.start()

    chunk_size = 10 * 1024 * 1024
    key_chunk_size = 10 * 1024 * 1024

    hasher = hashlib.sha256()

    # Get the checksum from the input file name
    input_file_basename = os.path.basename(input_file)
    checksum_hex = input_file_basename.split(".")[-2]

    with open(input_file, "rb") as in_file, open(key_file, "rb") as key_file:
        # Get the size of the input file
        file_size = os.path.getsize(input_file)

        # Minus the checksum size
        file_size -= hasher.digest_size

        # Read the checksum from the beginning of the input file
        expected_hash = in_file.read(hasher.digest_size)

        # Create the output file path without the checksum in the filename
        # remove .<checksum>.enc
        input_file_basename = input_file_basename[:-len(checksum_hex) - 5]
        output_file = os.path.join(output_dir, input_file_basename)

        with open(output_file, "wb") as out_file:
            r_chunk_idx = 0  # how many chunks we have read
            w_chunk_idx = 0  # how many chunks have been written
            write_pending = []  # have xor results, awaiting to be written to file

            bytes_read = 0
            while True:
                chunk = in_file.read(chunk_size)
                if not chunk:
                    break

                key_chunk = key_file.read(key_chunk_size)
                if not key_chunk:
                    key_file.seek(0)
                    key_chunk = key_file.read(key_chunk_size)
                
                task_queue.put((r_chunk_idx, chunk, key_chunk))
                # read available results
                while not result_queue.empty():
                    write_pending.append(result_queue.get())
                    
                w_chunk_idx_new, write_pending = write_result_chunk(out_file, w_chunk_idx, write_pending, hasher)

                bytes_read += (w_chunk_idx_new - w_chunk_idx) * chunk_size
                progress = bytes_read / file_size * 100
                sys.stdout.write(f"\rProgress: {progress:.2f}%")
                sys.stdout.flush()
                
                w_chunk_idx = w_chunk_idx_new
                r_chunk_idx += 1

            # wait for xor workers
            sys.stdout.write('\rWaiting for workers...')
            sys.stdout.flush()
            task_queue.join()
            while not result_queue.empty():
                write_pending.append(result_queue.get())
            sys.stdout.write('\rWriting final chunks...')
            sys.stdout.flush()
            write_result_chunk(out_file, w_chunk_idx, write_pending, hasher)

            computed_hash = hasher.digest()

            if computed_hash != expected_hash:
                print("\nError: Checksums do not match. The file may be corrupted.")
                sys.exit(1)

        print ("\nDecryption completed.")

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: decrypt.py input_file key_file output_dir")
        sys.exit(1)

    main(sys.argv[1], sys.argv[2], sys.argv[3])

git上还给了一个法条检索模块

3、从百度网盘（提取码：r0vx）下载法条检索模块，并运行其中的python server.py启动法条检索服务，默认挂在9098端口。

模型解密好后可以使用了

使用交互界面运行

运行以下命令启动交互网页，访问http://127.0.0.1:7863。

python demo_web.py \
--port 7863 \
--checkpoint /path/to/model \
--classifier_url "http://127.0.0.1:9098/check_hunyin"

执行这个就可以开启使用旅程了

git：https://github.com/AndrewZhe/lawyer-llama

下期预告（帮作者点一下广告，企鹅将为大模型测试注入一笔赞助费O(∩_∩)O哈哈~）：

下期分享【基于大模型sd，lora微调自己的图片的训练】，欢迎关注 pythonLLM智能

往期回顾：

176B竟然可以辣么快，效果直逼chatgpt-4直接hf在线体验，还可以商用

数据、180B模型、训练方法一并开源，这个世界级多模态大模型可媲美chatgpt 96%的效果

达到chatgpt 90%效果的llama，Chinese-Alpaca-Plus-13B合并使用全过程分享

chatglm+langchain+互联网，你可以将大模型接入网络了

什么情况用Bert模型，什么情况用LLaMA、ChatGLM类大模型，咋选？

基于chatglm、moss+知识库+langchain的问系统的搭建

weixin_39394916

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
基于chinese-llama-plus北大团队推出法律大模型，数据与模型全部开源，模型合并使用全流程

上篇分享了一个法律的大模型，lawGPt，目前看回答一些基本的法律问题还是可以的，昨天又发现，北京大学也开源了一个法律大模型，lawyer-llama，通过在大规模法律语料中进行训练，系统学习中国的法律知识体系使得模型可以掌握中国的法律知识并应用于中国的法律实务。看看论文的例子。
复制链接

扫一扫