QAnything windows下的安装步骤

环境准备

  • RTX4090
  • win10专业版系统
  • wsl 安装ubuntu
    • 设置wsl 为version 2
      • 参考链接:https://blog.csdn.net/SUNbrightness/article/details/116783604
    • wsl --list -o 查看linux的发行版本
    • wsl --install -d Ubuntu22.04 安装ubuntu的版本
    • 需要设置用户名和密码,需要记住,不要忘记了
  • git 下载
  • mobaXterm 用于远程连接wsl虚拟机
  • ubuntu安装python3-pip
    • 如果出现安装失败,可以考虑先apt update一下,拉取一下最新的apt仓库再进行就会成功
  • 安装nvidia-sim 驱动
    • 一定要先执行一次nvidia-msi 看看是否能看到驱动,有可能是可以的,如果可以,则不执行下面的apt命令
    • apt install nvidia-utils-535

配置安装git的目录

  • 源码地址:git clone https://github.com/netease-youdao/QAnything.git
  • embedding地址:git clone https://www.modelscope.cn/netease-youdao/QAnything.git
    • 将这个模型下载后解压models.zip,放到QAnything的根目录,这个models就是我们说的RAG框架里的embedding模型,用于表征
  • 下载大模型:
    • git lfs install
    • git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
    • 说明:大模型地址:git clone https://www.modelscope.cn/netease-youdao/Qwen-7B-QAnything.git是有问题的,需要从huggingface上下载大模型,不然会有各种问题,我试过chatglm3-6b模型是正常的,在执行-b hf是正常的,就是这个QWen模型有问题,我准备从huggingface上试试看,不要从其他地方下载,这个问题我查了好几天,搞不清楚为啥会这种
    • 将这个模型放到QAnything/assets/custom_models 这个路径下

执行阶段

  • 安装需要的requirement的python包
    • apt install update
    • sudo apt install python3-pip
    • pip install -r requirement.txt
  • 查看nvida驱动程序
    • 安装nvidia的显卡驱动,apt install nvidia-
    • nvidia-smi 查看显卡情况
  • 执行bash
    • 现将文件的格式转换一下
    sed -i "s/\r//" scripts/run_for_local_option.sh
    sed -i "s/^M//" scripts/run_for_local_option.sh
    sed -i "s/\r//" scripts/run_for_cloud_option.sh
    sed -i "s/^M//" scripts/run_for_cloud_option.sh
    sed -i "s/\r//" run.sh
    sed -i "s/^M//" run.sh
    sed -i "s/\r//" close.sh
    sed -i "s/^M//" close.sh
    
    • 安装第三方库:
      • sudo apt-get install jq
      • sudo apt-get install bc
    • 设置request python包
      • pip install requests==2.28.1
    • 安装docker文件
sudo apt-get update

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \
  "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compse

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

sudo service docker stop
sudo service docker start

sudo apt install docker-compose

# 测试docker
sudo docker run --runtime=nvidia  --rm -it --name tensorflow-1.14.0 tensorflow/tensorflow:1.14.0-gpu-py3
- 执行脚本文件:
	- sudo bash ./run.sh -c local -i 0 -b hf -m Qwen-7B-QAnything -t qwen-7b-qanything(将-b hf的时候会卡住,=经过确定是因为下载的模型有问题我是从魔塔上下载的模型,所以一直gpu加载不动,使用了chatglm3-6b模型之后就正常了)
	- 此时会执行一些docker的下载资源,有很多的docker库需要下载
	- 注意一下-b 参数需要谨慎选择,有可能是不能使用的,我用chatglm3-6b使用vllm时就执行不出来,用hf就可以,用default就是失败的

问题错误:

    1. docker 执行的时候报错,错误如下:
      TypeError: HTTPConnection.request() got an unexpected keyword argument ‘chunked’
      docker.errors.DockerException: Error while fetching server API version: HTTPConnection.request() got an unexpected keyword argument ‘chunked’
    • 解决方案:更具网址的解决方案:https://github.com/google-deepmind/alphafold/issues/812
    • 将requests的python库调整为2.28.1
      • 具体操作是:pip install requests==2.28.1
  • 2.urllib3.exceptions.ProtocolError: (‘Connection aborted.’, PermissionError(13, ‘Permission denied’))

    • 尝试用sudo进行处理,可以解决这个问题
    1. ERROR: for qanything-container-local Cannot start service qanything_local: could not select device driver “nvidia” with capabilities: [[gpu]]
    • 可能是因为安装nvida的显卡导致,需要安装nvida的显卡解决
    • 尝试方案1:
      • 1.官网下载nvida的显卡驱动进行安装,Linux 4-bit无效,Linux aarch64无效,FreeBSD x64无效,方法不可行
        1. 在输入了nvidias-smi之后会有一些推荐,经过查看,安装了版本 apt install nvidia-utils-535的版本
      • 3.经过查看nvidia-smi可以看到显卡的版本了
      • 参考链接:https://blog.csdn.net/m0_73139694/article/details/135473124
    • 参考链接1:https://github.com/THUDM/CodeGeeX/issues/103
  • 4.安装了nvidia-smi之后出现NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

    • 尝试方法1(无效):
      1. sudo add-apt-repository ppa:graphics-drivers/ppa --yes
      2. sudo apt update
      3. sudo apt install nvidia-driver-550   当前最新的驱动了
      
    • 尝试方法2:
      	1. sudo apt-get remove --purge '^nvidia-.*'
      		sudo apt-get remove --purge '^libnvidia-.*'
      		sudo apt-get remove --purge '^cuda-.*'
      	2. wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run
      	3. sudo sh cuda_12.3.2_545.23.08_linux.run
      
    • 开启了hypter_type之后去执行nvida-msi竟然好了
    • 参考网页:https://blog.csdn.net/wjinjie/article/details/108997692
    • 参考网页:https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141/2
    • 参考网页:https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141
  • 5.docker.errors.DockerException: Error while fetching server API version: (‘Connection aborted.’, FileNotFoundError(2, 'No such file or ,Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

    • 原因是docker进程没有启动起来,所以docker服务是不可用的
    • 换种方案安装docker: snap install docker
  • 5.执行的时候报错:nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

    • 方案1 安装nvidia-container:
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
      curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
      sudo snap restart docker   # 因为是通过snap安装的,其他可以通过systemctl服务重启(systemctl restart docker)
      docker run --rm -it --gpus all ubuntu:22.04 nvidia-smi   # 失败了,和上面的报错一样
      nvidia-smi # 未执行
      安装最新的安装方式后就正常了,参考连接:https://blog.csdn.net/SUNbrightness/article/details/116783604
      
    • 方案2(有问题):
      • 可能是因为nvida-container的libnvidia-ml.so.1 的位置不对,经过查询这个文件在目录: nvidia-container-cli list 发现是在/usr/lib/wsl中,说明是有的,
      • 将这个文件注释掉,方式是因为重复的文件名字导致的问题:
        mv /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1  /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1.bak
        
        mv /usr/lib/x86_64-linux-gnu/libcuda.so.1  /usr/lib/x86_64-linux-gnu/libcuda.so.1.bak
        
      • 参考链接:https://www.cnblogs.com/devilmaycry812839668/p/17296525.html
    • 方案3:安装一个nvida-docker2(有问题)
      • sudo docker run --rm -it ls -al /usr/lib/x86_64-linux-gnu/libnv* output 参考这种方式,将nvida的docker直接软链家到这的那个的lib库中
      • 参考链接:https://github.com/NVIDIA/nvidia-container-toolkit/issues/289
      • 参考链接:https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/docker.html
    • 方案3:安装一个nvida-cuda试试看(方法无效)
      • docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi 作者:titan909 https://www.bilibili.com/read/cv31412472/ 出处:bilibili
    • 方案4:去掉了docker-window.yaml中的nvidia的选项,就能够执行成功了,但是在rank_embed时报错
    • 方案5:将nvidia-smi设置为persistence模式试试看(无效)
      • 参考链接:https://github.com/NVIDIA/nvidia-docker/issues/1648
    • 方案6
      • 参考:https://www.cnblogs.com/dudu/p/18010103 尝试一下
    • 方案7:可能是因为使用了snap安装了docker导致的?卸载了snap的docker,再试试看?
      • 最终确定问题就是因为使用了snap安装docker出现的问题
      • docker安装流程:https://blog.csdn.net/SUNbrightness/article/details/116783604
  • 问题6:please install FlashAttention https://github.com/Dao-AILab/flash-attention

    • pip install ninja
    • pip install torch
    • pip install flash-attn --no-build-isolation
    • OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root(需要安装cuda)
    • wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run
    • sudo sh cuda_12.3.2_545.23.08_linux.run
    • export CUDA_HOME=/usr/local/cuda-X.X
    • 参考链接:https://gist.github.com/Brainiarc7/470a57e5c9fc9ab9f9c4e042d5941a40
    • 参考链接:https://blog.csdn.net/OOFFrankDura/article/details/113632416
    • 参考链接:https://gist.github.com/Brainiarc7/470a57e5c9fc9ab9f9c4e042d5941a40
    • 参考链接:https://github.com/Dao-AILab/flash-attention 一个关于gpu加速开源项目
  • 问题7:RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

    • 方法一:
      • pip install ultralytics
      • 参考链接:https://github.com/ultralytics/ultralytics/issues/5793
      • 这个只是一个可用性分析,无用
    • 从网上分析来看,这个模型是有问题的,我是从https://www.modelscope.cn下载的可能是有问题的,查看了一下原始的模型有14G,而我本地的只有2G,不用git下载,用浏览器下载pytorch模型,然后用新的模型就是正常的了
  • 问题8:ValueError: Cannot find any model weight files. Please check your (cached) weight path: /model_repos/CustomLLM/Qwen-7B-QAnything

    • https://github.com/lm-sys/FastChat/blob/main/fastchat/model/compression.py 这个代码里写着了,实际上是由的,重新拉去,直接用git clone拉取的模型是不对的,需要特别注意
  • 问题9:import flash_attn rms_norm fail,import flash_attn rotary fail,import flash_attn fail,

    • 参考如下的方式处理:
      git clone https://github.com/Dao-AILab/flash-attention
      cd flash-attention && pip install .   (安装太慢,使用pip install flash-attn --no-build-isolation 安装)
      # Below are optional. Installing them might be slow.
      pip install csrc/layer_norm  (非常的占用cpu,需要等待)
      pip install csrc/rotary
      
      • 安装报错: ninja: build stopped: subcommand failed.
        Traceback (most recent call last):
        File “/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py”, line 2096, in _run_ninja_build
        subprocess.run(
        File “/usr/lib/python3.10/subprocess.py”, line 526, in run
        raise CalledProcessError(retcode, process.args,
        subprocess.CalledProcessError: Command ‘[‘ninja’, ‘-v’]’ returned non-zero exit status 1.
      • 在cpp_extension文件中将-v改为–version
  • 问题10:/usr/bin/ld: cannot find /mnt/c/QAnything/flash-attention/csrc/rotary/build/temp.linux-x86_64-cpython-310/rotary.o: No such file or directory

    • 因为我装的是cuda12,版本不对,需要修改一下
    • 尝试解决也是无效,参考链接:https://github.com/Dao-AILab/flash-attention/issues/484
  • 问题11:run.sh 设置为-b hf 或者vllm是无法成功的

  • 问题12: requests.exceptions.ReadTimeout: HTTPConnectionPool(host=‘0.0.0.0’, port=36001): Read timed out. (read timeout=60)

    • 从实际的情况上来看,实际上答案已经给出了,只是这个36001服务没有启动成功导致的
    • 36001是由 llm_server_run.sh启动的,但是实际上是没有启动成功的,需要排查一下为什么会没有启动成功
    • 是通过Sanic第三方模块进行处理的,但是实际上没有能成功的启动36001这个端口,
    • 重新启动了一下llm_server_run.sh后正常了
  • 问题13:wsl安装ubuntu之后没有关在电脑硬盘

    • 之前都是可以的,现在不行了,奇怪
    • 经过重新连接wsl后正常,可能有一些奇怪的情况
  • 问题14: Failed to install npm dependencies.

    • 尝试安装apt install -y npm进行解决
  • 问题15:Failed to build the front end.

    • 之前以为是环境配置问题,实际发现重新安装了整个系统环境也是不行
    • 直接在front_end执行npm run build 查看到有报错:
      file:///mnt/c/QAnything/QAnything/front_end/node_modules/vite/bin/vite.js:7
          await import('source-map-support').then((r) => r.default.install())
          ^^^^^
      
      SyntaxError: Unexpected reserved word
      
    • 查看了npm的版本是8.5,可能是apt install npm的版本比较低导致的,尝试使用自动的npm进行安装尝试
    • 调整npm的版本内容即可
  • 问题16:peer closed connection without sending complete message body (incomplete chunked read)

    • 具体的报错信息如下:
    2024-02-26 12:06:58 | ERROR | stderr | ERROR:    Exception in ASGI application
    2024-02-26 12:06:58 | ERROR | stderr | Traceback (most recent call last):
    2024-02-26 12:06:58 | ERROR | stderr |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
    2024-02-26 12:06:58 | ERROR | stderr |     await wrap(partial(self.listen_for_disconnect, receive))
    2024-02-26 12:06:58 | ERROR | stderr |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in wrap
    2024-02-26 12:06:58 | ERROR | stderr |     await func()
    2024-02-26 12:06:58 | ERROR | stderr |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 234, in listen_for_disconnect
    2024-02-26 12:06:58 | ERROR | stderr |     message = await receive()
    2024-02-26 12:06:58 | ERROR | stderr |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 587, in receive
    2024-02-26 12:06:58 | ERROR | stderr |     await self.message_event.wait()
    2024-02-26 12:06:58 | ERROR | stderr |   File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    2024-02-26 12:06:58 | ERROR | stderr |     await fut
    2024-02-26 12:06:58 | ERROR | stderr | asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa254ef6140
    
    • 在有结果出来的时候没有实际的相应,结果答案出不来
    • 尝试方法:pip install gradio_client==0.2.7 参考:https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11855
    • 经过和大佬沟通,发现可能是模型的输出内容有问题导致的,需要换一个模型试试看,推荐说让用minichat-2B进行尝试一下
    • 最终终于明白,原来是参数不对,正常的参数命令如下:
    • bash ./run.sh -c local -i 0 -b hf -m chatglm3-6b -t chatglm3 即可正常了

参考链接

  • https://blog.csdn.net/lianghao118/article/details/136087436
  • https://github.com/Dao-AILab/flash-attention
  • https://github.com/QwenLM/Qwen#quickstart
  • 21
    点赞
  • 44
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值