问题描述
安装HuggingFace datasets时出现报错
系统:MacOS 10.13.6
环境:Conda虚拟环境,python==3.8.1
命令:
pip install datasets
报错信息:
CMake Error at CMakeLists.txt:268 (find_package):
By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Arrow", but
CMake did not find one.
Could not find a package configuration file provided by "Arrow" with any of
the following names:
ArrowConfig.cmake
arrow-config.cmake
Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
"Arrow_DIR" to a directory containing one of the above files. If "Arrow"
provides a separate development package or SDK, be sure it has been
installed.
-- Configuring incomplete, errors occurred!
See also "/private/var/folders/sd/6d0w7lz121v38498dngh6y540000gn/T/pip-install-ewqnh087/pyarrow_673989b028794d389cba544b08d75516/build/temp.macosx-10.9-x86_64-cpython-38/CMakeFiles/CMakeOutput.log".
error: command '/usr/local/bin/cmake' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects
解决方案
使用Conda虚拟环境,先拉取测试数据,配置环境变量
$ cd /Users/../anaconda3/envs/env_name # 先定位到虚拟环境目录
$ git clone https://github.com/apache/arrow.git
$ pushd arrow
$ git submodule update --init
$ export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
$ export ARROW_TEST_DATA="${PWD}/testing/data"
$ popd
conda-forge安装Arrow C++和PyArrow的依赖,但是报错`CondaValueError: Malformed version string '~': invalid character(s).`
$ conda activate env_name # 激活虚拟环境
$ conda install -c conda-forge \
--file arrow/ci/conda_env_unix.txt \
--file arrow/ci/conda_env_cpp.txt \
--file arrow/ci/conda_env_python.txt \
--file arrow/ci/conda_env_gandiva.txt \
compilers # 从channel下载
$ export ARROW_HOME=$CONDA_PREFIX
尝试从系统虚拟环境入手,安装Arrow C++的依赖,配置环境变量。使用现有虚拟环境时安装时,发现大量深度学习相关的包,都有依赖冲突问题,需要创建新虚拟环境:
- lamini 1.0.2 requires pydantic==1.10.*,但gradio 4.4.0 requires pydantic>=2.0
- tensorflow 2.6.5 requires typing-extensions<3.11,>=3.7,但大部分要求typing-extensions>=4.7.0
- tensorflow 2.6.5 requires numpy~=1.19.2, 但大部分要求numpy>=1.22.0
$ brew update && brew bundle --file=arrow/cpp/Brewfile
$ python3 -m venv pyarrow-dev # 创建新的虚拟环境
$ source ./pyarrow-dev/bin/activate
$ pip install -r arrow/python/requirements-build.txt # 里面含有oldest-supported-numpy,无法用于Conda、HomeBrew
$ mkdir dist
$ export ARROW_HOME=$(pwd)/dist
$ export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
$ export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
安装
$ mkdir arrow/cpp/build
$ pushd arrow/cpp/build
$ cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_BUILD_TYPE=Debug \
-DARROW_BUILD_TESTS=ON \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_HDFS=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DPARQUET_REQUIRE_ENCRYPTION=ON \
..
$ make -j4
$ make install
$ popd
进行到cmake步骤,又报错,暂时放弃 😢
CMake Error at /usr/local/Cellar/cmake/3.22.2/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find Boost (missing: Boost_INCLUDE_DIR system filesystem)
(Required is at least version "1.64")
Call Stack (most recent call first):
/usr/local/Cellar/cmake/3.22.2/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
/usr/local/Cellar/cmake/3.22.2/share/cmake/Modules/FindBoost.cmake:2375 (find_package_handle_standard_args)
cmake_modules/ThirdpartyToolchain.cmake:307 (find_package)
cmake_modules/ThirdpartyToolchain.cmake:1271 (resolve_dependency)
CMakeLists.txt:542 (include)
参考课程中的环境版本,后续再搞:
- pyarrow==13.0.0
- numpy==1.24.3
- datasets==2.14.4
参考
https://arrow.apache.org/docs/developers/python.html#python-development