使用 pd.read_parquet() 时产生如下报错:
$ python read_parquet.py
Traceback (most recent call last):
File "read_parquet.py", line 3, in <module>
df = pd.read_parquet('t1')
File "/opt/userhome/atom_guoyanan/anaconda3/lib/python3.7/site-packages/pandas/io/parquet.py", line 281, in read_parquet
impl = get_engine(engine)
File "/opt/userhome/atom_guoyanan/anaconda3/lib/python3.7/site-packages/pandas/io/parquet.py", line 32, in get_engine
raise ImportError("Unable to find a usable engine; "
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support
# read_parquet.py
import pandas as pd
df = pd.read_parquet('jd')
print(df.columns)
print(df.head(2).T)
安装 pyarrow 和 fastparquet 包即可解决:
$ conda install -c conda-forge pyarrow
$ conda install -c conda-forge fastparquet
重新运行,得到结果:
(base) [atom_xxx@kdd7216 ~]$ python read_parquet.py
Index(['company_id', 'created_at', 'degree_id', 'edited_at',
'experience_begin', 'experience_end', 'gender', 'level', 'number',
'origin_company_id', 'position_id', 'real_company_id', 'refreshed_at',
'salary_begin', 'salary_end', 'status', 'updated_at', 'user_id'],
dtype='object')
0 1
company_id 1228459 1228463
created_at 2013-12-27 16:01:33.0 2014-02-25 13:11:16.0
degree_id 2 1
edited_at 2018-07-16 13:36:24.0 2018-07-16 12:31:39.0
experience_begin 2 1
experience_end 10 0
gender 0 0
level 1 1
number 10 1
origin_company_id 1228459 1228463
position_id 211 469
real_company_id 0 0
refreshed_at 2018-07-16 13:36:24.0 2018-07-16 12:31:39.0
salary_begin 0 10000
salary_end 0 15000
status 4 4
updated_at 2018-07-16 13:36:34.0 2018-07-16 12:31:39.0
user_id 95 270