c# hdf5 写string,使用Unicode将字符串数据集存储在HDF5中

I am trying to store variable string expressions from a file which contains special characters, like ø, æ , and å. Here is my code:

import h5py as h5

file = h5.File('deleteme.hdf5','a')

dt = h5.special_dtype(vlen=str)

dset = file.create_dataset("text",(1,),dtype=dt)

dset.attrs[str(1)] = "some text with ø, æ, å"

However the text is not stored properly. The data stored contains text:

"some text with \37777777703\37777777670, \37777777703\37777777646,\37777777703\37777777645"

How can I store the special characters properly? I have tried to follow the guide provided in the documentation here: Strings in HDF5 - Variable-length UTF-8

Edit:

The output was from h5dump. The answer below verified that the characters are properly stored as utf-8.

解决方案

With:

import numpy as np

import h5py as h5

file = h5.File('deleteme.hdf5','w')

dt = h5.special_dtype(vlen=str)

dset = file.create_dataset("text",(3,),dtype=dt)

dset[:] = 'ø æ å'.split()

dset.attrs["1"] = "some text with ø, æ, å"

file.close()

file = h5.File('deleteme.hdf5','r')

print(file['text'][:])

print(file['text'].attrs["1"])

file.close()

I see:

$ python3 stack44661467.py

['ø' 'æ' 'å']

some text with ø, æ, å

That is h5py does see/interpret the strings as unicode - writing and reading.

With the dump utility:

$ h5dump deleteme.hdf5

HDF5 "deleteme.hdf5" {

GROUP "/" {

DATASET "text" {

DATATYPE H5T_STRING {

STRSIZE H5T_VARIABLE;

STRPAD H5T_STR_NULLTERM;

CSET H5T_CSET_UTF8;

CTYPE H5T_C_S1;

}

DATASPACE SIMPLE { ( 3 ) / ( 3 ) }

DATA {

(0): "\37777777703\37777777670", "\37777777703\37777777646",

(2): "\37777777703\37777777645"

}

ATTRIBUTE "1" {

DATATYPE H5T_STRING {

STRSIZE H5T_VARIABLE;

STRPAD H5T_STR_NULLTERM;

CSET H5T_CSET_UTF8;

CTYPE H5T_C_S1;

}

DATASPACE SCALAR

DATA {

(0): "some text with \37777777703\37777777670, \37777777703\37777777646, \37777777703\37777777645"

}

}

}

}

}

Note that in both case the datatype is marked UTF8

DATATYPE H5T_STRING {

STRSIZE H5T_VARIABLE;

STRPAD H5T_STR_NULLTERM;

CSET H5T_CSET_UTF8;

CTYPE H5T_C_S1;

}

That's what the docs say:

They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.

Let h5py (or other reader) worry about interpreting \37777777703\37777777670 as the proper unicode character.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值