Python-Pandas学习之HDFStore存储数据警告（your performance may suffer as PyTables will pickle....）-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

听爸爸的话

1123

文章

885510

访问

+关注

分类

暂无分类

日期归档

2023-05(1)

2023-06(3)

Python-Pandas学习之HDFStore存储数据警告（your performance may suffer as PyTables will pickle....）

发布于2019-08-28 11:41 阅读(5487) 评论(0) 点赞(27) 收藏(0)

这是一个类似数据表字典的格式，可以将很多的数据帧（dataframe）保存在一个对象里面。

每一个数据帧，都标有一个key，然后通过key来访问数据帧的数据。

但是，在使用HDF的时候，如果不指定格式，那么我们数据中存在string类型的数据，就会报以下警告：


PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['user_id', 'version', 'country']]

我的这几列都是string类型的，虽然是警告，也能存进去，但是如果要读取出来就会报错:


Traceback (most recent call last):
  File "/Users/guojicheng/Desktop/Python/3/Projects/CsvToDatabase4/maintest-hdf.py", line 28, in <module>
    df = hstore.get('eos_20190612')
  File "/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 695, in get
    return self._read_group(group)
  File "/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1423, in _read_group
    return s.read(**kwargs)
  File "/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2995, in read
    start=_start, stop=_stop)
  File "/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2540, in read_array
    ret = node[0][start:stop]
  File "/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 821, in read
    listarr = self._read_array(start, stop, step)
  File "tables/hdf5extension.pyx", line 2155, in tables.hdf5extension.VLArray._read_array
ValueError: cannot set WRITEABLE flag to True of this array
Closing remaining open files:eos_data.h5...done

这个其实就是因为之前的警告引起的，那为什么string类型就不行呢？

因为我们的数据，默认都是unicode编码格式，而这个hdf对unicode的支持并不好，因此，如果有字符串的列，我们需要去转换一次：


 
df['user_id'] = df['user_id'].str.decode('utf-8')
df['version'] = df['version'].str.decode('utf-8')
df['country'] = df['country'].str.decode('utf-8')

这么转了之后，就不会出现警告了。但是当我读取出来的时候数据都变成了 float64，还需要转成int，在转化成string，也就是encode，很是麻烦。

下面还有一个方式，我们在存入数据的时候，指定下一存储格式，这种方式不需要做转换就可以：


hstore = pd.HDFStore('eos_data.h5', mode='w')
df = pd.read_csv(
    'testdata.csv', 
)
print(df.dtypes)
hstore.put('eos_20190612', df, format='table', append=False) # 指定了 format 为 table
hstore.close()

上面我指定格式为 table，这也没有警告。下面我们在读取出来看看，数据的类型有没有变：


hstore = pd.HDFStore('eos_data.h5', mode='r')
df = hstore.get('eos_20190612')
print(df.dtypes)
hstore.close()

看到结果是：


 
user_id                               object
version                               object
country                               object
dtype: object