Question

在试图将数据集存放在按区域分类的档案中,以将其上载到HuggingFace I ve时遇到了一种奇怪的现象:在将50-byte阵列作为一栏时,文档尺寸在>时,而不是使用<编码>。 BYTE_ARRAY。

这里的《 code法》表明:

import os
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

tmp_dir = os.path.expanduser("~/tmp/pq") # directory for example files to be stored
n_bytes = 50
n_records = 4096 * 126
np_rng = np.random.default_rng(1)
array = np.frombuffer(
    np_rng.bytes(n_bytes * n_records), dtype=np.dtype((np.bytes_, n_bytes)))
table_byte_array = pa.Table.from_pydict({ byte_data : array})
pq.write_table(table_byte_array, f {tmp_dir}/byte_array_example.parquet )
table_fixed_len_byte_array = pa.Table.from_pydict(
    { byte_data : array},
    schema=pa.schema([
        pa.field( byte_data , pa.binary(array.itemsize), nullable=False)]))
pq.write_table(
    table_fixed_len_byte_array,
    f {tmp_dir}/fixed_len_byte_array_example.parquet )
file_sizes = []
for name in [ byte_array ,  fixed_len_byte_array ]:
    filename = f {tmp_dir}/{name}_example.parquet 
    table = pq.read_table(filename)
    file_size = os.path.getsize(filename)
    file_sizes.append(file_size)
    print(f {name}: {table.schema.field("byte_data").type} {file_size} )
print(
    f size ratio: {file_sizes[1] / file_sizes[0]:.3f};  
    f size diff: {file_sizes[1] - file_sizes[0]} )

文件:

byte_array: binary 25493170
fixed_len_byte_array: fixed_size_binary[50] 25850348
size ratio: 1.014; size diff: 357178

这似乎非常具有反面性:没有固定长度(即储存较少数据)导致档案规模扩大。为什么情况如此?

用于上述测试的系统:Ubraham 22,04.4 LTS;3,10.11;pyarrow edition 12.0.1(使用12.0.1)。在研究<代码>pqrs schema <name>_example.parquet -detailed的产出时,请参看<编码>encoding。两卷:<代码>编码:编码 RLE_DITION PLAIN RLE PLAIN。

Answer 1

Why file with `BYTE_ARRAY` column type is smaller?

由于它没有储存所有投入数据:table_byte_array = pa.Table. from_pydict ({ byte_data : range}) truncates submissions on first NULL byte. 可通过以下例子加以核实:

import numpy as np
import pyarrow as pa
np_array = np.array([
  b x00Very important data. Keep it safe! ,
  b Less important data. Keep it safe?? ])
table = pa.Table.from_pydict({ sample : np_array})
print(table[ sample ][0].as_py())

人们可能期望它印刷<代码>b×00Very的重要数据。维护安全! 。此外,如果在<代码>----->-->--/代码>的条目中,在<代码>-array<>/code>后添加一些斜体,产出代码<>.parquet文档是相同的,确认该数据在输出代码>.parquet文档中不予以保存。由于这种行为似乎是一种丑恶行为,我已提出。

So `FIXED_LEN_BYTE_ARRAY` is better, right?

对于一些应用——可能的话,但HuggingFace s datasets Library并不支持固定规模的双轨阵列:试图在未产生习惯负荷结果的情况下,在<>Value Error(“固定型固定型固定_binary[50])中装载这种文档;没有从<条码>数据集/features/features.py上填写的数据集/代码>。

How can we store an `np.ndarray` as `BYTE_ARRAY` then?

具体的工作是:在问题代码中,密码取代table_byte_array = pa.Table. from_pydict({ byte_data : range})。

table_byte_array = pa.Table.from_pydict({
         byte_data : pa.array(
                array, type=pa.binary(array.itemsize)
            ).cast(target_type=pa.binary())
    })

Why file with BYTE_ARRAY column type is smaller?

So FIXED_LEN_BYTE_ARRAY is better, right?

How can we store an np.ndarray as BYTE_ARRAY then?

友情链接

Why file with `BYTE_ARRAY` column type is smaller?

So `FIXED_LEN_BYTE_ARRAY` is better, right?

How can we store an `np.ndarray` as `BYTE_ARRAY` then?