我有一份PyArrow Parquet文件,其篇幅太大,无法记忆。 由于数据很容易被分割成不同的硬体,我愿意人工分割,并从档案中生成PyArrow数据。 虽然我是分治,但分治内部的分行本身需要重新编辑,以便能够按自然顺序更新数据。
如果相关,则是一种chem刀。 分离钥匙是chain_id
:
pa.schema([
("chain_id", pa.uint32()),
("pair_id", pa.uint64()),
("block_number", pa.uint32()),
("timestamp", pa.timestamp("s")),
("tx_hash", pa.binary(32)),
("log_index", pa.uint32()),
])
我计划利用这一进程。
- Determine partition ids beforehand (
chain_id
in the above schema) - Open a new dataset for writing
- For each partition id
- Create an in-memory temporary PyArrow table
- Read (iterate batches) the source Parquet file
- Identify rows belonging to this partition and add them to the in-memory table
- Sort the rows in memory
- Append table to the dataset
然而,以下文件::FileSystemDataset on PyArrow 是皮肤病。 我的问题大致如下:
- How do I add full tables to a
FilesystemDataset
, assuming the table is the whole content of a partition itself? - Is there any existing tools to partition (PyArrow) Parquet files to datasets, without needing a write a manual script?
- What kind of ordering guarantees
pyarrow.dataset.FileSystemDataset
gives assuming I want to always read the data in the insertd presorted order withto_batches
? - Any other PyArrwo tips dealing with data and datasets that does not fit into RAM?