Question

我有一份PyArrow Parquet文件,其篇幅太大,无法记忆。由于数据很容易被分割成不同的硬体,我愿意人工分割,并从档案中生成PyArrow数据。虽然我是分治,但分治内部的分行本身需要重新编辑,以便能够按自然顺序更新数据。

如果相关,则是一种chem刀。分离钥匙是chain_id:

pa.schema([
    ("chain_id", pa.uint32()),
    ("pair_id", pa.uint64()),
    ("block_number", pa.uint32()),
    ("timestamp", pa.timestamp("s")),
    ("tx_hash", pa.binary(32)),
    ("log_index", pa.uint32()),
])

我计划利用这一进程。

Determine partition ids beforehand (chain_id in the above schema)
Open a new dataset for writing
For each partition id
- Create an in-memory temporary PyArrow table
- Read (iterate batches) the source Parquet file
  - Identify rows belonging to this partition and add them to the in-memory table
- Sort the rows in memory
- Append table to the dataset

然而,以下文件::FileSystemDataset on PyArrow 是皮肤病。我的问题大致如下:

How do I add full tables to a FilesystemDataset, assuming the table is the whole content of a partition itself?
Is there any existing tools to partition (PyArrow) Parquet files to datasets, without needing a write a manual script?
What kind of ordering guarantees pyarrow.dataset.FileSystemDataset gives assuming I want to always read the data in the insertd presorted order with to_batches?
Any other PyArrwo tips dealing with data and datasets that does not fit into RAM?

Answer 1

http://arrow.apache.org/docs/python/api/dataset.html 数据读物确实维持秩序。

我自己的partitionutil与你的需要相似。它提供分类,并可进一步加以调整或提供想法。

友情链接