English 中文(简体)
分配PyArrow Parquet档案,将其整理成数据集
原标题:Partitioning PyArrow Parquet file and writing it out sorted to a dataset

我有一份PyArrow Parquet文件,其篇幅太大,无法记忆。 由于数据很容易被分割成不同的硬体,我愿意人工分割,并从档案中生成PyArrow数据。 虽然我是分治,但分治内部的分行本身需要重新编辑,以便能够按自然顺序更新数据。

如果相关,则是一种chem刀。 分离钥匙是chain_id:

pa.schema([
    ("chain_id", pa.uint32()),
    ("pair_id", pa.uint64()),
    ("block_number", pa.uint32()),
    ("timestamp", pa.timestamp("s")),
    ("tx_hash", pa.binary(32)),
    ("log_index", pa.uint32()),
])

我计划利用这一进程。

  • Determine partition ids beforehand (chain_id in the above schema)
  • Open a new dataset for writing
  • For each partition id
    • Create an in-memory temporary PyArrow table
    • Read (iterate batches) the source Parquet file
      • Identify rows belonging to this partition and add them to the in-memory table
    • Sort the rows in memory
    • Append table to the dataset

然而,以下文件::FileSystemDataset on PyArrow 是皮肤病。 我的问题大致如下:

  • How do I add full tables to a FilesystemDataset, assuming the table is the whole content of a partition itself?
  • Is there any existing tools to partition (PyArrow) Parquet files to datasets, without needing a write a manual script?
  • What kind of ordering guarantees pyarrow.dataset.FileSystemDataset gives assuming I want to always read the data in the insertd presorted order with to_batches?
  • Any other PyArrwo tips dealing with data and datasets that does not fit into RAM?
最佳回答

http://arrow.apache.org/docs/python/api/dataset.html 数据读物确实维持秩序。

我自己的partitionutil与你的需要相似。 它提供分类,并可进一步加以调整或提供想法。

问题回答

暂无回答




相关问题
分配PyArrow Parquet档案,将其整理成数据集

我有一份PyArrow Parquet文件,其篇幅太大,无法记忆。 由于数据很容易被分割成不同的硬体,我愿意人工分割这一数据,并生成一个PyArrow数据。

Apache Flink Python Datastream API sink to Parquet

I have a Kafka topic that contains json messages. Using Flink Python API I try to process this messages and store in parquet files in GCS. Here is cleaned code snippet: class Extract(MapFunction): ...

热门标签