Question

我有挑战,我不知道如何在“Seta Scala”数据框架内这样做。

我有一个数据框架,在许多栏目中,2个关键栏目是品牌(3)和客户Id(~700K)。

当我写这本书和分版时,我获得3倍,每个品牌获得1倍,每卷约200份。每个档案中可包含许多客户的数据。每一客户的信息和数据 Id可能分散在不同档案中。

www.un.org/Depts/DGACM/index_spanish.htm 面临的挑战是分门别类,但需要确保单一客户的数据。 Id完全载于单一档案中。如果档案中包含多个客户的数据,则它会 s。但是,数据应在该档案中填写。

我可以分割By(品牌、客户Id),但这将产生3个顶级的夹子,然后是700K的夹子,因为我不想这样,那是无法管理的。

I need 3 folders and about 100 files within each folder, but data for a given customerId is fully contained in a single file but a single file can contain data fully for multiple customerIds

Any help is much appreciated!

Answer 1

You can add a logical column based on customerId: new_col
such as customerId = (1,2,3,4,5,6)
you can generate new_col = (1,1,1,2,2,2)

aggregate new_col the aggregated data as a whole for the same data,
so that you can aggregate data with the same or similar customerId.
 
Reduce the number of files in each folder using coalesce(100)

友情链接