English 中文(简体)
将所有相关数据输入单一档案
原标题:Spark Write all related data into a single file

我有挑战,我不知道如何在“Seta Scala”数据框架内这样做。

我有一个数据框架,在许多栏目中,2个关键栏目是品牌(3)和客户Id(~700K)。

当我写这本书和分版时,我获得3倍,每个品牌获得1倍,每卷约200份。 每个档案中可包含许多客户的数据。 每一客户的信息和数据 Id可能分散在不同档案中。

www.un.org/Depts/DGACM/index_spanish.htm 面临的挑战是分门别类,但需要确保单一客户的数据。 Id完全载于单一档案中。 如果档案中包含多个客户的数据,则它会 s。 但是,数据应在该档案中填写。

我可以分割By(品牌、客户Id),但这将产生3个顶级的夹子,然后是700K的夹子,因为我不想这样,那是无法管理的。

I need 3 folders and about 100 files within each folder, but data for a given customerId is fully contained in a single file but a single file can contain data fully for multiple customerIds

Any help is much appreciated!

问题回答
You can add a logical column based on customerId: new_col
such as customerId = (1,2,3,4,5,6)
you can generate new_col = (1,1,1,2,2,2)

aggregate new_col the aggregated data as a whole for the same data,
so that you can aggregate data with the same or similar customerId.
 
Reduce the number of files in each folder using coalesce(100)




相关问题
How to flatten a List of different types in Scala?

I have 4 elements:List[List[Object]] (Objects are different in each element) that I want to zip so that I can have a List[List[obj1],List[obj2],List[obj3],List[obj4]] I tried to zip them and I ...

To use or not to use Scala for new Java projects? [closed]

I m impressed with Twitter and investigating to use Scala for a new large scale web project with Hibernate and Wicket. What do you think about Scala, and should I use it instead of Java? EDIT: And, ...

Why does Scala create a ~/tmp directory when I run a script?

When I execute a Scala script from the command line, a directory named "tmp" is created in my home directory. It is always empty, so I simply deleted it without any apparent problem. Of course, when I ...

Include jar file in Scala interpreter

Is it possible to include a jar file run running the Scala interpreter? My code is working when I compile from scalac: scalac script.scala -classpath *.jar But I would like to be able to include a ...

Scala and tail recursion

There are various answers on Stack Overflow which explain the conditions under which tail recursion is possible in Scala. I understand the limitations and how and where I can take advantage of tail ...

热门标签