Question

我读到了十几页,似乎:

我可以ski熟地学习cal。
波什(我不需要学到任何东西方的情景)已完全执行APIC。
互动模式既全面又迅速地发挥作用,因为 s状和 trouble击同样容易。
象NumPy这样的灰色模块仍会进口(没有燃烧的环境)。

Are there fall-short areas that will make it impossible?

Answer 1

In recent Spark releases (1.0+), we ve implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).

我先前的答复如下:

Original answer as of Spark 0.9

我最初回答以来的7个月中发生了许多变化(在本答复的底线重新提出):

Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I ve implemented tools to help keep the Java API up-to-date.

从00.9起,PyS花园的主要缺失之处是:

zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there s an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows Powershell or cmd.exe, though).
Support for job cancellation.

虽然我们已经取得了许多业绩改进,但Seta Scala和APIC之间仍然存在业绩差距。用户邮寄名单 rel=“nofollow noretinger”>an open thread,讨论其当前业绩。

如果你发现PyS花园有任何缺失的特征,请在。

Original answer as of Spark 0.7.2:

The Spark Python Programming Guide have a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It s also missing a few convenience methods added to the Scala API.

贾瓦·安普森在获释时与斯卡拉·安普森合,但此后又增加了许多新的民主与发展方法,并非所有方法都被纳入了贾瓦的总结班。讨论了如何在https://groups.google.com/d/msg/sstart-developers/TMGvtxYN9/MoeFpD17VeAIJ 。在座右边,我建议采用一种方法,自动找到缺失的特征,因此,这只是一个人花时间添加这些特征并提出拉动要求。

关于性能,PyS花园将比Schala花更慢。部分业绩差异源自于在大型蒸发过程中的 we蒸程序,但开放式拉拉皮要求 ,应当予以确定。其他瓶颈来自序列化:现在,PyS花园就要求用户明确登记其物体的序列器(我们目前使用双轨芯片加一些电池优化)。过去,我期待着对用户可视的序列器给予更多的支持,以便你能够具体说明你的物体类型,从而使用更快的专门序列器;我希望在某个时候恢复这方面的工作。

PySpark is implemented using a regular CPython interpreter, so libraries like NumPy should work fine (this wouldn t have been the case if PySpark was written in Jython).

It s pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.

Answer 2

我愿补充几点,说明为什么许多使用二字标的人都建议采用Schala的标语。我很难做到这一点,而不要仅仅指出在Garz vs Scala和我对书写生产质量守则的有活力的语言和解释性语言的不满。因此,此处是使用案例的具体原因:

Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because Python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
在Schala撰写了“闪电”号,在Schala收集了轮式轮船,学习了“轮船”如何运行,在Schala学习如何使用“轮船”非常容易,因为你很容易将CTRL + B列入源代码,并读到“轮船”的低水平,以挽救正在做的事情。我认为,这对优化工作和减少更复杂的应用工作特别有用。
现在,我的最后一点似乎似乎只是一个Schala诉Adre案,但它与具体使用案件(即 /plong>和 平行处理)密切相关。 Scala代表Scalable 语,许多解释认为,这是指它有针对性地加以扩大和易于多读。它不仅指Mlambdas,还带有Schala的头对面特征,使得它能够使用大数据和平行处理<>的完美语言。我有一些数据科学朋友,他们被用到沙里,不希望学习新的语言,而是坚持他们的习惯。粉碎是一种描述性的语言,不是针对这一具体使用案例设计的,它是一种繁琐的工具,而是这一职务的错误手段。其结果在法典中是显而易见的,其法典往往比我的Schala法典长2 - 5x,因为Sharma缺乏许多特征。此外,由于它们偏离了基本框架,它们更难优化其守则。

请允许我这样说,如果有人知道Schala和Adhur,那么他们就几乎总是选择使用Scala的预报。唯一使用沙尔语的人就是那些根本不想学习沙丘的人。

Original answer as of Spark 0.9

Original answer as of Spark 0.7.2:

友情链接