I m using the jupyter/pyspark-notebook docker image to develop a spark script. My dockerfile looks like this:
FROM jupyter/pyspark-notebook
USER root
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt
# this is a default user and the image is configured to use it
ARG NB_USER=jovyan
ARG NB_UID=1000
ARG NB_GID=100
ENV USER ${NB_USER}
ENV HOME /home/${NB_USER}
RUN groupadd -f ${USER} &&
chown -R ${USER}:${USER} ${HOME}
USER ${NB_USER}
RUN export PACKAGES="io.delta:delta-core_2.12:1.0.0"
RUN export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
我的要求。 t:
delta-spark==2.1.0
deltalake==0.10.1
jupyterlab==4.0.6
pandas==2.1.0
pyspark==3.3.3
I build and run the image via docker compose, and then attempt to run this in a notebook:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("LocalDelta")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
并且有以下错误:
AttributeError Traceback (most recent call last)
Cell In[2], line 2
1 import pyspark
----> 2 from delta import *
4 builder = pyspark.sql.SparkSession.builder.appName("LocalDelta")
5 .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
6 .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
8 spark = configure_spark_with_delta_pip(builder).getOrCreate()
File /opt/conda/lib/python3.11/site-packages/delta/__init__.py:17
1 #
2 # Copyright (2021) The Delta Lake Project Authors.
3 #
(...)
14 # limitations under the License.
15 #
---> 17 from delta.tables import DeltaTable
18 from delta.pip_utils import configure_spark_with_delta_pip
20 __all__ = [ DeltaTable , configure_spark_with_delta_pip ]
File /opt/conda/lib/python3.11/site-packages/delta/tables.py:21
1 #
2 # Copyright (2021) The Delta Lake Project Authors.
3 #
(...)
14 # limitations under the License.
15 #
17 from typing import (
18 TYPE_CHECKING, cast, overload, Any, Iterable, Optional, Union, NoReturn, List, Tuple
19 )
---> 21 import delta.exceptions # noqa: F401; pylint: disable=unused-variable
22 from delta._typing import (
23 ColumnMapping, OptionalColumnMapping, ExpressionOrColumn, OptionalExpressionOrColumn
24 )
26 from pyspark import since
File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:166
162 utils.convert_exception = convert_delta_exception
165 if not _delta_exception_patched:
--> 166 _patch_convert_exception()
167 _delta_exception_patched = True
File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:154, in _patch_convert_exception()
149 def _patch_convert_exception() -> None:
150 """
151 Patch PySpark s exception convert method to convert Delta s Scala concurrent exceptions to the
152 corresponding Python exceptions.
153 """
--> 154 original_convert_sql_exception = utils.convert_exception
156 def convert_delta_exception(e: "JavaObject") -> CapturedException:
157 delta_exception = _convert_delta_exception(e)
AttributeError: module pyspark.sql.utils has no attribute convert_exception
看来,火花和三角洲两种版本之间似乎互不兼容,但我没有能够找到任何关于 st流或其它任何地方的任何东西来向我指明正确的方向。 我以这个例子为依据:。
Any help would be much appreciated.