Question

spark version: 3.2

I have pandas_udf defined

def calculate_shap(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for X in iterator:
        yield pd.DataFrame(
            explainer.shap_values(np.array(X), check_additivity=False)[0],
            columns=columns_for_shap_calculation,
        )

return_schema = StructType()
for feature in columns_for_shap_calculation:
    return_schema = return_schema.add(StructField(feature, FloatType()))

shap_values = df.mapInPandas(calculate_shap, schema=return_schema)

在此情况下,我如何能够确保,当我们通过图印班达时,iterator Object将分成我想要确定的任何部分?。

For example, if i have pyspark dataframe with 1Million rows with ID column that has value of 1,2,3,4 and

200K rows have value of 1
500K rows have value of 2
100K rows have value of 3
200K rows have value of 4

if this is the case, my iterator should be partitioned by ID, which is splited by [200K,500K,100K,200K] and perform pandas_udf.

我有一些想法使用

df = df.repartition("ID") and then pass to df.mapInPandas, but will this work this will change my number of partition but not iterator object??

或者

df = df.groupBy("ID") and then pass to df.mapInPandas, but how can I make this work using groupBy?

操纵的途径比较容易。激光器物体

Answer 1

重新划分数据范围。使用<代码>再分配("ID”),以在应用<代码>的“ID”栏前的分立数据为准。

This physically shuffles data, creating partitions with rows having the same "ID" values. It directly affects the iterator object within mapInPandas.

df = df.repartition("ID")
shap_values = df.mapInPandas(calculate_shap, schema=return_schema)

Don t use groupBy immediately before mapInPandas. groupBy creates a GroupedData object, not a DataFrame compatible with mapInPandas.

#If at all grouping is important, Write a function -combining grouping and mapInPandas:

def grouped_calculate_shap(df):
    for id_value, group_df in df.groupBy("ID"):
        yield calculate_shap(group_df.toPandas())

shap_values = df.mapInPandas(grouped_calculate_shap, schema=return_schema)

希望是有助益的!

友情链接