spark version: 3.2
I have pandas_udf defined
def calculate_shap(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for X in iterator:
yield pd.DataFrame(
explainer.shap_values(np.array(X), check_additivity=False)[0],
columns=columns_for_shap_calculation,
)
return_schema = StructType()
for feature in columns_for_shap_calculation:
return_schema = return_schema.add(StructField(feature, FloatType()))
shap_values = df.mapInPandas(calculate_shap, schema=return_schema)
在此情况下,我如何能够确保,当我们通过图印班达时,iterator Object将分成我想要确定的任何部分?。
For example, if i have pyspark dataframe with 1Million rows with ID column that has value of 1,2,3,4 and
- 200K rows have value of 1
- 500K rows have value of 2
- 100K rows have value of 3
- 200K rows have value of 4
if this is the case, my iterator should be partitioned by ID, which is splited by [200K,500K,100K,200K]
and perform pandas_udf.
我有一些想法使用
df = df.repartition("ID")
and then pass to df.mapInPandas
, but will this work this will change my number of partition but not iterator object??
或者
df = df.groupBy("ID")
and then pass to df.mapInPandas
, but how can I make this work using groupBy?
操纵的途径比较容易。 激光器物体