I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000).
These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition
methods).
A toy example of the dataframe would be:
df = session.sql("""
select doc1 as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
union
select doc2 as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
""")
print(df)
# DOC_ID | DOC_VEC
# doc1 | [ 0.1, 0.3, 0.5, 0.7 ]
# doc2 | [ 0.2, 0.4, 0.6, 0.8 ]
However, when I try to fit this dataframe
from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols = doc_vec , output_cols= out_svd )
print(tsvd)
out = tsvd.fit(df)
我
File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
args = {"X": df[input_cols]}
~~^^^^^^^^^^^^ File "pandas/core/frame.py", line 3767, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
<...snip...>
KeyError: "None of [Index([ doc_vec ], dtype= object )] are in the [columns]"
Based on the information in this tutorial text_embedding_as_snowpark_python_udf,
I suspect the Snowpark array needs to be converted to a np.ndarray
before being fed to underlying sklearn.decomposition.TruncatedSVD
难道有人会向我指出,使用Snoflake阵列作为对Snow花园模型的投入的任何例子?