Question

I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000). These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition methods). A toy example of the dataframe would be:

df = session.sql("""
    select  doc1  as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
    union
    select  doc2  as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
    """)
print(df)
# DOC_ID  | DOC_VEC
# doc1 | [   0.1,   0.3,   0.5,   0.7 ]
# doc2 | [   0.2,   0.4,   0.6,   0.8 ]

However, when I try to fit this dataframe

from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols =  doc_vec , output_cols= out_svd )
print(tsvd)
out = tsvd.fit(df)

我

 File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
    args = {"X": df[input_cols]}
                 ~~^^^^^^^^^^^^   File "pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]

<...snip...>

KeyError: "None of [Index([ doc_vec ], dtype= object )] are in the [columns]"

Based on the information in this tutorial text_embedding_as_snowpark_python_udf, I suspect the Snowpark array needs to be converted to a np.ndarray before being fed to underlying sklearn.decomposition.TruncatedSVD

难道有人会向我指出,使用Snoflake阵列作为对Snow花园模型的投入的任何例子?

Answer 1

现在的问题是,Snowflake目前并不支持零敲碎打的矩阵(但会)。

一名团队撰写了这一样本守则,今后将予以支持:

from snowflake.ml.modeling.decomposition import TruncatedSVD
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session, functions as F, types as T

session = Session.builder.configs(SnowflakeLoginOptions()).getOrCreate()

# This can not work right now because snowflake ml doesn t accept input as array type so far... We ll support it in the future!
t = session.range(5).with_column(
    "doc_vec",
    F.array_construct(
        F.lit(0.1),
        F.lit(0.2),
        F.lit(0.3),
    ),
).with_column("doc_vec", F.col("doc_vec").cast(T.ArrayType(T.FloatType())))
tsvd = TruncatedSVD(input_cols="DOC_VEC", output_cols="DOC_VEC")

# create a dataframe as input
t = session.create_dataframe([[0.1, 0.2, 0.3] for _ in range(5)], schema=["A", "B", "C"])
tsvd = TruncatedSVD(input_cols=["A", "B", "C"], output_cols=["OUTPUT"])
t.show()

tsvd.fit(t)
# show the results
tsvd.transform(t).show()

友情链接