Question

我有2个星座,如所附档案所示。预计_df和实际_df

在我的单位测试中,我试图检查两者是否平等。

我的法典

expected = map(lambda row: row.asDict(), expected_df.collect()) 
actual = map(lambda row: row.asDict(), actaual_df.collect()) 
assert expected = actual

Since both dfs are same but row order is different so assert fails here. What is best way to compare such dfs.

Answer 1

页: 1

https://pypi.org/project/pysstart-test/。

This is inspired by the panadas testing module build for pyspark.

使用简便

from pyspark_test import assert_pyspark_df_equal

assert_pyspark_df_equal(df_1, df_2)

此外,除了对数据范围进行比较外,正如同安达测试模块一样,它还接受许多可选择的寄生虫,供你在文件上核对。

注:

The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
This package is for unit/integration testing, so meant to be used with small size dfs

Answer 2

在一些动物园艺文献中做了这项工作:

assertated (expected_df.()) = 分类(actaual_df.(())

Answer 3

我们解决了这一问题,将每一行与“天花”功能分开,然后将由此产生的一栏汇总起来。

from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def hash_df(df):
    """Hashes a DataFrame for comparison.

    Arguments:
        df (DataFrame): A dataframe to generate a hash from

    Returns:
        int: Summed value of hashed rows of an input DataFrame
    """
    # Hash every row into a new hash column
    df = df.withColumn( hash_value , F.hash(*sorted(df.columns))).select( hash_value )

    # Sum the hashes, see https://shortest.link/28YE
    value = df.agg(F.sum( hash_value )).collect()[0][0]

    return value

expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash

Answer 4

Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn t any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:

expected = expected_df.orderBy( period_start_time ).collect()
actual = actaual_df.orderBy( period_start_time ).collect() 
assert expected == actual

Answer 5

如果增设一个图书馆,例如<代码>pysstart_test是一个问题,你可尝试用同一栏对两个数据组进行分类,将其转换成像样,并使用pd.testing.assert_frame_ Equal。

我知道,由于数据被输入到司机的记忆中(见害虫文件),因此一般不鼓励采用<代码>。

For example:

sort_cols = actual_df.columns

pd.testing.assert_frame_equal(
    actual_df.sort(sort_cols).toPandas(),
    expected_df.sort(sort_cols).toPandas()
)

Answer 6

一种办法是使用 chispa。

from chispa.dataframe_comparer import assert_df_equality

assert_df_equality(actual_df, expected_df, ignore_row_order=True)

你也可以无视一栏命令,提出其他论点。这里是快速审查功能签名。

Signature:
assert_df_equality(
    df1,
    df2,
    ignore_nullable=False,
    transforms=None,
    allow_nan_equality=False,
    ignore_column_order=False,
    ignore_row_order=False,
    underline_cells=False,
    ignore_metadata=False,
)

查阅文件here 。

Answer 7

I have two Dataframes with the same order. Comparing this two I use:

def test_df(df1, df2):
    assert df1.values.tolist() == df2.values.tolist()

Answer 8

try to have "==" instead of "=". assert expected == actual

友情链接