English 中文(简体)
动物园地数据框架比较的标语
原标题:pytest assert for pyspark dataframe comparison

我有2个星座,如所附档案所示。 预计_df和实际_df

“entergraph

在我的单位测试中,我试图检查两者是否平等。

我的法典

expected = map(lambda row: row.asDict(), expected_df.collect()) 
actual = map(lambda row: row.asDict(), actaual_df.collect()) 
assert expected = actual 

Since both dfs are same but row order is different so assert fails here. What is best way to compare such dfs.

问题回答

页: 1

https://pypi.org/project/pysstart-test/

This is inspired by the panadas testing module build for pyspark.

使用简便

from pyspark_test import assert_pyspark_df_equal

assert_pyspark_df_equal(df_1, df_2)

此外,除了对数据范围进行比较外,正如同安达测试模块一样,它还接受许多可选择的寄生虫,供你在文件上核对。

注:

  1. The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
  2. This package is for unit/integration testing, so meant to be used with small size dfs

在一些动物园艺文献中做了这项工作:

assertated (expected_df.()) = 分类(actaual_df.(())

我们解决了这一问题,将每一行与“天花”功能分开,然后将由此产生的一栏汇总起来。

from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def hash_df(df):
    """Hashes a DataFrame for comparison.

    Arguments:
        df (DataFrame): A dataframe to generate a hash from

    Returns:
        int: Summed value of hashed rows of an input DataFrame
    """
    # Hash every row into a new hash column
    df = df.withColumn( hash_value , F.hash(*sorted(df.columns))).select( hash_value )

    # Sum the hashes, see https://shortest.link/28YE
    value = df.agg(F.sum( hash_value )).collect()[0][0]

    return value

expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash

Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn t any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:

expected = expected_df.orderBy( period_start_time ).collect()
actual = actaual_df.orderBy( period_start_time ).collect() 
assert expected == actual

如果增设一个图书馆,例如<代码>pysstart_test是一个问题,你可尝试用同一栏对两个数据组进行分类,将其转换成像样,并使用pd.testing.assert_frame_ Equal

我知道,由于数据被输入到司机的记忆中(见害虫文件),因此一般不鼓励采用<代码>。

For example:

sort_cols = actual_df.columns

pd.testing.assert_frame_equal(
    actual_df.sort(sort_cols).toPandas(),
    expected_df.sort(sort_cols).toPandas()
)

一种办法是使用chispa

from chispa.dataframe_comparer import assert_df_equality

assert_df_equality(actual_df, expected_df, ignore_row_order=True)

你也可以无视一栏命令,提出其他论点。 这里是快速审查功能签名。

Signature:
assert_df_equality(
    df1,
    df2,
    ignore_nullable=False,
    transforms=None,
    allow_nan_equality=False,
    ignore_column_order=False,
    ignore_row_order=False,
    underline_cells=False,
    ignore_metadata=False,
)

查阅文件here

I have two Dataframes with the same order. Comparing this two I use:

def test_df(df1, df2):
    assert df1.values.tolist() == df2.values.tolist()

try to have "==" instead of "=". assert expected == actual





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签