Question

Is it possible to get find the pair between two dataframes but it has duplicates. If in df1 i have a pair with 3 instances and df2 i have only 2 instances of that pair I will retain only the ones that doesnt have a pair.

df_1 = pd.DataFrame(data={ ID :[A, A, A, A,B,B,B,D],  Value :[1, 1, 1, 2, 1, 1, 2,3]})
df_2 = pd.DataFrame(data={ ID :[A,A,A,B,B,C],  Value :[1,1,2,1,2,4]})

结果:

remaining_df = pd.DataFrame(data={ ID :[A,B,B,C,D],  Value :[1,1,2,4,3]})

Im thinking concatenating ID and value then Create a pivot then count of the concatenated text I dont have any yet, im still exploring if its possible or can someone give me an idea/guide me what method to use.

Answer 1

我的第一个想法是,将.groupby([ID”、“Value”]与相同的ID和Value和使用.size()加以计算,其中有两个Series。

series1 = df_1.groupby(["ID", "Value"]).size()
series2 = df_2.groupby(["ID", "Value"]).size()

ID  Value
A   1        3
    2        1
B   1        2
    2        1
D   3        1
dtype: int64


ID  Value
A   1        2
    2        1
B   1        1
    2        1
C   4        1
dtype: int64

Later it needs to subtract these Series - but using s1 - s2 it gives values NaN for values C and D because they exist only in one Series. It needs .subtract(... fill_value=0) to fill missing elements.

它需要<代码>.abs(),将消极价值转化为积极价值。

series = series1.subtract(series2, fill_value=0)
series = series.abs()

ID  Value
A   1        1.0
    2        0.0
B   1        1.0
    2        0.0
C   4        1.0
D   3        1.0
dtype: float64

现在需要放弃结果,即<代码>0。

series = series[ series[0] != 0 ]

ID  Value
A   1        1.0
B   1        1.0
C   4        1.0
D   3        1.0
dtype: float64

And finally it needs to clean it up - reset index, remove column with size (column 0). And reset gives back DataFrame

df = series.reset_index().drop(columns=[0])

  ID  Value
0  A      1
1  B      1
2  C      4
3  D      3

正式工作法典:

import pandas as pd

df_1 = pd.DataFrame(data={ ID :["A","A","A","A","B","B","B","D"],  Value :[1, 1, 1, 2, 1, 1, 2,3]})
df_2 = pd.DataFrame(data={ ID :["A","A","A","B","B","C"],  Value :[1,1,2,1,2,4]})

series1 = df_1.groupby(["ID", "Value"]).size()   # Series
series2 = df_2.groupby(["ID", "Value"]).size()   # Series

print( 
--- series1 groupby.size ---
 )
print(series1)
print( 
--- series2 groupby.size ---
 )
print(series2)

series = series1.subtract(series2, fill_value=0).abs()  # Series

print( 
--- series subtract.abs ---
 )
print(series)

series = series[ series != 0 ]  # Series

print( 
--- series drop ---
 )
print(series)

df = series.reset_index().drop(columns=[0])  # DataFrame

print( 
--- df clean ---
 )

print(df)

结果:

--- series1 groupby.size ---

ID  Value
A   1        3
    2        1
B   1        2
    2        1
D   3        1
dtype: int64

--- series2 groupby.size ---

ID  Value
A   1        2
    2        1
B   1        1
    2        1
C   4        1
dtype: int64

--- series subtract.abs ---

ID  Value
A   1        1.0
    2        0.0
B   1        1.0
    2        0.0
C   4        1.0
D   3        1.0
dtype: float64

--- series drop ---

ID  Value
A   1        1.0
B   1        1.0
C   4        1.0
D   3        1.0
dtype: float64

--- df clean ---

  ID  Value
0  A      1
1  B      1
2  C      4
3  D      3

Answer 2

我认为,我们可以简单地利用多功能经验:

# setup
A, B, C, D =  ABCD 
df1 = pd.DataFrame(data={ ID :[A, A, A, A,B,B,B,D],  Value :[1, 1, 1, 2, 1, 1, 2,3]})
df2 = pd.DataFrame(data={ ID :[A,A,A,B,B,C],  Value :[1,1,2,1,2,4]})


idval = [ ID ,  Value ]  # because I m lazy
a = df1.assign(k=df1.groupby(idval).cumcount())
b = df2.assign(k=df2.groupby(idval).cumcount())
df = pd.MultiIndex.from_frame(
    a
).symmetric_difference(
    pd.MultiIndex.from_frame(b)
).to_frame(index=False).drop( k , axis=1)

>>> df
  ID  Value
0  A      1
1  B      1
2  C      4
3  D      3

友情链接