我的第一个想法是,将.groupby([ID”、“Value”]
与相同的ID
和Value
和使用.size(
)加以计算,其中有两个Series
。
series1 = df_1.groupby(["ID", "Value"]).size()
series2 = df_2.groupby(["ID", "Value"]).size()
ID Value
A 1 3
2 1
B 1 2
2 1
D 3 1
dtype: int64
ID Value
A 1 2
2 1
B 1 1
2 1
C 4 1
dtype: int64
Later it needs to subtract these Series
- but using s1 - s2
it gives values NaN
for values C
and D
because they exist only in one Series
.
It needs .subtract(... fill_value=0)
to fill missing elements.
它需要<代码>.abs(),将消极价值转化为积极价值。
series = series1.subtract(series2, fill_value=0)
series = series.abs()
ID Value
A 1 1.0
2 0.0
B 1 1.0
2 0.0
C 4 1.0
D 3 1.0
dtype: float64
现在需要放弃结果,即<代码>0。
series = series[ series[0] != 0 ]
ID Value
A 1 1.0
B 1 1.0
C 4 1.0
D 3 1.0
dtype: float64
And finally it needs to clean it up - reset index, remove column with size
(column 0
). And reset gives back DataFrame
df = series.reset_index().drop(columns=[0])
ID Value
0 A 1
1 B 1
2 C 4
3 D 3
正式工作法典:
import pandas as pd
df_1 = pd.DataFrame(data={ ID :["A","A","A","A","B","B","B","D"], Value :[1, 1, 1, 2, 1, 1, 2,3]})
df_2 = pd.DataFrame(data={ ID :["A","A","A","B","B","C"], Value :[1,1,2,1,2,4]})
series1 = df_1.groupby(["ID", "Value"]).size() # Series
series2 = df_2.groupby(["ID", "Value"]).size() # Series
print(
--- series1 groupby.size ---
)
print(series1)
print(
--- series2 groupby.size ---
)
print(series2)
series = series1.subtract(series2, fill_value=0).abs() # Series
print(
--- series subtract.abs ---
)
print(series)
series = series[ series != 0 ] # Series
print(
--- series drop ---
)
print(series)
df = series.reset_index().drop(columns=[0]) # DataFrame
print(
--- df clean ---
)
print(df)
结果:
--- series1 groupby.size ---
ID Value
A 1 3
2 1
B 1 2
2 1
D 3 1
dtype: int64
--- series2 groupby.size ---
ID Value
A 1 2
2 1
B 1 1
2 1
C 4 1
dtype: int64
--- series subtract.abs ---
ID Value
A 1 1.0
2 0.0
B 1 1.0
2 0.0
C 4 1.0
D 3 1.0
dtype: float64
--- series drop ---
ID Value
A 1 1.0
B 1 1.0
C 4 1.0
D 3 1.0
dtype: float64
--- df clean ---
ID Value
0 A 1
1 B 1
2 C 4
3 D 3