Question

我愿合并两个数据框架,第一个与栏目<代码> 时间_1和 时间_2(和其他),第二个与栏目<代码> 时间_3,即,一般而言,第二个数据范围比第一个更长。

我想合并这两个数据框架,即第二个数据框架的栏目 时间_3在 时间_1 和 时间_2之间,并在第二栏<代码> 时间_3之间重复每一条目第一数据轨道中的条目,在<代码> 时间_1和<代码>之间。

For example, if the first data frame had the following format

`time_1`	`time_2`	`dummy_data`
2023-10-01 04:02:00	2023-10-01 08:29:00	-245.669907
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571
...	...	...

页: 1

`time_3`	`dummy_data2`
2023-10-01 06:21:13.238024	-131.367901
2023-10-01 06:47:19.796628	-236.277444
2023-10-01 07:37:06.438740	5.915493
2023-10-01 08:16:16.995256	-134.032433
2023-10-01 08:33:53.081095	-103.733212

然后,预期以下产出:

`time_1`	`time_2`	`dummy_data`	`time_3`	`dummy_data2`
2023-10-01 04:02:00	2023-10-01 08:29:00	-245.669907	2023-10-01 06:21:13.238024	-131.367901
2023-10-01 04:02:00	2023-10-01 08:29:00	-245.669907	2023-10-01 06:47:19.796628	-236.277444
2023-10-01 04:02:00	2023-10-01 08:29:00	-245.669907	2023-10-01 07:37:06.438740	5.915493
2023-10-01 04:02:00	2023-10-01 08:29:00	-245.669907	2023-10-01 08:16:16.995256	-134.032433
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571	2023-10-01 06:21:13.238024	-131.367901
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571	2023-10-01 06:47:19.796628	-236.277444
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571	2023-10-01 07:37:06.438740	5.915493
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571	2023-10-01 08:16:16.995256	-134.032433
2023-10-01 04:03:00	2023-10-01 08:49:00	-1772.948571	2023-10-01 08:33:53.081095	-103.733212

I can make this work by "cheating" and iterating through each row and the list and then joining everything back up later as shown in the code below -- but I m wondering if I there s a more "pandas-y" way to do this that doesn t require the nested loops and dictionary of indexes?

# Load the data
df =pd.read_csv("datetime_list.csv")
df[ time_3 ] = pd.to_datetime(datetime_list[ time_3 ])

df2 = pd.read_csv( dataframe.csv )


indexes = {}
# Record which indexes of `df` are between which indexes of `df2`    

for i in df2.index:
    s = df2[ time_3 ].between(df.loc[i][ time_1 ], 
                              df.loc[i][ time_2 ],
                              inclusive =  left  )
    
    friends = list(s[s == True].index)
    indexes[i] = friends
    

output_df = pd.DataFrame()
# Merge them all together, duplicating rows in df where necessary 
for key in indexes.keys():
    for idx in indexes[key]:
        output_df = output_df.append(pd.concat([df.loc[key], 
                                                df2.loc[idx]]), 
                                     ignore_index = True)
output_df

你可能期望,这一解决办法非常缓慢。任何建议都值得高度赞赏。

Answer 1

绝食广播是你所期待的。这比居多倍。

# For this to work, the indices on both dataframe must be
# unique. You can use .reset_index(drop=True) if you don t
# care about existing indices.
df1 = df1.reset_index()
df2 = df2.reset_index()

# Prepare the data for numpy broadcasting
t1 = df1["time_1"].to_numpy()[:, None]
t2 = df1["time_2"].to_numpy()[:, None]
t3 = df2["time_3"].to_numpy()

# This is the broadcasting, i.e. compare every value in t1
# against every value in t3, then every value in t3 against
# every value in t2. The result is an n * m matrix.
#
# .nonzero() will return the coordinates of cells that are True.
#   - the x coordinate is the row number in df1
#   - the y coordinate is the row number in df2
x, y = ((t1 < t3) & (t3 < t2)).nonzero()

# Then it s just a matter of combining row at the return coordinates
result = pd.concat(
    [
        df1.iloc[x].reset_index(drop=True),
        df2.iloc[y].reset_index(drop=True),
    ],
    axis=1,
)

请注意,如果<代码>df1和df2 计算机记忆可能受到重创,因为它必须进行并储存<代码>n * m的比较。

友情链接