English 中文(简体)
Pandas, 合并两个数据组,在一个数据组中一栏的时间介于另一栏的两栏之间。
原标题:Pandas, merging two dataframes where the time of a column in one dataframe is between the time in two columns of the other

我愿合并两个数据框架,第一个与栏目<代码> 时间_1和 时间_2(和其他),第二个与栏目<代码> 时间_3,即,一般而言,第二个数据范围比第一个更长。

我想合并这两个数据框架,即第二个数据框架的栏目 时间_3 时间_1 时间_2之间,并在第二栏<代码> 时间_3之间重复每一条目第一数据轨道中的条目,在<代码> 时间_1和<代码>之间。

For example, if the first data frame had the following format

time_1 time_2 dummy_data
2023-10-01 04:02:00 2023-10-01 08:29:00 -245.669907
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571
... ... ...

页: 1

time_3 dummy_data2
2023-10-01 06:21:13.238024 -131.367901
2023-10-01 06:47:19.796628 -236.277444
2023-10-01 07:37:06.438740 5.915493
2023-10-01 08:16:16.995256 -134.032433
2023-10-01 08:33:53.081095 -103.733212

然后,预期以下产出:

time_1 time_2 dummy_data time_3 dummy_data2
2023-10-01 04:02:00 2023-10-01 08:29:00 -245.669907 2023-10-01 06:21:13.238024 -131.367901
2023-10-01 04:02:00 2023-10-01 08:29:00 -245.669907 2023-10-01 06:47:19.796628 -236.277444
2023-10-01 04:02:00 2023-10-01 08:29:00 -245.669907 2023-10-01 07:37:06.438740 5.915493
2023-10-01 04:02:00 2023-10-01 08:29:00 -245.669907 2023-10-01 08:16:16.995256 -134.032433
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571 2023-10-01 06:21:13.238024 -131.367901
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571 2023-10-01 06:47:19.796628 -236.277444
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571 2023-10-01 07:37:06.438740 5.915493
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571 2023-10-01 08:16:16.995256 -134.032433
2023-10-01 04:03:00 2023-10-01 08:49:00 -1772.948571 2023-10-01 08:33:53.081095 -103.733212

I can make this work by "cheating" and iterating through each row and the list and then joining everything back up later as shown in the code below -- but I m wondering if I there s a more "pandas-y" way to do this that doesn t require the nested loops and dictionary of indexes?

# Load the data
df =pd.read_csv("datetime_list.csv")
df[ time_3 ] = pd.to_datetime(datetime_list[ time_3 ])

df2 = pd.read_csv( dataframe.csv )


indexes = {}
# Record which indexes of `df` are between which indexes of `df2`    

for i in df2.index:
    s = df2[ time_3 ].between(df.loc[i][ time_1 ], 
                              df.loc[i][ time_2 ],
                              inclusive =  left  )
    
    friends = list(s[s == True].index)
    indexes[i] = friends
    

output_df = pd.DataFrame()
# Merge them all together, duplicating rows in df where necessary 
for key in indexes.keys():
    for idx in indexes[key]:
        output_df = output_df.append(pd.concat([df.loc[key], 
                                                df2.loc[idx]]), 
                                     ignore_index = True)
output_df 

你可能期望,这一解决办法非常缓慢。 任何建议都值得高度赞赏。

问题回答

绝食广播是你所期待的。 这比居多倍。

# For this to work, the indices on both dataframe must be
# unique. You can use .reset_index(drop=True) if you don t
# care about existing indices.
df1 = df1.reset_index()
df2 = df2.reset_index()

# Prepare the data for numpy broadcasting
t1 = df1["time_1"].to_numpy()[:, None]
t2 = df1["time_2"].to_numpy()[:, None]
t3 = df2["time_3"].to_numpy()

# This is the broadcasting, i.e. compare every value in t1
# against every value in t3, then every value in t3 against
# every value in t2. The result is an n * m matrix.
#
# .nonzero() will return the coordinates of cells that are True.
#   - the x coordinate is the row number in df1
#   - the y coordinate is the row number in df2
x, y = ((t1 < t3) & (t3 < t2)).nonzero()

# Then it s just a matter of combining row at the return coordinates
result = pd.concat(
    [
        df1.iloc[x].reset_index(drop=True),
        df2.iloc[y].reset_index(drop=True),
    ],
    axis=1,
)

请注意,如果<代码>df1和df2 计算机记忆可能受到重创,因为它必须进行并储存<代码>n * m的比较。





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签