English 中文(简体)
Pandas: 建立新栏,载列最新指数,满足与目前增长有关的条件
原标题:Pandas: Create new column that contains the most recent index where a condition related to the current row is met

在以下例子中,我希望回到目前“下游”一栏“上下游”一栏“上层”一栏的最后一栏。 我能够以预期的结果来这样做,但这不是真正的矢量,对较大的数据框架来说效率很低。

import pandas as pd

# Sample DataFrame
data = { lower : [7, 1, 6, 1, 1, 1, 1, 11, 1, 1],
         upper : [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}

df = pd.DataFrame(data=data)

df[ DATE ] = pd.date_range( 2020-01-01 , periods=len(data[ lower ]))
df[ DATE ] = pd.to_datetime(df[ DATE ])
df.set_index( DATE , inplace=True)

# new columns that contains the most recent index of previous rows, where the previous "lower" is greater than or equal to the current "upper"
def get_most_recent_index(row):
    previous_indices = df.loc[:row.name - pd.Timedelta(minutes=1)]  
    recent_index = previous_indices[previous_indices[ lower ] >= row[ upper ]].index.max()
    return recent_index

df[ prev ] = df.apply(get_most_recent_index, axis=1) 

print(df)

我怎么会把这一点变成最有效率的吗?

EDIT:

首先,感谢大家的答复。

关于四个可行解决办法之间的业绩问题,Andrej Kesely提议的明确胜者为一模。 我排除了Pyjanitor和任何数据量接近我的一套,我们很快会发现所有干扰。

baseline: 1min 35s ± 5.15 s per loop (mean ± std. dev. of 2 runs, 2 loops each)

bisect: 1.76 s ± 82.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

enumerate: 1min 13s ± 2.17 s per loop (mean ± std. dev. of 2 runs, 2 loops each)

import pandas as pd
import numpy as np
from bisect import bisect_left
import janitor

def get_sample_df(rows=100_000):
    # Sample DataFrame
    data = { lower : np.random.default_rng(seed=1).uniform(1,100,rows),
             upper : np.random.default_rng(seed=2).uniform(1,100,rows)}

    df = pd.DataFrame(data=data)
    df = df.astype(int)

    df[ DATE ] = pd.date_range( 2020-01-01 , periods=len(data[ lower ]), freq="min")
    df[ DATE ] = pd.to_datetime(df[ DATE ])
    df.set_index( DATE , inplace=True)

    return df


def get_baseline():
    df = get_sample_df()

    # new columns that contains the most recent index of previous rows, where the previous "lower" is greater than or equal to the current "upper"
    def get_most_recent_index(row):
        previous_indices = df.loc[:row.name - pd.Timedelta(minutes=1)]  
        recent_index = previous_indices[previous_indices[ lower ] >= row[ upper ]].index.max()
        return recent_index

    df[ prev ] = df.apply(get_most_recent_index, axis=1) 
    return df


def get_pyjanitor():

    df = get_sample_df()
    df.reset_index(inplace=True)

    # set the DATE column as an index
    # after the operation you can set the original DATE
    # column as an index
    left_df = df.assign(index_prev=df.index)
    right_df = df.assign(index_next=df.index)
    out=(left_df
        .conditional_join(
            right_df, 
            ( lower , upper , >= ), 
            ( index_prev , index_next , < ), 
            df_columns= index_prev , 
            right_columns=[ index_next , lower , upper ])
        )
    # based on the matches, we may have multiple returns
    # what we need is the closest to the current row
    closest=out.index_next-out.index_prev
    grouper=[out.index_next, out.lower,out.upper]
    min_closest=closest.groupby(grouper).transform( min )
    closest=closest==min_closest
    # we have out matches, which is defined by `index_prev`
    # use index_prev to get the relevant DATE
    prev=out.loc[closest, index_prev ]
    prev=df.loc[prev, DATE ].array # avoid index alignment here
    index_next=out.loc[closest, index_next ]
    # now assign back to df, based on index_next and prev
    prev=pd.Series(prev,index=index_next)
    df = df.assign(prev=prev)
    return df

   

def get_bisect():
    df = get_sample_df()

    def get_prev_bs(lower, upper, _date):
        uniq_lower = sorted(set(lower))
        last_seen = {}

        for l, u, d in zip(lower, upper, _date):
            # find index of element that is >= u
            idx = bisect_left(uniq_lower, u)

            max_date = None
            for lv in uniq_lower[idx:]:
                if lv in last_seen:
                    if max_date is None:
                        max_date = last_seen[lv]
                    elif last_seen[lv] > max_date:
                        max_date = last_seen[lv]
            yield max_date
            last_seen[l] = d

    df["prev"] = list(get_prev_bs(df["lower"], df["upper"], df.index))
    return df

def get_enumerate():
    df = get_sample_df()
    df.reset_index(inplace=True)

    date_list=df["DATE"].values.tolist()
    lower_list=df["lower"].values.tolist()
    upper_list=df["upper"].values.tolist()
    new_list=[]
    for i,(x,y) in enumerate(zip(lower_list,upper_list)):
        if i==0:
            new_list.append(None)
        else:
            if (any(j >= y for j in lower_list[0:i])):
                

                for ll,dl in zip(reversed(lower_list[0:i]),reversed(date_list[0:i])):
                    if ll>=y:
                        new_list.append(dl)
                        break
                    else:
                        continue
            else:
                new_list.append(None)
    df[ prev ]=new_list
    df[ prev ]=pd.to_datetime(df[ prev ])
    return df

print("baseline:")
%timeit -n 2 -r 2 get_baseline()

# Unable to allocate 37.2 GiB for an array with shape (4994299505,) and data type int64
# print("pyjanitor:")
# %timeit -n 2 get_pyjanitor()

print("bisect:")
%timeit -n 2 -r 2 get_bisect()

print("enumerate:")
%timeit -n 2 -r 2 get_enumerate()
最佳回答

我不敢肯定这能否成为病媒(因为你有依赖过去状态的变量)。 但是,你可以尝试加快计算,使用双轨搜索,例如:

from bisect import bisect_left


def get_prev(lower, upper, _date):
    uniq_lower = sorted(set(lower))
    last_seen = {}

    for l, u, d in zip(lower, upper, _date):
        # find index of element that is >= u
        idx = bisect_left(uniq_lower, u)

        max_date = None
        for lv in uniq_lower[idx:]:
            if lv in last_seen:
                if max_date is None:
                    max_date = last_seen[lv]
                elif last_seen[lv] > max_date:
                    max_date = last_seen[lv]
        yield max_date
        last_seen[l] = d


df["prev_new"] = list(get_prev(df["lower"], df["upper"], df.index))
print(df)

印刷:

            lower  upper       prev   prev_new
DATE                                          
2020-01-01      7      2        NaT        NaT
2020-01-02      1      3 2020-01-01 2020-01-01
2020-01-03      6      4 2020-01-01 2020-01-01
2020-01-04      1      5 2020-01-03 2020-01-03
2020-01-05      1      6 2020-01-03 2020-01-03
2020-01-06      1      7 2020-01-01 2020-01-01
2020-01-07      1      8        NaT        NaT
2020-01-08     11      9        NaT        NaT
2020-01-09      1     10 2020-01-08 2020-01-08
2020-01-10      1     11 2020-01-08 2020-01-08
问题回答

我的理解是,通过像名单和字典这样的假冒物体,而不是像数据框架的浏览(可能是错误的)。 因此,下文是我所尝试的,它为你们的投入而努力:

date_list=df["DATE"].values.tolist()
lower_list=df["lower"].values.tolist()
upper_list=df["upper"].values.tolist()
new_list=[]
for i,(x,y) in enumerate(zip(lower_list,upper_list)):
    if i==0:
        new_list.append(None)
    else:
        if (any(j >= y for j in lower_list[0:i])):
            

            for ll,dl in zip(reversed(lower_list[0:i]),reversed(date_list[0:i])):
                if ll>=y:
                    new_list.append(dl)
                    break
                else:
                    continue
        else:
            new_list.append(None)
df[ prev ]=new_list
df[ prev ]=pd.to_datetime(df[ prev ])

<>UPDATE:这种集合过多,因而不适合这项任务。 @Andrejkesely


您可以使用一系列的办法来有效地获得您的配对——https://pyjanitor-devs.github.io/pyjanitor/api/Functions/#janitor.Functions.conditional_join.conditional_join”rel=“nofollow noreferer”>>。 解决这一问题。 如果你能够,请与你分享业绩测试。

# pip install pyjanitor
import pandas as pd
import janitor

# set the DATE column as an index
# after the operation you can set the original DATE
# column as an index
left_df = df.assign(index_prev=df.index)
right_df = df.assign(index_next=df.index)
out=(left_df
    .conditional_join(
        right_df, 
        ( lower , upper , >= ), 
        ( index_prev , index_next , < ), 
        df_columns= index_prev , 
        right_columns=[ index_next , lower , upper ])
    )
# based on the matches, we may have multiple returns
# what we need is the closest to the current row
closest=out.index_next-out.index_prev
grouper=[out.index_next, out.lower,out.upper]
min_closest=closest.groupby(grouper).transform( min )
closest=closest==min_closest
# we have out matches, which is defined by `index_prev`
# use index_prev to get the relevant DATE
prev=out.loc[closest, index_prev ]
prev=df.loc[prev, DATE ].array # avoid index alignment here
index_next=out.loc[closest, index_next ]
# now assign back to df, based on index_next and prev
prev=pd.Series(prev,index=index_next)
df.assign(prev=prev)

   lower  upper       DATE       prev
0      7      2 2020-01-01        NaT
1      1      3 2020-01-02 2020-01-01
2      6      4 2020-01-03 2020-01-01
3      1      5 2020-01-04 2020-01-03
4      1      6 2020-01-05 2020-01-03
5      1      7 2020-01-06 2020-01-01
6      1      8 2020-01-07        NaT
7     11      9 2020-01-08        NaT
8      1     10 2020-01-09 2020-01-08
9      1     11 2020-01-10 2020-01-08

另一种解决办法的结果略有不同。

import pandas as pd
import numpy as np
# Sample DataFrame
data = { lower : [7, 1, 6, 1, 1, 1, 1, 11, 1, 1],
         upper : [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}

df = pd.DataFrame(data=data)

df[ DATE ] = pd.date_range( 2020-01-01 , periods=len(data[ lower ]))
df[ DATE ] = pd.to_datetime(df[ DATE ])
df[ prev ] = pd.to_datetime(np.nan)

df[ prev ] = np.where(df[ lower ] >= df[ upper ], df[ DATE ], df[ prev ])
df[ prev ] = df[ prev ].shift(1).fillna(method =  ffill )

print(df)

  lower upper   DATE      prev
0   7   2   2020-01-01  NaT
1   1   3   2020-01-02  2020-01-01
2   6   4   2020-01-03  2020-01-01
3   1   5   2020-01-04  2020-01-03
4   1   6   2020-01-05  2020-01-03
5   1   7   2020-01-06  2020-01-03
6   1   8   2020-01-07  2020-01-03
7   11  9   2020-01-08  2020-01-03
8   1   10  2020-01-09  2020-01-08
9   1   11  2020-01-10  2020-01-08

我不敢肯定为什么我们将在中间2个日期接上<条码>。 我的解决方案在这些地方没有<代码>NaT。





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签