Question

I have a Polars DataFrame that looks like this:

┌────────────┬───────┐
│ date       ┆ value │
│ ---        ┆ ---   │
│ str        ┆ i64   │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 7     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 12    │
└────────────┴───────┘

I have another DataFrame that looks like this:

┌──────────┬───────┬───────┐
│ category ┆ lower ┆ upper │
│ ---      ┆ ---   ┆ ---   │
│ str      ┆ i64   ┆ i64   │
╞══════════╪═══════╪═══════╡
│ A        ┆ 0     ┆ 5     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B        ┆ 5     ┆ 10    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ C        ┆ 10    ┆ 15    │
└──────────┴───────┴───────┘

I want to join these two DataFrames so that the first DataFrame has a new column "category" where each row is categorized based on which category it falls between in the second DataFrame. The final DF should look something like this:

┌────────────┬───────┬──────────┐
│ date       ┆ value ┆ category │
│ ---        ┆ ---   ┆ ---      │
│ str        ┆ i64   ┆ str      │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 3     ┆ A        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 7     ┆ B        │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 12    ┆ C        │
└────────────┴───────┴──────────┘

Is there a way to do this efficiently using Polars? What about with an unlimited upper bound on category C?

Answer 1

If you have many categories in your second dataframe, you can "convert" it to SQL CASE expression:

cond = df_pl2.select(
    pl.format(
        "WHEN value >= {} AND value < {} THEN  {} ",
        pl.col("lower"),
        pl.col("upper"),
        pl.col("category"),
    ).alias("cond")
)
cond = "
".join(cond["cond"]) + " ELSE  Not Found "

The cond now contains:

WHEN value >= 0 AND value < 5 THEN  A 
WHEN value >= 5 AND value < 10 THEN  B 
WHEN value >= 10 AND value < 15 THEN  C  ELSE  Not Found

Then use Polars SQLContext():

ctxt = pl.SQLContext()
ctxt.register("df_pl", df_pl.lazy())  # <-- `df_pl` is your first dataframe

print(
    ctxt.execute(
        f"""
    SELECT date, value,
    CASE
        {cond}
    END AS category
    FROM df_pl
""",
        eager=True,
    )
)

Prints:

┌────────────┬───────┬──────────┐
│ date       ┆ value ┆ category │
│ ---        ┆ ---   ┆ ---      │
│ str        ┆ i64   ┆ str      │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 3     ┆ A        │
│ 2022-01-02 ┆ 7     ┆ B        │
│ 2022-01-03 ┆ 12    ┆ C        │
└────────────┴───────┴──────────┘

If you want unlimited upper bound just update the last condition in the CASE statement, for example cond = cond.replace("AND value < 15", "")

Answer 2

join_asof is best for this "nearest key" type of problem. There are forward/backward join strategies:

df = pl.DataFrame({ value : [3, 7, 12]}).set_sorted( value )
df2 = pl.DataFrame({ category : [ A ,  B ,  C ],  lower : [0, 5, 10],  upper : [5, 10, 15]}).set_sorted([ lower ,  upper ])

df.join_asof(df2, left_on= value , right_on= lower , strategy= backward )
# equivalent alternate way, although the above is better for "unlimited upper bound"
df.join_asof(df2, left_on= value , right_on= upper , strategy= forward )

shape: (3, 4)
┌───────┬──────────┬───────┬───────┐
│ value ┆ category ┆ lower ┆ upper │
│ ---   ┆ ---      ┆ ---   ┆ ---   │
│ i64   ┆ str      ┆ i64   ┆ i64   │
╞═══════╪══════════╪═══════╪═══════╡
│ 3     ┆ A        ┆ 0     ┆ 5     │
│ 7     ┆ B        ┆ 5     ┆ 10    │
│ 12    ┆ C        ┆ 10    ┆ 15    │
└───────┴──────────┴───────┴───────┘

Alternatively, you could do a cut expression instead of any type of join:

df = pl.DataFrame({ value  : [3, 7, 12, 99]})
df2 = pl.DataFrame({ lower  : [0, 5, 10],  upper  : [5, 10, 15]})

df.with_columns(
    category=pl.col( value ).cut(
        df2.get_column( upper ), labels=[ A ,  B ,  C ,  unlimited ]
    )
)

shape: (4, 2)
┌───────┬───────────┐
│ value ┆ category  │
│ ---   ┆ ---       │
│ i64   ┆ cat       │
╞═══════╪═══════════╡
│ 3     ┆ A         │
│ 7     ┆ B         │
│ 12    ┆ C         │
│ 99    ┆ unlimited │
└───────┴───────────┘

友情链接