Optimizing Variable Combinations to Maximize a Classification-python黑洞网

本站消息

站长简介/公众号

出租广告位,需要合作请联系站长

黑洞官方问答小能手

1776142

文章

1060688943

访问

+关注

分类

暂无分类

日期归档

暂无数据

Optimizing Variable Combinations to Maximize a Classification

发布于2024-10-31 20:30 阅读(163) 评论(0) 点赞(4) 收藏(5)

I am working with a dataset where users interact via an app or a website, and I need to determine the optimal combination of variables (x1, x2, ... xn) that will maximize the number of users classified as "APP Lovers." According to the business rule, a user is considered an "APP Lover" if they use the app more than 66% of the time.

Here’s a simplified example of the data structure:

import polars as pl
df = pl.DataFrame({
    "ID": [1, 2, 3, 1, 2, 3, 1, 2, 3],
    "variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
    "Favourite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
})

In this dataset, each ID represents a user, and variable refers to the function (e.g., x1, x2, x3), with Favourite indicating whether the function was executed via the app or the website.

I pivot the data to count how many actions were performed via APP or WEB:

(
    df
    .pivot(
        index=["ID"],
        on="Favourite",
        values=["variable"],
        aggregate_function=pl.col("Favourite").len()
    ).fill_null(0)
)

Output:

shape: (3, 3)
┌─────┬─────┬─────┐
│ ID  ┆ APP ┆ WEB │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 0   │
│ 2   ┆ 2   ┆ 1   │
│ 3   ┆ 1   ┆ 2   │
└─────┴─────┴─────┘

Next, I calculate the proportion of app usage for each user and classify them:

(
    df2
    .with_columns(
        Total = pl.col("APP") + pl.col("WEB")
    )
    .with_columns(
        Proportion = pl.col("APP") / pl.col("Total")
    )
    .with_columns(
        pl
        .when(pl.col("Proportion") >= 0.6).then(pl.lit("APP Lover"))
        .when(pl.col("Proportion") > 0.1).then(pl.lit("BOTH"))
        .otherwise(pl.lit("Inactive"))
    )
)

shape: (3, 6)
┌─────┬─────┬─────┬───────┬────────────┬───────────┐
│ ID  ┆ APP ┆ WEB ┆ Total ┆ Proportion ┆ literal   │
│ --- ┆ --- ┆ --- ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ u32 ┆ u32 ┆ u32   ┆ f64        ┆ str       │
╞═════╪═════╪═════╪═══════╪════════════╪═══════════╡
│ 1   ┆ 3   ┆ 0   ┆ 3     ┆ 1.0        ┆ APP Lover │
│ 2   ┆ 2   ┆ 1   ┆ 3     ┆ 0.666667   ┆ APP Lover │
│ 3   ┆ 1   ┆ 2   ┆ 3     ┆ 0.333333   ┆ BOTH      │
└─────┴─────┴─────┴───────┴────────────┴───────────┘

The challenge: In my real dataset, I have at least 19 different x variables. Yesterday asked, I tried iterating over all possible combinations of these variables to filter out the ones that result in the highest number of "APP Lovers," but the number of combinations (2^19) is too large to compute efficiently.

Question: How can I efficiently determine the best combination of xn variables that maximizes the number of "APP Lovers"? I'm looking for guidance on how to approach this in terms of algorithmic optimization or more efficient iterations.

解决方案

Here's my suggestion, take the data:

df = pl.DataFrame({
    "id": [1, 2, 3, 1, 2, 3, 1, 2, 3],
    "variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
    "favorite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
})

and pivot it such that column xi is true if user id uses that action primarily through the app:

action_through_app = (
    df
    .with_columns(pl.col.favorite == "APP")
    .pivot(index="id", on="variable", values="favorite")
)

For example:

shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ id  ┆ x1    ┆ x2    ┆ x3    │
│ --- ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ bool  ┆ bool  ┆ bool  │
╞═════╪═══════╪═══════╪═══════╡
│ 1   ┆ true  ┆ true  ┆ true  │
│ 2   ┆ true  ┆ false ┆ true  │
│ 3   ┆ false ┆ true  ┆ false │
└─────┴───────┴───────┴───────┘

Now we can efficiently query if given some combination variables how many users would be app lovers by summing the relevant columns and checking if their sums are >= 0.6 * the number of columns.

def num_app_lovers(combination):
    return (pl.sum_horizontal(combination) >= 0.6*len(combination)).sum()

action_through_app.select(
    num_app_lovers([pl.col.x1]).alias("x1"),
    num_app_lovers([pl.col.x2]).alias("x2"),
    num_app_lovers([pl.col.x3]).alias("x3"),
    num_app_lovers([pl.col.x1, pl.col.x2]).alias("x12"),
    num_app_lovers([pl.col.x2, pl.col.x3]).alias("x23"),
    num_app_lovers([pl.col.x1, pl.col.x3]).alias("x13"),
    num_app_lovers([pl.col.x1, pl.col.x2, pl.col.x3]).alias("x123"),
)

shape: (1, 7)
┌─────┬─────┬─────┬─────┬─────┬─────┬──────┐
│ x1  ┆ x2  ┆ x3  ┆ x12 ┆ x23 ┆ x13 ┆ x123 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ---  │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32  │
╞═════╪═════╪═════╪═════╪═════╪═════╪══════╡
│ 2   ┆ 2   ┆ 2   ┆ 1   ┆ 1   ┆ 2   ┆ 2    │
└─────┴─────┴─────┴─────┴─────┴─────┴──────┘

Now this lets you query combinations in bulk, but this still doesn't scale well to 2^19 possible combinations. For that problem I'd suggest using evolutionary programming.

Initialize a pool of possible combinations with x1, x2, x3, ... xn. Then, randomly add or remove a column (if > 1 column) to each combination in your pool, and test them with the above query. Keep the top, say, 100 combinations. Repeat this process for a bunch of iterations until the result no longer improves, and return that.