发布于2024-10-31 20:30 阅读(121) 评论(0) 点赞(4) 收藏(5)
I am working with a dataset where users interact via an app or a website, and I need to determine the optimal combination of variables (x1, x2, ... xn)
that will maximize the number of users classified as "APP Lovers." According to the business rule, a user is considered an "APP Lover" if they use the app more than 66% of the time.
Here’s a simplified example of the data structure:
import polars as pl
df = pl.DataFrame({
"ID": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
"Favourite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
})
In this dataset, each ID
represents a user, and variable
refers to the function (e.g., x1, x2, x3)
, with Favourite indicating whether the function was executed via the app or the website.
I pivot the data to count how many actions were performed via APP or WEB:
(
df
.pivot(
index=["ID"],
on="Favourite",
values=["variable"],
aggregate_function=pl.col("Favourite").len()
).fill_null(0)
)
Output:
shape: (3, 3)
┌─────┬─────┬─────┐
│ ID ┆ APP ┆ WEB │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 0 │
│ 2 ┆ 2 ┆ 1 │
│ 3 ┆ 1 ┆ 2 │
└─────┴─────┴─────┘
Next, I calculate the proportion of app usage for each user and classify them:
(
df2
.with_columns(
Total = pl.col("APP") + pl.col("WEB")
)
.with_columns(
Proportion = pl.col("APP") / pl.col("Total")
)
.with_columns(
pl
.when(pl.col("Proportion") >= 0.6).then(pl.lit("APP Lover"))
.when(pl.col("Proportion") > 0.1).then(pl.lit("BOTH"))
.otherwise(pl.lit("Inactive"))
)
)
shape: (3, 6)
┌─────┬─────┬─────┬───────┬────────────┬───────────┐
│ ID ┆ APP ┆ WEB ┆ Total ┆ Proportion ┆ literal │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ u32 ┆ u32 ┆ f64 ┆ str │
╞═════╪═════╪═════╪═══════╪════════════╪═══════════╡
│ 1 ┆ 3 ┆ 0 ┆ 3 ┆ 1.0 ┆ APP Lover │
│ 2 ┆ 2 ┆ 1 ┆ 3 ┆ 0.666667 ┆ APP Lover │
│ 3 ┆ 1 ┆ 2 ┆ 3 ┆ 0.333333 ┆ BOTH │
└─────┴─────┴─────┴───────┴────────────┴───────────┘
The challenge: In my real dataset, I have at least 19 different x variables. Yesterday asked, I tried iterating over all possible combinations of these variables to filter out the ones that result in the highest number of "APP Lovers," but the number of combinations (2^19)
is too large to compute efficiently.
Question: How can I efficiently determine the best combination of xn variables that maximizes the number of "APP Lovers"? I'm looking for guidance on how to approach this in terms of algorithmic optimization or more efficient iterations.
Here's my suggestion, take the data:
df = pl.DataFrame({
"id": [1, 2, 3, 1, 2, 3, 1, 2, 3],
"variable": ["x1", "x1", "x1", "x2", "x2", "x2", "x3", "x3", "x3"],
"favorite": ["APP", "APP", "WEB", "APP", "WEB", "APP", "APP", "APP", "WEB"]
})
and pivot it such that column xi
is true
if user id
uses that action primarily through the app:
action_through_app = (
df
.with_columns(pl.col.favorite == "APP")
.pivot(index="id", on="variable", values="favorite")
)
For example:
shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ id ┆ x1 ┆ x2 ┆ x3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ bool ┆ bool ┆ bool │
╞═════╪═══════╪═══════╪═══════╡
│ 1 ┆ true ┆ true ┆ true │
│ 2 ┆ true ┆ false ┆ true │
│ 3 ┆ false ┆ true ┆ false │
└─────┴───────┴───────┴───────┘
Now we can efficiently query if given some combination variables how many users would be app lovers by summing the relevant columns and checking if their sums are >= 0.6 * the number of columns.
def num_app_lovers(combination):
return (pl.sum_horizontal(combination) >= 0.6*len(combination)).sum()
action_through_app.select(
num_app_lovers([pl.col.x1]).alias("x1"),
num_app_lovers([pl.col.x2]).alias("x2"),
num_app_lovers([pl.col.x3]).alias("x3"),
num_app_lovers([pl.col.x1, pl.col.x2]).alias("x12"),
num_app_lovers([pl.col.x2, pl.col.x3]).alias("x23"),
num_app_lovers([pl.col.x1, pl.col.x3]).alias("x13"),
num_app_lovers([pl.col.x1, pl.col.x2, pl.col.x3]).alias("x123"),
)
shape: (1, 7)
┌─────┬─────┬─────┬─────┬─────┬─────┬──────┐
│ x1 ┆ x2 ┆ x3 ┆ x12 ┆ x23 ┆ x13 ┆ x123 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╪═════╪═════╪═════╪══════╡
│ 2 ┆ 2 ┆ 2 ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
└─────┴─────┴─────┴─────┴─────┴─────┴──────┘
Now this lets you query combinations in bulk, but this still doesn't scale well to 2^19 possible combinations. For that problem I'd suggest using evolutionary programming.
Initialize a pool of possible combinations with x1, x2, x3, ... xn. Then, randomly add or remove a column (if > 1 column) to each combination in your pool, and test them with the above query. Keep the top, say, 100 combinations. Repeat this process for a bunch of iterations until the result no longer improves, and return that.
作者:黑洞官方问答小能手
链接:https://www.pythonheidong.com/blog/article/2040401/51d2f9eafa55e2120fcf/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!