This answer provides a solution to get a rolling sum of a column grouped by another column based on a date window. To reproduce it here:
df = pd.DataFrame(
{
'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
'Date': {
0: datetime.datetime(2019, 7, 1),
1: datetime.datetime(2019, 5, 1),
2: datetime.datetime(2019, 6, 25),
3: datetime.datetime(2019, 5, 27),
4: datetime.datetime(2019, 6, 29),
5: datetime.datetime(2019, 7, 18),
6: datetime.datetime(2019, 7, 15)
},
'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
}
)
amounts = df.groupby(["ID"]).apply(lambda g: g.sort_values('Date').rolling('28d', on='Date').sum())
df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])
Output:
+-------+------------+--------+--------------------+
| ID | Date | Amount | amount_4wk_rolling |
+-------+------------+--------+--------------------+
| 10001 | 01/07/2019 | 50 | 60 |
| 10001 | 01/05/2019 | 15 | 15 |
| 10001 | 25/06/2019 | 10 | 10 |
| 10001 | 27/05/2019 | 20 | 35 |
| 10002 | 29/06/2019 | 25 | 25 |
| 10002 | 18/07/2019 | 35 | 100 |
| 10002 | 15/07/2019 | 40 | 65 |
+-------+------------+--------+--------------------+
However, if two of the dates are the same then I get the error:
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
This makes sense as I can see on the final line that Date
is being used to set an index which is now no longer unique. However, as I don't really understand what that final line does I'm little stumped on trying to develop an alternative solution.
Could someone help out?
Based on comments to the question, it appears that OP already found a solution. However, this is an attempt at providing another way to resolve this, which is to solve the root cause of the error - duplicate date values.
To resolve it, we can add aggregation by date within the apply. In the snippet below, the Amount
values are aggregated using sum
, but it is possible that in some contexts another aggregation should be used, e.g. min
or max
. This is the relevant part:
.apply(
lambda g: (
g
.groupby('Date', as_index=False)
.agg({'Amount': 'sum'})
.rolling('28d', on='Date')
.sum()
)
)
And the full snippet below:
import pandas as pd
import datetime
df = pd.DataFrame(
{
'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
'Date': {
0: datetime.datetime(2019, 7, 1),
1: datetime.datetime(2019, 5, 1),
2: datetime.datetime(2019, 6, 25),
3: datetime.datetime(2019, 5, 27),
4: datetime.datetime(2019, 6, 29),
5: datetime.datetime(2019, 7, 18),
6: datetime.datetime(2019, 7, 18)
},
'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
}
)
amounts = (
df
.groupby(["ID"])
.apply(
lambda g: (
g
.groupby('Date', as_index=False)
.agg({'Amount': 'sum'})
.rolling('28d', on='Date')
.sum()
)
)
)
df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])
# this yields
# ID Date Amount amount_4wk_rolling
# 0 10001 2019-07-01 50 60.0
# 1 10001 2019-05-01 15 15.0
# 2 10001 2019-06-25 10 10.0
# 3 10001 2019-05-27 20 35.0
# 4 10002 2019-06-29 25 25.0
# 5 10002 2019-07-18 35 100.0
# 6 10002 2019-07-18 40 100.0
The problem is the first level index of amounts
:
>>> df
ID Date Amount
0 10001 2019-07-01 50
1 10001 2019-05-01 15
2 10001 2019-06-25 10
3 10001 2019-05-27 20
4 10002 2019-06-29 25
5 10002 2019-07-18 35 # <- dup date
6 10002 2019-07-18 40 # <- dup date
>>> amounts
Amount Date ID
ID
10001 1 15.0 2019-05-01 10001.0
3 35.0 2019-05-27 20002.0
2 10.0 2019-06-25 10001.0
0 60.0 2019-07-01 20002.0
10002 4 25.0 2019-06-29 10002.0
5 60.0 2019-07-18 20004.0
6 100.0 2019-07-18 30006.0
If you map amounts
on Date
columns to merge your data on df
, you got your error because Pandas doesn't know which values it should use for 2019-07-18. If you look carefully the second level of index of amounts
is the index of your original dataframe.
So if you drop the first level index set by groupby
, you can use direct assignment:
df['amount_4wk_rolling'] = amounts.droplevel(0)['Amount']
print(df)
# Output:
ID Date Amount amount_4wk_rolling
0 10001 2019-07-01 50 60.0
1 10001 2019-05-01 15 15.0
2 10001 2019-06-25 10 10.0
3 10001 2019-05-27 20 35.0
4 10002 2019-06-29 25 25.0
5 10002 2019-07-18 35 60.0
6 10002 2019-07-18 40 100.0