Company logo
  • Empleos
  • Bootcamp
  • Acerca de nosotros
  • Para profesionales
    • Inicio
    • Empleos
    • Cursos y retos
    • Preguntas
    • Profesores
    • Bootcamp
  • Para empresas
    • Inicio
    • Nuestro proceso
    • Planes
    • Pruebas
    • Nómina
    • Blog
    • Comercial
    • Calculadora

0

196
Vistas
groupby rolling date window sum with duplicate dates

This answer provides a solution to get a rolling sum of a column grouped by another column based on a date window. To reproduce it here:

df = pd.DataFrame(
    {
        'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
        'Date': {
            0: datetime.datetime(2019, 7, 1),
            1: datetime.datetime(2019, 5, 1),
            2: datetime.datetime(2019, 6, 25),
            3: datetime.datetime(2019, 5, 27),
            4: datetime.datetime(2019, 6, 29),
            5: datetime.datetime(2019, 7, 18),
            6: datetime.datetime(2019, 7, 15)
        },
        'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
    }
)
amounts = df.groupby(["ID"]).apply(lambda g: g.sort_values('Date').rolling('28d', on='Date').sum())
df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])

Output:

+-------+------------+--------+--------------------+
|  ID   |    Date    | Amount | amount_4wk_rolling |
+-------+------------+--------+--------------------+
| 10001 | 01/07/2019 |     50 |                 60 |
| 10001 | 01/05/2019 |     15 |                 15 |
| 10001 | 25/06/2019 |     10 |                 10 |
| 10001 | 27/05/2019 |     20 |                 35 |
| 10002 | 29/06/2019 |     25 |                 25 |
| 10002 | 18/07/2019 |     35 |                100 |
| 10002 | 15/07/2019 |     40 |                 65 |
+-------+------------+--------+--------------------+

However, if two of the dates are the same then I get the error:

pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

This makes sense as I can see on the final line that Date is being used to set an index which is now no longer unique. However, as I don't really understand what that final line does I'm little stumped on trying to develop an alternative solution.

Could someone help out?

9 months ago · Santiago Trujillo
2 Respuestas
Responde la pregunta

0

Based on comments to the question, it appears that OP already found a solution. However, this is an attempt at providing another way to resolve this, which is to solve the root cause of the error - duplicate date values.

To resolve it, we can add aggregation by date within the apply. In the snippet below, the Amount values are aggregated using sum, but it is possible that in some contexts another aggregation should be used, e.g. min or max. This is the relevant part:

    .apply(
        lambda g: (
            g
            .groupby('Date', as_index=False)
            .agg({'Amount': 'sum'})
            .rolling('28d', on='Date')
            .sum()
        )
    )

And the full snippet below:

import pandas as pd
import datetime

df = pd.DataFrame(
    {
        'ID': {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
        'Date': {
            0: datetime.datetime(2019, 7, 1),
            1: datetime.datetime(2019, 5, 1),
            2: datetime.datetime(2019, 6, 25),
            3: datetime.datetime(2019, 5, 27),
            4: datetime.datetime(2019, 6, 29),
            5: datetime.datetime(2019, 7, 18),
            6: datetime.datetime(2019, 7, 18)
        },
        'Amount': {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
    }
)

amounts = (
    df
    .groupby(["ID"])
    .apply(
        lambda g: (
            g
            .groupby('Date', as_index=False)
            .agg({'Amount': 'sum'})
            .rolling('28d', on='Date')
            .sum()
        )
    )
)

df['amount_4wk_rolling'] = df["Date"].map(amounts.set_index('Date')['Amount'])

# this yields
#       ID       Date  Amount  amount_4wk_rolling
# 0  10001 2019-07-01      50                60.0
# 1  10001 2019-05-01      15                15.0
# 2  10001 2019-06-25      10                10.0
# 3  10001 2019-05-27      20                35.0
# 4  10002 2019-06-29      25                25.0
# 5  10002 2019-07-18      35               100.0
# 6  10002 2019-07-18      40               100.0
9 months ago · Santiago Trujillo Denunciar

0

The problem is the first level index of amounts:

>>> df
      ID       Date  Amount
0  10001 2019-07-01      50
1  10001 2019-05-01      15
2  10001 2019-06-25      10
3  10001 2019-05-27      20
4  10002 2019-06-29      25
5  10002 2019-07-18      35  # <- dup date
6  10002 2019-07-18      40  # <- dup date

>>> amounts
         Amount       Date       ID
ID                                 
10001 1    15.0 2019-05-01  10001.0
      3    35.0 2019-05-27  20002.0
      2    10.0 2019-06-25  10001.0
      0    60.0 2019-07-01  20002.0
10002 4    25.0 2019-06-29  10002.0
      5    60.0 2019-07-18  20004.0
      6   100.0 2019-07-18  30006.0

If you map amounts on Date columns to merge your data on df, you got your error because Pandas doesn't know which values it should use for 2019-07-18. If you look carefully the second level of index of amounts is the index of your original dataframe.

So if you drop the first level index set by groupby, you can use direct assignment:

df['amount_4wk_rolling'] = amounts.droplevel(0)['Amount']
print(df)

# Output:
      ID       Date  Amount  amount_4wk_rolling
0  10001 2019-07-01      50                60.0
1  10001 2019-05-01      15                15.0
2  10001 2019-06-25      10                10.0
3  10001 2019-05-27      20                35.0
4  10002 2019-06-29      25                25.0
5  10002 2019-07-18      35                60.0
6  10002 2019-07-18      40               100.0
9 months ago · Santiago Trujillo Denunciar
Responde la pregunta
Encuentra empleos remotos