• Jobs
  • About Us
  • professionals
    • Home
    • Jobs
    • Courses and challenges
  • business
    • Home
    • Post vacancy
    • Our process
    • Pricing
    • Assessments
    • Payroll
    • Blog
    • Sales
    • Salary Calculator

0

133
Views
Groupby Roll up or Roll Down for any kind of aggregates

TL;DR: How can we achieve something similar to Group By Roll Up with any kind of aggregates in pandas? (Credit to @Scott Boston for this term)

I have following dataframe:

       P   Q  R     S  T
0   PLAC  NR  F   HOL  F
1   PLAC  NR  F  NHOL  F
2   TRTB  NR  M  NHOL  M
3   PLAC  NR  M  NHOL  M
4   PLAC  NR  F  NHOL  F
5   PLAC   R  M  NHOL  M
6   TRTA   R  F   HOL  F
7   TRTA  NR  F   HOL  F
8   TRTB  NR  F  NHOL  F
9   PLAC  NR  F  NHOL  F
10  TRTB  NR  F  NHOL  F
11  TRTB  NR  M  NHOL  M
12  TRTA  NR  F   HOL  F
13  PLAC  NR  F   HOL  F
14  PLAC   R  F  NHOL  F

For a list of columns ['Q', 'R', 'S', 'T'], I want to calculate some aggregates on P column on following 4 list of grouping columns:

  1. ['Q']
  2. ['Q', 'R']
  3. ['Q', 'R', 'S']
  4. ['Q', 'R', 'S', 'T']

I've already written the code to group above dataframes in an increasing number of columns, and calculate the aggregate (using count for the shake of simplicity) on each of the groupby object, and finally concatenate them:

cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
    groupCols.append(col)
    result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
result = pd.concat(result)[groupCols+['count']]

However, I've strong feeling that above method is not so efficient in terms of CPU time. Is there a more efficient way to apply aggregate on such continuously increasing number of columns for grouping?

Why I think it is not so efficient is because: For above values, in first iteration, it groups the dataframe on Q column then calculates aggregate. Then in next iteration it groups the dataframe on Q and R, that means it again needs to group it by Q then R, but it was already grouped by Q in the first iteration, so the same operation is repeating. If there is some way to utilize the previously created groups, I think it'll be efficient.

OUTPUT:

    Q    R     S    T  count
0  NR  NaN   NaN  NaN     12
1   R  NaN   NaN  NaN      3
0  NR    F   NaN  NaN      9
1  NR    M   NaN  NaN      3
2   R    F   NaN  NaN      2
3   R    M   NaN  NaN      1
0  NR    F   HOL  NaN      4
1  NR    F  NHOL  NaN      5
2  NR    M  NHOL  NaN      3
3   R    F   HOL  NaN      1
4   R    F  NHOL  NaN      1
5   R    M  NHOL  NaN      1
0  NR    F   HOL    F      4
1  NR    F  NHOL    F      5
2  NR    M  NHOL    M      3
3   R    F   HOL    F      1
4   R    F  NHOL    F      1
5   R    M  NHOL    M      1

I already looked into Is there an equivalent of SQL GROUP BY ROLLUP in Python pandas? and Pandas Pivot tables row subtotals, they don't work in my case, I already tried them i.e. These method can be used to get the count only, and immediately fail even for unique counts when the same identifier appears for more than one values:

pd.pivot_table(df, aggCol, columns=cols, aggfunc='count', margins=True).T.reset_index()
    Q    R     S  T  P
0  NR    F   HOL  F  4
1  NR    F  NHOL  F  5
2  NR    M  NHOL  M  3
3  NR  All           3
4   R    F   HOL  F  1
5   R    F  NHOL  F  1
6   R    M  NHOL  M  1
7   R  All           3

UPDATE

In order to avoid any unnecessary confusion with just getting the count as per suggestion in the comment, I have added it for the mean as aggregate, changing P column to a numeric type:

    P   Q  R     S  T
0   9  NR  F   HOL  F
1   7  NR  F  NHOL  F
2   3  NR  M  NHOL  M
3   9  NR  M  NHOL  M
4   1  NR  F  NHOL  F
5   0   R  M  NHOL  M
6   1   R  F   HOL  F
7   7  NR  F   HOL  F
8   2  NR  F  NHOL  F
9   2  NR  F  NHOL  F
10  1  NR  F  NHOL  F
11  2  NR  M  NHOL  M
12  3  NR  F   HOL  F
13  6  NR  F   HOL  F
14  0   R  F  NHOL  F

cols = list('QRST')
cols = list('QRST')
aggCol = 'P'
groupCols = []
result = []
for col in cols:
    groupCols.append(col)
    result.append(df.groupby(groupCols)[aggCol]
                  .agg(agg=np.mean)
                  .round(2).reset_index())
result = pd.concat(result)[groupCols+['agg']]
>>> result
    Q    R     S    T   agg
0  NR  NaN   NaN  NaN  4.33
1   R  NaN   NaN  NaN  0.33
0  NR    F   NaN  NaN  4.22
1  NR    M   NaN  NaN  4.67
2   R    F   NaN  NaN  0.50
3   R    M   NaN  NaN  0.00
0  NR    F   HOL  NaN  6.25
1  NR    F  NHOL  NaN  2.60
2  NR    M  NHOL  NaN  4.67
3   R    F   HOL  NaN  1.00
4   R    F  NHOL  NaN  0.00
5   R    M  NHOL  NaN  0.00
0  NR    F   HOL    F  6.25
1  NR    F  NHOL    F  2.60
2  NR    M  NHOL    M  4.67
3   R    F   HOL    F  1.00
4   R    F  NHOL    F  0.00
5   R    M  NHOL    M  0.00
about 3 years ago · Santiago Trujillo
4 answers
Answer question

0

I think this is a bit more efficient:

b = df.groupby(cols)[aggCol].count()
l = list(range(b.index.nlevels-1))
p = [b]
while l:
    p.append(b.groupby(level=l).sum())
    l.pop()

result = pd.concat(p)

Timings:

7.4 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs original

20.7 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using sum instead of counting each all elements each time. Count all the elements once then sum for the levels for the index decreasing.


With mean or averaging we can use @PierreD suggest with a sum and a count then aggregate:

from itertools import zip_longest
cols = list('QRST')
aggCol = 'P'
b = df.groupby(cols)[aggCol].agg(['sum', 'count'])
l = list(range(b.index.nlevels-1))
p = [b]
while l:
    p.append(b.groupby(level=l).sum())
    l.pop()

result = pd.concat(p)
result = result.assign(avg=result['sum']/result['count']).drop(['sum', 'count'], axis=1)
result 

result.index = pd.MultiIndex.from_arrays(list(zip_longest(*result.index)))

Output:

                       avg
(NR, F, HOL, F)   6.250000
(NR, F, NHOL, F)  2.600000
(NR, M, NHOL, M)  4.666667
(R, F, HOL, F)    1.000000
(R, F, NHOL, F)   0.000000
(R, M, NHOL, M)   0.000000
(NR, F, HOL)      6.250000
(NR, F, NHOL)     2.600000
(NR, M, NHOL)     4.666667
(R, F, HOL)       1.000000
(R, F, NHOL)      0.000000
(R, M, NHOL)      0.000000
(NR, F)           4.222222
(NR, M)           4.666667
(R, F)            0.500000
(R, M)            0.000000
NR                4.333333
R                 0.333333
about 3 years ago · Santiago Trujillo Report

0

Pandas' groupby feature is versatile, and can take custom functions. I am presenting the solution with a lambda that returns the count, but you could easily substitute np.min or np.max or other custom functions. Please bear in mind that any of these functions should make sense when applied recursively over the groupby's nesting levels (so count, min, max will all make sense; but if you have a statistical function such as mean, you will lose the information needed to calculate correct aggregates at higher groupings).

df=pd.DataFrame.from_records(
[['PLAC','NR','F','HOL','F'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'M',  'NHOL',  'M'],
['PLAC','NR',  'M',  'NHOL',  'M'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['PLAC','R', 'M', 'NHOL',  'M'],
['TRTA','R',  'F',   'HOL',  'F'],
['TRTA','NR',  'F',   'HOL',  'F'],
['TRTB','NR',  'F',  'NHOL',  'F'],
['PLAC','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'F',  'NHOL',  'F'],
['TRTB','NR',  'M',  'NHOL',  'M'],
['TRTA','NR',  'F',   'HOL',  'F'],
['PLAC','NR',  'F',   'HOL',  'F'],
['PLAC','R',  'F',  'NHOL',  'F']],
columns = ['P','Q','R','S','T'])

First, define a groupby-dataframe using the most granular groupings:

grdf = df.groupby(['Q','R','S','T'])['P'].apply(lambda x:len(x)).to_frame()

Now use the unstack() method of the this dataframe to successively obtain aggregates at less granular grouping levels. For instance, at one level higher with index as ['Q','R','S']:

df2 = df.unstack()
result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()

result2 will look like this:

enter image description here

Similarly, compute aggregates at all grouping levels desired and append them all to the same dataframe using a function like this (ideally you can make this a recursive function, but I kept it simple so the flow can be easily seen):

def combine_aggregates(df):
    #if type(grdf) == pd.core.frame.DataFrame:
    df1 = df
    result1 = df.sum(axis=1).rename(str(df1.index.names)).to_frame()
    df2 = df1.unstack()
    result2 = df2.sum(axis=1).rename(str(df2.index.names)).to_frame()
    df3 = df2.unstack()
    result3 = df3.sum(axis=1).rename(str(df3.index.names)).to_frame()
    df4 = df3.unstack()
    result4 = df4.sum(axis=1).rename(str(df4.index.names)).to_frame()

    return result1.append(result2).append(result3).append(result4)



combine_aggregates(grdf)     

And the final output will be:

enter image description here

about 3 years ago · Santiago Trujillo Report

0

Most of code is same except indexing and one extra engine argument

I preset the index, then groupby levels one at a time

Also for performance I try to use numba for numeric types Enhancingperf. It seems depending on size of df, you can add parallel, nogil options in numba.

Numba's first execution could be slow as it compiles, but subsequent execution should be faster

l = list('QRST')
df1 = df1.set_index(l)
result = [
    df1.groupby(level=l[:i+1])['P'].agg(np.mean, engine='numba').round(2).reset_index()
    for i in range(4)
]
pd.concat(result)
about 3 years ago · Santiago Trujillo Report

0

Solution:

Maybe you could try this with recursion.

Like the below:

newdf = pd.DataFrame(columns=df.columns)
cols = list('QRST')
aggCol = 'P'
def aggregation(cols, origcols, aggCol, df, count=1):
    global newdf
    cols = origcols[:count]
    count += 1
    newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
    if cols != origcols:
        aggregation(cols, origcols, aggCol, df, count)

aggregation(cols, cols, aggCol, df)
newdf['agg'] = newdf.pop(aggCol)
print(newdf)

Output:

     Q    R     S    T   agg
0   NR  NaN   NaN  NaN  4.33
1    R  NaN   NaN  NaN  0.33
2   NR    F   NaN  NaN  4.22
3   NR    M   NaN  NaN  4.67
4    R    F   NaN  NaN   0.5
5    R    M   NaN  NaN     0
6   NR    F   HOL  NaN  6.25
7   NR    F  NHOL  NaN   2.6
8   NR    M  NHOL  NaN  4.67
9    R    F   HOL  NaN     1
10   R    F  NHOL  NaN     0
11   R    M  NHOL  NaN     0
12  NR    F   HOL    F  6.25
13  NR    F  NHOL    F   2.6
14  NR    M  NHOL    M  4.67
15   R    F   HOL    F     1
16   R    F  NHOL    F     0
17   R    M  NHOL    M     0

Timings:

Timing with the following code (running it 5000 times):

import time

u11time1 = time.time()

for i in range(5000):
    df = pd.read_clipboard()
    newdf = pd.DataFrame(columns=df.columns)
    cols = list('QRST')
    aggCol = 'P'
    def aggregation(cols, origcols, aggCol, df, count=1):
        global newdf
        cols = origcols[:count]
        count += 1
        newdf = newdf.append(df.groupby(cols)[aggCol].agg('mean').round(2).reset_index().T.reindex(origcols + [aggCol]).T, ignore_index=True)
        if cols != origcols:
            aggregation(cols, origcols, aggCol, df, count)

    aggregation(cols, cols, aggCol, df)
    newdf['agg'] = newdf.pop(aggCol)

u11time2 = time.time()

print('u11 time:', u11time2 - u11time1)

thepyguytime1 = time.time()

for i in range(5000):
    df = pd.read_clipboard()
    cols = list('QRST')
    aggCol = 'P'
    groupCols = []
    result = []
    for col in cols:
        groupCols.append(col)
        result.append(df.groupby(groupCols)[aggCol].agg(count='count').reset_index())
    result = pd.concat(result)[groupCols+['count']]

thepyguytime2 = time.time()

print('ThePyGuy time:', thepyguytime2 - thepyguytime1)

Gives:

u11 time: 120.2678394317627
ThePyGuy time: 153.01533579826355

My code is faster by 33 seconds...

But if you run it only a few times, i.e. 10 times, my code usually still wins but not with a such a big margin. But for more iterations, i.e. 5000 times, my code performs much faster than your original for loop code.

Conclusion is: My solution runs faster :)

about 3 years ago · Santiago Trujillo Report
Answer question
Find remote jobs

Discover the new way to find a job!

Top jobs
Top job categories
Business
Post vacancy Pricing Our process Sales
Legal
Terms and conditions Privacy policy
© 2025 PeakU Inc. All Rights Reserved.

Andres GPT

Recommend me some offers
I have an error