Question

0

132

Vistas

NumPy maxima of groups defined by a label array

I have two arrays, one is a list of values and one is a list of IDs corresponding to each value. Some IDs have multiple values. I want to create a new array that contains the maximum value recorded for each id, which will have a length equal to the number of unique ids.

Example using a for loop:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

uniq_ids = np.unique(ids)
maximums = np.ones_like(uniq_ids) * np.nan

for i, id in enumerate(uniq_ids):
    maximums[i] = np.max(values[np.where(ids == id)])

print(uniq_ids)
print(maximums)

[0 1 3 5 6]
[5. 3. 6. 8. 8.]

Is it possible to vectorize this so it runs fast? I'm imagining a one-liner that can create the "maximums" array using only NumPy functions, but I haven't been able to come up with anything that works.

over 3 years ago · Santiago Trujillo

5 Respuestas

Responde la pregunta

0

Here's a solution, which, although not 100% vectorized, (per my benchmarks) takes about half the time as your does (using your sample data). The performance improvement probably becomes more drastic with more data:

maximums = [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]

Output:

>>> maximums
[5, 3, 6, 8, 8]

over 3 years ago · Santiago Trujillo Denunciar

0

In trying to visualize the problem:

In [82]: [np.where(ids==id) for id in uniq_ids]
Out[82]: 
[(array([0]),),
 (array([1]),),
 (array([2, 3, 4, 5]),),
 (array([6]),),
 (array([7, 8, 9]),)]

unique can also return:

In [83]: np.unique(ids, return_inverse=True)
Out[83]: (array([0, 1, 3, 5, 6]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

Which is a variante on what richardec produced:

In [88]: [a for a in np.split(ids, np.arange(1, ids.shape[0])[(np.diff(ids) !=
    ...: 0)])]
Out[88]: [array([0]), array([1]), array([3, 3, 3, 3]), array([5]), array([6, 6, 6])]

That inverse is also produced by doing where on all == at once:

In [90]: ids[:,None] == uniq_ids
Out[90]: 
array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False],
       [False, False, False, False,  True],
       [False, False, False, False,  True],
       [False, False, False, False,  True]])
In [91]: np.nonzero(ids[:,None] == uniq_ids)
Out[91]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

I'm still thinking through this ...

over 3 years ago · Santiago Trujillo Denunciar

0

EDIT: I'll leave this up as an example of why we can't always use np.vectorize() to make everything magically faster:

One solution is to use numpy's vectorize function:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

def my_func(id):
    return np.max(values[np.where(ids==id)])

vector_func = np.vectorize(my_func)
maximums = vector_func(np.unique(ids))

which returns

array([5, 3, 6, 8, 8])

But as for speed, your version has about the same performance when we use

values = np.array([random.randint(1, 100) for i in range(1000000)])
ids = []
for i in range(100000):
    r = random.randint(1, 4)
    if r == 3:
        for x in range(3):
            ids.append(i)
    elif r == 2:
        for x in range(4):
            ids.append(i)
    else:
        ids.append(i)
ids = np.array(ids)

It's about 12 seconds per execution.

over 3 years ago · Santiago Trujillo Denunciar

0

With pandas:

import pandas as pd

def with_pandas(ids, vals):
    df = pd.DataFrame({'ids': ids, 'vals': values})
    return df.groupby('ids')['vals'].max().to_numpy()

Timing:

import numpy as np

values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)

%timeit with_pandas(ids, values)
692 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

over 3 years ago · Santiago Trujillo Denunciar

0

Responde la pregunta

Encuentra empleos remotos

Accepted Answer · 2022-04-05T15:49:03.227Z

np.lexsort sorts by multiple columns. However, this is not compulsory. You can sort ids first and then choose maximum item of each divided group using numpy.maximum.reduceat

def mathfux(values, ids, return_groups=False):
    argidx = np.argsort(ids) #70% time
    ids_sort, values_sort = ids[argidx], values[argidx] #4% time
    div_points = np.r_[0, np.flatnonzero(np.diff(ids_sort)) + 1] #11% time (the most part for np.flatnonzero)
    if return_groups: 
        return ids[div_points], np.maximum.reduceat(values_sort, div_points)
    else: return np.maximum.reduceat(values_sort, div_points)

mathfux(values, ids, return_groups=True)
>>> (array([0, 1, 3, 5, 6]), array([5, 3, 6, 8, 8]))
mathfux(values, ids)
>>> mathfux(values, ids)
array([5, 3, 6, 8, 8])

Usually, some parts of numpy codes could be optimised further in numba. Note that np.argsort is a bottleneck in majority of groupby problems which can't be replaced by any other method. It is unlikely to be improved soon in numba or numpy. So you are reaching an optimal performance here and can't do much in further optimisations.

0

132

NumPy maxima of groups defined by a label array

5 Respuestas

0

0

0

0

0

Encuentra empleos remotos

Andres GPT