Company logo
  • Empleos
  • Bootcamp
  • Acerca de nosotros
  • Para profesionales
    • Inicio
    • Empleos
    • Cursos y retos
    • Preguntas
    • Profesores
    • Bootcamp
  • Para empresas
    • Inicio
    • Nuestro proceso
    • Planes
    • Pruebas
    • Nómina
    • Blog
    • Calculadora

0

167
Vistas
'nlargest' returning weird results

I am trying to fetch the top 'n' result using 'largest' but the behavior is odd in my opinion. It would be great if someone can help me understand why the behavior is like this.

filter = pd.DataFrame([['user1','item2',2,1],
                   ['user1','item1',2,0.666667],
                   ['user1','item3',2,0.500000]],
                  columns=['user_id','item_id','num_transactions','RCP'])

sort_RCP_df = (
        filter.set_index("item_id")
        .groupby(["user_id"])["RCP"]
        .nlargest(2)
        .reset_index()
)
print(sort_RCP_df)

user_id item_id RCP
user1   item2   1.000000
user1   item1   0.666667

If I keep nlargest(2), then I get the correct output but if I change the value to 3, I only get the columns item_id and RCP.

filter = pd.DataFrame([['user1','item2',2,1],
                   ['user1','item1',2,0.666667],
                   ['user1','item3',2,0.500000]],
                  columns=['user_id','item_id','num_transactions','RCP'])

sort_RCP_df = (
        filter.set_index("item_id")
        .groupby(["user_id"])["RCP"]
        .nlargest(3)
        .reset_index()
)
print(sort_RCP_df)

item_id RCP
item2   1.000000
item1   0.666667
item3   0.500000

Why does the column 'user_id' not appear with nlargest = 3?

And if this the expected behavior, is there a way I can make 'user_id' part of the output as well?

9 months ago · Santiago Trujillo
1 Respuestas
Responde la pregunta

0

The docs hint at the cause of the issue because in the Notes they explicitly call out a performance consideration:

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

If you look deep into the code, Series.nlargest/Series.nsmallest are handled by the SelectNSeries class in pandas/core/algorithms. This class has different behavior depending upon n relative to the length of the Series:

# slow method
if n >= len(self.obj):
    ascending = method == "nsmallest"
    return dropped.sort_values(ascending=ascending).head(n)

# fast method
arr, new_dtype = _ensure_data(dropped.values)
if method == "nlargest":
    arr = -arr
    if is_integer_dtype(new_dtype):
        # GH 21426: ensure reverse ordering at boundaries
        arr -= 1

...

The key take-away here is that when n >= length of Series the call doesn't use the normal algorithm to calculate the largest/smallest value and instead just calculates it with sort_values + head. We can see this manually matches your output if we substitute your nlargest call with this logic.

sort_RCP_df = (
        filter.set_index("item_id")
        .groupby(["user_id"])["RCP"]
        .apply(lambda s: s.sort_values(ascending=False).head(2))
        .reset_index()
)

#  user_id item_id       RCP
#0   user1   item2  1.000000
#1   user1   item1  0.666667
9 months ago · Santiago Trujillo Denunciar
Responde la pregunta
Encuentra empleos remotos