I am trying to fetch the top 'n' result using 'largest' but the behavior is odd in my opinion. It would be great if someone can help me understand why the behavior is like this.
filter = pd.DataFrame([['user1','item2',2,1],
['user1','item1',2,0.666667],
['user1','item3',2,0.500000]],
columns=['user_id','item_id','num_transactions','RCP'])
sort_RCP_df = (
filter.set_index("item_id")
.groupby(["user_id"])["RCP"]
.nlargest(2)
.reset_index()
)
print(sort_RCP_df)
user_id item_id RCP
user1 item2 1.000000
user1 item1 0.666667
If I keep nlargest(2), then I get the correct output but if I change the value to 3, I only get the columns item_id and RCP.
filter = pd.DataFrame([['user1','item2',2,1],
['user1','item1',2,0.666667],
['user1','item3',2,0.500000]],
columns=['user_id','item_id','num_transactions','RCP'])
sort_RCP_df = (
filter.set_index("item_id")
.groupby(["user_id"])["RCP"]
.nlargest(3)
.reset_index()
)
print(sort_RCP_df)
item_id RCP
item2 1.000000
item1 0.666667
item3 0.500000
Why does the column 'user_id' not appear with nlargest = 3?
And if this the expected behavior, is there a way I can make 'user_id' part of the output as well?
The docs hint at the cause of the issue because in the Notes they explicitly call out a performance consideration:
Faster than
.sort_values(ascending=False).head(n)
for small n relative to the size of the Series object.
If you look deep into the code, Series.nlargest
/Series.nsmallest
are handled by the SelectNSeries
class in pandas/core/algorithms. This class has different behavior depending upon n
relative to the length of the Series:
# slow method
if n >= len(self.obj):
ascending = method == "nsmallest"
return dropped.sort_values(ascending=ascending).head(n)
# fast method
arr, new_dtype = _ensure_data(dropped.values)
if method == "nlargest":
arr = -arr
if is_integer_dtype(new_dtype):
# GH 21426: ensure reverse ordering at boundaries
arr -= 1
...
The key take-away here is that when n >= length of Series
the call doesn't use the normal algorithm to calculate the largest/smallest value and instead just calculates it with sort_values
+ head
. We can see this manually matches your output if we substitute your nlargest call with this logic.
sort_RCP_df = (
filter.set_index("item_id")
.groupby(["user_id"])["RCP"]
.apply(lambda s: s.sort_values(ascending=False).head(2))
.reset_index()
)
# user_id item_id RCP
#0 user1 item2 1.000000
#1 user1 item1 0.666667