Company logo
  • Empleos
  • Bootcamp
  • Acerca de nosotros
  • Para profesionales
    • Inicio
    • Empleos
    • Cursos y retos
    • Preguntas
    • Profesores
    • Bootcamp
  • Para empresas
    • Inicio
    • Nuestro proceso
    • Planes
    • Pruebas
    • Nómina
    • Blog
    • Calculadora

0

114
Vistas
Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

prints:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

Where the difference comes from?

The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel (with intercept) was fit for X prepared as for statsmodels.api.OLS.

The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (statsmodels and scikit-learn) while linked question is about statsmodels and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

9 months ago · Santiago Trujillo
1 Respuestas
Responde la pregunta

0

As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels' implementation than in scikit-learn's.

From documentation of RegressionResults class (emphasis mine):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

From documentation of LinearRegression.score():

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

9 months ago · Santiago Trujillo Denunciar
Responde la pregunta
Encuentra empleos remotos

¡Descubre la nueva forma de encontrar empleo!

Top de empleos
Top categorías de empleo
Empresas
Publicar empleo Planes Nuestro proceso Comercial
Legal
Términos y condiciones Política de privacidad
© 2023 PeakU Inc. All Rights Reserved.