Régression logistique en 2D¶

Prédire la couleur d’un vin à partir de ses composants.

[1]:

%matplotlib inline

[4]:

from teachpyx.datasets import load_wines_dataset

data = load_wines_dataset()
X = data.drop(["quality", "color"], axis=1)
y = data["color"]

[5]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

[6]:

from statsmodels.discrete.discrete_model import Logit

model = Logit(y_train == "white", X_train)
res = model.fit()

Optimization terminated successfully.
         Current function value: 0.048414
         Iterations 11

[7]:

res.summary2()

[7]:

Model:	Logit	Method:	MLE
Dependent Variable:	color	Pseudo R-squared:	0.913
Date:	2024-01-23 00:52	AIC:	493.7476
No. Observations:	4872	BIC:	565.1515
Df Model:	10	Log-Likelihood:	-235.87
Df Residuals:	4861	LL-Null:	-2717.5
Converged:	1.0000	LLR p-value:	0.0000
No. Iterations:	11.0000	Scale:	1.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
fixed_acidity	-1.4541	0.1515	-9.5981	0.0000	-1.7511	-1.1572
volatile_acidity	-11.3716	0.9995	-11.3771	0.0000	-13.3306	-9.4126
citric_acid	1.7492	1.1280	1.5507	0.1210	-0.4616	3.9599
residual_sugar	0.1246	0.0600	2.0756	0.0379	0.0069	0.2422
chlorides	-32.7390	3.9560	-8.2758	0.0000	-40.4926	-24.9854
free_sulfur_dioxide	-0.0505	0.0134	-3.7724	0.0002	-0.0768	-0.0243
total_sulfur_dioxide	0.0632	0.0050	12.6896	0.0000	0.0534	0.0730
density	42.0110	4.2093	9.9806	0.0000	33.7610	50.2610
pH	-8.7417	0.9800	-8.9204	0.0000	-10.6624	-6.8210
sulphates	-8.8918	1.0237	-8.6857	0.0000	-10.8983	-6.8853
alcohol	0.4150	0.1233	3.3656	0.0008	0.1733	0.6567

On ne garde que les deux premières.

[8]:

X_train2 = X_train.iloc[:, :2]

[9]:

import pandas

df = pandas.DataFrame(X_train2.copy())
df["y"] = y_train

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
df[df.y == "white"].plot(
    x="fixed_acidity", y="volatile_acidity", ax=ax, kind="scatter", label="white"
)
df[df.y == "red"].plot(
    x="fixed_acidity",
    y="volatile_acidity",
    ax=ax,
    kind="scatter",
    label="red",
    color="red",
    s=2,
)
ax.set_title("Vins rouges et white selon deux composantes");

../../_images/practice_ml_winesc_color_line_8_0.png

[10]:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train2, y_train == "white")

[10]:

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[11]:

model.coef_, model.intercept_

[11]:

(array([[ -1.11120776, -11.79383309]]), array([13.83313405]))

On trace cette droite sur le graphique.

[12]:

x0 = 3
y0 = -(model.coef_[0, 0] * x0 + model.intercept_) / model.coef_[0, 1]
x1 = 14
y1 = -(model.coef_[0, 0] * x1 + model.intercept_) / model.coef_[0, 1]
x0, y0, x1, y1

[12]:

(3, array([0.89025431]), 14, array([-0.14615898]))

[13]:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
df[df.y == "white"].plot(
    x="fixed_acidity", y="volatile_acidity", ax=ax, kind="scatter", label="white"
)
df[df.y == "red"].plot(
    x="fixed_acidity",
    y="volatile_acidity",
    ax=ax,
    kind="scatter",
    label="red",
    color="red",
    s=2,
)
ax.plot(
    [x0, x1],
    [y0, y1],
    "y--",
    lw=4,
    label="frontière trouvée\npar la régression\nlogistique",
)
ax.legend()
ax.set_title("Vins rouges et blancs\nselon deux composantes");

../../_images/practice_ml_winesc_color_line_13_0.png

Notebook on github