Régression sans inversion#

Ce notebook mesure le temps de calcul dans deux algorithmes pour résoudre une régression linéaire, le premier inverse un matrice, le second le fait sans inverser une matrice, le troisième reprend l’idée du second mais utilise une décomposition QR puis inverse la matrice R.

[1]:
%matplotlib inline
[2]:
import numpy.random as rnd

X = rnd.randn(1000, 7)
eps = rnd.randn(1000, 1) / 3
y = X.sum(axis=1).reshape((X.shape[0], 1)) + eps
y = y.ravel()
X.shape, y.shape, eps.shape
[2]:
((1000, 7), (1000,), (1000, 1))
[3]:
from mlstatpy.ml.matrices import linear_regression, gram_schmidt

beta1 = linear_regression(X, y, algo=None)
beta2 = linear_regression(X, y, algo="gram")
beta1, beta2
[3]:
(array([0.97915374, 1.00078055, 1.00537618, 1.01021414, 1.0003261 ,
        0.9944518 , 0.98742625]),
 array([0.97915374, 1.00078055, 1.00537618, 1.01021414, 1.0003261 ,
        0.9944518 , 0.98742625]))
[4]:
%timeit linear_regression(X, y, algo=None)
38.4 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
[5]:
%timeit linear_regression(X, y, algo="gram")
310 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[6]:
%timeit linear_regression(X, y, algo="qr")
139 µs ± 8.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
[7]:
Xt = X.T
%timeit gram_schmidt(Xt)
210 µs ± 5.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Un exemple avec scikit-learn.

[8]:
from sklearn.linear_model import LinearRegression

clr = LinearRegression()
%timeit clr.fit(X, y)
443 µs ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Qui utilise la fonction lstsq:

[9]:
from numpy.linalg import lstsq

%timeit lstsq(X, y, rcond=None)
75.5 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Il serait sans doute possible d’optimiser les calculs en réduisant le nombre de copie et de transposées. La version utilisant une décomposition QR est assez rapide. Le code est là matrices.py. Pour dépasser numpy, il faut passer au C++. scikit-learn ajoute des étapes intermédiaires pour vérifier les données ce qui explique la longueur. On résume le tout par un graphique.

[10]:
from onnx_array_api.ext_test_case import measure_time
[11]:
stmts = [
    dict(name="lr_matrix", fct="linear_regression(X, y, algo=None)"),
    dict(name="lr_gram", fct="linear_regression(X, y, algo='gram')"),
    dict(name="lr_qr", fct="linear_regression(X, y, algo='qr')"),
    dict(name="gram", fct="gram_schmidt(Xt)"),
    dict(name="sklearn", fct="clr.fit(X, y)"),
    dict(name="lstsq", fct="lstsq(X, y)"),
]

memo = []
for size, dim in [
    (100, 10),
    (1000, 10),
    (10000, 10),
    (100, 20),
    (1000, 20),
    (10000, 20),
    (100, 50),
    (1000, 50),
]:
    print(size, dim)
    X = rnd.randn(size, dim)
    eps = rnd.randn(size, 1) / 3
    y = X.sum(axis=1).reshape((X.shape[0], 1)) + eps
    y = y.ravel()
    context = dict(
        linear_regression=linear_regression,
        Xt=X.T,
        X=X,
        y=y,
        gram_schmidt=gram_schmidt,
        clr=clr,
        lstsq=lambda X, y: lstsq(X, y, rcond=None),
    )

    for stmt in stmts:
        res = measure_time(
            stmt["fct"], number=20, repeat=20, div_by_number=True, context=context
        )
        res.update(stmt)
        res["size"] = size
        res["dim"] = dim
        memo.append(res)

import pandas

df = pandas.DataFrame(memo)
df.head()
100 10
1000 10
10000 10
100 20
1000 20
10000 20
100 50
1000 50
[11]:
average context_size deviation dim fct max_exec min_exec name number repeat size
0 0.000039 368 0.000019 10 linear_regression(X, y, algo=None) 0.000091 0.000018 lr_matrix 20 20 100
1 0.000365 368 0.000045 10 linear_regression(X, y, algo='gram') 0.000485 0.000312 lr_gram 20 20 100
2 0.000114 368 0.000031 10 linear_regression(X, y, algo='qr') 0.000223 0.000093 lr_qr 20 20 100
3 0.000229 368 0.000020 10 gram_schmidt(Xt) 0.000256 0.000197 gram 20 20 100
4 0.000403 368 0.000031 10 clr.fit(X, y) 0.000464 0.000346 sklearn 20 20 100
[12]:
piv = pandas.pivot_table(df, index=["size", "dim"], columns="name", values="average")
piv
[12]:
name gram lr_gram lr_matrix lr_qr lstsq sklearn
size dim
100 10 0.000229 0.000365 0.000039 0.000114 0.000081 0.000403
20 0.000442 0.000772 0.000057 0.000142 0.000143 0.000433
50 0.001384 0.002303 0.000115 0.000298 0.000619 0.000935
1000 10 0.000335 0.000498 0.000052 0.000168 0.000140 0.000633
20 0.000867 0.001197 0.000093 0.000335 0.000246 0.000641
50 0.003242 0.004482 0.000263 0.001220 0.000945 0.001545
10000 10 0.001434 0.001309 0.000234 0.002760 0.000551 0.001828
20 0.010212 0.010944 0.000293 0.005926 0.002128 0.005581
[13]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
piv[:6].plot(kind="bar", ax=ax[0])
piv[6:].plot(kind="bar", ax=ax[1])
ax[0].set_title("Régression Linéaire, size < 10000")
ax[1].set_title("Régression Linéaire, size >= 10000");
../../_images/notebooks_ml_regression_no_inversion_16_0.png

Streaming versions#

L’idée est différente ici puisqu’il s’agit de calculer toutes les régressions linéaires intermédiaires. Les algorithmes sont décrits par l’exposé Régression linéaire par morceaux.

[14]:
from mlstatpy.ml.matrices import (
    streaming_linear_regression,
    streaming_linear_regression_gram_schmidt,
)


def all_linear_regression(X, y):
    for i in range(X.shape[1], X.shape[0]):
        yield linear_regression(X[:i], y[:i])


stmts = [
    dict(name="lr_matrix", fct="list(all_linear_regression(X, y))"),
    dict(name="lr_st_mat", fct="list(streaming_linear_regression(X, y))"),
    dict(name="lr_st_gram", fct="list(streaming_linear_regression_gram_schmidt(X, y))"),
]

memo = []
for dim in (10,):
    for size in range(100, 3500, 500):
        print(size, dim)
        X = rnd.randn(size, dim)
        eps = rnd.randn(size, 1) / 3
        y = X.sum(axis=1).reshape((X.shape[0], 1)) + eps
        y = y.ravel()
        context = dict(
            X=X,
            y=y,
            all_linear_regression=all_linear_regression,
            streaming_linear_regression=streaming_linear_regression,
            streaming_linear_regression_gram_schmidt=streaming_linear_regression_gram_schmidt,
        )

        for stmt in stmts:
            if "gram" in stmt["name"]:
                nn = 1
                if size >= 1000:
                    continue
            else:
                nn = 5
            res = measure_time(
                stmt["fct"], number=nn, repeat=nn, div_by_number=True, context=context
            )
            res.update(stmt)
            res["size"] = size
            res["dim"] = dim
            memo.append(res)

import pandas

df = pandas.DataFrame(memo)
df.head()
100 10
600 10
1100 10
1600 10
2100 10
2600 10
3100 10
[14]:
average context_size deviation dim fct max_exec min_exec name number repeat size
0 0.002589 368 0.001263 10 list(all_linear_regression(X, y)) 0.005070 0.001753 lr_matrix 5 5 100
1 0.002621 368 0.000038 10 list(streaming_linear_regression(X, y)) 0.002688 0.002572 lr_st_mat 5 5 100
2 0.031022 368 0.000000 10 list(streaming_linear_regression_gram_schmidt(... 0.031022 0.031022 lr_st_gram 1 1 100
3 0.018594 368 0.000749 10 list(all_linear_regression(X, y)) 0.019532 0.017664 lr_matrix 5 5 600
4 0.022098 368 0.001805 10 list(streaming_linear_regression(X, y)) 0.024896 0.020070 lr_st_mat 5 5 600
[15]:
piv = pandas.pivot_table(df, index=["size"], columns="name", values="average")
piv
[15]:
name lr_matrix lr_st_gram lr_st_mat
size
100 0.002589 0.031022 0.002621
600 0.018594 0.204768 0.022098
1100 0.040404 NaN 0.034072
1600 0.062186 NaN 0.052658
2100 0.097438 NaN 0.060824
2600 0.128451 NaN 0.079594
3100 0.161074 NaN 0.090113
[16]:
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
piv[["lr_matrix", "lr_st_mat"]].plot(ax=ax[0])
piv.plot(ax=ax[1])
ax[0].set_title("Régression Linéaire streaming (all)\n10 features")
ax[1].set_title("Régression Linéaire no Gram-Schmidt\n10 features");
../../_images/notebooks_ml_regression_no_inversion_20_0.png

La version streaming devient plus intéressante à partir de 1000 observations, le coût en linéaire en N contrairement à la version classique qui est en \(N^2\). La version Gram-Schmidt devrait être réécrite en C++ pour proposer des temps comparables.

[17]: