mlinsights.metrics¶
- mlinsights.metrics.correlations.non_linear_correlations(df, model, draws=5, minmax=False)[source]¶
Computes non linear correlations.
- Parameters:
df –
pandas.DataFrame
ornumpy.array
model – machine learned model used to compute the correlations
draws – number of tries for bootstrap, the correlation is the average of the results obtained at each draw
minmax – if True, returns three matrices correlations, min, max, only the correlation matrix if False
- Returns:
see parameter minmax
\[cor(X_i, X_j) = \frac{cov(X_i, Y_i)}{\sigma(X_i)\sigma(X_j)}\]If variables are centered, \(\mathbb{E}X_i=\mathbb{E}X_j=0\), it becomes:
\[cor(X_i, X_j) = \frac{\mathbb{E}(X_i X_j)} {\sqrt{\mathbb{E}X_i^2 \mathbb{E}X_j^2}}\]If rescaled, \(\mathbb{E}X_i^2=\mathbb{E}X_j^2=1\), then it becomes \(cor(X_i, X_j) = \mathbb{E}(X_i X_j)\). Let’s assume we try to find a coefficient such as \(\alpha_{ij}\) minimizes the standard deviation of noise \(\epsilon_{ij}\):
\[X_j = \alpha_{ij}X_i + \epsilon_{ij}\]It is like if coefficient \(\alpha_{ij}\) comes from a a linear regression which minimizes \(\mathbb{E}(X_j - \alpha_{ij}X_i)^2\). If variable \(X_i\), \(X_j\) are centered and rescaled: \(\alpha_{ij}^* = \mathbb{E}(X_i X_j) = cor(X_i, X_j)\). We extend that definition to function \(f\) of parameter \(\omega\) defined as: \(f(\omega, X) \rightarrow \mathbb{R}\). \(f\) is not linear anymore. Let’s assume parameter \(\omega^*\) minimizes quantity \(\min_\omega (X_j - f(\omega, X_i))^2\). Then \(X_j = \alpha_{ij} \frac{f(\omega^*, X_i)}{\alpha_{ij}} + \epsilon_{ij}\) and we choose \(\alpha_{ij}\) such as \(\mathbb{E}\left(\frac{f(\omega^*, X_i)^2}{\alpha_{ij}^2}\right) = 1\). Let’s define a non linear correlation bounded by \(f\) as:
\[cor^f(X_i, X_j) = \sqrt{ \mathbb{E} (f(\omega, X_i)^2 )}\]We can verify that this value is in interval`:math:[0,1]`. That also means that there is no negative correlation. \(f\) is a machine learned model and most of them usually overfit the data. The database is split into two parts, one is used to train the model, the other one to compute the correlation. The same split are used for every coefficient. The returned matrix is not necessarily symmetric.
Compute non linear correlations
The following example compute non linear correlations on Iris datasets based on a RandomForestRegressor model.
<<<
import pandas from sklearn import datasets from sklearn.ensemble import RandomForestRegressor from mlinsights.metrics import non_linear_correlations iris = datasets.load_iris() X = iris.data[:, :4] df = pandas.DataFrame(X) df.columns = ["X1", "X2", "X3", "X4"] cor = non_linear_correlations(df, RandomForestRegressor()) print(cor)
>>>
X1 X2 X3 X4 X1 0.999017 0.182334 0.879377 0.811052 X2 0.012349 0.993362 0.434102 0.307907 X3 0.864211 0.493669 0.999497 0.958225 X4 0.757984 0.624353 0.968231 0.999414