Predictable t-SNE

t-SNE is not a transformer which can produce outputs for other inputs than the one used to train the transform. The proposed solution is train a predictor afterwards to try to use the results on some other inputs the model never saw.

t-SNE on MNIST

Let’s reuse some part of the example of Manifold learning on handwritten digits: Locally Linear Embedding, Isomap.

import numpy
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from mlinsights.mlmodel import PredictableTSNE


digits = datasets.load_digits(n_class=6)
Xd = digits.data
yd = digits.target
imgs = digits.images
n_samples, n_features = Xd.shape
n_samples, n_features
(1083, 64)

Let’s split into train and test.

tsne = TSNE(n_components=2, init="pca", random_state=0)

X_train_tsne = tsne.fit_transform(X_train, y_train)
X_train_tsne.shape
(812, 2)
def plot_embedding(Xp, y, imgs, title=None, figsize=(12, 4)):
    x_min, x_max = numpy.min(Xp, 0), numpy.max(Xp, 0)
    X = (Xp - x_min) / (x_max - x_min)

    fig, ax = plt.subplots(1, 2, figsize=figsize)
    for i in range(X.shape[0]):
        ax[0].text(
            X[i, 0],
            X[i, 1],
            str(y[i]),
            color=plt.cm.Set1(y[i] / 10.0),
            fontdict={"weight": "bold", "size": 9},
        )

    if hasattr(offsetbox, "AnnotationBbox"):
        # only print thumbnails with matplotlib > 1.0
        shown_images = numpy.array([[1.0, 1.0]])  # just something big
        for i in range(X.shape[0]):
            dist = numpy.sum((X[i] - shown_images) ** 2, 1)
            if numpy.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = numpy.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(imgs[i], cmap=plt.cm.gray_r), X[i]
            )
            ax[0].add_artist(imagebox)
    ax[0].set_xticks([]), ax[0].set_yticks([])
    ax[1].plot(Xp[:, 0], Xp[:, 1], ".")
    if title is not None:
        ax[0].set_title(title)
    return ax


plot_embedding(X_train_tsne, y_train, imgs_train, "t-SNE embedding of the digits")
t-SNE embedding of the digits
array([<Axes: title={'center': 't-SNE embedding of the digits'}>,
       <Axes: >], dtype=object)

Repeatable t-SNE

We use class PredictableTSNE but it works for other trainable transform too.

ptsne = PredictableTSNE()
ptsne.fit(X_train, y_train)
/home/xadupre/github/scikit-learn/sklearn/neural_network/_multilayer_perceptron.py:688: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
PredictableTSNE(estimator=MLPRegressor(), transformer=TSNE())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


X_train_tsne2 = ptsne.transform(X_train)
plot_embedding(X_train_tsne2, y_train, imgs_train, "Predictable t-SNE of the digits")
Predictable t-SNE of the digits
array([<Axes: title={'center': 'Predictable t-SNE of the digits'}>,
       <Axes: >], dtype=object)

The difference now is that it can be applied on new data.

X_test_tsne2 = ptsne.transform(X_test)
plot_embedding(
    X_test_tsne2, y_test, imgs_test, "Predictable t-SNE on new digits on test database"
)
Predictable t-SNE on new digits on test database
array([<Axes: title={'center': 'Predictable t-SNE on new digits on test database'}>,
       <Axes: >], dtype=object)

By default, the output data is normalized to get comparable results over multiple tries such as the loss computed between the normalized output of t-SNE and their approximation.

0.01733241511855887

Repeatable t-SNE with another predictor

# The predictor is a
# `MLPRegressor <https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html>`_.


ptsne.estimator_
MLPRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Let’s replace it with a KNeighborsRegressor and a normalizer StandardScaler.

ptsne_knn = PredictableTSNE(
    normalizer=StandardScaler(), estimator=KNeighborsRegressor()
)
ptsne_knn.fit(X_train, y_train)
PredictableTSNE(estimator=KNeighborsRegressor(), normalizer=StandardScaler(),
                transformer=TSNE())
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


X_train_tsne2 = ptsne_knn.transform(X_train)
plot_embedding(
    X_train_tsne2,
    y_train,
    imgs_train,
    "Predictable t-SNE of the digits\nStandardScaler+KNeighborsRegressor",
)
Predictable t-SNE of the digits StandardScaler+KNeighborsRegressor
array([<Axes: title={'center': 'Predictable t-SNE of the digits\nStandardScaler+KNeighborsRegressor'}>,
       <Axes: >], dtype=object)
X_test_tsne2 = ptsne_knn.transform(X_test)
plot_embedding(
    X_test_tsne2,
    y_test,
    imgs_test,
    "Predictable t-SNE on new digits\nStandardScaler+KNeighborsRegressor",
)
Predictable t-SNE on new digits StandardScaler+KNeighborsRegressor
array([<Axes: title={'center': 'Predictable t-SNE on new digits\nStandardScaler+KNeighborsRegressor'}>,
       <Axes: >], dtype=object)

The model seems to work better as the loss is better but as it is evaluated on the training dataset, it is just a way to check it is not too big.

0.005313852336257696

Total running time of the script: (0 minutes 6.580 seconds)

Gallery generated by Sphinx-Gallery