Parallelization of a dot product with processes (joblib)

Uses processes to parallelize a dot product is not a very solution because processes do not share memory, they need to exchange data. This parallelisation is efficient if the ratio exchanged data / computation time is low. joblib is used by scikit-learn. The cost of creating new processes is also significant.

import numpy
from tqdm import tqdm
from pandas import DataFrame
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
from teachcompute.ext_test_case import measure_time, unit_test_going


def parallel_dot_joblib(va, vb, max_workers=2):
    dh = va.shape[0] // max_workers
    k = 2
    dhk = dh // k
    if dh != float(va.shape[0]) / max_workers:
        raise RuntimeError("size must be a multiple of max_workers.")

    r = Parallel(n_jobs=max_workers, backend="loky")(
        delayed(numpy.dot)(va[i * dhk : i * dhk + dhk], vb[i * dhk : i * dhk + dhk])
        for i in range(max_workers * k)
    )
    return sum(r)

We check that it returns the same values.

va = numpy.random.randn(100).astype(numpy.float64)
vb = numpy.random.randn(100).astype(numpy.float64)
print(parallel_dot_joblib(va, vb), numpy.dot(va, vb))
3.95418605643733 3.9541860564373317

Let’s benchmark.

if unit_test_going():
    tries = [10, 20]
else:
    tries = [1000, 2000]

res = []
for n in tqdm(tries):
    va = numpy.random.randn(n).astype(numpy.float64)
    vb = numpy.random.randn(n).astype(numpy.float64)

    m1 = measure_time(
        "dot(va, vb, 2)", dict(va=va, vb=vb, dot=parallel_dot_joblib), repeat=1
    )
    m2 = measure_time("dot(va, vb)", dict(va=va, vb=vb, dot=numpy.dot))
    res.append({"N": n, "numpy.dot": m2["average"], "joblib": m1["average"]})

df = DataFrame(res).set_index("N")
print(df)
df.plot(logy=True, logx=True)
plt.title("Parallel / numpy dot")
Parallel / numpy dot
  0%|          | 0/2 [00:00<?, ?it/s]
 50%|█████     | 1/2 [00:00<00:00,  1.95it/s]
100%|██████████| 2/2 [00:00<00:00,  2.09it/s]
100%|██████████| 2/2 [00:00<00:00,  2.07it/s]
      numpy.dot    joblib
N
1000   0.000003  0.010030
2000   0.000002  0.008846

Text(0.5, 1.0, 'Parallel / numpy dot')

The parallelisation is inefficient.

Total running time of the script: (0 minutes 3.460 seconds)

Gallery generated by Sphinx-Gallery