.. _l-gridsearch-cache: Caching algorithm for a GridSearchCV ==================================== .. index:: grid search, cache, scikit-learn, joblib .. contents:: :local: Ideas +++++ The goal is to measure the impact of using a cache while optimizing a pipeline: :: [ ('scale', MinMaxScaler()), ('pca', PCA(2)), ('poly', PolynomialFeatures()), ('bins', KBinsDiscretizer()), ('lr', LogisticRegression(solver='liblinear')) ] With the following parameters: :: params_grid = { 'scale__feature_range': [(0, 1), (-1, 1)], 'pca__n_components': [2, 4], 'poly__degree': [2, 3], 'bins__n_bins': [5], 'bins__encode': ["onehot-dense", "ordinal"], 'lr__penalty': ['l1', 'l2'], } It looks into different ways to speed up the optimization by caching. One option is not implemented in :epkg:`scikit-learn`: `PipelineCache `_, it implements a cache in memory as opposed of :epkg:`joblib` which stores everything on disk. This implementation is faster when the training runs with one process, :epkg:`joblib` does a better job if the number of jobs and processes is higher even if it may store a huge load of data. Graphs ++++++ .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_results name = "../../scikit-learn/results/bench_plot_gridsearch_cache.csv" df = pandas.read_csv(name) plt.close('all') plot_bench_results(df, row_cols=['N'], col_cols=['n_jobs'], x_value='dim', hue_cols=['test'], cmp_col_values='test', title="GridSearchCV\nBenchmark caching strategies") plt.show() .. plot:: import matplotlib.pyplot as plt import pandas from pymlbenchmark.plotting import plot_bench_xtime name = "../../scikit-learn/results/bench_plot_gridsearch_cache.csv" df = pandas.read_csv(name) plt.close('all') plot_bench_xtime(df, row_cols=['n_jobs'], hue_cols=['N'], x_value='mean', cmp_col_values='test', title="GridSearchCV\nBenchmark caching strategies"); plt.show() Machine used to run the test ++++++++++++++++++++++++++++ .. runpython:: :rst: :warningout: RuntimeWarning :showcode: from pyquickhelper.pandashelper import df2rst import pandas name = os.path.join(__WD__, "../../scikit-learn/results/bench_plot_gridsearch_cache.time.csv") df = pandas.read_csv(name) print(df2rst(df, number_format=4)) Raw results +++++++++++ :download:`bench_polynomial_features.csv <../../scikit-learn/results/bench_plot_gridsearch_cache.csv>` .. runpython:: :rst: :warningout: RuntimeWarning :showcode: from pyquickhelper.pandashelper import df2rst import pandas name = os.path.join(__WD__, "../../scikit-learn/results/bench_plot_gridsearch_cache.csv") df = pandas.read_csv(name) print(df2rst(df, number_format=4)) Benchmark code ++++++++++++++ .. literalinclude:: ../../scikit-learn/bench_plot_gridsearch_cache.py :language: python