.. _fasterpolynomialfeaturesrst: ========================== Faster Polynomial Features ========================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/sklearn/faster_polynomial_features.ipynb|*` .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline Polynomial Features ------------------- The current implementation of `PolynomialFeatures `__ (0.20.2) implements a term by term product for each pair :math:`X_i, X_j` of features where :math:`i \leqslant j` which is not the most efficient way to do it. .. code:: ipython3 import numpy.random X = numpy.random.random((100, 5)) .. code:: ipython3 from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) Xpoly = poly.fit_transform(X) poly.get_feature_names() .. parsed-literal:: ['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x2^2', 'x2 x3', 'x2 x4', 'x3^2', 'x3 x4', 'x4^2'] .. code:: ipython3 %timeit poly.transform(X) .. parsed-literal:: 114 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) The class `ExtendedFeatures `__ implements a different way to compute the polynomial features as it tries to reduce the number of calls to numpy by using broacasted vector multplications. .. code:: ipython3 from mlinsights.mlmodel import ExtendedFeatures ext = ExtendedFeatures(poly_degree=2) Xpoly = ext.fit_transform(X) ext.get_feature_names() .. parsed-literal:: ['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x2^2', 'x2 x3', 'x2 x4', 'x3^2', 'x3 x4', 'x4^2'] .. code:: ipython3 %timeit ext.transform(X) .. parsed-literal:: 68.7 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) Comparison with 5 features -------------------------- .. code:: ipython3 from cpyquickhelper.numbers import measure_time .. code:: ipython3 res = [] for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000]: X = numpy.random.random((n, 5)) poly.fit(X) ext.fit(X) r1 = measure_time("poly.transform(X)", context=dict(X=X, poly=poly), repeat=5, number=10, div_by_number=True) r2 = measure_time("ext.transform(X)", context=dict(X=X, ext=ext), repeat=5, number=10, div_by_number=True) r3 = measure_time("poly.fit_transform(X)", context=dict(X=X, poly=poly), repeat=5, number=10, div_by_number=True) r4 = measure_time("ext.fit_transform(X)", context=dict(X=X, ext=ext), repeat=5, number=10, div_by_number=True) r1["name"] = "poly" r2["name"] = "ext" r3["name"] = "poly+fit" r4["name"] = "ext+fit" r1["size"] = n r2["size"] = n r3["size"] = n r4["size"] = n res.append(r1) res.append(r2) res.append(r3) res.append(r4) import pandas df = pandas.DataFrame(res) df.tail() .. raw:: html

	average	deviation	min_exec	max_exec	repeat	number	context_size	name	size
63	0.037830	0.005577	0.031248	0.044832	5	10	240	ext+fit	100000
64	0.072671	0.005360	0.067559	0.082539	5	10	240	poly	200000
65	0.075712	0.018271	0.060476	0.100143	5	10	240	ext	200000
66	0.106755	0.019861	0.079880	0.139184	5	10	240	poly+fit	200000
67	0.074090	0.009142	0.063925	0.085899	5	10	240	ext+fit	200000

.. code:: ipython3 piv = df.pivot("size", "name", "average") piv[:5] .. raw:: html

name	ext	ext+fit	poly	poly+fit
size
1	0.000068	0.000402	0.000238	0.000275
2	0.000066	0.000156	0.000166	0.000213
5	0.000031	0.000427	0.000165	0.000196
10	0.000048	0.000237	0.000134	0.000306
20	0.000070	0.000188	0.000109	0.000153

.. code:: ipython3 ax = piv.plot(logy=True, logx=True) ax.set_title("Polynomial Features for 5 features\ndegree=2") ax.set_ylabel("seconds") ax.set_xlabel("number of observations"); .. image:: faster_polynomial_features_14_0.png The gain is mostly visible for small dimensions. Comparison with 1000 observations --------------------------------- In this experiment, the number of observations is fixed to 1000 but the number of features varies. .. code:: ipython3 poly = PolynomialFeatures(degree=2) ext = ExtendedFeatures(poly_degree=2) # implementation of PolynomialFeatures in 0.20.2 extslow = ExtendedFeatures(poly_degree=2, kind="poly-slow") res = [] for n in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 40, 50]: X = numpy.random.random((1000, n)) poly.fit(X) ext.fit(X) extslow.fit(X) r1 = measure_time("poly.transform(X)", context=dict(X=X, poly=poly), repeat=5, number=30, div_by_number=True) r2 = measure_time("ext.transform(X)", context=dict(X=X, ext=ext), repeat=5, number=30, div_by_number=True) r3 = measure_time("extslow.transform(X)", context=dict(X=X, extslow=extslow), repeat=5, number=30, div_by_number=True) r1["name"] = "poly" r2["name"] = "ext" r3["name"] = "extslow" r1["nfeat"] = n r2["nfeat"] = n r3["nfeat"] = n x1 = poly.transform(X) x2 = ext.transform(X) x3 = extslow.transform(X) r1["numf"] = x1.shape[1] r2["numf"] = x2.shape[1] r3["numf"] = x3.shape[1] res.append(r1) res.append(r2) res.append(r3) import pandas df = pandas.DataFrame(res) df.tail() .. raw:: html

	average	deviation	min_exec	max_exec	repeat	number	context_size	name	nfeat	numf
37	0.009331	0.001603	0.008280	0.012519	5	30	240	ext	40	861
38	0.022619	0.002868	0.018793	0.026324	5	30	240	extslow	40	861
39	0.013188	0.000370	0.012828	0.013888	5	30	240	poly	50	1326
40	0.012817	0.000102	0.012700	0.012951	5	30	240	ext	50	1326
41	0.030384	0.000717	0.029955	0.031813	5	30	240	extslow	50	1326

.. code:: ipython3 piv = df.pivot("nfeat", "name", "average") piv[:5] .. raw:: html

name	ext	extslow	poly
nfeat
1	0.000026	0.000059	0.000152
2	0.000055	0.000100	0.000113
3	0.000161	0.000381	0.000237
4	0.000148	0.000221	0.000219
5	0.000185	0.000340	0.000236

.. code:: ipython3 ax = piv.plot(logy=True, logx=True) ax.set_title("Polynomial Features for 1000 observations\ndegree=2") ax.set_ylabel("seconds") ax.set_xlabel("number of features"); .. image:: faster_polynomial_features_19_0.png It is twice faster. Comparison for different degrees -------------------------------- In this experiment, the number of observations and features is fixed, the degree increases. .. code:: ipython3 res = [] for n in [2, 3, 4, 5, 6, 7, 8]: X = numpy.random.random((1000, 4)) poly = PolynomialFeatures(degree=n) ext = ExtendedFeatures(poly_degree=n) poly.fit(X) ext.fit(X) r1 = measure_time("poly.transform(X)", context=dict(X=X, poly=poly), repeat=5, number=30, div_by_number=True) r2 = measure_time("ext.transform(X)", context=dict(X=X, ext=ext), repeat=5, number=30, div_by_number=True) r1["name"] = "poly" r2["name"] = "ext" r1["degree"] = n r2["degree"] = n x1 = poly.transform(X) x2 = ext.transform(X) r1["numf"] = x1.shape[1] r2["numf"] = x2.shape[1] res.append(r1) res.append(r2) import pandas df = pandas.DataFrame(res) df.tail() .. raw:: html

	average	deviation	min_exec	max_exec	repeat	number	context_size	name	degree	numf
9	0.001960	0.000067	0.001915	0.002094	5	30	240	ext	6	210
10	0.003131	0.000118	0.003009	0.003327	5	30	240	poly	7	330
11	0.003076	0.000233	0.002845	0.003393	5	30	240	ext	7	330
12	0.004299	0.000046	0.004243	0.004367	5	30	240	poly	8	495
13	0.004157	0.000035	0.004114	0.004217	5	30	240	ext	8	495

.. code:: ipython3 piv = df.pivot("degree", "name", "average") piv[:5] .. raw:: html

name	ext	poly
degree
2	0.000140	0.000312
3	0.000304	0.000363
4	0.000506	0.000579
5	0.000715	0.000789
6	0.001960	0.002032

.. code:: ipython3 ax = piv.plot(logy=True, logx=True) ax.set_title("Polynomial Features for 1000 observations\nnumber of features is 4") ax.set_ylabel("seconds") ax.set_xlabel("degree"); .. image:: faster_polynomial_features_24_0.png It is worth transposing. Same experiment with interaction_only=True ------------------------------------------ .. code:: ipython3 res = [] for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000]: poly = PolynomialFeatures(degree=2, interaction_only=True) ext = ExtendedFeatures(poly_degree=2, poly_interaction_only=True) X = numpy.random.random((n, 5)) poly.fit(X) ext.fit(X) r1 = measure_time("poly.transform(X)", context=dict(X=X, poly=poly), repeat=2, number=30, div_by_number=True) r2 = measure_time("ext.transform(X)", context=dict(X=X, ext=ext), repeat=2, number=30, div_by_number=True) r1["name"] = "poly" r2["name"] = "ext" r1["size"] = n r2["size"] = n res.append(r1) res.append(r2) import pandas df = pandas.DataFrame(res) df.tail() .. raw:: html

	average	deviation	min_exec	max_exec	repeat	number	context_size	name	size
29	0.010691	0.000073	0.010618	0.010764	2	30	240	ext	50000
30	0.026612	0.000794	0.025817	0.027406	2	30	240	poly	100000
31	0.025052	0.001583	0.023469	0.026635	2	30	240	ext	100000
32	0.058772	0.001345	0.057427	0.060118	2	30	240	poly	200000
33	0.054771	0.004555	0.050216	0.059327	2	30	240	ext	200000

.. code:: ipython3 piv = df.pivot("size", "name", "average") piv[:5] .. raw:: html

name	ext	poly
size
1	0.000042	0.000086
2	0.000034	0.000104
5	0.000068	0.000089
10	0.000032	0.000092
20	0.000040	0.000103

.. code:: ipython3 ax = piv.plot(logy=True, logx=True) ax.set_title("Polynomial Features for 5 features\ndegree is 2 + interaction_only=True") ax.set_ylabel("seconds") ax.set_xlabel("N obs"); .. image:: faster_polynomial_features_29_0.png Memory profiler --------------- .. code:: ipython3 from memory_profiler import memory_usage poly = PolynomialFeatures(degree=2, interaction_only=True) poly.fit(X) memory_usage((poly.transform, (X,)), interval=0.1, max_usage=True) .. parsed-literal:: 258.02734375 .. code:: ipython3 def pick_value(v): try: return v[0] except TypeError: return v res = [] for n in [10000, 50000, 100000, 200000]: X = numpy.random.random((n, 50)) print(n) poly = PolynomialFeatures(degree=2, interaction_only=True) ext = ExtendedFeatures(poly_degree=2, poly_interaction_only=True) poly.fit(X) ext.fit(X) r1 = memory_usage((poly.transform, (X,)), interval=0.1, max_usage=True) r2 = memory_usage((ext.transform, (X,)), interval=0.1, max_usage=True) r1 = {"memory": pick_value(r1)} r2 = {"memory": pick_value(r2)} r1["name"] = "poly" r2["name"] = "ext" r1["size"] = n r2["size"] = n res.append(r1) res.append(r2) import pandas df = pandas.DataFrame(res) df.tail() .. parsed-literal:: 10000 50000 100000 200000 .. raw:: html

	memory	name	size
3	699.679688	ext	50000
4	1243.664062	poly	100000
5	1205.515625	ext	100000
6	1952.316406	poly	200000
7	2029.765625	ext	200000

.. code:: ipython3 piv = df.pivot("size", "name", "memory") piv[:5] .. raw:: html

name	ext	poly
size
10000	392.445312	396.347656
50000	699.679688	718.839844
100000	1205.515625	1243.664062
200000	2029.765625	1952.316406

.. code:: ipython3 ax = piv.plot(logy=True, logx=True) ax.set_title("Polynomial Features for 50 features\ndegree is 2 - memory") ax.set_ylabel("Mb") ax.set_xlabel("N obs"); .. image:: faster_polynomial_features_34_0.png