2019-03-25 Nogil, numpy, cython#

I had to implement a custom criterion to optimize a decision tree and I wanted to leverage scikit-learn instead of rewriting my own. Version 0.21 of scikit-learn introduced some changed in the API which make possible to overload an existing criterion and replace some of the logic by another one: _criterion.pyx. The purpose was to show that a fast implementation requires some tricks (see Custom DecisionTreeRegressor adapted to a linear regression) and piecewise_tree_regression_criterion.pyx, piecewise_tree_regression_criterion_fast.pyx for the code. Other than that, every function to overlaod is marked as nogil. Every function or method marked as nogil cannot go through the GIL (see also PEP-0311), which no python object can be created in that method. In fact, no python can be called inside a Cython method protected with nogil. The issue with that is that any numpy method cannot be called.

My goal was to replace the implemention of the decision tree criterion by something optimizing a linear regression, basically something close to function numpy.linalg.lstsq but that’s inside numpy so unavailable in a nogil method. I needed to use the inner API from BLAS or LAPACK and available in cython through cython_blas (matrix operations) cython_lapack (complex matrix operations). It is fast but not really well documented… I needed to use function dgelss (same from scipy scipy.linalg.lapack.dgelss). which documentation is available through Lapack documentation. Its signature can be found at cython_lapack_signatures.txt and is the following:

cdef void dgelss(int *m, int *n, int *nrhs, double *a, int *lda, double *b, int *ldb,
                 double *s, double *rcond, int *rank,
                 double *work, int *lwork, int *info) nogil

I tried and it failed many times before getting it correctly. Most of the time, python just crashes without telling me what the issue is. I had to use many printf in the cython code to get it right (remember no python call, so no print function). These function do not do any allocation, every needed buffer must be allocated first. The documentation gives some recommendations about the optimal buffer size. The function usually modifies the inputs, they must be copied first if the user wants to reuse them later. I finally implemented dgelss or on github: direct_blas_lapack.pyx.