XD blog

2017-08

2017-08-28 Continuous integration

Maintaining notebooks is a lot of work. I consider a notebook is ok if it runs with the latest versions of the packages it uses and if it can be converted into HTML to be included on a website. Easy when you have one notebook but what if you have more than 200 hundred of them to verify for your teachings? I had to automate. I started by running a virtual machine on Azure wich Jenkins and everything I needed. This machine is still up and running and tests everything once in a week for all my python package. But my teachings are also open source so I decided to use continuous integration to test my package on other distributions. I first used travis and appveyor but I could not include the compilation of the notebooks into documentation. It requires a couple of huge dependencies as latex. appveyor is quite slow and stops any job after one hour. travis was using Ubuntu 12.04 when I started and 14.04 now. I recently tried CircleCI 2.0. The design is really nice, it offers Ubuntu 16.04, the configuration file config.yml is much easier to read and to write. One interesting feature is artifacts. The user can easily copy in a specific folder whatever he wants to keep from the build and make it available to download. The other interesting feature is caching. CircleCI automatically caches a docker which contains the dependencies of the package to test. The first run is slow and the following are faster due to this option plus renewing the cache can be easily done by changing a file. So I decided it was worth spending some time enabling CircleCI anywhere. I described here all the steps to create a Python package, with unit tests and documentation and to build it on CircleCI : Tests unitaires, setup et ingéniérie logiciel (French).

2017-08-25 Read paper, read git issues too!

Sometimes I go for a walk at scikit-learn/issues on radio github and I listened to this short news Random Forest Imputation and I discovered this package fancyimpute which is about filling missing values with many fancy ways and then knnimpute and downhill which implements a couple of gradient descent algorithms with theano. A little bit later: Thompson sampling with the online bootstrap.

2017-08-24 Remove big files from git history

Git repositories always get bigger. I noticed than one of GitHub repository was above 500Mb. I was wondering how I could make that size smaller. First, let see the size.

git count-objects -v

count: 0
size: 0
in-pack: 19644
packs: 1
size-pack: 222397
prune-packable: 0
garbage: 0
size-garbage: 0

The size is size-pack. To clean, the first option is to rebase the repository so basically to clean everything and to commit the current state of the content. One solution is to keep only the latest commits (see Reduce repository size).

git log -n N
git reset --hard HEAD~N
git push --force

more...

2017-08-17 L'inflation numérique

Chaque année, je reçois beaucoup de projets informatiques réalisés par des étudiants, plus d'une centaine l'année qui vient de se terminer. Je garde les projets principalement parce que des étudiants me demandent des lettres de recommandation. Chaque année cela grossit.

2017-08-03 PyData Seattle

Les vidéos des talks à PyData Seatle sont disponibles : PyData Seattle 2017. Quelques-unes à regarder pour les novices en Python :

Keynote Jake VanderPlas (les outils du datascientiste)
Stephanie Kim - How to be a 10x Data Scientist
Tom Radcliffe - Robust Algorithms for Machine Learning
Quentin Caudron - Introduction to data analytics with pandas

La datatascience, finalement, on passe son temps à chercher le bon graphe, celui qui nous montrera ce qu'il fallait voir dans ces données ou alors on cherche l'outils qui nous permettra de trouver le bon angle : Jeffrey Heer - Interactive Data Analysis: Visualization and Beyond. Pour les experts : Chris Fregly - High Performance Distributed Tensorflow, Jeff Fischer - Python and IoT: From Chips and Bits to Data Science, Stephen Hoover - Scaling Scikit Learn.

Pour les utilisateurs de Spark, cette vidéo pourrait vous intéresser : Raj Singh - PixieDust make Jupyter Notebooks with Apache Spark Faster, Flexible, and Easier to use (voir aussi PixieDust).

Si vous avez un vieux système à mettre à jour, Matt Braymer-Hayes, Erin Haswell - Upgrading Legacy Projects: Lessons Learned.

2017-08-01 La panne du 1er août

Je fais partie comme de nombreux voyageurs des personnes impactées par la panne qui empêche la SNCF de faire circuler ses trains au départ ou à destinations de la gare Montparnasse. Le message que j'ai reçu est assez simple et sans appel : Nous vous informons qu'en raison d'une panne de signalisation, votre train ne circulera pas. Nous vous invitons, dans la mesure du possible, à reporter votre voyage. J'ai donc pris un autre billet pour le même jour en espérant que le site web ne m'ait pas laissé acheter un billet non valide. La panne arrive à tout le monde et je ne voudrais pas blâmer une entreprise publique qui pour ma part satisfait mes besoins. Néanmoins, j'aurais apprécié que le mail me proposât quelques options de remplacement. Je suis très heureux d'apprendre que cinquante techniciens inspectent le poste de commandemant de Vanves-Malakoff (d'après LeMonde) mais je ne sais combien travaillent sur les itinéraires de remplacement.

more...

Xavier Dupré