XD blog

2015-02

2015-02-28 Automated build on Travis for a python module

Many python modules display a small logo which indicates the build status: . I set up the same for the module pyquickhelper which is held on github/pyquickhelper. Travis installs packages before building the modules. The first step is to gather all the dependencies:

pip freeze > requirements.txt

I replaced == by >= and removed some of them, I got:

Cython>=0.20.2
Flask>=0.10.1
Flask-SQLAlchemy>=2.0
Jinja2>=2.7.3
Markdown>=2.4.1
...

more...

2015-02-26 Use scikit-learn with your own model

scikit-learn has a very simple API and it is quite simple to use its features with your own model. It just needs to be embbeded into a class which implements the methods fit, predict, decision_function, score. I wrote a simple model (kNN) which follows those guidelines: SkCustomKnn. A last method is needed for the cross validation scenario. This one needs to clone the machine learned model. It just calls the constructor with proper parameters. To do so, it needs to get a copy of those. That is the purpose of method get_params. You are all set.

2015-02-21 Distribution pour Python sous Windows

La distribution WinPython propose maintenant Python 3.4 mais aussi des versions customisées (ou flavors). L'une d'entre elles utilise Kivy. Une autre est particulièrement intéressante pour un datascientist puisqu'elle inclue R. On peut alors passer facilement de Python à R depuis le même notebooks sans étape d'installation supplémentaire ce qu'on teste aisément avec un notebook préinstallé. Comme le compilateur MinGW fait partie de la distribution, cython ne pose plus aucun problème.

Avec cette dernière version, le choix entre WinPython et Anaconda devient difficile sous Windows. Un seul bémol, l'installation du module paramiko est très simple avec Anaconda (avec conda install) mais se révèle compliquée avec WinPython. Donc, si vous avez besoin d'accéder à des ressources web de façon cryptée, Anaconda reste sans doute le plus sûr.

2015-02-16 Delay evaluation

The following class is meant to be a kind of repository of many tables. Its main issue it is loads everything first. It takes time and might not be necessary if not all the tables are required.

import pandas

class DataContainer:
    def __init__( self, big_tables ):
        self.big_tables = big_tables
        
    def __getitem__(self, i):
        return self.big_tables[i]
        
filenames = [ "file1.txt", "files2.txt" ]
          
def load(filename):
    return pandas.read_csv(filename, sep="\t")
    
container = DataContainer ( [ load(f) for f in filenames ] )

So the goal is to load the data only when it is required. But I would like to avoid tweaking the interface of class. And the logic loading the data is held outside the container. However I would an access to the container to activate the loading of the data. Si instead of giving the class DataContainer the data itself, I give it a function able to load the data.

def memoize(f):
    memo = {}
    def helper(self, x):
        if x not in memo:            
            memo[x] = f(self, x)
        return memo[x]
    return helper        
        
class DataContainerDelayed:
    def __init__( self, big_tables ):
        self.big_tables = big_tables
        
    @memoize
    def __getitem__(self, i):
        return self.big_tables[i]()
        
container = DataContainerDelayed ( [ lambda t=f : load(t) for f in filenames ] )        
for i in range(0,2): print(container[i])

But I would like to avoid loading the data only one time. So I used a memoize mechanism.

2015-02-09 Jouer à Space Invaders à coup de ligne de code

Si vous ne me croyez pas, aller voir ici : codingame. Ce n'est pas vraiment un jeu d'arcade mais il s'agit d'implémenter une stratégie qui vous permette de résoudre un jeu sans joystick. Allez voir le blog.

2015-02-06 Quelques trucs à propos de PIG

PIG a besoin de connaître le nombre exact de colonnes. Supposons que vous ayez quatre colonnes :

c1  c2  c3  c3
1   2   3   4
6   7   8   9
...

PIG ne dira rien si on écrit ceci :

A = LOAD '$CONTAINER/$PSEUDO/fichiers/ExportHDInsightutf8_noheader.txt'
          USING PigStorage('\t') 
           AS (c1:chararray,c2:chararray,c3:chararray) ;

La dernière colonne sera forcément incluse avec une autre, ce entraînera une erreur plus tard dans l'exécution du script. Une autre erreur causée par inadvertance :

2015-02-05 23:40:54,964 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - 
ERROR 0: Exception while executing [POUserFunc 
(Name: POUserFunc(org.apache.pig.builtin.YearsBetween)[long] - 
scope-380 Operator Key: scope-380) children: null at []]: 
java.lang.IllegalArgumentException: ReadableInstant objects must not be null

Cette erreur apparaît par exemple lors de la conversion d'une chaîne de caractères au format Date. Par ailleurs, on sait que les valeurs de cette colonne ne sont jamais nulles. Alors pourquoi ? Le fichier importé sur Hadoop provient en fait d'un fichier texte enregistré à l'aide de pandas. La première ligne contient le nom des colonnes. Or, sous Hadoop, le nom des colonnes n'est jamais précisé dans un fichier. Il n'y a pas de concept de première ligne sous Hadoop. Un gros fichier est stocké sur plusieurs machines et en plusieurs blocs. Chaque blocs a une première ligne mais l'ensemble des blocs n'en a pas vraiment. Ces blocs ne sont d'ailleurs pas ordonnés. On n'insère donc jamais le nom des colonnes dans un fichier sur Hadoop car il n'y a pas de première ligne.

2015-02-05 Run a IPython notebook offline

I intensively use notebooks for my teachings and I recently noticed that some of them fail because of I updated a module or I did some changes to my python installation. So I thought I looked for a way to run my notebooks in batch mode. I found runipy which runs a notebook and catches exception it raises. After a couple of tries, I decided to modify the code to get more infos when it fails. It ended up with a function run_notebook:

from pyquickhelper.ipythonhelper.notebook_helper import run_notebook
output = run_notebook(notebook_filename, 
             working_dir=folder, 
             outfilename=outfile)

I think it is going to save some time from one year to the next one.

Xavier Dupré