XD blog

blog page

pydata, python


2015-07-28 PyData 2015 in Seattle

I attended my first conference pydata in Seattle and I must say I learned a lot. I discovered much what I could ever do by looking on Internet for a library for a precise need. That was really worth taking a plane and attend. Most of all, I felt people very passionnated, constantly looking for improvement. So passionate that I would definitely recommend Python over R as a first choice for a machine learning language. R seems only to grow by the number of available packages. But Python catches up. And its environment is also extending by various initiatives to improve plotting or the handling of very big datasets.

I would not be surprised if a language named Rython pops up one day.

There were many talks about Spark such as this one on sparkling pandas which aims at proposing the same methods as you would expect from any dataframe but for a Resilient Distributed Dataset (RDD). I came accross some code to convert Python into Scala: PythonToScale.

The speech made by Rob Story was quite interesting as he walked through many libraries to handle very big datasets on disk or with lazy evaluation. Big datasets means several Gb. Lazy evaluation usually means working with iterator or functional programming at which cytoolz is very good. blaze and odo, bcolz, dask is hte only to propose distributed evaluation. SArray, SFrame from GraphLab. The last one is biggus but I guess it is too early to say.

Deep learning was quite popular. A couple of talks and many people in the room. theano is the main reference, I discovered deeplearning4j and discovered again caffe whose development seems to be more active than in the past. And neon might become popular too.

Engineers are still looking for a good way to display results. The Jupyter notebook is of course the first choise and many talks were relying on it. But for people who are fond of R Shiny, spyre should help them to build a simple web application which looks the same way. bokeh is growing fast and attracts a lof of attention. It is now possible to insert custom Javascript and callbacks. I'll end this section by some extensions of matplotlib cubehelix and viscm. But because the notebook became very popular, many javascript extensions appear (bearcart).

About machine learning, I discovered python-recsys, a recommendation system easy to train and easier to install (on Windows) compare to crab. lifelines would be a package to do survival analysis. The presention Hack the Derivative (link from pydata). was about computing a limit of a real function with complex number and studying the precision of the results.

Finally some others links: Mood Stochastic (the connection was the package Rborist). An interesting paper: Bayesian Online Changepoint Detection and this associated notebook. An interesting library for javascript plotting: rickshaw. Somebody to follow when it comes to deal with huge datasets: Tob Story. I don't remember how I heard about this one: ibis. Anyway, this is a templating library.


<-- -->

Xavier Dupré