{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Python Hadoop Pig\n", "\n", "This notebook aims at showing how to submit a PIG job to remote hadoop cluster (tested with Cloudera). It works better if you know Hadoop otherwise I recommend reading [Map/Reduce avec PIG](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx2/notebooks/td3a_cenonce_session6.html#td3acenoncesession6rst) (French). First, we download data. We are going to upload that data to the remote cluster. The Hadoop distribution tested here is [Cloudera](http://www.cloudera.com/)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/plain": ["'ConfLongDemo_JSI.txt'"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyensae\n", "%load_ext pyensae\n", "%load_ext pyenbc\n", "pyensae.download_data(\"ConfLongDemo_JSI.txt\", website=\"https://archive.ics.uci.edu/ml/machine-learning-databases/00196/\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We open a SSH connection to the bridge which can communicate to the cluster."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
credentials\n", "
password \n", "
server \n", "
username \n", "
\n", ""], "text/plain": [""]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyquickhelper.ipythonhelper as ipy\n", "params={\"server\":\"\", \"username\":\"\", \"password\":\"\"}\n", "ipy.open_html_form(params=params,title=\"credentials\",key_save=\"ssh_remote_hadoop\")"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["password = ssh_remote_hadoop[\"password\"]\n", "server = ssh_remote_hadoop[\"server\"]\n", "username = ssh_remote_hadoop[\"username\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We open the SSH connection:"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": [""]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_open"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the content of the remote machine:"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "total 3404\n", "-rw-rw-r-- 1 xavierdupre xavierdupre    1043 Jul 14 23:40 centrer_reduire.pig\n", "-rw-r--r-- 1 xavierdupre xavierdupre       2 Jul 15 00:22 diff_cluster\n", "-rw-rw-r-- 1 xavierdupre xavierdupre       0 Sep 27 00:21 dummy\n", "-rw-rw-r-- 1 xavierdupre xavierdupre     290 Jul 14 23:48 init_random.pig\n", "-rw-rw-r-- 1 xavierdupre xavierdupre    1654 Jul 15 00:20 iteration_complete.pig\n", "-rw-rw-r-- 1 xavierdupre xavierdupre     235 Jul 14 23:37 nb_obervations.pig\n", "-rw-rw-r-- 1 xavierdupre xavierdupre    1778 Jul 14 23:57 pig_1436911046432.log\n", "-rw-rw-r-- 1 xavierdupre xavierdupre    4570 Jul 15 00:45 pig_1436913856496.log\n", "-rw-rw-r-- 1 xavierdupre xavierdupre    4570 Jul 15 23:52 pig_1436997076356.log\n", "-rw-rw-r-- 1 xavierdupre xavierdupre     574 Jul 15 23:51 post_traitement.pig\n", "-rw-rw-r-- 1 xavierdupre xavierdupre     659 Sep 27 00:21 pystream.pig\n", "-rw-rw-r-- 1 xavierdupre xavierdupre     382 Sep 27 00:21 pystream.py\n", "-rw-rw-r-- 1 xavierdupre xavierdupre   26186 Jul 15 23:52 redirection.err\n", "-rw-rw-r-- 1 xavierdupre xavierdupre       0 Jul 15 23:51 redirection.out\n", "-rw-rw-r-- 1 xavierdupre xavierdupre 3400818 Jul 15 23:48 Skin_NonSkin.txt\n", "\n", "
"], "text/plain": [""]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd ls -l"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attributescodealiasfoldersizeunitnameisdir
-rw-rw-r--1xavierduprexavierdupre1043Jul1423:40centrer_reduire.pigFalse
-rw-r--r--1xavierduprexavierdupre2Jul1500:22diff_clusterFalse
-rw-rw-r--1xavierduprexavierdupre0Sep2700:21dummyFalse
1xavierduprexavierdupre290Jul1423:48init_random.pigFalse
1xavierduprexavierdupre1654Jul1500:20iteration_complete.pigFalse
1xavierduprexavierdupre235Jul1423:37nb_obervations.pigFalse
1xavierduprexavierdupre1778Jul1423:57pig_1436911046432.logFalse
1xavierduprexavierdupre4570Jul1500:45pig_1436913856496.logFalse
1xavierduprexavierdupre4570Jul1523:52pig_1436997076356.logFalse
1xavierduprexavierdupre574Jul1523:51post_traitement.pigFalse
1xavierduprexavierdupre659Sep2700:21pystream.pigFalse
1xavierduprexavierdupre382Sep2700:21pystream.pyFalse
1xavierduprexavierdupre26186Jul1523:52redirection.errFalse
1xavierduprexavierdupre0Jul1523:51redirection.outFalse
1xavierduprexavierdupre3400818Jul1523:48Skin_NonSkin.txtFalse
\n", "
"], "text/plain": [" attributes code alias folder size unit \\\n", "-rw-rw-r-- 1 xavierdupre xavierdupre 1043 Jul 14 23:40 \n", "-rw-r--r-- 1 xavierdupre xavierdupre 2 Jul 15 00:22 \n", "-rw-rw-r-- 1 xavierdupre xavierdupre 0 Sep 27 00:21 \n", " 1 xavierdupre xavierdupre 290 Jul 14 23:48 \n", " 1 xavierdupre xavierdupre 1654 Jul 15 00:20 \n", " 1 xavierdupre xavierdupre 235 Jul 14 23:37 \n", " 1 xavierdupre xavierdupre 1778 Jul 14 23:57 \n", " 1 xavierdupre xavierdupre 4570 Jul 15 00:45 \n", " 1 xavierdupre xavierdupre 4570 Jul 15 23:52 \n", " 1 xavierdupre xavierdupre 574 Jul 15 23:51 \n", " 1 xavierdupre xavierdupre 659 Sep 27 00:21 \n", " 1 xavierdupre xavierdupre 382 Sep 27 00:21 \n", " 1 xavierdupre xavierdupre 26186 Jul 15 23:52 \n", " 1 xavierdupre xavierdupre 0 Jul 15 23:51 \n", " 1 xavierdupre xavierdupre 3400818 Jul 15 23:48 \n", "\n", " name isdir \n", "-rw-rw-r-- 1 centrer_reduire.pig False \n", "-rw-r--r-- 1 diff_cluster False \n", "-rw-rw-r-- 1 dummy False \n", " 1 init_random.pig False \n", " 1 iteration_complete.pig False \n", " 1 nb_obervations.pig False \n", " 1 pig_1436911046432.log False \n", " 1 pig_1436913856496.log False \n", " 1 pig_1436997076356.log False \n", " 1 post_traitement.pig False \n", " 1 pystream.pig False \n", " 1 pystream.py False \n", " 1 redirection.err False \n", " 1 redirection.out False \n", " 1 Skin_NonSkin.txt False "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_ls ."]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the content on the cluster:"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "Found 33 items\n", "drwx------   - xavierdupre xavierdupre          0 2015-09-27 02:00 .Trash\n", "drwx------   - xavierdupre xavierdupre          0 2015-09-27 00:22 .staging\n", "-rw-r--r--   3 xavierdupre xavierdupre     132727 2014-11-16 02:37 ConfLongDemo_JSI.small.example.txt\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-16 02:38 ConfLongDemo_JSI.small.example2.walking.txt\n", "-rw-r--r--   3 xavierdupre xavierdupre    3400818 2015-07-14 23:35 Skin_NonSkin.txt\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:22 diff_cluster\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-14 23:44 donnees_normalisees\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-14 23:43 ecartstypes\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-14 23:49 init_random\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-14 23:41 moyennes\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-14 23:38 nb_obervations\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:05 output_iter1\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:22 output_iter10\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:07 output_iter2\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:09 output_iter3\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:11 output_iter4\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:13 output_iter5\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:15 output_iter6\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:17 output_iter7\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:18 output_iter8\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-07-15 00:20 output_iter9\n", "-rw-r--r--   3 xavierdupre xavierdupre     461444 2014-11-20 01:33 paris.2014-11-11_22-00-18.331391.txt\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-23 22:03 python_info.txt\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-23 22:07 python_info2.txt\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-12-03 22:55 random\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-20 23:43 unitest2\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-09-27 00:23 unittest\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2015-09-27 00:22 unittest2\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-20 01:53 velib_1hjs\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-21 01:17 velib_py\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-23 21:34 velib_py_results\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-23 21:51 velib_py_results_3days\n", "drwxr-xr-x   - xavierdupre xavierdupre          0 2014-11-21 11:08 velib_several_days\n", "\n", "
"], "text/plain": [""]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -ls"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attributescodealiasfoldersizedatetimenameisdir
0drwx-------xavierduprexavierdupre02015-09-2702:00.TrashTrue
1drwx-------xavierduprexavierdupre02015-09-2700:22.stagingTrue
2-rw-r--r--3xavierduprexavierdupre1327272014-11-1602:37ConfLongDemo_JSI.small.example.txtFalse
3drwxr-xr-x-xavierduprexavierdupre02014-11-1602:38ConfLongDemo_JSI.small.example2.walking.txtTrue
4-rw-r--r--3xavierduprexavierdupre34008182015-07-1423:35Skin_NonSkin.txtFalse
5drwxr-xr-x-xavierduprexavierdupre02015-07-1500:22diff_clusterTrue
6drwxr-xr-x-xavierduprexavierdupre02015-07-1423:44donnees_normaliseesTrue
7drwxr-xr-x-xavierduprexavierdupre02015-07-1423:43ecartstypesTrue
8drwxr-xr-x-xavierduprexavierdupre02015-07-1423:49init_randomTrue
9drwxr-xr-x-xavierduprexavierdupre02015-07-1423:41moyennesTrue
10drwxr-xr-x-xavierduprexavierdupre02015-07-1423:38nb_obervationsTrue
11drwxr-xr-x-xavierduprexavierdupre02015-07-1500:05output_iter1True
12drwxr-xr-x-xavierduprexavierdupre02015-07-1500:22output_iter10True
13drwxr-xr-x-xavierduprexavierdupre02015-07-1500:07output_iter2True
14drwxr-xr-x-xavierduprexavierdupre02015-07-1500:09output_iter3True
15drwxr-xr-x-xavierduprexavierdupre02015-07-1500:11output_iter4True
16drwxr-xr-x-xavierduprexavierdupre02015-07-1500:13output_iter5True
17drwxr-xr-x-xavierduprexavierdupre02015-07-1500:15output_iter6True
18drwxr-xr-x-xavierduprexavierdupre02015-07-1500:17output_iter7True
19drwxr-xr-x-xavierduprexavierdupre02015-07-1500:18output_iter8True
20drwxr-xr-x-xavierduprexavierdupre02015-07-1500:20output_iter9True
21-rw-r--r--3xavierduprexavierdupre4614442014-11-2001:33paris.2014-11-11_22-00-18.331391.txtFalse
22drwxr-xr-x-xavierduprexavierdupre02014-11-2322:03python_info.txtTrue
23drwxr-xr-x-xavierduprexavierdupre02014-11-2322:07python_info2.txtTrue
24drwxr-xr-x-xavierduprexavierdupre02014-12-0322:55randomTrue
25drwxr-xr-x-xavierduprexavierdupre02014-11-2023:43unitest2True
26drwxr-xr-x-xavierduprexavierdupre02015-09-2700:23unittestTrue
27drwxr-xr-x-xavierduprexavierdupre02015-09-2700:22unittest2True
28drwxr-xr-x-xavierduprexavierdupre02014-11-2001:53velib_1hjsTrue
29drwxr-xr-x-xavierduprexavierdupre02014-11-2101:17velib_pyTrue
30drwxr-xr-x-xavierduprexavierdupre02014-11-2321:34velib_py_resultsTrue
31drwxr-xr-x-xavierduprexavierdupre02014-11-2321:51velib_py_results_3daysTrue
32drwxr-xr-x-xavierduprexavierdupre02014-11-2111:08velib_several_daysTrue
\n", "
"], "text/plain": [" attributes code alias folder size date time \\\n", "0 drwx------ - xavierdupre xavierdupre 0 2015-09-27 02:00 \n", "1 drwx------ - xavierdupre xavierdupre 0 2015-09-27 00:22 \n", "2 -rw-r--r-- 3 xavierdupre xavierdupre 132727 2014-11-16 02:37 \n", "3 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-16 02:38 \n", "4 -rw-r--r-- 3 xavierdupre xavierdupre 3400818 2015-07-14 23:35 \n", "5 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:22 \n", "6 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:44 \n", "7 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:43 \n", "8 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:49 \n", "9 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:41 \n", "10 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:38 \n", "11 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:05 \n", "12 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:22 \n", "13 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:07 \n", "14 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:09 \n", "15 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:11 \n", "16 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:13 \n", "17 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:15 \n", "18 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:17 \n", "19 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:18 \n", "20 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:20 \n", "21 -rw-r--r-- 3 xavierdupre xavierdupre 461444 2014-11-20 01:33 \n", "22 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 22:03 \n", "23 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 22:07 \n", "24 drwxr-xr-x - xavierdupre xavierdupre 0 2014-12-03 22:55 \n", "25 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-20 23:43 \n", "26 drwxr-xr-x - xavierdupre xavierdupre 0 2015-09-27 00:23 \n", "27 drwxr-xr-x - xavierdupre xavierdupre 0 2015-09-27 00:22 \n", "28 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-20 01:53 \n", "29 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-21 01:17 \n", "30 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 21:34 \n", "31 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 21:51 \n", "32 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-21 11:08 \n", "\n", " name isdir \n", "0 .Trash True \n", "1 .staging True \n", "2 ConfLongDemo_JSI.small.example.txt False \n", "3 ConfLongDemo_JSI.small.example2.walking.txt True \n", "4 Skin_NonSkin.txt False \n", "5 diff_cluster True \n", "6 donnees_normalisees True \n", "7 ecartstypes True \n", "8 init_random True \n", "9 moyennes True \n", "10 nb_obervations True \n", "11 output_iter1 True \n", "12 output_iter10 True \n", "13 output_iter2 True \n", "14 output_iter3 True \n", "15 output_iter4 True \n", "16 output_iter5 True \n", "17 output_iter6 True \n", "18 output_iter7 True \n", "19 output_iter8 True \n", "20 output_iter9 True \n", "21 paris.2014-11-11_22-00-18.331391.txt False \n", "22 python_info.txt True \n", "23 python_info2.txt True \n", "24 random True \n", "25 unitest2 True \n", "26 unittest True \n", "27 unittest2 True \n", "28 velib_1hjs True \n", "29 velib_py True \n", "30 velib_py_results True \n", "31 velib_py_results_3days True \n", "32 velib_several_days True "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["%dfs_ls ."]}, {"cell_type": "markdown", "metadata": {}, "source": ["We upload the file on the bridge (we should zip it first, it would reduce the uploading time)."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["'ConfLongDemo_JSI.txt'"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_up ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check it got there:"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "ConfLongDemo_JSI.txt\n", "\n", "
"], "text/plain": [""]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd ls Conf*JSI.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We put it on the cluster:"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
"], "text/plain": [""]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -put ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check it was put on the cluster:"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "Found 1 items\n", "-rw-r--r--   3 xavierdupre xavierdupre   21546346 2015-09-27 11:33 ConfLongDemo_JSI.txt\n", "\n", "
"], "text/plain": [""]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -ls Conf*JSI.txt"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attributescodealiasfoldersizedatetimenameisdir
0-rw-r--r--3xavierduprexavierdupre215463462015-09-2711:33ConfLongDemo_JSI.txtFalse
\n", "
"], "text/plain": [" attributes code alias folder size date time \\\n", "0 -rw-r--r-- 3 xavierdupre xavierdupre 21546346 2015-09-27 11:33 \n", "\n", " name isdir \n", "0 ConfLongDemo_JSI.txt False "]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["dfs_ls Conf*JSI.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We create a simple PIG program:"]}, {"cell_type": "code", "execution_count": 14, "metadata": {"collapsed": true}, "outputs": [], "source": ["%%PIG filter_example.pig\n", "\n", "myinput = LOAD 'ConfLongDemo_JSI.txt' USING PigStorage(',') AS\n", " (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activity) ;\n", "filt = FILTER myinput BY activity == 'walking' ;\n", "STORE filt INTO 'ConfLongDemo_JSI.walking.txt' USING PigStorage() ;"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
"], "text/plain": [""]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["%pig_submit filter_example.pig -r=filter_example.redirect"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the redirected files were created:"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "filter_example.redirect.err\n", "filter_example.redirect.out\n", "\n", "
"], "text/plain": [""]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd ls f*redirect*"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the tail on a regular basis to see the job running (some other commands can be used to monitor jobs, ``%remote_cmd mapred --help``)."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "Spillable Memory Manager spill count : 0\n", "Total bags proactively spilled: 0\n", "Total records proactively spilled: 0\n", "\n", "Job DAG:\n", "job_1435583503337_0055\n", "\n", "\n", "2015-09-27 11:38:56,436 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 164860 time(s).\n", "2015-09-27 11:38:56,436 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!\n", "\n", "
"], "text/plain": [""]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd tail filter_example.redirect.err"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "Found 2 items\n", "-rw-r--r--   3 xavierdupre xavierdupre          0 2015-09-27 11:38 ConfLongDemo_JSI.walking.txt/_SUCCESS\n", "-rw-r--r--   3 xavierdupre xavierdupre          0 2015-09-27 11:38 ConfLongDemo_JSI.walking.txt/part-m-00000\n", "\n", "
"], "text/plain": [""]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -ls Conf*JSI.walking.txt"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
attributescodealiasfoldersizedatetimenameisdir
0-rw-r--r--3xavierduprexavierdupre02015-09-2711:38ConfLongDemo_JSI.walking.txt/_SUCCESSFalse
1-rw-r--r--3xavierduprexavierdupre02015-09-2711:38ConfLongDemo_JSI.walking.txt/part-m-00000False
\n", "
"], "text/plain": [" attributes code alias folder size date time \\\n", "0 -rw-r--r-- 3 xavierdupre xavierdupre 0 2015-09-27 11:38 \n", "1 -rw-r--r-- 3 xavierdupre xavierdupre 0 2015-09-27 11:38 \n", "\n", " name isdir \n", "0 ConfLongDemo_JSI.walking.txt/_SUCCESS False \n", "1 ConfLongDemo_JSI.walking.txt/part-m-00000 False "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["%dfs_ls Conf*JSI.walking.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["After that, the stream has to downloaded to the bridge and then to the local machine with ``%remote_down``. We finally close the connection."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["True"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_close"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**END**"]}, {"cell_type": "code", "execution_count": 21, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}