"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["%pig_submit solution_groupby_join.pig -r groupby.join.redirection"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "2015-10-29 01:15:15,416 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address\n", "2015-10-29 01:15:15,416 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS\n", "2015-10-29 01:15:15,416 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://nameservice1\n", "2015-10-29 01:15:17,285 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY\n", "2015-10-29 01:15:17,348 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}\n", "2015-10-29 01:15:17,404 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator\n", "2015-10-29 01:15:17,426 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: \n", " Output Location Validation Failed for: 'hdfs://nameservice1/user/xavierdupre/ConfLongDemo_JSI.small.group.join.txt More info to follow:\n", "Output directory hdfs://nameservice1/user/xavierdupre/ConfLongDemo_JSI.small.group.join.txt already exists\n", "Details at logfile: /home/xavierdupre/pig_1446077714461.log\n", "\n", "
"], "text/plain": [""]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd tail groupby.join.redirection.err"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "Found 2 items\n", "-rw-r--r-- 3 xavierdupre xavierdupre 0 2015-10-29 01:13 ConfLongDemo_JSI.small.group.join.txt/_SUCCESS\n", "-rw-r--r-- 3 xavierdupre xavierdupre 144059 2015-10-29 01:13 ConfLongDemo_JSI.small.group.join.txt/part-r-00000\n", "\n", "
"], "text/plain": [""]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -ls ConfLongDemo_JSI.small.group.join.txt"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "26262834000\t27.05.2009 14:03:46:283\t3.3038318157196045\t1.938292145729065\t0.7622964978218079\tstanding up from sitting\tstanding up from sitting\t42\n", "652\tA01\t020-000-033-111\t633790226262563704\t27.05.2009 14:03:46:257\t3.2363295555114746\t2.00623106956482\t1.1472841501235962\tstanding up from sitting\tstanding up from sitting\t42\n", "651\tA01\t010-000-030-096\t633790226262293413\t27.05.2009 14:03:46:230\t3.275949239730835\t1.7746492624282837\t0.3117055296897888\tstanding up from sitting\tstanding up from sitting\t42\n", "650\tA01\t010-000-024-033\t633790226262023117\t27.05.2009 14:03:46:203\t3.2498104572296143\t1.878917098045349\t0.13854867219924927\tstanding up from sitting\tstanding up from sitting\t42\n", "649\tA01\t020-000-032-221\t633790226261752823\t27.05.2009 14:03:46:177\t3.352446317672729\t1.950886845588684\t0.8281049728393555\tstanding up from sitting\tstanding up from sitting\t42\n", "648\tA01\t020-000-033-111\t633790226261482530\t27.05.2009 14:03:46:147\t3.2220029830932617\t2.0042579174041752\t1.032345414161682\tstanding up from sitting\tstanding up from sitting\t42\n", "\n", "
"], "text/plain": [""]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["%remote_cmd hdfs dfs -tail ConfLongDemo_JSI.small.group.join.txt/part-r-00000"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Prolongements
\n", "\n", "[PIG](http://pig.apache.org/) n'est pas la seule fa\u00e7on d'ex\u00e9cuter des jobs Map/Reduce. [Hive](https://hive.apache.org/) est un langage dont la syntaxe est tr\u00e8s proche de celle du SQL. L'article [Comparing Pig Latin and SQL for Constructing Data Processing Pipelines](https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html) explicite les diff\u00e9rences des deux approches.\n", "\n", "**langage haut niveau**\n", "\n", "Ce qu'il faut retenir est que le langage PIG est un langage haut niveau. Le programme est compil\u00e9 en une s\u00e9quence d'op\u00e9rations Map/Reduce transparente pour l'utilisateur. Le temps de d\u00e9veloppement est tr\u00e8s r\u00e9duit lorsqu'on le compare au m\u00eame programme \u00e9crit en Java. Le compilateur construit un plan d'ex\u00e9cution ([quelques exemples ici](http://chimera.labs.oreilly.com/books/1234000001811/ch07.html#explain)) et inf\u00e8re le nombre de machines requises pour distribuer le job. Cela suffit pour la plupart des besoins, cela n\u00e9cessite.\n", "\n", "**petits jeux**\n", "\n", "Certains jobs peuvent durer des heures, il est conseill\u00e9e de les essayer sur des petits jeux de donn\u00e9es avant de les faire tourner sur les vrais donn\u00e9es. Il est toujours frustrant de s'apercevoir qu'un job a plant\u00e9 au bout de deux heures car une cha\u00eene de caract\u00e8res est vide et que ce cas n'a pas \u00e9t\u00e9 pr\u00e9vu.\n", "\n", "Avec ces petits jeux, il est possible de faire tourner et conseill\u00e9 de tester le job d'abord sur la passerelle ([ex\u00e9cution local](http://archive.cloudera.com/cdh/3/pig/tutorial.html#Running+the+Pig+Scripts+in+Local+Mode)) avant de le lancer sur le cluster. Avec pyensae, il faut ajouter l'option ``-local`` \u00e0 la commande [pig_submit](http://www.xavierdupre.fr/app/pyensae/helpsphinx/pyensae/remote/magic_remote_ssh.html?highlight=pig_submit#pyensae.remote.magic_remote_ssh.MagicRemoteSSH.pig_submit).\n", "\n", "**concat\u00e9ner les fichiers divis\u00e9s**\n", "\n", "Un programme PIG ne produit pas un fichier mais plusieurs fichiers dans un r\u00e9pertoire. La commande [getmerge](http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/FileSystemShell.html) t\u00e9l\u00e9charge ces fichiers sur la passerelle et les fusionne en un seul.\n", "\n", "**ordre des lignes**\n", "\n", "Les jobs sont distribu\u00e9s, m\u00eame en faisant rien (LOAD + STORE), il n'est pas garanti que l'ordre des lignes soit pr\u00e9serv\u00e9. La probabili\u00e9 que ce soit le cas est quasi nulle."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}