module filehelper.pig_helper

Short summary

module pyenbc.filehelper.pig_helper

Hadoop uses a java implementation of Python: Jython. This provides provides helper around that.

Functions

function

truncated documentation

download_pig_standalone

Downloads the standalone :epkg:`jython`. If it does not exists, we should version HADOOP_VERSION by default …

get_hadoop_jars

Returns the list of jars to include into the command line in order to run :epkg:`HADOOP`.

get_hadoop_path

This function assumes a folder pig hadoopjar is present in this directory, the function returns the folder.

get_pig_jars

Returns the list of jars to include into the command line in order to run :epkg:`PIG`.

get_pig_path

This function assumes a folder pig pigjar is present in this directory, the function returns the folder

run_pig

Runs a :epkg:`pig` script and returns the standard output and error.

Documentation

Hadoop uses a java implementation of Python: Jython. This provides provides helper around that.

New in version 1.1.

source on GitHub

pyenbc.filehelper.pig_helper.download_pig_standalone(pig_version='0.17.0', hadoop_version='3.3.0', fLOG=<function noLOG>)

Downloads the standalone :epkg:`jython`. If it does not exists, we should version HADOOP_VERSION by default in order to fit the cluster’s version.

Parameters:
  • pig_version – pig_version

  • hadoop_version – hadoop_version

  • fLOG – logging function

Returns:

location

This function might need to be run twice if the first try fails, it might to due to very long path when unzipping the downloaded file.

:epkg:`Hadoop` is downloaded from one of the websites referenced at Apache Software Foundation. Check the source to see which one was chosen.

source on GitHub

pyenbc.filehelper.pig_helper.get_hadoop_jars()

Returns the list of jars to include into the command line in order to run :epkg:`HADOOP`.

Returns:

list of jars

source on GitHub

pyenbc.filehelper.pig_helper.get_hadoop_path()

This function assumes a folder pig hadoopjar is present in this directory, the function returns the folder.

Returns:

absolute path

source on GitHub

pyenbc.filehelper.pig_helper.get_pig_jars()

Returns the list of jars to include into the command line in order to run :epkg:`PIG`.

Returns:

list of jars

source on GitHub

pyenbc.filehelper.pig_helper.get_pig_path()

This function assumes a folder pig pigjar is present in this directory, the function returns the folder

Returns:

absolute path

source on GitHub

pyenbc.filehelper.pig_helper.run_pig(pigfile, argv=None, pig_path=None, hadoop_path=None, jython_path=None, timeout=None, logpath='logs', pig_version='0.17.0', hadoop_version='3.3.0', jar_no_hadoop=True, fLOG=<function noLOG>)

Runs a :epkg:`pig` script and returns the standard output and error.

Parameters:
  • pigfile – pig file

  • argv – arguments to sned to the command line

  • pig_path – path to pig 0.XX.0

  • hadoop_path – path to hadoop

  • timeout – timeout

  • logpath – path to the logs

  • pig_version – PIG version (if pig_path is not defined)

  • hadoop_version – Hadoop version (if hadoop_path is not defined)

  • jar_no_hadoop – use :epkg:`pig` without :epkg:`hadoop`

  • fLOG – logging function

Returns:

out, err

If pig_path is None, the function looks into this directory.

source on GitHub