module datainc.data_cresus#

Short summary#

module ensae_projects.datainc.data_cresus

Script to process the date from Cresus for the hackathon 2016

source on GitHub

Functions#

function

truncated documentation

cresus_dummy_file

prepare_cresus_data

Prepares the data for the challenge.

process_cresus_sql

Processes the database sent by cresus and produces a list of flat files.

process_cresus_whole_process

Processes the database from Cresus until it splits the data into two two sets of files.

split_train_test_cresus_data

Splits the tables into two sets for tables (based on users).

split_XY_bind_dataset_cresus_data

Splits XY for the blind set.

Documentation#

Script to process the date from Cresus for the hackathon 2016

source on GitHub

ensae_projects.datainc.data_cresus.cresus_dummy_file()#
Returns:

local filename

source on GitHub

ensae_projects.datainc.data_cresus.prepare_cresus_data(dbfile, outfold=None, fLOG=<function fLOG>)#

Prepares the data for the challenge.

Parameters:
  • dbfile – database file

  • outfold – output folder

  • fLOG – logging function

Returns:

dictionary of table files

source on GitHub

ensae_projects.datainc.data_cresus.process_cresus_sql(infile, out_clean_sql=None, outdb=None, fLOG=<function fLOG>)#

Processes the database sent by cresus and produces a list of flat files.

Parameters:
  • infile – dump of a sql database

  • out_clean_sql – filename which contains the cleaned sql

  • outdb – sqlite3 file (removed if it exists)

  • fLOG – logging function

Returns:

dataframe with a list

source on GitHub

ensae_projects.datainc.data_cresus.process_cresus_whole_process(infile, outfold, ratio=0.2, fLOG=<function fLOG>)#

Processes the database from Cresus until it splits the data into two two sets of files.

source on GitHub

ensae_projects.datainc.data_cresus.split_XY_bind_dataset_cresus_data(filename, fLOG=<function fLOG>)#

Splits XY for the blind set.

Parameters:
  • filename – table to split

  • fLOG – logging function

Returns:

dictionary of created files

It assumes the targets are columns orientation, nature.

source on GitHub

ensae_projects.datainc.data_cresus.split_train_test_cresus_data(tables, outfold, ratio=0.2, fLOG=<function fLOG>)#

Splits the tables into two sets for tables (based on users).

Parameters:
  • tables – dictionary of tables, prepare_cresus_data

  • outfold – if not None, output all tables in this folder

  • fLOG – logging function

Returns:

couple of dictionaries of table files

source on GitHub