module sql.file_text_binary_columns

Inheritance diagram of pyensae.sql.file_text_binary_columns

Short summary

module pyensae.sql.file_text_binary_columns

contains a class which iterations on rows of a text file structured as a table.

source on GitHub

Classes

class

truncated documentation

TextFileColumns

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted …

Static Methods

staticmethod

truncated documentation

_store

Stores a list of dictionaries into a file (add a header).

fusion

Does a fusion between several files with the same columns (different order is allowed).

Methods

method

truncated documentation

__init__

__iter__

__str__

Returns the header.

close

Closes the file and remove all information related to the format, next time it is opened, the format will be checked …

get_columns

open

Opens the file and find out if there is a header, what are the columns, what are their type.

sort

Sorts a text file, even a big one, one or several columns gives the order.

Documentation

contains a class which iterations on rows of a text file structured as a table.

source on GitHub

class pyensae.sql.file_text_binary_columns.TextFileColumns(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)

Bases: TextFile

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted as a TSV file or file containing columns. The separator is found automatically. The columns are assumed to be in the first line but it is not mandatory. It walks along a file through an iterator, every line is automatically converted into a dictionary { column : value }. If the class was able to guess what type is which column, the conversion will automatically take place.

f = TextFileColumns(filename)
        # filename is a file
        # the separator is unknown --> the class automatically determines it
        # as well as the columns and their type
f.open()
for d in f:
    print(d)       # d is a dictionary
f.close()

attribute

meaning

_force_header

there is a header even if not detected

_force_noheader

there is no header even if detected

_changes

replace the columns name

_regexfix

impose a regular expression to interpret a line instead of the automatically built one

_filter_dict

it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

_fields

name of the columns (if there is no header)

Spaces and non-ascii characters cannot be used to name a column. This name must be a named group for a regular expression.

source on GitHub

Parameters:
  • filename – filename

  • errors – see str (errors = …)

  • fLOG – LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)
Parameters:
  • filename – filename

  • errors – see str (errors = …)

  • fLOG

    LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__iter__()
Returns:

a dictionary { column_name: value }

source on GitHub

__str__()

Returns the header.

source on GitHub

static _store(output, la, encoding='utf-8')

Stores a list of dictionaries into a file (add a header).

Parameters:
  • output – filename

  • la – list of dictionary key:value

  • encoding – encoding

Warning

format is utf-8

source on GitHub

close()

Closes the file and remove all information related to the format, next time it is opened, the format will be checked again.

source on GitHub

static fusion(key, files, output, force_header=False, encoding='utf-8', fLOG=<function noLOG>)

Does a fusion between several files with the same columns (different order is allowed).

Parameters:
  • key – columns to be compared

  • files – list of files

  • output – output file

  • force_header – impose the first line as a header

  • encoding – encoding

  • fLOG – logging function

Warning

We assume all files are sorted depending on columns in key

source on GitHub

get_columns()
Returns:

the columns

source on GitHub

open()

Opens the file and find out if there is a header, what are the columns, what are their type… any information about which format was found is logged.

source on GitHub

sort(output, key, maxmemory=268435456, folder=None, fLOG=<function noLOG>)

Sorts a text file, even a big one, one or several columns gives the order.

Parameters:
  • output – output file result

  • key – lines sorted depending of these columns

  • maxmemory – a file is split into smaller files which contains not more than maxmemory lines

  • folder – the function needs to create temporary files, this folder will contain them before they get removed

  • fLOG – logging function

Returns:

Warning

We assume this file is not opened.

source on GitHub