module sql.file_text_binary_columns

Inheritance diagram of pyensae.sql.file_text_binary_columns

Short summary

module pyensae.sql.file_text_binary_columns

contains a class which iterations on rows of a text file structured as a table.

source on GitHub



truncated documentation


This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted …

Static Methods


truncated documentation


Stores a list of dictionaries into a file (add a header).


Does a fusion between several files with the same columns (different order is allowed).



truncated documentation




Returns the header.


Closes the file and remove all information related to the format, next time it is opened, the format will be checked …



Opens the file and find out if there is a header, what are the columns, what are their type.


Sorts a text file, even a big one, one or several columns gives the order.


contains a class which iterations on rows of a text file structured as a table.

source on GitHub

class pyensae.sql.file_text_binary_columns.TextFileColumns(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)

Bases: TextFile

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted as a TSV file or file containing columns. The separator is found automatically. The columns are assumed to be in the first line but it is not mandatory. It walks along a file through an iterator, every line is automatically converted into a dictionary { column : value }. If the class was able to guess what type is which column, the conversion will automatically take place.

f = TextFileColumns(filename)
        # filename is a file
        # the separator is unknown --> the class automatically determines it
        # as well as the columns and their type
for d in f:
    print(d)       # d is a dictionary




there is a header even if not detected


there is no header even if detected


replace the columns name


impose a regular expression to interpret a line instead of the automatically built one


it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not


name of the columns (if there is no header)

Spaces and non-ascii characters cannot be used to name a column. This name must be a named group for a regular expression.

source on GitHub

  • filename – filename

  • errors – see str (errors = …)

  • fLOG – LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)
  • filename – filename

  • errors – see str (errors = …)

  • fLOG

    LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub


a dictionary { column_name: value }

source on GitHub


Returns the header.

source on GitHub

static _store(output, la, encoding='utf-8')

Stores a list of dictionaries into a file (add a header).

  • output – filename

  • la – list of dictionary key:value

  • encoding – encoding


format is utf-8

source on GitHub


Closes the file and remove all information related to the format, next time it is opened, the format will be checked again.

source on GitHub

static fusion(key, files, output, force_header=False, encoding='utf-8', fLOG=<function noLOG>)

Does a fusion between several files with the same columns (different order is allowed).

  • key – columns to be compared

  • files – list of files

  • output – output file

  • force_header – impose the first line as a header

  • encoding – encoding

  • fLOG – logging function


We assume all files are sorted depending on columns in key

source on GitHub


the columns

source on GitHub


Opens the file and find out if there is a header, what are the columns, what are their type… any information about which format was found is logged.

source on GitHub

sort(output, key, maxmemory=268435456, folder=None, fLOG=<function noLOG>)

Sorts a text file, even a big one, one or several columns gives the order.

  • output – output file result

  • key – lines sorted depending of these columns

  • maxmemory – a file is split into smaller files which contains not more than maxmemory lines

  • folder – the function needs to create temporary files, this folder will contain them before they get removed

  • fLOG – logging function



We assume this file is not opened.

source on GitHub