module `sql.file_text_binary_columns`¶

Short summary¶

module pyensae.sql.file_text_binary_columns

contains a class which iterations on rows of a text file structured as a table.

Classes¶

class	truncated documentation
`TextFileColumns`	This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted …

Static Methods¶

staticmethod	truncated documentation
`_store`	Stores a list of dictionaries into a file (add a header).
`fusion`	Does a fusion between several files with the same columns (different order is allowed).

Methods¶

method	truncated documentation
`__init__`
`__iter__`
`__str__`	Returns the header.
`close`	Closes the file and remove all information related to the format, next time it is opened, the format will be checked …
`get_columns`
`open`	Opens the file and find out if there is a header, what are the columns, what are their type.
`sort`	Sorts a text file, even a big one, one or several columns gives the order.

Documentation¶

contains a class which iterations on rows of a text file structured as a table.

source on GitHub

class pyensae.sql.file_text_binary_columns.TextFileColumns(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)¶

Bases: TextFile

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted as a TSV file or file containing columns. The separator is found automatically. The columns are assumed to be in the first line but it is not mandatory. It walks along a file through an iterator, every line is automatically converted into a dictionary { column : value }. If the class was able to guess what type is which column, the conversion will automatically take place.

f = TextFileColumns(filename)
        # filename is a file
        # the separator is unknown --> the class automatically determines it
        # as well as the columns and their type
f.open()
for d in f:
    print(d)       # d is a dictionary
f.close()

attribute	meaning
_force_header	there is a header even if not detected
_force_noheader	there is no header even if detected
_changes	replace the columns name
_regexfix	impose a regular expression to interpret a line instead of the automatically built one
_filter_dict	it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
_fields	name of the columns (if there is no header)

Spaces and non-ascii characters cannot be used to name a column. This name must be a named group for a regular expression.

source on GitHub

Parameters:

filename – filename
errors – see str (errors = …)
fLOG – LOG function, see fLOG
force_header – defines the first line as columns header whatever is it relevant or not
changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column
force_noheader – there is no header at all
regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field
filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
fields – when the header is not here, these fields will name the columns
keep_text_when_bad_type – keep the value when the conversion type does not word
break_at – if != -1, stop when this limit is reached
strip_space – remove space around columns if True
force_sep – if != None, impose a column separator
nb_line_guess – number of lines used to guess types
mistake – not more than mistake conversion in numbers are allowed
encoding – encoding
strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)¶

Parameters:

filename – filename
errors – see str (errors = …)
fLOG –
LOG function, see fLOG
force_header – defines the first line as columns header whatever is it relevant or not
changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column
force_noheader – there is no header at all
regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field
filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
fields – when the header is not here, these fields will name the columns
keep_text_when_bad_type – keep the value when the conversion type does not word
break_at – if != -1, stop when this limit is reached
strip_space – remove space around columns if True
force_sep – if != None, impose a column separator
nb_line_guess – number of lines used to guess types
mistake – not more than mistake conversion in numbers are allowed
encoding – encoding
strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__iter__()¶

Returns:: a dictionary { column_name: value }

source on GitHub

__str__()¶

Returns the header.

source on GitHub

static _store(output, la, encoding='utf-8')¶

Stores a list of dictionaries into a file (add a header).

Parameters:

output – filename
la – list of dictionary key:value
encoding – encoding

Warning

format is utf-8

source on GitHub

close()¶

Closes the file and remove all information related to the format, next time it is opened, the format will be checked again.

source on GitHub

static fusion(key, files, output, force_header=False, encoding='utf-8', fLOG=<function noLOG>)¶

Does a fusion between several files with the same columns (different order is allowed).

Parameters:

key – columns to be compared
files – list of files
output – output file
force_header – impose the first line as a header
encoding – encoding
fLOG – logging function

Warning

We assume all files are sorted depending on columns in key

source on GitHub

get_columns()¶

Returns:: the columns

source on GitHub

open()¶

Opens the file and find out if there is a header, what are the columns, what are their type… any information about which format was found is logged.

source on GitHub

sort(output, key, maxmemory=268435456, folder=None, fLOG=<function noLOG>)¶

Sorts a text file, even a big one, one or several columns gives the order.

Parameters:

output – output file result
key – lines sorted depending of these columns
maxmemory – a file is split into smaller files which contains not more than maxmemory lines
folder – the function needs to create temporary files, this folder will contain them before they get removed
fLOG – logging function

Returns:

Warning

We assume this file is not opened.

source on GitHub

module `sql.file_text_binary_columns`¶

Short summary¶

Classes¶

Static Methods¶

Methods¶

Documentation¶

Links

Contents

Information

Related Topics

This Page

module sql.file_text_binary_columns¶

Short summary¶

Classes¶

Static Methods¶

Methods¶

Documentation¶

module `sql.file_text_binary_columns`¶