module sql.file_text_binary_columns
¶
Short summary¶
module pyensae.sql.file_text_binary_columns
contains a class which iterations on rows of a text file structured as a table.
Classes¶
class |
truncated documentation |
---|---|
This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted … |
Static Methods¶
staticmethod |
truncated documentation |
---|---|
Stores a list of dictionaries into a file (add a header). |
|
Does a fusion between several files with the same columns (different order is allowed). |
Methods¶
method |
truncated documentation |
---|---|
Returns the header. |
|
Closes the file and remove all information related to the format, next time it is opened, the format will be checked … |
|
Opens the file and find out if there is a header, what are the columns, what are their type. |
|
Sorts a text file, even a big one, one or several columns gives the order. |
Documentation¶
contains a class which iterations on rows of a text file structured as a table.
- class pyensae.sql.file_text_binary_columns.TextFileColumns(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)¶
Bases:
TextFile
This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted as a TSV file or file containing columns. The separator is found automatically. The columns are assumed to be in the first line but it is not mandatory. It walks along a file through an iterator, every line is automatically converted into a dictionary
{ column : value }
. If the class was able to guess what type is which column, the conversion will automatically take place.f = TextFileColumns(filename) # filename is a file # the separator is unknown --> the class automatically determines it # as well as the columns and their type f.open() for d in f: print(d) # d is a dictionary f.close()
attribute
meaning
_force_header
there is a header even if not detected
_force_noheader
there is no header even if detected
_changes
replace the columns name
_regexfix
impose a regular expression to interpret a line instead of the automatically built one
_filter_dict
it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
_fields
name of the columns (if there is no header)
Spaces and non-ascii characters cannot be used to name a column. This name must be a named group for a regular expression.
- Parameters:
filename – filename
errors – see str (errors = …)
fLOG – LOG function, see fLOG
force_header – defines the first line as columns header whatever is it relevant or not
changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column
force_noheader – there is no header at all
regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field
filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
fields – when the header is not here, these fields will name the columns
keep_text_when_bad_type – keep the value when the conversion type does not word
break_at – if != -1, stop when this limit is reached
strip_space – remove space around columns if True
force_sep – if != None, impose a column separator
nb_line_guess – number of lines used to guess types
mistake – not more than mistake conversion in numbers are allowed
encoding – encoding
strict_separator – strict number of columns, it assumes there is no separator in the content of every column
- __init__(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)¶
- Parameters:
filename – filename
errors – see str (errors = …)
fLOG –
LOG function, see fLOG
force_header – defines the first line as columns header whatever is it relevant or not
changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column
force_noheader – there is no header at all
regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field
filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not
fields – when the header is not here, these fields will name the columns
keep_text_when_bad_type – keep the value when the conversion type does not word
break_at – if != -1, stop when this limit is reached
strip_space – remove space around columns if True
force_sep – if != None, impose a column separator
nb_line_guess – number of lines used to guess types
mistake – not more than mistake conversion in numbers are allowed
encoding – encoding
strict_separator – strict number of columns, it assumes there is no separator in the content of every column
- __iter__()¶
- Returns:
a dictionary
{ column_name: value }
- __str__()¶
Returns the header.
- static _store(output, la, encoding='utf-8')¶
Stores a list of dictionaries into a file (add a header).
- Parameters:
output – filename
la – list of dictionary key:value
encoding – encoding
Warning
format is utf-8
- close()¶
Closes the file and remove all information related to the format, next time it is opened, the format will be checked again.
- static fusion(key, files, output, force_header=False, encoding='utf-8', fLOG=<function noLOG>)¶
Does a fusion between several files with the same columns (different order is allowed).
- Parameters:
key – columns to be compared
files – list of files
output – output file
force_header – impose the first line as a header
encoding – encoding
fLOG – logging function
Warning
We assume all files are sorted depending on columns in key
- get_columns()¶
- Returns:
the columns
- open()¶
Opens the file and find out if there is a header, what are the columns, what are their type… any information about which format was found is logged.
- sort(output, key, maxmemory=268435456, folder=None, fLOG=<function noLOG>)¶
Sorts a text file, even a big one, one or several columns gives the order.
- Parameters:
output – output file result
key – lines sorted depending of these columns
maxmemory – a file is split into smaller files which contains not more than maxmemory lines
folder – the function needs to create temporary files, this folder will contain them before they get removed
fLOG – logging function
- Returns:
Warning
We assume this file is not opened.