module sql.file_text_binary
¶
Short summary¶
module pyensae.sql.file_text_binary
contains a class which opens a text file as a binary file.
Classes¶
class |
truncated documentation |
---|---|
This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open … |
Methods¶
method |
truncated documentation |
---|---|
Iterator |
|
Builds a regular expression. |
|
Returns the number of every character in car. |
|
Guesses the type of value s. |
|
Splits a line into a list, separator |
|
Interprets the first line which contains the columns name. |
|
load… |
|
Closes the file. |
|
Counts the number of rejected lines by regular expression exp. |
|
Returns the number of read bytes. |
|
Returns the number of read lines. |
|
Guesses the columns type. |
|
Joins several files together. |
|
Opens the file in reading mode. |
|
Extracts all the lines, the file must not be opened through method open |
Documentation¶
contains a class which opens a text file as a binary file.
- class pyensae.sql.file_text_binary.TextFile(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')¶
Bases:
object
This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open function.
attribute
meaning
filename
file name
errors
decoding in utf8 can raise some errors, see str to understand the meaning of this parameter
LOG
logging function
_buffer_size
read a text file _buffer_size bytes each time
_filter
function filter, None or return True or False whether a line should considered or not
_encoding
encoding
Example:
f = TextFile(filename) f.open () for line in f : print line f.close ()
- Parameters:
filename – filename
errors – see str (errors = …)
fLOG – LOG function, see fLOG
buffer_size – buffer_size (mostly use to test the reading function)
filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not
separated – if True, the line returned by the iterator are splitted by the most probable separator
- __init__(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')¶
- Parameters:
filename – filename
errors – see str (errors = …)
fLOG –
LOG function, see fLOG
buffer_size – buffer_size (mostly use to test the reading function)
filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not
separated – if True, the line returned by the iterator are splitted by the most probable separator
- __iter__()¶
Iterator
f = open('...', 'r') for line in f : ... f.close ()
- Returns:
a str string
- _build_regex(sep, columns, exp={<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0, 4})?', <class 'str'>: '.*'}, nomore=False, regex=None)¶
Builds a regular expression.
- Parameters:
sep – separator
columns – columns definition
exp – regular expression associated to each type, (see below for the default value)
nomore – private argument, no more try, not possible to simplify
regex – if the default expression for a field is not the expected one, look into regex if there is one
- Returns:
regex
Default value for
exp
:{ int: "([-]?[1-9][0-9]*?)|(0?)", decimal.Decimal: "([-]?[1-9][0-9]*?L?)|(0?)", float: "[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?", str: ".*" }
- _build_regex_default_value_types = {<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?', <class 'str'>: '.*'}¶
- _count_s(car)¶
Returns the number of every character in car.
- _get_type(s)¶
Guesses the type of value s.
- _interpret(line)¶
Splits a line into a list, separator
\t
.- Parameters:
line – string
- Returns:
list
- _interpret_columns(line)¶
Interprets the first line which contains the columns name.
- Parameters:
line – string
- Returns:
dictionary { name:position }
- _load(filename, this_column, file_column, prefix, **param)¶
load…
- _sep_available = '\t;,| '¶
- _split_expr = re.compile('\\r?\\t')¶
- close()¶
Closes the file.
- count_rejected_lines(header, exp, output=None)¶
Counts the number of rejected lines by regular expression exp.
- Parameters:
header – header or not in the first line
exp – regular expression
output – if != None, output is a stream which will receive the unrecognized line (see below)
- Returns:
nb_accepted, nb rejected
Format for the file containing the unrecognized lines:
line number line
- get_nb_readbytes()¶
Returns the number of read bytes.
- get_nb_readlines()¶
Returns the number of read lines.
- guess_columns(nb=100, force_header=False, changes=None, force_noheader=False, fields=None, regex=None, force_sep=None, mistake=3)¶
Guesses the columns type.
- Parameters:
nb – number of lines to have a look to in order to find all the necessary elements
force_header – impose a header whether it is detect or not
changes – modify some column names, example { “query”:”query___” }
force_noheader – there is no header at all
fields – name of the columns if there is no header (instead of c000, c001…)
regex – if the default expression for a field is not the expected one, change by looking into regex
force_sep – force the separator to be the one chosen by the user (None by default)
mistake – not more than mistake conversion in numbers are allowed
- Returns:
4-tuple, see below
Returned result is a 4 t-uple:
True or False: presence of a header (it means there is at least one numerical column)
column definition
{ position : (name, type) }
or{ position : (name, (str, max_length*2)) }
separator
regex which allow the user to extract information from the file
The column separator is looked into
, | ; \t
Warning
The file must not be opened, it will be several times.
- join(definition, output, missing_value='', unique=None, **param)¶
Joins several files together.
- Parameters:
definition – list of triplets: filename, this_column, file_column, prefix
output – if None, return the results as a list, otherwise save it into output
param – parameter used to open files
missing_value – specify a value for the missing values
unique – if unique is a column name, do not process a line whose value has already been processed, None otherwise
- Returns:
columns, matrix or number of of missing values
We assume that every file starts with header giving columns names. The function associates this_column value to file_column and appends all the columns from filename with a prefix. We also assumes values in file_column are unique.
- open()¶
Opens the file in reading mode.
- readlines()¶
Extracts all the lines, the file must not be opened through method open
\n
are removed.