module `sql.file_text_binary`¶

Short summary¶

module pyensae.sql.file_text_binary

contains a class which opens a text file as a binary file.

Classes¶

class	truncated documentation
`TextFile`	This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open …

Methods¶

method	truncated documentation
`__init__`
`__iter__`	Iterator
`_build_regex`	Builds a regular expression.
`_count_s`	Returns the number of every character in car.
`_get_type`	Guesses the type of value s.
`_interpret`	Splits a line into a list, separator `\t`.
`_interpret_columns`	Interprets the first line which contains the columns name.
`_load`	load…
`close`	Closes the file.
`count_rejected_lines`	Counts the number of rejected lines by regular expression exp.
`get_nb_readbytes`	Returns the number of read bytes.
`get_nb_readlines`	Returns the number of read lines.
`guess_columns`	Guesses the columns type.
`join`	Joins several files together.
`open`	Opens the file in reading mode.
`readlines`	Extracts all the lines, the file must not be opened through method open `\n` are removed.

Documentation¶

contains a class which opens a text file as a binary file.

source on GitHub

class pyensae.sql.file_text_binary.TextFile(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')¶

Bases: object

This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open function.

attribute	meaning
filename	file name
errors	decoding in utf8 can raise some errors, see str to understand the meaning of this parameter
LOG	logging function
_buffer_size	read a text file _buffer_size bytes each time
_filter	function filter, None or return True or False whether a line should considered or not
_encoding	encoding

Example:

f = TextFile(filename)
f.open ()
for line in f :
    print line
f.close ()

source on GitHub

Parameters:

filename – filename
errors – see str (errors = …)
fLOG – LOG function, see fLOG
buffer_size – buffer_size (mostly use to test the reading function)
filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not
separated – if True, the line returned by the iterator are splitted by the most probable separator

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')¶

Parameters:

filename – filename
errors – see str (errors = …)
fLOG –
LOG function, see fLOG
buffer_size – buffer_size (mostly use to test the reading function)
filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not
separated – if True, the line returned by the iterator are splitted by the most probable separator

source on GitHub

__iter__()¶

Iterator

f = open('...', 'r')
for line in f :
    ...
f.close ()

Returns:: a str string

source on GitHub

_build_regex(sep, columns, exp={<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0, 4})?', <class 'str'>: '.*'}, nomore=False, regex=None)¶

Builds a regular expression.

Parameters:

sep – separator
columns – columns definition
exp – regular expression associated to each type, (see below for the default value)
nomore – private argument, no more try, not possible to simplify
regex – if the default expression for a field is not the expected one, look into regex if there is one

Returns:

regex

Default value for exp:

{
    int:             "([-]?[1-9][0-9]*?)|(0?)",
    decimal.Decimal: "([-]?[1-9][0-9]*?L?)|(0?)",
    float:           "[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?",
    str:             ".*"
}

source on GitHub

_build_regex_default_value_types = {<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?', <class 'str'>: '.*'}¶

_count_s(car)¶

Returns the number of every character in car.

source on GitHub

_get_type(s)¶

Guesses the type of value s.

source on GitHub

_interpret(line)¶

Splits a line into a list, separator \t.

Parameters:: line – string
Returns:: list

source on GitHub

_interpret_columns(line)¶

Interprets the first line which contains the columns name.

Parameters:: line – string
Returns:: dictionary { name:position }

source on GitHub

_load(filename, this_column, file_column, prefix, **param)¶

load…

source on GitHub

_sep_available = '\t;,| '¶

_split_expr = re.compile('\\r?\\t')¶

close()¶

Closes the file.

source on GitHub

count_rejected_lines(header, exp, output=None)¶

Counts the number of rejected lines by regular expression exp.

Parameters:

header – header or not in the first line
exp – regular expression
output – if != None, output is a stream which will receive the unrecognized line (see below)

Returns:

nb_accepted, nb rejected

Format for the file containing the unrecognized lines:

line number           line

source on GitHub

get_nb_readbytes()¶

Returns the number of read bytes.

source on GitHub

get_nb_readlines()¶

Returns the number of read lines.

source on GitHub

guess_columns(nb=100, force_header=False, changes=None, force_noheader=False, fields=None, regex=None, force_sep=None, mistake=3)¶

Guesses the columns type.

Parameters:

nb – number of lines to have a look to in order to find all the necessary elements
force_header – impose a header whether it is detect or not
changes – modify some column names, example { “query”:”query___” }
force_noheader – there is no header at all
fields – name of the columns if there is no header (instead of c000, c001…)
regex – if the default expression for a field is not the expected one, change by looking into regex
force_sep – force the separator to be the one chosen by the user (None by default)
mistake – not more than mistake conversion in numbers are allowed

Returns:

4-tuple, see below

Returned result is a 4 t-uple:

True or False: presence of a header (it means there is at least one numerical column)
column definition { position : (name, type) } or { position : (name, (str, max_length*2)) }
separator
regex which allow the user to extract information from the file

The column separator is looked into , | ; \t

Warning

The file must not be opened, it will be several times.

source on GitHub

join(definition, output, missing_value='', unique=None, **param)¶

Joins several files together.

Parameters:

definition – list of triplets: filename, this_column, file_column, prefix
output – if None, return the results as a list, otherwise save it into output
param – parameter used to open files
missing_value – specify a value for the missing values
unique – if unique is a column name, do not process a line whose value has already been processed, None otherwise

Returns:

columns, matrix or number of of missing values

We assume that every file starts with header giving columns names. The function associates this_column value to file_column and appends all the columns from filename with a prefix. We also assumes values in file_column are unique.

source on GitHub

open()¶

Opens the file in reading mode.

source on GitHub

readlines()¶

Extracts all the lines, the file must not be opened through method open \n are removed.

source on GitHub

module `sql.file_text_binary`¶

Short summary¶

Classes¶

Methods¶

Documentation¶

Links

Contents

Information

Related Topics

This Page

module sql.file_text_binary¶

Short summary¶

Classes¶

Methods¶

Documentation¶

module `sql.file_text_binary`¶