module sql.file_text_binary

Inheritance diagram of pyensae.sql.file_text_binary

Short summary

module pyensae.sql.file_text_binary

contains a class which opens a text file as a binary file.

source on GitHub

Classes

class

truncated documentation

TextFile

This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open …

Methods

method

truncated documentation

__init__

__iter__

Iterator

_build_regex

Builds a regular expression.

_count_s

Returns the number of every character in car.

_get_type

Guesses the type of value s.

_interpret

Splits a line into a list, separator \t.

_interpret_columns

Interprets the first line which contains the columns name.

_load

load…

close

Closes the file.

count_rejected_lines

Counts the number of rejected lines by regular expression exp.

get_nb_readbytes

Returns the number of read bytes.

get_nb_readlines

Returns the number of read lines.

guess_columns

Guesses the columns type.

join

Joins several files together.

open

Opens the file in reading mode.

readlines

Extracts all the lines, the file must not be opened through method open \n are removed.

Documentation

contains a class which opens a text file as a binary file.

source on GitHub

class pyensae.sql.file_text_binary.TextFile(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')

Bases: object

This class opens a text file as if it were a binary file. It can deal with null characters which are missed by open function.

attribute

meaning

filename

file name

errors

decoding in utf8 can raise some errors, see str to understand the meaning of this parameter

LOG

logging function

_buffer_size

read a text file _buffer_size bytes each time

_filter

function filter, None or return True or False whether a line should considered or not

_encoding

encoding

Example:

f = TextFile(filename)
f.open ()
for line in f :
    print line
f.close ()

source on GitHub

Parameters:
  • filename – filename

  • errors – see str (errors = …)

  • fLOG – LOG function, see fLOG

  • buffer_size – buffer_size (mostly use to test the reading function)

  • filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not

  • separated – if True, the line returned by the iterator are splitted by the most probable separator

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, buffer_size=1048576, filter=None, separated=False, encoding='utf-8')
Parameters:
  • filename – filename

  • errors – see str (errors = …)

  • fLOG

    LOG function, see fLOG

  • buffer_size – buffer_size (mostly use to test the reading function)

  • filter – None if there is no filter, otherwise it is a function which takes a list and returns a boolean which tells if the line must considered or not

  • separated – if True, the line returned by the iterator are splitted by the most probable separator

source on GitHub

__iter__()

Iterator

f = open('...', 'r')
for line in f :
    ...
f.close ()
Returns:

a str string

source on GitHub

_build_regex(sep, columns, exp={<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0, 4})?', <class 'str'>: '.*'}, nomore=False, regex=None)

Builds a regular expression.

Parameters:
  • sep – separator

  • columns – columns definition

  • exp – regular expression associated to each type, (see below for the default value)

  • nomore – private argument, no more try, not possible to simplify

  • regex – if the default expression for a field is not the expected one, look into regex if there is one

Returns:

regex

Default value for exp:

{
    int:             "([-]?[1-9][0-9]*?)|(0?)",
    decimal.Decimal: "([-]?[1-9][0-9]*?L?)|(0?)",
    float:           "[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?",
    str:             ".*"
}

source on GitHub

_build_regex_default_value_types = {<class 'int'>: '([-]?[1-9][0-9]*?)|(0?)', <class 'decimal.Decimal'>: '([-]?[1-9][0-9]*?L?)|(0?)', <class 'float'>: '[-]?[0-9]*?([.][0-9]*?)?([eE][-]?[0-9]{0,4})?', <class 'str'>: '.*'}
_count_s(car)

Returns the number of every character in car.

source on GitHub

_get_type(s)

Guesses the type of value s.

source on GitHub

_interpret(line)

Splits a line into a list, separator \t.

Parameters:

line – string

Returns:

list

source on GitHub

_interpret_columns(line)

Interprets the first line which contains the columns name.

Parameters:

line – string

Returns:

dictionary { name:position }

source on GitHub

_load(filename, this_column, file_column, prefix, **param)

load…

source on GitHub

_sep_available = '\t;,| '
_split_expr = re.compile('\\r?\\t')
close()

Closes the file.

source on GitHub

count_rejected_lines(header, exp, output=None)

Counts the number of rejected lines by regular expression exp.

Parameters:
  • header – header or not in the first line

  • exp – regular expression

  • output – if != None, output is a stream which will receive the unrecognized line (see below)

Returns:

nb_accepted, nb rejected

Format for the file containing the unrecognized lines:

line number           line

source on GitHub

get_nb_readbytes()

Returns the number of read bytes.

source on GitHub

get_nb_readlines()

Returns the number of read lines.

source on GitHub

guess_columns(nb=100, force_header=False, changes=None, force_noheader=False, fields=None, regex=None, force_sep=None, mistake=3)

Guesses the columns type.

Parameters:
  • nb – number of lines to have a look to in order to find all the necessary elements

  • force_header – impose a header whether it is detect or not

  • changes – modify some column names, example { “query”:”query___” }

  • force_noheader – there is no header at all

  • fields – name of the columns if there is no header (instead of c000, c001…)

  • regex – if the default expression for a field is not the expected one, change by looking into regex

  • force_sep – force the separator to be the one chosen by the user (None by default)

  • mistake – not more than mistake conversion in numbers are allowed

Returns:

4-tuple, see below

Returned result is a 4 t-uple:

  • True or False: presence of a header (it means there is at least one numerical column)

  • column definition { position : (name, type) } or { position : (name, (str, max_length*2)) }

  • separator

  • regex which allow the user to extract information from the file

The column separator is looked into , | ; \t

Warning

The file must not be opened, it will be several times.

source on GitHub

join(definition, output, missing_value='', unique=None, **param)

Joins several files together.

Parameters:
  • definition – list of triplets: filename, this_column, file_column, prefix

  • output – if None, return the results as a list, otherwise save it into output

  • param – parameter used to open files

  • missing_value – specify a value for the missing values

  • unique – if unique is a column name, do not process a line whose value has already been processed, None otherwise

Returns:

columns, matrix or number of of missing values

We assume that every file starts with header giving columns names. The function associates this_column value to file_column and appends all the columns from filename with a prefix. We also assumes values in file_column are unique.

source on GitHub

open()

Opens the file in reading mode.

source on GitHub

readlines()

Extracts all the lines, the file must not be opened through method open \n are removed.

source on GitHub