Not so clean text to tables (pandas fails)

Links: notebook, html, PDF, python, slides, GitHub

An example where import_flatfile_into_database does a better job than pandas about converting a flat file to a table.

import random, pandas
text = [ "one","two","three","four","five","six","seven","eight","nine","ten" ]
data = [ { "name": text[random.randint(0,9)], "number": random.randint(0,99)} \
                        for i in range(0,10000) ]
df = pandas.DataFrame(data)
df.head(n=3)
name number
0 one 75
1 two 73
2 three 77
df.to_csv("flatfile.txt", sep="\t", encoding="utf8", header=True, index=False)
dfr = pandas.read_csv("flatfile.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name number
0 one 75
1 two 73
2 three 77

Le’s assume now we introduce extra tabulations.

datatab = [ {"name": " one\ttab", "number":100 } ] + data
df = pandas.DataFrame(datatab)
df.head(n=3)
name number
0 one\ttab 100
1 one 75
2 two 73
df.to_csv("flatfile_tab.txt", sep="\t", encoding="utf8", header=True, index=False)
dfr = pandas.read_csv("flatfile_tab.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name number
0 one\ttab 100
1 one 75
2 two 73

It works well because we use pandas to save the dataframe, and we use pandas to restore it. In the file flatfile_tab.txt, it looks like "on\te". pandas interprets the quotes as a delimiter. However most of the times, the flat file is produced in a different way and the quotes are not present.

with open("flatfile_tab.txt", "r", encoding="utf8") as f:
    content = f.read()
content = content.replace('"','')
with open("flatfile_tab2.txt", "w", encoding="utf8") as f:
    f.write(content)
dfr = pandas.read_csv("flatfile_tab2.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
name number
one tab 100.0
one 75 NaN
two 73 NaN

It failed! Data is not aligned and it did not raise an exception. If we move the extra tab in second position, we get:

datatab = data[:1] + [ {"name": " one\ttab", "number":100 } ] + data[1:]
df = pandas.DataFrame(datatab)
df.to_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8", header=True, index=False)
with open("flatfile_tab_pos2.txt","r",encoding="utf8") as f:
    content = f.read()
content = content.replace('"','')
with open("flatfile_tab_pos2.txt","w",encoding="utf8") as f:
    f.write(content)
dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8")
dfr.head(n=3)
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-8-473d60fe4f7d> in <module>()
----> 1 dfr = pandas.read_csv("flatfile_tab_pos2.txt", sep="\t", encoding="utf8")
      2 dfr.head(n=3)

c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654
--> 655         return _read(filepath_or_buffer, kwds)
    656
    657     parser_f.__name__ = name

c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    409
    410     try:
--> 411         data = parser.read(nrows)
    412     finally:
    413         parser.close()

c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1003                 raise ValueError('skipfooter not supported for iteration')
   1004
-> 1005         ret = self._engine.read(nrows)
   1006
   1007         if self.options.get('as_recarray'):

c:\Python36_x64\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1746     def read(self, nrows=None):
   1747         try:
-> 1748             data = self._reader.read(nrows)
   1749         except StopIteration:
   1750             if self._first_chunk:

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()

pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

As suggested in Python Pandas Error tokenizing data, we could add the parameter error_bad_lines=False or skiprows=N but we would still lose those bad lines. So we use function import_flatfile_into_database.

from pyensae.sql import import_flatfile_into_database
import_flatfile_into_database("flatfile_tab_pos2.db3", "flatfile_tab_pos2.txt")
  TextFile: opening file  flatfile_tab_pos2.txt
  TextFile.guess_columns: processing file  flatfile_tab_pos2.txt
  TextFile: opening file  flatfile_tab_pos2.txt
  TextFile.guess_columns: using  101  lines
  TextFile: closing file  flatfile_tab_pos2.txt
  TextFile.guess_columns: sep  't' nb cols 2  bestnb  100  more  {('t', 1): 100, ('t', 2): 1}
  TextFile.guess_columns: header  True  columns  {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)}
  compiling ^(?P<name>.*)t(?P<number>([-]?[1-9][0-9]*?)|(0?))$
  TextFile.guess_columns: regex  ^(?P<name>.*)t(?P<number>([-]?[1-9][0-9]*?)|(0?))$
  TextFile.guess_columns: header  True  columns  {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)}
  [_guess_columns] sep=['t']
  TextFile: closing file  flatfile_tab_pos2.txt
  [_guess_columns] columns_name=None
   guess with  1001 lines
   count_types  {0: {<class 'str'>: 1000}, 1: {<class 'int'>: 999, <class 'str'>: 1}}
   columns  {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)}
   guess {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)}
SQL  'CREATE TABLE flatfile_tab_pos2(name TEXT,'
'       number INTEGER);'
   column_has_space False ['name', 'number']
   changes {}
    TextFileColumns (2): regex:  {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)}
  TextFile.guess_columns: processing file  flatfile_tab_pos2.txt
  TextFile: opening file  flatfile_tab_pos2.txt
  TextFile.guess_columns: using  101  lines
  TextFile: closing file  flatfile_tab_pos2.txt
  TextFile.guess_columns: sep  't' nb cols 2  bestnb  100  more  {('t', 1): 100, ('t', 2): 1}
  TextFile.guess_columns: header  True  columns  {0: ('name', <class 'str'>), 1: ('number', <class 'int'>)}
  compiling ^(?P<name>.*)t(?P<number>([-]?[1-9][0-9]*?)|(0?))$
  TextFile.guess_columns: regex  ^(?P<name>.*)t(?P<number>([-]?[1-9][0-9]*?)|(0?))$
  TextFile.guess_columns: header  True  columns  {0: ('name', (<class 'str'>, 10)), 1: ('number', <class 'int'>)}
  TextFile: opening file  flatfile_tab_pos2.txt
^(?P<name>.*)t(?P<number>([-]?[1-9][0-9]*?)|(0?))$
error regex 0 unable to interpret line  10002 :  ''
  TextFile: closing file  flatfile_tab_pos2.txt
10001  lines imported
'flatfile_tab_pos2'

We check that we got the inserted line in the dataframe:

from pyensae.sql import Database
db = Database("flatfile_tab_pos2.db3")
db.connect()
df = db.to_df("SELECT * FROM flatfile_tab_pos2")
db.close()
df.head()
SQL  'SELECT * FROM flatfile_tab_pos2'
name number
0 one 75
1 one\ttab 100
2 two 73
3 three 77
4 six 38