lundi 23 mars 2015

Unit Test with Pandas Dataframe to read *.csv files

I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.


So, I am trying to write a Unit Test with Pandas that will: a. check the size of all the *.csv files in some folder b. ONLY read in the files that have a pre-determined number of columns c. print a message indicating the naems of the *.csv files have the wrong number of columns


Here is what I have (I am working in the folder C:\Users\Downloads):



import unittest
import pandas as pd
from os import listdir

# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv)
df1.to_csv('test2.csv)

class Conct(unittest.TestCase):
"""Tests for `primes.py`."""

TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()

def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle

def tearDown(self):
del self.err_fle

if __name__ == '__main__':
unittest.main()


Problem: I am gettingthis output:



['test1.csv', 'test2.csv']

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK


The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.


Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.


How can I use a Python unittest to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?


Aucun commentaire:

Enregistrer un commentaire