# Split CSV File Into Multiple Files Using Python

A few days ago I wrote about a small Python class which I created to merge multiple CSV files into one large file. I had tried to make it extensible a little bit. Ironically only a few days later I found myself in a situation where I needed to do the exact opposite task and split a large csv file into smaller chunks. I’m going to walk you through some of the changes thinking I went through updating the class. You can find the original article and class posted here

A few days ago I wrote about a small Python class which I created to merge multiple CSV files into one large file. I had tried to make it extensible a little bit. Ironically only a few days later I found myself in a situation where I needed to do the exact opposite task and split a large csv file into smaller chunks. I’m going to walk you through some of the changes thinking I went through updating the class. You can find the original article and class posted here.

## Arguments using string formatting

I felt the need to add finer customization in how a user would pass in their configuration for a base file name to the class as well. In the first version if a user wanted to merge a grouping of CSV files they would create the class and pass in a base_name string which the class would use to loop through their files by concatenation of an index value and “.csv” extension to loop through the CSV files to merge, base_name + i + ".csv", sort of lame to be honest. This required all files to fit a ridged naming convention and affords little control to the user should they want to work with files using a convention such as “file-name (2).csv” so I decided to upgrade that as well using string formatting.

Alright, there are 2 ways a developer might choose to format strings in Python and I chose the less error prone method which is using the string types [.format()](https://docs.python.org/2/library/functions.html#format) method. The other more strict choice is using classic string formatting with the modulo operator and a tuple containing values. Just to demonstrate two different ways to format strings:

With the first method we can pass in more values or less values than the sting being formatted asks for and not have to worry about causing errors, also we could pass in any type of value. The second method of formatting requires us to be explicit about which types of values we are expecting and how many values will be expected in the string, %d for digit, %f for float, %s for string… I think of this type of formatting as “declarative string formatting” and that’s how I choose to differentiate the two in my head. Declarative string formatting will probably yield more speed but it also raises exceptions if it isn’t passed the exact amount and precisely expected types of input.

The string type format() method is more forgiving and when writing a module which relies on the input of other people it’s better to be forgiving, (unless those values are going to be used in database queries or evaluated statements), in which case you can’t trust nobody no how!

In this newer version which is included at the bottom of the post or the Gist on Github. the user can pass in base_name="my-file({}).csv" which would match files such as “my-file (1).csv, my-file (2).csv …” and so on.

### CSV split functionality explained

OK folks, onto the added functionality. This is what the newly added split function looks like:

Besides the implicit self argument, the only argument we have to think about here is chunk_size which allows the user to say how many rows should be in each of the split files. I think the next update I’ll also let the user decide how many files to generate and have the class figure out how many lines to put in each. Like in the original class which combined multiple files into 1 large file, we rely on the self.file_h file handle as the current working file. This file will be split into many little files when the split method is called (or it may be split into a few little files maybe, who knows).

Each iteration we have to increment the line_num value to keep track of how many rows we have already read. When reading the first line we will want to get the CSV file_headers and output them as the first row of each split file so that each has its own row of headers which is required in most cases to maintain usability when read independently of each other.

To keep track of when to stop writing to one file and begining another file check for line_num % chunk_size == 1 which will tell us that we’ve read another amount of rows equal to chunk_size. That is the splitting functionality, just update the file name and use the string formatting method explained in the top of the post next_filename = self.save_name.format(file_num) and on completion of reading all the rows we should close both open files (this means the last file to receive output and the original file which all the lines were read from). self.file_h.close() and self.output_file.close().

## The complete CSV Splitting/CSV Merging Class

Alright, without further to do here is the module to date. It should work fine under 2.7 - I’m pretty sure that it will work in Python 3 as well after I switch out the built-in IO functions for the Python csv module: