Split CSV File Into Multiple Files Using Python

A few days ago I wrote about a small Python class which I created to merge multiple CSV files into one large file. I had tried to make it extensible a little bit. Ironically only a few days later I found myself in a situation where I needed to do the exact opposite task and split a large csv file into smaller chunks. I’m going to walk you through some of the changes thinking I went through updating the class. You can find the original article and class posted here

A few days ago I wrote about a small Python class which I created to merge multiple CSV files into one large file. I had tried to make it extensible a little bit. Ironically only a few days later I found myself in a situation where I needed to do the exact opposite task and split a large csv file into smaller chunks. I’m going to walk you through some of the changes thinking I went through updating the class. You can find the original article and class posted here.

Arguments using string formatting

I felt the need to add finer customization in how a user would pass in their configuration for a base file name to the class as well. In the first version if a user wanted to merge a grouping of CSV files they would create the class and pass in a base_name string which the class would use to loop through their files by concatenation of an index value and “.csv” extension to loop through the CSV files to merge, base_name + i + ".csv", sort of lame to be honest. This required all files to fit a ridged naming convention and affords little control to the user should they want to work with files using a convention such as “file-name (2).csv” so I decided to upgrade that as well using string formatting.

Alright, there are 2 ways a developer might choose to format strings in Python and I chose the less error prone method which is using the string types [.format()](https://docs.python.org/2/library/functions.html#format) method. The other more strict choice is using classic string formatting with the modulo operator and a tuple containing values. Just to demonstrate two different ways to format strings:

1
2
3
4
5
for index in range(1, 33):
# (First Method) The format method of a string
filename = "my-file{}.csv".format(index)
# (Second Method) String formatter with modulo operator
filename = "my-file%d.csv" % (index,)

With the first method we can pass in more values or less values than the sting being formatted asks for and not have to worry about causing errors, also we could pass in any type of value. The second method of formatting requires us to be explicit about which types of values we are expecting and how many values will be expected in the string, %d for digit, %f for float, %s for string… I think of this type of formatting as “declarative string formatting” and that’s how I choose to differentiate the two in my head. Declarative string formatting will probably yield more speed but it also raises exceptions if it isn’t passed the exact amount and precisely expected types of input.

The string type format() method is more forgiving and when writing a module which relies on the input of other people it’s better to be forgiving, (unless those values are going to be used in database queries or evaluated statements), in which case you can’t trust nobody no how!

In this newer version which is included at the bottom of the post or the Gist on Github. the user can pass in base_name="my-file({}).csv" which would match files such as “my-file (1).csv, my-file (2).csv …” and so on.

CSV split functionality explained

OK folks, onto the added functionality. This is what the newly added split function looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def split(self, chunk_size=500):
"""Split the current file into files of row size n where n is chunk_size
First use self.open() to open a file to read from and then call self.split()
:chunk_size int: How many lines per file.
"""
line_num = 0
file_num = 1
for line in self.file_h:
line_num += 1
if line_num == 1 or file_headers is None:
file_headers = line
if int(line_num) % int(chunk_size) == 1:
if self.output_file is not None:
self.output_file.close()
del self.output_file
next_filename = self.save_name.format(file_num)
print("About to write next {0} lines into file: {1}".format(chunk_size, next_filename))
self.output_file = open(next_filename, 'w+')
file_num += 1
# Write the file headers as the first line of each file
self.output_file.write(file_headers)
if line_num == 1:
# continue because we dont' want to write headers 2 times.
continue
self.output_file.write(line)
self.file_h.close()
self.output_file.close()

Besides the implicit self argument, the only argument we have to think about here is chunk_size which allows the user to say how many rows should be in each of the split files. I think the next update I’ll also let the user decide how many files to generate and have the class figure out how many lines to put in each. Like in the original class which combined multiple files into 1 large file, we rely on the self.file_h file handle as the current working file. This file will be split into many little files when the split method is called (or it may be split into a few little files maybe, who knows).

Each iteration we have to increment the line_num value to keep track of how many rows we have already read. When reading the first line we will want to get the CSV file_headers and output them as the first row of each split file so that each has its own row of headers which is required in most cases to maintain usability when read independently of each other.

To keep track of when to stop writing to one file and begining another file check for line_num % chunk_size == 1 which will tell us that we’ve read another amount of rows equal to chunk_size. That is the splitting functionality, just update the file name and use the string formatting method explained in the top of the post next_filename = self.save_name.format(file_num) and on completion of reading all the rows we should close both open files (this means the last file to receive output and the original file which all the lines were read from). self.file_h.close() and self.output_file.close().

The complete CSV Splitting/CSV Merging Class

Alright, without further to do here is the module to date. It should work fine under 2.7 - I’m pretty sure that it will work in Python 3 as well after I switch out the built-in IO functions for the Python csv module:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# I haven't used csv in this class yet, but I am planning to integrate it in soon.
# since csv is just text file the class works fine with default IO functions.
import csv
import sys
class CSV_Monster:
# current filename, file handle, index, whether to maintain headers when merging
current_filename = ""
file_h = None
i = None
with_headers = False
def __init__(self, base_name, last_i, first_i=0, save_name="final-csv-output{}.csv",
skips=set(), zero_index_name=None):
"""
Prepare to write the files. Open the save name and wait for run() to be called on the
utility
:base_name str: The common file path and name of csv files to parse
:last_i int: The highest integer found in file names. such as base/name-csv-file (30).csv
:first_i int: My files start without a number so 1 is default.
:save_name str: The base name of the final output .csv if there is 1 singular file (merging)
:skips set: A set of indexes which should be skipped
:zero_index_name None|str: Pass in a starting name if there is a file to begin with which doesn't fit
convention in base_name
"""
self.base_name = base_name
self.save_name = save_name
self.last_i = last_i
self.i = first_i
# --- Skips are file numbers which should be ignored for any reason .
self.skips = skips
self.zero_index_name = zero_index_name
# --- Open up the output file for writing
print("Preparing to write to file: {}".format(self.save_name.format('')))
self.output_file = open(self.save_name.format(''), 'w')
def open(self, filename=None):
"""Open the self.current_filename for reading As of now this method is overkill,
there is no way to explicitly tell the class to append a file which isn't named
using the self.base_name convention. But I would like to extend the object to
be able to allow the user to pass in specific names.
:filename str: Pass filename to explicitly open. Default - self.current_filename
:{return} file_handler:
"""
if filename is not None:
self.current_filename = filename
self.file_h = open(self.current_filename, 'r')
print("opening up file: {}".format(self.current_filename))
return self.file_h
def next(self):
"""Setup the next file to be read then return file handler by calling self.open()
#Todo: next() should be able to consume a list as a generator I think.
:{return} file_handler:
"""
if self.i > self.last_i:
return False
# Check that this isn't a skip index
if self.skips and len(self.skips):
while self.i in self.skips:
self.i += 1
if self.zero_index_name is not None:
self.current_filename = self.zero_index_name
self.zero_index_name = None
self.with_headers = True
else:
self.current_filename = self.base_name.format(self.i)
self.with_headers = False
# incriment for the next file
self.i += 1
return self.open()
def run(self):
"""Roll through each file and append lines to the output.
"""
while self.next():
# should skip the headers unless self.with_headers is True
if not self.with_headers:
self.file_h.next()
# Append each line in current file to the output file
for line in self.file_h:
self.output_file.write(line)
self.file_h.close()
print("Closing the main file... {}".format(self.save_name.format('')))
self.output_file.close()
def check_headers(self):
# TODO: Add header checks
pass
def split(self, chunk_size=500):
"""Split the current file into files of row size n where n is chunk_size
First use self.open() to open a file to read from and then call self.split()
:chunk_size int: How many lines per file.
"""
line_num = 0
file_num = 1
if self.file_h is None:
# If no file_h Split assumes to split the base_name file.
self.open(self.base_name)
for line in self.file_h:
line_num += 1
if line_num == 1 or file_headers is None:
file_headers = line
if int(line_num) % int(chunk_size) == 1:
if self.output_file is not None:
self.output_file.close()
del self.output_file
next_filename = self.save_name.format(file_num)
print("About to write next {0} lines into file: {1}".format(chunk_size, next_filename))
self.output_file = open(next_filename, 'w+')
file_num += 1
# Write the file headers as the first line of each file
self.output_file.write(file_headers)
if line_num == 1:
# continue because we dont' want to write headers 2 times.
continue
self.output_file.write(line)
self.file_h.close()
self.output_file.close()
if __name__ == '__main__':
# Make split_mode True to split files (for examples).
split_mode = False
if split_mode is True:
# Splitting files example
csv_monster = CSV_Monster("large-files/all-orders.csv", 1, save_name="orders-chunk-{}.csv")
csv_monster.split(400)
elif __name__ == '__main__' and split_mode is not True:
# Merging files example
csv_monster = CSV_Monster("split-files/orders-{}.csv", 34, first_i=1, save_name="orders-completed.csv",
skips=set((2, 9, 10, 22, 27, 28, 29, 33)))
csv_monster.run()