4.7 Counting Lines in a File
Credit: Luther Blissett
4.7.1 Problem
You need to compute the number of lines in
a file.
4.7.2 Solution
The simplest approach, for
reasonably sized files, is to read the file as a list of lines so
that the count of lines is the length of the list. If the
file's path is in a string bound to the
thefilepath variable, that's
just:
count = len(open(thefilepath).readlines( ))
For a truly huge file, this may be very slow or even fail to work. If
you have to worry about humongous files, a loop using the
xreadlines method always works:
count = 0
for line in open(thefilepath).xreadlines( ): count += 1
Here's a slightly tricky
alternative, if the line terminator is '\n' (or
has '\n' as a substring, as happens on Windows):
count = 0
thefile = open(thefilepath, 'rb')
while 1:
buffer = thefile.read(8192*1024)
if not buffer: break
count += buffer.count('\n')
thefile.close( )
Without the 'rb' argument to
open, this will work anywhere, but performance may
suffer greatly on Windows or Macintosh platforms.
4.7.3 Discussion
If you have an external program that counts a file's
lines, such as wc -l on Unix-like platforms, you
can of course choose to use that (e.g., via os.popen(
)). However, it's generally simpler,
faster, and more portable to do the line-counting in your program.
You can rely on almost all text files having a reasonable size, so
that reading the whole file into memory at once is feasible. For all
such normal files, the len of the result of
readlines gives you the count of lines in the
simplest way.
If the file is larger
than available memory (say, a few hundred of megabytes on a typical
PC today), the simplest solution can become slow, as the operating
system struggles to fit the file's contents into
virtual memory. It may even fail, when swap space is exhausted and
virtual memory can't help any more. On a typical PC,
with 256 MB of RAM and virtually unlimited disk space, you should
still expect serious problems when you try to read into memory files
of, say, 1 or 2 GB, depending on your operating system (some
operating systems are much more fragile than others in handling
virtual-memory issues under such overstressed load conditions). In
this case, the xreadlines method of file objects,
introduced in Python 2.1, is generally a good way to process text
files line by line. In Python 2.2, you can do even better, in terms
of both clarity and speed, by looping directly on the file object:
for line in open(thefilepath): count += 1
However, xreadlines does not return a sequence,
and neither does a loop directly on the file object, so you
can't just use len in these cases
to get the number of lines. Rather, you have to loop and count line
by line, as shown in the solution.
Counting line-terminator characters while reading the file by bytes,
in reasonably sized chunks, is the key idea in the third approach.
It's probably the least immediately intuitive, and
it's not perfectly cross-platform, but you might
hope that it's fastest (for example, by analogy with
Recipe 8.2 in the Perl Cookbook).
However, remember that, in most cases, performance
doesn't really matter all that much. When it does
matter, the time sink might not be what your intuition tells you it
is, so you should never trust your intuition in this
matter—instead, always benchmark and measure. For example, I
took a typical Unix syslog file of middling size, a bit over 18 MB of
text in 230,000 lines:
[situ@tioni nuc]$ wc nuc
231581 2312730 18508908 nuc
and
I set up the following benchmark framework script,
bench.py:
import time
def timeo(fun, n=10):
start = time.clock( )
for i in range(n): fun( )
stend = time.clock( )
thetime = stend-start
return fun._ _name_ _, thetime
import os
def linecount_wc( ):
return int(os.popen('wc -l nuc').read().split( )[0])
def linecount_1( ):
return len(open('nuc').readlines( ))
def linecount_2( ):
count = 0
for line in open('nuc').xreadlines( ): count += 1
return count
def linecount_3( ):
count = 0
thefile = open('nuc')
while 1:
buffer = thefile.read(65536)
if not buffer: break
count += buffer.count('\n')
return count
for f in linecount_wc, linecount_1, linecount_2, linecount_3:
print f._ _name_ _, f( )
for f in linecount_1, linecount_2, linecount_3:
print "%s: %.2f"%timeo(f)
First, I print the line counts obtained by all methods, thus ensuring
that there is no anomaly or error (counting tasks are notoriously
prone to off-by-one errors). Then, I run each alternative 10 times,
under the control of the timing function timeo,
and look at the results. Here they are:
[situ@tioni nuc]$ python -O bench.py
linecount_wc 231581
linecount_1 231581
linecount_2 231581
linecount_3 231581
linecount_1: 4.84
linecount_2: 4.54
linecount_3: 5.02
As you can see, the performance differences hardly matter: a
difference of 10% or so in one auxiliary task is something that your
users will never even notice. However, the fastest approach (for my
particular circumstances, a cheap but very recent PC running a
popular Linux distribution, as well as this specific benchmark) is
the humble loop-on-every-line technique, while the slowest one is the
ambitious technique that counts line terminators by chunks. In
practice, unless I had to worry about files of many hundreds of
megabytes, I'd always use the simplest approach
(i.e., the first one presented in this recipe).
4.7.4 See Also
The Library Reference section on file objects
and the time module; Perl
Cookbook Recipe 8.2.
|