Count occurrence of words in a .txt file

Question

I'm taking an intro to programming class and although I've learned some things I didn't know before (I've been using Python for about 1.5 years) I feel like I've not progressed much in writing "beautiful" code. My professor is committed to keeping this as a general intro class, and chose Python for its "friendliness" as an initial language. I can't really tell how much I'm improving (or not) as the grades at this point seem really inflated, so I wanted to get some input from here.

We were assigned an exercise in class to take a .txt file (in our case a .txt of the Gettysburg Address) and count the number of times words occur. We were then to output our results in a neatly formatted fashion. We've been getting drilled in writing functions and are starting to work with dictionaries, so I came up with this solution with those things in mind. I want to know how I can improve my code (i.e make it more efficient, Pythonic, and embrace what Python brings to the table as a language).

from re import split


def process_line(words, word_dict):
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1


def process_dict(word_dict):
    temp_list = []
    for key, value in word_dict.items():
        temp_list.append((value, key))

    temp_list.sort()
    return temp_list


def format_print(input_list, reverse, word_num):
    if reverse:
        input_list.sort(reverse=True)

    print "\n", ("[Unique Words: " + str(word_num) + "]").center(35, "=")
    print "-"*35 + "\n", "%-16s %s %16s" % ("Word", "|", "Count"), "\n", "-"*35
    for count, word in input_list:
        print "%-16s %s %16d" % (word, "|", count)


def word_count(_file, max_to_min=False):
    txt = open(_file, "rU")
    word_dict = {}
    for line in txt:
        if line.replace(" ", "") != ("\n" or None):
            process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    txt.close()
    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))


word_count("Gettysburg.txt", True)

\$\begingroup\$ Not working properly on Python or Code Academy. \$\endgroup\$

Nazar Mohammad
– Nazar Mohammad

2015-03-10 16:01:57 +00:00
Commented Mar 10, 2015 at 16:01 — Nazar Mohammad
– Nazar Mohammad, Commented Mar 10, 2015 at 16:01

Community · Accepted Answer · 2017-04-13 12:40:52Z

Let's take a look at word_count, which appears to be the central function:

def word_count(_file, max_to_min=False):
    txt = open(_file, "rU")
    word_dict = {}
    for line in txt:
        if line.replace(" ", "") != ("\n" or None):
            process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    txt.close()
    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))

_file is not a suitable name as according to PEP 8. It is more Pythonic to use with open(_file, "rU") as f too (known as context managers). With that, I rename _file to filename. These two points are mentioned in vnp's answer; however, I disagree with vnp's suggestion to catch the exception, as there is no need for a graceful exit. The program should crash if the file cannot be opened.

def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        word_dict = {}
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)

    final_list = process_dict(word_dict)
    format_print(final_list, max_to_min, len(word_dict))

Your function calls the process_line function:

def process_line(words, word_dict):
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1

There's a builtin Python class for this called Counter. It has a dictionary interface too. With that, the process_line function is no longer necessary and we can rewrite this as:

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            if line.replace(" ", "") != ("\n" or None):
                counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))

Secondly, you appear to be removing all spaces from the line so as to find out if the line is just a series of whitespace and contains no actual words. This can be easily done using the strip function.

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(filter(None, split("[^a-zA-Z']+", line)))

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))

filter can be rewritten as a generator, which feels more natural to me. That also uses less parenthesis, making the code more readable.

from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
    with open(filename, "rU") as f:
        counter = Counter()
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(x for x in split("[^a-zA-Z']+", line) if x)

    final_list = process_dict(counter)
    format_print(final_list, max_to_min, len(counter))

Now, let's take a look at process_dict.

def process_dict(word_dict):
    temp_list = []
    for key, value in word_dict.items():
        temp_list.append((value, key))

    temp_list.sort()
    return temp_list

The first few lines can be done with a lambda. The new function looks like this:

def process_dict(counter):
    temp_list = map(lambda (a, b): (b, a), counter.items())
    temp_list.sort()
    return temp_list

But was there really a need for a function on its own? In fact, I'd argue that since your function is named word_count, the function should only count words. Hence, we should just return the counter object and let the printing be handled. Also, we usually name functions as verbs, so I'll change the name to count_words.

The above change affects our whole program structure. Hence, I'll show the final code before explaining the changes I made.

from collections import Counter
from re import split

BANNER = "-" * 35

def format_print(counter, is_reverse=False):
    lst = counter.items()
    lst.sort(key=lambda (a, b): (b, a), reverse=is_reverse)
    print ("[Unique Words: %d]" % len(lst)).center(35, "=")
    print "%-16s | %16s" % ("Word", "Count")
    print BANNER
    for word, count in lst:
        print "%-16s | %16d" % (word, count)

def count_words(filename):
    counter = Counter()
    with open(filename, "rU") as f:
        for line in f:
            line = line.strip().lower()
            if not line:
                continue
            counter.update(x for x in split("[^a-zA-Z']+", line) if x)
    return counter

format_print(count_words("Gettysburg.txt"), is_reverse=False)

I've removed max_to_min=False since we no longer sort the items in count_words.

In format_print, I renamed reverse to is_reverse, assigned it to False by default and removed num_words.

Afterwards, I rewrote the function that sorts the list such that it would sort by the count, then the word, without affecting the structure of the list. This makes the later loop more intuitive.

I've also seperated print statements that had strings seperated by commas, since they were confusing. I declared BANNER as a global variable (which is alright in Python, as long as it is used as a constant). In the process, I made a few minor changes to the output; I hope you don't mind!

It took a long time, but the end result is worth it. I hope that I've managed to show you the process of clearing up your code. :)

EDIT: The code here is not yet tested; I am currently checking all of the code I posted here.

EDIT 2: Updated the fixed version.

This is great. I struggle with incorporating list comprehensions into my code naturally. This certainly looks cleaner and is more readable. I do, however, have one question: Why is it you recommended not catching the exception? It hadn't occurred to me to do so, since I haven't gotten in the habit of it, but it seemed like a reasonable suggestion. I guess what I'm asking is, is there some criteria you used to judge whether catching the exception was suitable in this case or not? — DamianJ
– DamianJ, Commented Nov 12, 2014 at 17:43
A graceful exit is only needed in a more complex application. Since this application is small and simple, I believe that it is fine to let it crash immediately. I'd only catch the exception if 1) the app needs to continue running 2) there is a user interface and user input may be the cause of the problem (hence catching an exception and allowing the user to give a new input would be beneficial). There are a few exceptions to this (for example, Unix-style command line utilities like grep). In general, if there's no real need to gracefully exit or recover, then don't. — wei2912
– wei2912, Commented Nov 12, 2014 at 17:56
Also, you may want to shift the IO into a main function and call count_words with a list of lines read from the file. Perhaps that'd be cleaner for you. — wei2912
– wei2912, Commented Nov 12, 2014 at 17:59
Since the update function takes only one argument, I think you can just write counter.update(x for x in split("[^a-zA-Z']+", line) if x) removing the square brackets. — parchment
– parchment, Commented Nov 15, 2014 at 7:16

vnp · Accepted Answer · 2014-11-12 07:02:55Z

Naming
- word_count is a wrong name. The function doesn't count words. It conts them sorts them and prints them - in other words, it completes the assignment. Hence a right name would be exercise_NNN with a proper number.
- _file looks strange. filename seems better because the argument is a file name.
- process_dict is non-descriptive. The function converts the dictionary into a sorted list. Should be to_sorted_list or something along the same line.
The final sort order is decided by a printing routine. I seriously doubt this design. A sorter should sort, a printer should print. For instance, your solution may be penalized by sorting data twice.
Context managers are much more pythonic than raw open/close methods:
```
with open(_file, "rU") as src:
    ...
```
In any case, open (and reading a file) may throw an exception. Better catch them. Graceful exit is a valuable feature.

Brythan · Accepted Answer · 2015-03-10 19:24:40Z

This is something that my teacher came up with for a spell-checking task, but I've adapted it to make it work for you, and it adds a bit to my program as well. The variable names are a bit weird, because I've copied them straight from my program.

inputfile=input("Enter the name (with file extension) of the file you would like to spellcheck: ")
fileToCheck = open(inputfile, 'rt') #opens the file
print("File found.")
textToCheck=[]
for line in fileToCheck:
    sentence=line.split() #splits it into words
    for word in sentence:
        textToCheck.append(word) #adds the rord to the list
fileToCheck.close()
print("File imported.")
print(str(len(textToCheck))+" words found in input file.") #prints the length of the list (number of words)

As for formatting it well, have you considered outputting the finished file as a HTML? You could use CSS, and maybe even basic Javascript to make it look good. This is how I did it:

(at start of program)

import os
outputText="<html>\n<head>\n<title>Document</title>\n<style>insert stuff here</style>\n<link rel="stylesheet" href="linktocss.css">\n</head>\n<h1>Document</h1>\n<body>"

(at end of program)

filename=inputfile+".html"
outputText+="</body></html>" #finishes off html
outputFile = open(filename, 'wt')
outputFile.write(outputText)
outputFile.close()
os.startfile(filename) #automatically open file

Deepi · Accepted Answer · 2015-09-17 10:58:13Z

1

This is the shortest and best suitable way to count the ocuurence of words in any text file.

import re
from collections 
import Counter
f=open('C:\Python27\myfile.txt', 'r')
passage = f.read()
words = re.findall(r'\w+', passage)
cap_words = [word.upper() for word in words]
# Converting to uppercase so that 'Is' & 'is' like words  should be  considered as same words
  word_counts = Counter(cap_words)
 print(word_counts)

You can check the output here-

http://pythonplanet.blogspot.in/2015/08/python-program-to-find-number-of-times.html

answered Sep 17, 2015 at 10:58

Deepi

111 bronze badge

\$\begingroup\$ Welcome to Code Review! You posted some alternate code but didn't really give feedback on what the OP had or explain what about this code is better. On CR answers should mainly serve to help the OP understand new and alternative ways of approaching what they did, and an answer with just code is less useful to show them. \$\endgroup\$

SuperBiasedMan
– SuperBiasedMan

2015-09-17 11:03:10 +00:00
Commented Sep 17, 2015 at 11:03

Add a comment |

Stack Exchange Network

Count occurrence of words in a .txt file

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Count occurrence of words in a .txt file

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions