8
\$\begingroup\$

I have a plain text file with the content of a dictionary (Webster's Unabridged Dictionary) in this format:

A
A (named a in the English, and most commonly ä in other languages).

Defn: The first letter of the English and of many other alphabets.
The capital A of the alphabets of Middle and Western Europe, as also
the small letter (a), besides the forms in Italic, black letter,
etc., are all descended from the old Latin A, which was borrowed from
the Greek Alpha, of the same form; and this was made from the first
letter (Aleph, and itself from the Egyptian origin. The Aleph was a
consonant letter, with a guttural breath sound that was not an
element of Greek articulation; and the Greeks took it to represent
their vowel Alpha with the ä sound, the Phoenician alphabet having no
vowel symbols. This letter, in English, is used for several different
vowel sounds. See Guide to pronunciation, §§ 43-74. The regular long
a, as in fate, etc., is a comparatively modern sound, and has taken
the place of what, till about the early part of the 17th century, was
a sound of the quality of ä (as in far).

2. (Mus.)

Defn: The name of the sixth tone in the model major scale (that in
C), or the first tone of the minor scale, which is named after it the
scale in A minor. The second string of the violin is tuned to the A
in the treble staff.
 -- A sharp (A#) is the name of a musical tone intermediate between A
and B.
 -- A flat (A) is the name of a tone intermediate between A and G.

A per se Etym: (L. per se by itself), one preëminent; a nonesuch.
[Obs.]
O fair Creseide, the flower and A per se Of Troy and Greece. Chaucer.

A
A, prep. Etym: [Abbreviated form of an (AS. on). See On.]

1. In; on; at; by. [Obs.] "A God's name." "Torn a pieces." "Stand a
tiptoe." "A Sundays" Shak. "Wit that men have now a days." Chaucer.
"Set them a work." Robynson (More's Utopia)

2. In process of; in the act of; into; to; -- used with verbal
substantives in -ing which begin with a consonant. This is a
shortened form of the preposition an (which was used before the vowel
sound); as in a hunting, a building, a begging. "Jacob, when he was a
dying" Heb. xi. 21. "We'll a birding together." " It was a doing."
Shak. "He burst out a laughing." Macaulay. The hyphen may be used to
connect a with the verbal substantive (as, a-hunting, a-building) or
the words may be written separately. This form of expression is now
for the most part obsolete, the a being omitted and the verbal
substantive treated as a participle.

MALAY
Ma*lay", n.

Defn: One of a race of a brown or copper complexion in the Malay
Peninsula and the western islands of the Indian Archipelago.

MALAY; MALAYAN
Ma*lay", Ma*lay"an, a.

Defn: Of or pertaining to the Malays or their country.
 -- n.

Defn: The Malay language. Malay apple (Bot.), a myrtaceous tree
(Eugenia Malaccensis) common in India; also, its applelike fruit.

MALAYALAM
Ma"la*ya"lam, n.

Defn: The name given to one the cultivated Dravidian languages,
closely related to the Tamil. Yule.

MALBROUCK
Mal"brouck, n. Etym: [F.] (Zoöl.)

Defn: A West African arboreal monkey (Cercopithecus cynosurus).

I want to convert this file to a different format to make it easier and more efficient to search in it. My idea is to:

  1. Split the file into a collection of entries
  2. Save each entry in its own file
    • Not all in the same directory (100k+ files), but split to multiple subdirs
  3. Create an index file
    • One line per entry, in the format: FILENAME:WORD

This is the script I'm using now:

#!/usr/bin/env python

import re
import os

from optparse import OptionParser

DATA_DIR = 'data'
INDEX_PATH = os.path.join(DATA_DIR, 'index.dat')

re_entry_start = re.compile(r'[A-Z][A-Z0-9 ;\'-.,]*$')
re_nonalpha = re.compile(r'[^a-z]')


def get_filename(term, count):
    slug = re_nonalpha.sub('_', term.lower())
    dirname = slug.ljust(2, '0')[:2]
    filename = slug + '-' + str(count)
    return dirname, filename + '.txt'


def write_entry_file(dirname, filename, content, debug=False):
    basedir = os.path.join(DATA_DIR, dirname)
    if not debug:
        if not os.path.isdir(basedir):
            os.makedirs(basedir)

    path = os.path.join(basedir, filename)
    print '* writing to file', path
    if not debug:
        with open(path, 'w') as fh:
            fh.write(content)


def append_to_index(dirname, filename, term, debug=False):
    if not debug:
        with open(INDEX_PATH, 'a') as fh:
            fh.write('{}/{}:{}\n'.format(dirname, filename, term.lower()))


def parse_file(arg, debug=False):
    prev_line_blank = True
    prev_prefix = '0'
    term = None
    content = ''
    count = 1
    if not debug and os.path.isfile(INDEX_PATH):
        os.remove(INDEX_PATH)
    with open(arg) as fh:
        for line0 in fh:
            line = line0.strip()
            if re_entry_start.match(line) and prev_line_blank and not line.count('  '):
                if term:
                    for term in term.split('; '):
                        dirname, filename = get_filename(term, count)
                        write_entry_file(dirname, filename, content, debug=debug)
                        append_to_index(dirname, filename, term, debug=debug)
                term = line
                content = line0
                if term[0] != prev_prefix:
                    prev_prefix = term[0]
                    count = 1
                count += 1
                if debug and count > 5:
                    break
            else:
                content += line0
            prev_line_blank = not line


def main():
    parser = OptionParser()
    parser.set_usage('%prog [options] file...')
    parser.add_option('--debug', '-d', help="Debug mode, don't write to files", action='store_true')
    parser.set_description('Generate index and entry files from cleaned plain text file')
    (options, args) = parser.parse_args()

    if args:
        for arg in args:
            parse_file(arg, debug=options.debug)
    else:
        parser.print_help()

if __name__ == '__main__':
    main()

It creates an index file like this:

a0/a-2.txt:a
a0/a-3.txt:a
a0/a-4.txt:a
a0/a-5.txt:a
a0/a-6.txt:a
a0/a-7.txt:a
a_/a_-8.txt:a-
a_/a__-9.txt:a 1
aa/aam-10.txt:aam
aa/aard_vark-11.txt:aard-vark
aa/aard_wolf-12.txt:aard-wolf
aa/aaronic-13.txt:aaronic
aa/aaronical-13.txt:aaronical
aa/aaron_s_rod-14.txt:aaron's rod

This works fine, but it's not exactly pretty. I'm wondering if there's a better way of doing this.

I will use the resulting index file and entry files by a simple dictionary tool. The index file is not too big (3.4 MB), so it will load it into memory and use it for searches. The entry files will be loaded as needed, they don't need to be searchable.

The full input file (cleaned data) is here (10 MB download, 27 MB unzipped).

The open-source project is here.

\$\endgroup\$
3
  • \$\begingroup\$ Are you sure the data file can be freely distributed? \$\endgroup\$ Commented Aug 10, 2014 at 20:11
  • \$\begingroup\$ The code looks good to me. \$\endgroup\$ Commented Aug 10, 2014 at 20:13
  • \$\begingroup\$ This is an extract from the license that's in the original file: github.com/janosgyerik/webdict/blob/master/plugins/wud/… -- Correct me if I'm wrong, I think it means I can redistribute as long as I link back to the Gutenberg Project. \$\endgroup\$ Commented Aug 10, 2014 at 20:38

2 Answers 2

7
\$\begingroup\$

I've executed the code using the cleaned file you provided and it worked fine for me.

I have a few comments:

  • Comments and docstrings would make the code more readable.
  • Try to use constants instead of hardcoded values (instead of 2 something like DIR_LENGTH or PREFIX_LENGTH would be nice)
  • Replace print statements (only one at the moment) with logging calls
  • The index file is opened and closed for every entry. That doesn't seem to be efficient.
  • not line.count(' ') seems to be equivalent to ' ' not in line which I find easier to read
  • I see there's a counter for all entries starting with the same character. However, when looking at the directories that only have one character and an underscore, the counter doesn't seem to be right.
  • content should be a list of strings and it should be joined when the term is going to be written to disk. Otherwise += with strings isn't efficient because strings are immutable.

Regarding how to use logging, the very basic to start with would be as follows:

import logging

...

def write_entry_file(dirname, filename, content, debug=False):
    ...
    logging.debug('writing to file %s', path)
    ...

def main():
    ...
    logging.basicConfig(
        format='%(levelname)s: %(message)s',
        level=logging.DEBUG)
    ...

Once you have more logs you can play a little bit setting different levels to each message and adding a command line option to set the desired level and get the desired level of verbosity.

\$\endgroup\$
3
  • \$\begingroup\$ Great tip, thanks a lot! Can you add an example about using logging? \$\endgroup\$ Commented Aug 11, 2014 at 18:14
  • 1
    \$\begingroup\$ Sure, I just added an example. Let me know if you need more detail. \$\endgroup\$ Commented Aug 12, 2014 at 11:41
  • \$\begingroup\$ Thanks again! I took your advices, made some other improvements, and posted the result in a new question: codereview.stackexchange.com/q/59833/12390 \$\endgroup\$ Commented Aug 12, 2014 at 19:12
4
\$\begingroup\$

Looks like a fine start to me. Here's my thoughts:

  • As soon as you say index, I think database. Why make your own filing system (aa,ab,etc...) when a database can provide you with all the views you could ever imagine? Naturally, this would be a big leap, but it sounds like a likely next step.
  • Parsing: make some sort of provision for entries that don't fit the spec. It sounds crazy, because you seem to know the spec, but I've often thought that as well and somewhere in the dataset there was something that didn't fit. So spit out entries that don't work so you can modify the code to suit.
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.