4

I'm trying to read a few thousands html files stored on disk.

Is there any way to do better than;

for files in os.listdir('.'):
    if files.endswith('.html') :
        with (open) files as f:
            a=f.read()
            #do more stuffs
2
  • Is this a repetitive task or a one-time operation? If repetitive, are the names of the files and their total amount expected to change or are they constant? Commented Nov 29, 2014 at 12:35
  • it's a repetitive task the name of the files are expected to change but not the extension. Commented Nov 29, 2014 at 12:39

2 Answers 2

2

For a similar problem I have used this simple piece of code:

import glob
for file in glob.iglob("*.html"):
    with open(file) as f:
        a = f.read()

iglob doesn't stores all file simultaneously, this is perfect with a huge directory.
Remenber to close files after you have finished, the construct "with-open" make sure for you.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the suggestion. I've made sure to close the file. But I'm looking for a way to read the files faster. So simultaneous reading would not be a bad thing actually.
what are you doing with the files? Maybe with more details we could help you. Look at this multi-threaded file processing. It's a little tricky
0

Here's some code that's significantly faster than with open(...) as f: f.read()

def read_file_bytes(path: str, size=-1) -> bytes:
    fd = os.open(path, os.O_RDONLY)
    try:
        if size == -1:
            size = os.fstat(fd).st_size
        return os.read(fd, size)
    finally:
        os.close(fd)

If you know the maximum size of the file, pass that in to the size argument so you can avoid the stat call.

Here's some all-around faster code:

for entry in os.scandir('.'):
    if entry.name.endswith('.html'):
        # on windows entry.stat(follow_symlinks=False) is free, but on unix requires a syscall.
        a = read_file_bytes(entry.path, entry.stat(follow_symlinks=False).st_size)
        a = file_bytes.decode()  # if string needed rather than bytes

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.