Return to Answer

Added closing code fence and disabled syntax highlighting

Source Link

edited Jun 19 at 10:01

15.7k
5
29
217

We have seen several answers about how to improve your python, and why using a regex to detect a mail address is a bad idea, but there's one more thing I want to point out:

Reading a complete file in one big "slurp" is almost always a bad idea. It means you need to have a much RAM as the file's size, and in languages that use 32-bit characters internally, with ascii or utf-8 files, you need 4 times as much RAM as the file's size.

This is ok for data that needs to be present at all times while your program is running, for example, a configuration file. It's also ok for an one-shot program with a known small(-ish) data file. But it can cause lots of problem in any kind of production code.

Running your program on your 16 GB PC against a 100 MB data file will work well. Running it on your 256 MB VPS server against a 4 GB log file will result in one of

your kernel killing your program while it's reading the file, to keep the machine working
your machine grinding to an unresponsive halt while it's desperately trying to swap your memory out
(worst case) your cloud provider assigning more ram to your machine "because you needed it" and presenting you with a huge bill for that.

You can avoid this kind of problem by reading the input file line by line, processing each line, until you reach the end of the file.

<rant>

Unfortunately, there are many many tutorials out from people who don't really know what they're doing, think they figured out how to do it, and share their newly acquired wisdom with the world by writing a "tutorial" or making a youtube video. Don't learn from those people. I've seen many a junior developer fall into this kind of trap then wondering why the code that worked so well on their machine fails miserably in production.

</rant>

Now, I don't know Python well enough to give you a working example (and I refuse to ask ChatGPT for one), but what you want to do is akin to this pseudo code:

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
    get the set of emails from this line
    merge the line-set into the result-set, eliminating duplicates
    read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
    get the set of emails from this line
    merge the line-set into the result-set, eliminating duplicates
    read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.

We have seen several answers about how to improve your python, and why using a regex to detect a mail address is a bad idea, but there's one more thing I want to point out:

Running your program on your 16 GB PC against a 100 MB data file will work well. Running it on your 256 MB VPS server against a 4 GB log file will result in one of

your kernel killing your program while it's reading the file, to keep the machine working
your machine grinding to an unresponsive halt while it's desperately trying to swap your memory out
(worst case) your cloud provider assigning more ram to your machine "because you needed it" and presenting you with a huge bill for that.

You can avoid this kind of problem by reading the input file line by line, processing each line, until you reach the end of the file.

<rant>

</rant>

Now, I don't know Python well enough to give you a working example (and I refuse to ask ChatGPT for one), but what you want to do is akin to this pseudo code:

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
    get the set of emails from this line
    merge the line-set into the result-set, eliminating duplicates
    read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.

We have seen several answers about how to improve your python, and why using a regex to detect a mail address is a bad idea, but there's one more thing I want to point out:

Running your program on your 16 GB PC against a 100 MB data file will work well. Running it on your 256 MB VPS server against a 4 GB log file will result in one of

your kernel killing your program while it's reading the file, to keep the machine working
your machine grinding to an unresponsive halt while it's desperately trying to swap your memory out
(worst case) your cloud provider assigning more ram to your machine "because you needed it" and presenting you with a huge bill for that.

You can avoid this kind of problem by reading the input file line by line, processing each line, until you reach the end of the file.

<rant>

</rant>

Now, I don't know Python well enough to give you a working example (and I refuse to ask ChatGPT for one), but what you want to do is akin to this pseudo code:

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
    get the set of emails from this line
    merge the line-set into the result-set, eliminating duplicates
    read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.

Source Link

answered Jun 19 at 9:12

Guntram Blohm

We have seen several answers about how to improve your python, and why using a regex to detect a mail address is a bad idea, but there's one more thing I want to point out:

Running your program on your 16 GB PC against a 100 MB data file will work well. Running it on your 256 MB VPS server against a 4 GB log file will result in one of

your kernel killing your program while it's reading the file, to keep the machine working
your machine grinding to an unresponsive halt while it's desperately trying to swap your memory out
(worst case) your cloud provider assigning more ram to your machine "because you needed it" and presenting you with a huge bill for that.

You can avoid this kind of problem by reading the input file line by line, processing each line, until you reach the end of the file.

<rant>

</rant>

Now, I don't know Python well enough to give you a working example (and I refuse to ask ChatGPT for one), but what you want to do is akin to this pseudo code:

Set a result set to empty
open the file
while (read a line from the file does not result in EOF)
    get the set of emails from this line
    merge the line-set into the result-set, eliminating duplicates
    read the next line
end while
close the file
emit the result set

That way, you only ever need ram for one line, plus the set of found emails, which should be a lot less than the input file itself.