Explicit contract
Returning an empty list to indicate "file not found" is probably the only big sin here. Now your function returns the same empty list in both of the following cases:
- File does not exist
- File is OK and contains no email addresses
If the file wasn't found, it also prints to stdout, which is both non-standard (error messages should go to stderr) and inconvenient (what if I want to use this function in a larger application with normal logging configuration?).
This is... surprising. I'd prefer such a function to tell me that a file does not exist. And that's exceptional case, so FileNotFoundError is just the right way to tell that.
Also note that your function still isn't exception-free: files that can't be decoded as UTF8 and files that can't be accessed due to filesystem permissions will still cause exceptions (UnicodeDecodeError and PermissionError, respectively; try running your script on /bin/ls to see the former).
So let's just propagate file exceptions to the caller instead.
with open(filename, 'r', encoding='utf-8') as file:
content = file.read()
Use modern APIs
Standard library has a better way to spell "read this file into memory".
from pathlib import Path
content = Path(filename).read_text(encoding='utf-8')
Nice, huh? Knowing about Path, it's probably better to use it from the very beginning, see my final suggestion below. Doing that would also prevent confusing filename and content by callers (possible if both are strings).
Compile once
You define a pattern as a raw string, good job with that! However, your function still might have to parse the pattern on every invocation. Let's compile it once:
EMAIL_PATTERN = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
# ...
emails = EMAIL_PATTERN.findall(content)
As noted by @Booboo, your function has a good chance to reuse the compiled expression anyway (see the note from the docs quoted below), but that depends on other regexes used in the program and is not guaranteed. Since there's no penalty for doing that explicitly once, I'd recommend just extracting the regex anyway. (And performance probably isn't critical here anyway)
The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
Type hints
If it's intended as a small standalone script, skip this section - typing small codebases is not worth the effort.
If it's part of a library, however, consider documenting what your function accepts and returns:
def extract_unique_emails(filename: str) -> list[str]:
...
Usability
If-main guard is good. Consider providing a simple CLI interface that takes filename from arguments.
Correctness
Your regex does not accept all valid emails and accepts some invalid emails. However, it really seems to cover almost all "good" emails encountered in wild and only accept a few invalid emails (e.g. [email protected] is invalid if I'm not mistaken). This should be documented but is most likely fine for your use case.
Rewrite suggestion
import re
import sys
from pathlib import Path
EMAIL_PATTERN = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
def extract_unique_emails(file_path: Path) -> list[str]:
content = file_path.read_text(encoding='utf-8')
emails = EMAIL_PATTERN.findall(content)
return sorted(set(emails))
def _parse_cli_args():
from argparse import ArgumentParser
parser = ArgumentParser(description="Extract unique email addresses from file")
parser.add_argument("filename", help="File to process", type=Path)
return parser.parse_args()
if __name__ == "__main__":
args = _parse_cli_args()
try:
emails = extract_unique_emails(args.filename)
except FileNotFoundError:
# Common case, offer a tailored error message
print(f"File not found at '{args.filename}'", file=sys.stderr)
except OSError as exc:
# Other filesystem errors - permissions, directory, etc.
print(f"Cannot open file at '{args.filename}': {exc}", file=sys.stderr)
except UnicodeDecodeError as exc:
print(
f"File at '{args.filename}' is not valid UTF-8 text: {exc}",
file=sys.stderr
)
else:
for email in emails:
print(email)
And now
$ printf '[email protected]' >/tmp/sample.txt
$ for file in /tmp /tmp/doesnotexist /root/.viminfo /bin/ls /tmp/sample.txt; do
printf '\n%s:\n' "$file"
python /tmp/a.py "$file"
done
/tmp:
Cannot open file at '/tmp': [Errno 21] Is a directory: '/tmp'
/tmp/doesnotexist:
File not found at '/tmp/doesnotexist'
/root/.viminfo:
Cannot open file at '/root/.viminfo': [Errno 13] Permission denied: '/root/.viminfo'
/bin/ls:
File at '/bin/ls' is not valid UTF-8 text: 'utf-8' codec can't decode byte 0xd8 in position 96: invalid continuation byte
/tmp/sample.txt:
[email protected]