0

I have a text file that mixes names and email addresses, in a comma-separated list.

First Person <[email protected]>, Second Person <[email protected]>, Third <[email protected]>
Fourth Person Long Name <[email protected]>
Fifth <[email protected]>, Sixth Person <[email protected]>

I want to extract a comma-separated list of email addresses for each line:

[email protected], [email protected], [email protected]
[email protected]
[email protected], [email protected]

I tried using grep -o but it printed each email address on a separate line. I want to keep the same number of lines that was in the original file. I'm open to any available regex tool (awk, sed, grep, etc).

7
  • 7
    Related: e-mail.wtf Commented Sep 1 at 12:43
  • 1
    Should the answer ensure that the email addresses are valid and correct the one that aren’t on the expected format (like you seem to do in your example)? Commented Sep 1 at 15:43
  • I don't need to validate the emails Commented Sep 2 at 0:32
  • 3
    fifth.example.com and sixth.example.com are not email addresses, they're domain names. And it would be wrong to just change the first . to an @ because that would break on things like <last.first.example.com> ([email protected]). and changing the second last . to @ would also be wrong because it would break on domains like fifth.example.com.au ([email protected]). These would be valid addresses, but probably not addresses belonging to the users in the input. Anyway, the point is that any such assumption you make is likely to be wrong in at least some cases. Commented Sep 2 at 2:57
  • 2
    yeah, they could be valid local addresses. or, more likely, the input examples were mis-typed (in which case, hugomg should fix the examples). The example output shows them with the first . changed to an @, which is why I explained why that would always be the wrong thing to do. Commented Sep 2 at 6:10

5 Answers 5

8

With perl, to extract the parts inside <...>:

$ perl -lne 'print join ", ", @emails if @emails = /<(.*?)>/g' <your-file
[email protected], [email protected], [email protected]
[email protected]
fifth.example.com, sixth.example.com

That's skipping the lines that don't have any emails. If you want to have empty lines of output instead for those, that's even simpler:

perl -lpe '$_ = join ", ", /<(.*?)>/g' <your-file
2
  • Note that this assumes the email addresses are well-behaved. Commented Sep 2 at 23:29
  • @Mark, it assumes the email addresses are between < and > and that neither display name nor email address may contain < or > or newline characters. If they may, then either the task is impossible, or the OP needs to specify what escape mechanism is in place to help recognise which of those </> are the email address boundaries or are part of display name/email address. Commented Sep 3 at 3:05
5

Email addresses are bracketed with <...> so it's easy to pick them out. Remove the leading part that isn't an email address upto the first <; remove the trailing part from the last >; replace the middle parts >...< with a comma & space

sed -e 's/^[^<]*<//' -e 's/>[^<]*$//' -e 's/>[^<]*</, /g' file

[email protected], [email protected], [email protected]
[email protected]
fifth.example.com, sixth.example.com

This will not handle pathological addresses that, for example, contain angle brackets themselves. (I'd call these valid but stupid.) See other answers for more robust but consequently more complex solutions

1
  • +1 for handing the oft-rejected mum&[email protected] correctly. Commented Sep 2 at 23:02
5

Using gawk:

$ awk -vFPAT='<.[^>]*>' -vOFS=',' 
             '{
             $1=$1; # force record to be reconstituted
             gsub(/[<>]/,"") # gsub(/pattern to be removed/, "")
     }1' file

FS describes the data beteen fields (“what fields are not”) and FPAT describes the fields themselves (“what fields are”). This leads to a subtle difference in how fields are found when using regexps as the value for FS or FPAT

About $1=$1:

Here FPAT sets fields andOutput Field Separator is set but the full record $0 is still old. To reconstruct the $0 using the current values of all fields and OFS, $1 is assigned to itself.

But this does not work, because nothing was done to change the record itself. Instead, you must force the record to be rebuilt, typically with a statement such as ‘$1 = $1’.²


¹See FS versus FPAT.

²See for Understanding $0.

3
perl -F',' -lane '@F = map {  s/^.*<(.*?)>.*/$1/r } @F;
   print join(", ", @F)' input.txt 
[email protected], [email protected], [email protected]
[email protected]
fifth.example.com, sixth.example.com

This auto-splits the input on commas and stores it in array @F. Then it use map() to run a regex over each element of @F to remove everything outside of angle-brackets < and >

It just assumes that everything inside angle brackets is a valid email address. Much of the time this is a reasonable assumption because that's how email addresses are supposed to be bounded.

This is not always true with real world email addresses and a better version would use Regexp::Common::Email::Address to validate each email address before accepting and printing it.


Here's a better version, written as a standalone script rather than a one-liner because it's easier to read and understand that way:

#!/usr/bin/perl

use strict;
use feature 'say';

use Regexp::Common qw(Email::Address);

while(<<>>) {
  next if m/^\s*$/;            # ignore empty lines
  chomp;                       # strip trailing \n

  my @addr = ();               # reset @addr array for each input line

  foreach (split /\s*,\s*/) {  # split on commas with optional whitespace
    next unless m/\@/;         # ignore elements without an @
    s/^.*<(.*?\@.*?)>.*/$1/;   # strip everything not inside < >

    # Add to @addr array if it's a valid address
    push @addr, $_ if  m/$RE{Email}{Address}/;
  };

  # print the addresses found on the current line, if any.
  say join(", ", @addr) if @addr;
};

Save that as, e.g., extract-addresses.pl and make it executable with chmod +x extract-addresses.pl. Then run it as extract-addresses.pl filename or pipe your data into it (it works with either or both).

For example, with an input file (input.txt) that contains this:

First Person <[email protected]>, Second Person <[email protected]>, Third <[email protected]>
Fourth Person Long Name <[email protected]>
Fifth <fifth.example.com>, Sixth Person <sixth.example.com>
bad<user> <[email protected]>, another<<<bad<<<user>>> xyz
<"<ohdear>first"@example.com>, Last, First <[email protected]>

It produces this output:

$ ./extract-addresses.pl input.txt 
[email protected], [email protected], [email protected]
[email protected]
[email protected]
[email protected]

Technically, another<<<bad<<<user>>> xyz is a valid email address even though it doesn't have an @ or a FQDN (Fully Qualified Domain Name, i.e. a hostname or domain name). It could be OK as an alias or in a virtual domain, but is probably not a valid user account name but even that is possible, depending on the kind of server. Anyway, this script is written to ignore addresses without an @ symbol in them.

And, as @StephenKitt points out, <"<ohdear>first"@example.com> is also a valid email address. This script does not handle pathological cases like that - as a judgemental programmer, I have arbitrarily decided that whoever owns that address does not deserve to receive email :-) and more importantly, as the great copout says, handling addresses like that is "left as an exercise for the reader". It probably wouldn't be too hard, but I haven't even finished my first coffee of the day yet.

5
  • Nice pointer to a possibly more robust solution — the issue isn’t so much the use of angle brackets as address boundaries, it’s that angle brackets can appear (quoted) inside either or both of the display name and the address’s local part. The latter is unusual but I do see angle brackets in display names regularly. Commented Sep 1 at 15:26
  • ^.*< is greedy, so should match everything before <(.*?)>. Still, users really like to be difficult so it's not impossible that they have <....> in the display name. or even unlikely. I've seen worse things. Commented Sep 1 at 15:31
  • Ah yes indeed, so that sorts out the display name, which will be fine for all real-world cases. <"<ohdear>first"@example.com> is valid and breaks this but getting that to work in practice is difficult! Commented Sep 1 at 15:39
  • 1
    , is common in display names are well, especially in Microsoft environments. Commented Sep 1 at 15:41
  • i'll look at this again in the morning. it's almost 2am here, time to get some sleep. Commented Sep 1 at 15:55
1

Using any awk, assuming the input is as simple/regular as in the provided example:

$ awk -F'[<>]' -v OFS=', ' '{for (i=2; i<=NF; i+=2) printf "%s%s", $i, (i<(NF-1) ? OFS : ORS)}' file
[email protected], [email protected], [email protected]
[email protected]
[email protected], [email protected]
4
  • 1
    Your awk command, as usual, is simpler and better than mine. Upvoted. Commented Oct 7 at 18:14
  • 1
    @PrabhjotSingh it's just more portable, I'd probably use yours if I could rely on having GNU awk. I had upvoted yours too. Commented Oct 7 at 18:19
  • I want to know Is it okay to go with awk '{gsub(/^[^>]*<|>$/,"");gsub(/>,[^<]*</,", ")}1'? Commented Oct 7 at 19:19
  • @PrabhjotSingh yes, best I can tell that'd be fine given the OPs sample input. Commented Oct 7 at 19:23

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.