Treat a column with text that has spaces as 1 field

Question

I have a file with the following format:
INTEGER INTEGER TEXT

The text is unicode and can have spaces.
I am trying to use awk in order to print the first INTEGER and the TEXT in a file in a specific format using printf.
Problem: because TEXT in some lines has spaces the $3 does not have the complete TEXT so the line is broken in more fields.

Example:

12 42956    Cinema - 3D/Multiplex  
7  12560    Status Update  
5  184   Movie

My approach for this is the following:

awk '{ c=$3; for(i=4; i< NF;++i){c=c" "$i}; printf "<tag>%d</tag>\n<tag>%s</tag>\n", $1,c}';

But I thought there might be a better approach

@don_crissti: The first is 1 space, the second is more than 1 spaces. But how can I replace them if I can't separate the line properly? — Jim
– Jim, Commented Feb 9, 2017 at 20:08
@don_crissti:But this only in a script and not the terminal? Pressing tab doesn't seem to work. My mode if vi in case it matters — Jim
– Jim, Commented Feb 9, 2017 at 20:20

Kusalananda · Accepted Answer · 2017-02-10 12:03:32Z

awk is useful if the data comes in well designated records. This data does not. However, the data is on the format "integer stuff the_rest" where both "integer" and "stuff" won't have spaces in them. This happens to be exactly what the read utility likes to read. It will read whitespace-separated words, as many as you give it variables to read, and then it will put "the rest" of the line into the last variable.

bash-4.4$ while read -r integer stuff the_rest; do printf '%d\t"%s"\n' "$integer" "$the_rest"; done <data
12      "Cinema - 3D/Multiplex"
7       "Status Update"
5       "Movie"

It will automatically strip off any trailing whitespace.

Stéphane Chazelas · Accepted Answer · 2017-02-10 12:57:33Z

To extract fields based on a pattern, perl is generally better than awk:

perl -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print "<tag>$1</tag><tag>$2</tag>"
  }'

which on your input gives:

<tag>12</tag><tag>Cinema - 3D/Multiplex</tag>
<tag>7</tag><tag>Status Update</tag>
<tag>5</tag><tag>Movie</tag>

That means you can do more advanced stuff like do proper HTML encoding if needed with for instance:

perl -Mopen=locale -MHTML::Entities -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print map {"<tag>" . encode_entities($_) . "</tag>"} $1, $2
  }'

Or XML encoding:

perl -Mopen=locale -MXML::LibXML -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print map {
      my $e = XML::LibXML::Element->new("tag");
      $e->appendText($_);
      $e->toString} $1, $2
  }'

Daniel Vasconcelos · Accepted Answer · 2017-02-17 10:52:34Z

1

Replace the $2 (that you don't use anyway) for an unused character (one that doesn't exist in your strings). After that, just do:

awk '{$2="+";print}' input-file.txt | awk -F "+" '{printf "<tag>%d</tag>\n<tag>%s</tag>\n",$1,$2}'

Above I used the plus "+" as the separator.

It is not the most elegant solution, but it is simple.

edited Feb 17, 2017 at 10:52

answered Feb 10, 2017 at 12:25

Daniel Vasconcelos

1861 silver badge7 bronze badges

1

Seems clever. BTW you miss the file. and you can avoid the pipe with process substitution : awk -F "+" '{printf "<tag>%d</tag>\n<tag>%s</tag>\n",$1,$2}' <(awk '{$2="+";print}' file)

George Vasiliou
– George Vasiliou

2017-02-10 12:28:36 +00:00
Commented Feb 10, 2017 at 12:28
Thank you @george-vasiliou. I corrected the missing input file. By the way, why process substitution is better than pipe?

Daniel Vasconcelos
– Daniel Vasconcelos

2017-02-10 18:21:52 +00:00
Commented Feb 10, 2017 at 18:21
well explained here: mywiki.wooledge.org/ProcessSubstitutionhttp://…

George Vasiliou
– George Vasiliou

2017-02-10 19:23:14 +00:00
Commented Feb 10, 2017 at 19:23
Welcome. This wooledge web site has tons of info about bash. Check out also the whole bash FAQ : mywiki.wooledge.org/BashFAQ

George Vasiliou
– George Vasiliou

2017-02-14 13:34:33 +00:00
Commented Feb 14, 2017 at 13:34

Add a comment |

Michael Vehrs · Accepted Answer · 2017-02-10 10:44:02Z

0

I think you might want something like

awk '{$2=""; print;}' input

answered Feb 10, 2017 at 10:44

Michael Vehrs

2,20810 silver badges7 bronze badges

Add a comment |

George Vasiliou · Accepted Answer · 2017-02-10 12:24:17Z

0

If this is not a huge file and since the text is always at the end, as an alternative you might consider a classic bash approach like :

while IFS=' ' read -r int1 int2 text;do
#do your stuff
done <file

As happens with while - read , the last var $text in the read command will get all the remaining fields as one field.

Testing:

$ IFS=' ' read -r int1 int2 text <<<"10 5 some text here"
$ echo "$text"
some text here

Bash while read can perform quite slow in big data files, but you can give a try to your case.

answered Feb 10, 2017 at 12:24

George Vasiliou

8,1013 gold badges24 silver badges43 bronze badges

Add a comment |

Stack Exchange Network

Treat a column with text that has spaces as 1 field

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Treat a column with text that has spaces as 1 field

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions