0

I have a file with the following format:
INTEGER INTEGER TEXT

The text is unicode and can have spaces.
I am trying to use awk in order to print the first INTEGER and the TEXT in a file in a specific format using printf.
Problem: because TEXT in some lines has spaces the $3 does not have the complete TEXT so the line is broken in more fields.

Example:

12 42956    Cinema - 3D/Multiplex  
7  12560    Status Update  
5  184   Movie  

My approach for this is the following:

awk '{ c=$3; for(i=4; i< NF;++i){c=c" "$i}; printf "<tag>%d</tag>\n<tag>%s</tag>\n", $1,c}';  

But I thought there might be a better approach

6
  • can you add a sample input/output? Commented Feb 9, 2017 at 10:00
  • @Sundeep:Please see updated OP Commented Feb 9, 2017 at 19:05
  • @don_crissti: The first is 1 space, the second is more than 1 spaces. But how can I replace them if I can't separate the line properly? Commented Feb 9, 2017 at 20:08
  • @don_crissti:What is the /2;? Commented Feb 9, 2017 at 20:15
  • @don_crissti:But this only in a script and not the terminal? Pressing tab doesn't seem to work. My mode if vi in case it matters Commented Feb 9, 2017 at 20:20

5 Answers 5

1

awk is useful if the data comes in well designated records. This data does not. However, the data is on the format "integer stuff the_rest" where both "integer" and "stuff" won't have spaces in them. This happens to be exactly what the read utility likes to read. It will read whitespace-separated words, as many as you give it variables to read, and then it will put "the rest" of the line into the last variable.

bash-4.4$ while read -r integer stuff the_rest; do printf '%d\t"%s"\n' "$integer" "$the_rest"; done <data
12      "Cinema - 3D/Multiplex"
7       "Status Update"
5       "Movie"

It will automatically strip off any trailing whitespace.

1

To extract fields based on a pattern, perl is generally better than awk:

perl -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print "<tag>$1</tag><tag>$2</tag>"
  }'

which on your input gives:

<tag>12</tag><tag>Cinema - 3D/Multiplex</tag>
<tag>7</tag><tag>Status Update</tag>
<tag>5</tag><tag>Movie</tag>

That means you can do more advanced stuff like do proper HTML encoding if needed with for instance:

perl -Mopen=locale -MHTML::Entities -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print map {"<tag>" . encode_entities($_) . "</tag>"} $1, $2
  }'

Or XML encoding:

perl -Mopen=locale -MXML::LibXML -lne '
  if (/^\s*(\d+)\s*\S+\s*(.*?)\s*$/) {
    print map {
      my $e = XML::LibXML::Element->new("tag");
      $e->appendText($_);
      $e->toString} $1, $2
  }'
1

Replace the $2 (that you don't use anyway) for an unused character (one that doesn't exist in your strings). After that, just do:

awk '{$2="+";print}' input-file.txt | awk -F "+" '{printf "<tag>%d</tag>\n<tag>%s</tag>\n",$1,$2}'

Above I used the plus "+" as the separator.

It is not the most elegant solution, but it is simple.

4
  • 1
    Seems clever. BTW you miss the file. and you can avoid the pipe with process substitution : awk -F "+" '{printf "<tag>%d</tag>\n<tag>%s</tag>\n",$1,$2}' <(awk '{$2="+";print}' file) Commented Feb 10, 2017 at 12:28
  • Thank you @george-vasiliou. I corrected the missing input file. By the way, why process substitution is better than pipe? Commented Feb 10, 2017 at 18:21
  • well explained here: mywiki.wooledge.org/ProcessSubstitutionhttp://… Commented Feb 10, 2017 at 19:23
  • Welcome. This wooledge web site has tons of info about bash. Check out also the whole bash FAQ : mywiki.wooledge.org/BashFAQ Commented Feb 14, 2017 at 13:34
0

I think you might want something like

awk '{$2=""; print;}' input
0

If this is not a huge file and since the text is always at the end, as an alternative you might consider a classic bash approach like :

while IFS=' ' read -r int1 int2 text;do
#do your stuff
done <file

As happens with while - read , the last var $text in the read command will get all the remaining fields as one field.

Testing:

$ IFS=' ' read -r int1 int2 text <<<"10 5 some text here"
$ echo "$text"
some text here

Bash while read can perform quite slow in big data files, but you can give a try to your case.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.