1

I have a file as below:

This is an _PLUTO_
This is _PINEAPPLE_
This is _ORANGE_
This is _RICE_

I'm using below code to extract the output:

awk '{ print "Country: "  $NF }'  report.txt   

Output:

Country: _PLUTO_
Country: _PINEAPPLE_
Country: _ORANGE_
Country: _RICE_

How do I remove all the underscore so that my output looks below:

Country: PLUTO
Country: PINEAPPLE
Country: ORANGE
Country: RICE
3

4 Answers 4

8

You can use this snippet:

$ awk '{ gsub("_", "", $NF); print "Country: " $NF }' report.txt
Country: PLUTO
Country: PINEAPPLE
Country: ORANGE
Country: RICE

Note that gsub() will perform the modification in place, so it will store the result of the substitution back to $NF, in your case.

If you're using GNU awk, you can use gensub() instead, which is slightly simpler:

$ gawk '{ print "Country: " gensub("_", "", "g", $NF) }' report.txt
Country: PLUTO
Country: PINEAPPLE
Country: ORANGE
Country: RICE

See GNU awk documentation for gsub() and gensub() for more details.

1
  • 1
    awk '{gsub("_", "", $0); print}' report.txt works as well. When print is called with no arguments, it prints the whole record, AKA $0. Also, if you are using Solaris by any chance, you need to use nawk for gsub to be available. On Red Hat Linux 6.x, nawk is a link to gawk, which also supports gsub. Commented Jan 3, 2019 at 21:35
1

try

awk -F_ '{ print "Country: " $(NF-1) }' infile

You could try sed instead.

sed -r 's/[^_]*_([^_]*)_.*/Country: \1/' infile
  • [^_]*_ matches everything until a first _ seen.
  • ([^_]*)_ matches everything after above match untill next _ seen and .* matches everything after that, but only keep (...) part as a captured group.
  • \1 is the back-reference to the ([^_]*) captured group.
1

Using sed instead:

$ sed -E 's/^This is (an? )?/Country: /; s/\<_//; s/_\>//' file
Country: PLUTO
Country: PINEAPPLE
Country: ORANGE
Country: RICE

This applies three substitutions:

  1. Replaces the text This is optionally followed by either a or an with Country:.
  2. Removes _ at the start of a word.
  3. Removes _ at the end of a word.

The last two substitutions allows for data on the form

This is a _big_blue_ball_

which would be transformed into

Country: big_blue_ball

and not

Country: big blue ball

An awk alternative that just ignores the first part of each line and trims the first and last characters off of the last whitespace-delimited field:

awk '{ printf("Country: %s\n", substr($NF, 2, length($NF)-2)) }'
3
  • With sed, you can also simply use this: sed 's/_//g' report.txt to delete all underscores. If you want to change the file itself, you can do an in-line replace: sed -i 's/_//g' report.txt Commented Jan 3, 2019 at 20:58
  • @Larry Sure, but the point I was making is that one may only want to delete the flanking underscores, and the rest of that field could contain underscores that should be preserved. Commented Jan 3, 2019 at 21:07
  • Indeed, that can be quite useful. If you want to restrict the regular expression so as not to touch possibly other lines, then it is also a good idea that you first filtered for lines of the required format (contains "This is a/an", etc.). Kudos. Commented Jan 3, 2019 at 21:14
0

Done by using python

#!/usr/bin/python
import re
l=[]
k=open('file.txt','r')
for i in k:
        l.append(i)
m=re.compile(r'_.*')
for h in l:
        out=re.search(m,h)
        print "Country:",out.group().split('_')[-2]

output

Country: PLUTO
Country: PINEAPPLE
Country: ORANGE
Country: RICE

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.