Changing the case of a string with awk

Question

I'm an awk newbie, so please bear with me.

The goal is to change the case of a string such that the first letter of every word is uppercase and the remaining letters are lowercase. (To keep the example simple, "word" is defined here as strictly alphabetic characters; all others are considered separators.)

I learned a nice way to make the first letter of every word uppercase from another post on this website using the following awk command:

echo 'abce efgh ijkl mnop' | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

Making the remaining letters lowercase is easily accomplished by preceding the awk command with a tr command:

echo 'aBcD EfGh ijkl MNOP' | tr [A-Z] [a-z] | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

However, in the interest of learning more about awk, I wanted to change the case of all but the first letter to lowercase with a similar awk construct. I used the regular expression \B[A-Za-z]+ to match all letters of a word but the first, and the awk command substr(tolower($i),2) to provide those same letters in lowercase, as follows:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) {sub("\B[A-Za-z]+",substr(tolower($i),2),$i)} print}' --> Abcd EFGH IJKL MNOP

Notice that the first word converted properly, but the remaining words are left unchanged. I would be very grateful for an explanation of why the remaining words did not convert properly and how to get them to do so.

Although I was trying to solve the problem with awk, thank you for the link to a nice perl solution. — scolfax
– scolfax, Commented Jan 3, 2013 at 14:20

Anders Johansson · Accepted Answer · 2013-01-03 13:55:45Z

10

The issue is that \B (zero-width non-word boundary) only seems to match at the beginning of the line, so $1 works but $2 and following fields do not match the regex, so they are not substituted and remain uppercase. Not sure why \B doesn't match except for the first field... B should match anywhere within any word:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1; i<=NF; ++i) { print match($i, /\B/); }}'
2   # \B matches ABCD at 2nd character as expected
0   # no match for EFGH
0   # no match for IJKL
0   # no match for MNOP

Anyway to achieve your result (capitalize only the first character of the line), you can operate on $0 (the whole line) instead of using a for loop:

echo 'ABCD EFGH IJKL MNOP' | awk '{print toupper(substr($0,1,1)) tolower(substr($0,2)) }'

Or if you still wanted to capitalize each word separately but with awk only:

awk '{for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }'

edited Jan 3, 2013 at 13:55

answered Jan 3, 2013 at 13:43

Anders Johansson

4,05621 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

scolfax Over a year ago

Thank you for showing that \B unfortunately only operates at the beginning of the line. With your suggested construct without the for loop, the first word changed case as desired but the remainder became lowercase: Abcd efgh ijkl mnop

Anders Johansson Over a year ago

@scolfax, yes at first I interpreted your request "I wanted to change the case of all but the first letter to lowercase" as all letters of the line. But I also provided a second construct below, in case you wanted to uppercase the first letter of each word instead.

scolfax Over a year ago

That did it! Simpler than using the sub command. Thank you very much. I've been a grep and sed person for some time, but it's finally time to dive into awk!

Steve · Accepted Answer · 2013-01-03 14:10:25Z

4

When matching regex using the sub() function or others (like gsub() etc), it's best used in the following form:

sub(/regex/, replacement, target)

This is different from what you have:

sub("regex", replacement, target)

So your command becomes:

awk '{ for (i=1;i<=NF;i++) sub(/\B\w+/, substr(tolower($i),2), $i) }1'

Results:

Abcd Efgh Ijkl Mnop

This article on String Functions maybe worth a read. HTH.

I should say that there are easier ways to accomplish what you want, for example using GNU sed:

sed -r 's/\B\w+/\L&/g'

edited Jan 3, 2013 at 14:10

answered Jan 3, 2013 at 13:34

Steve

55.1k13 gold badges94 silver badges105 bronze badges

1 Comment

scolfax Over a year ago

The link you provided describes nicely why regex literals /.../ are much preferred over strings "...", and I will make that change henceforth. However, for some reason, I got the same result as before, with only the first word converting case: echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) sub(/\B[A-Za-z]+/, substr(tolower($i),2),$i)}1' --> Abcd EFGH IJKL MNOP. I wonder if this is due to an awk version difference (mine is OS X, which by the way doesn't support \L in the sed command, unfortunately).

Pilou · Accepted Answer · 2013-01-03 14:01:00Z

3

My solution will be to get the first part of the sub with a first substr insted of your regex :

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }'
Abcd Efgh Ijkl Mnop

answered Jan 3, 2013 at 14:01

Pilou

1,4881 gold badge14 silver badges25 bronze badges

6 Comments

scolfax Over a year ago

Using substr($i,1) itself as the regex is very creative. At first I thought I could run into problems if that sequence of characters was repeated later in the same word, but I believe sub matches only the first occurrence, so your solution should work fine. Thank you for this nice solution.

scolfax Over a year ago

Sorry, I meant substr($i,2). And cancel my comment about a second occurrence, since ($i,2) consists of the entire word from the second character on.

scolfax Over a year ago

Actually, this construct does run into a problem if the 2nd-through-last characters match the 1st-through-(last - 1) characters: echo 'AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }' --> aaaA

Pilou Over a year ago

Effectively i thought about repeted sequence but not about the problem you pointed, i'll probably look about an other/improved solution. But you choose an answer so probably not immediatly...

Pilou Over a year ago

You can sub all the $i to be good, but that's end with an equivalent of the accepted solution :

echo 'ABCD EFGH IJKL MNOP AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub($i,substr($i,1,1)tolower(substr($i,2)),$i)} print }' Abcd Efgh Ijkl Mnop Aaaa

|

coelhudo · Accepted Answer · 2013-01-03 13:29:22Z

1

You have to add another \ character before \B

 echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++)
 {sub("\\B[A-Za-z]+",substr(tolower($i),2),$i)} print}'

With just \B awk gave me this warning:

awk: cmd. line:1: warning: escape sequence \B' treated as plainB'

answered Jan 3, 2013 at 13:29

coelhudo

5,1707 gold badges43 silver badges60 bronze badges

1 Comment

scolfax Over a year ago

I understand the use of the second \ character in the regex string: to make sure it is interpreted as \B rather than an escaped B. However, for some reason, my version of awk (OS X) seemed to interpret it as \B anyway, and adding the second \ character didn't change the result. But thanks for that good reminder.

Collectives™ on Stack Overflow

Changing the case of a string with awk

4 Answers 4

3 Comments

1 Comment

6 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

6 Comments

1 Comment

Linked

Related