11

I'm an awk newbie, so please bear with me.

The goal is to change the case of a string such that the first letter of every word is uppercase and the remaining letters are lowercase. (To keep the example simple, "word" is defined here as strictly alphabetic characters; all others are considered separators.)

I learned a nice way to make the first letter of every word uppercase from another post on this website using the following awk command:

echo 'abce efgh ijkl mnop' | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

Making the remaining letters lowercase is easily accomplished by preceding the awk command with a tr command:

echo 'aBcD EfGh ijkl MNOP' | tr [A-Z] [a-z] | awk '{for (i=1;i <= NF;i++) {sub(".",substr(toupper($i),1,1),$i)} print}' --> Abcd Efgh Ijkl Mnop

However, in the interest of learning more about awk, I wanted to change the case of all but the first letter to lowercase with a similar awk construct. I used the regular expression \B[A-Za-z]+ to match all letters of a word but the first, and the awk command substr(tolower($i),2) to provide those same letters in lowercase, as follows:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) {sub("\B[A-Za-z]+",substr(tolower($i),2),$i)} print}' --> Abcd EFGH IJKL MNOP

Notice that the first word converted properly, but the remaining words are left unchanged. I would be very grateful for an explanation of why the remaining words did not convert properly and how to get them to do so.

2
  • you can find the solution here Commented Jan 3, 2013 at 13:40
  • Although I was trying to solve the problem with awk, thank you for the link to a nice perl solution. Commented Jan 3, 2013 at 14:20

4 Answers 4

10

The issue is that \B (zero-width non-word boundary) only seems to match at the beginning of the line, so $1 works but $2 and following fields do not match the regex, so they are not substituted and remain uppercase. Not sure why \B doesn't match except for the first field... B should match anywhere within any word:

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1; i<=NF; ++i) { print match($i, /\B/); }}'
2   # \B matches ABCD at 2nd character as expected
0   # no match for EFGH
0   # no match for IJKL
0   # no match for MNOP

Anyway to achieve your result (capitalize only the first character of the line), you can operate on $0 (the whole line) instead of using a for loop:

echo 'ABCD EFGH IJKL MNOP' | awk '{print toupper(substr($0,1,1)) tolower(substr($0,2)) }'

Or if you still wanted to capitalize each word separately but with awk only:

awk '{for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }'
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for showing that \B unfortunately only operates at the beginning of the line. With your suggested construct without the for loop, the first word changed case as desired but the remainder became lowercase: Abcd efgh ijkl mnop
@scolfax, yes at first I interpreted your request "I wanted to change the case of all but the first letter to lowercase" as all letters of the line. But I also provided a second construct below, in case you wanted to uppercase the first letter of each word instead.
That did it! Simpler than using the sub command. Thank you very much. I've been a grep and sed person for some time, but it's finally time to dive into awk!
4

When matching regex using the sub() function or others (like gsub() etc), it's best used in the following form:

sub(/regex/, replacement, target)

This is different from what you have:

sub("regex", replacement, target)

So your command becomes:

awk '{ for (i=1;i<=NF;i++) sub(/\B\w+/, substr(tolower($i),2), $i) }1'

Results:

Abcd Efgh Ijkl Mnop

This article on String Functions maybe worth a read. HTH.


I should say that there are easier ways to accomplish what you want, for example using GNU sed:

sed -r 's/\B\w+/\L&/g'

1 Comment

The link you provided describes nicely why regex literals /.../ are much preferred over strings "...", and I will make that change henceforth. However, for some reason, I got the same result as before, with only the first word converting case: echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++) sub(/\B[A-Za-z]+/, substr(tolower($i),2),$i)}1' --> Abcd EFGH IJKL MNOP. I wonder if this is due to an awk version difference (mine is OS X, which by the way doesn't support \L in the sed command, unfortunately).
3

My solution will be to get the first part of the sub with a first substr insted of your regex :

echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }'
Abcd Efgh Ijkl Mnop

6 Comments

Using substr($i,1) itself as the regex is very creative. At first I thought I could run into problems if that sequence of characters was repeated later in the same word, but I believe sub matches only the first occurrence, so your solution should work fine. Thank you for this nice solution.
Sorry, I meant substr($i,2). And cancel my comment about a second occurrence, since ($i,2) consists of the entire word from the second character on.
Actually, this construct does run into a problem if the 2nd-through-last characters match the 1st-through-(last - 1) characters: echo 'AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub(substr($i,2),tolower(substr($i,2)),$i)} print }' --> aaaA
Effectively i thought about repeted sequence but not about the problem you pointed, i'll probably look about an other/improved solution. But you choose an answer so probably not immediatly...
You can sub all the $i to be good, but that's end with an equivalent of the accepted solution : echo 'ABCD EFGH IJKL MNOP AAAA' | awk '{for (i=1 ; i <= NF ; i++) {sub($i,substr($i,1,1)tolower(substr($i,2)),$i)} print }' Abcd Efgh Ijkl Mnop Aaaa
|
1

You have to add another \ character before \B

 echo 'ABCD EFGH IJKL MNOP' | awk '{for (i=1;i <= NF;i++)
 {sub("\\B[A-Za-z]+",substr(tolower($i),2),$i)} print}'

With just \B awk gave me this warning:

awk: cmd. line:1: warning: escape sequence \B' treated as plainB'

1 Comment

I understand the use of the second \ character in the regex string: to make sure it is interpreted as \B rather than an escaped B. However, for some reason, my version of awk (OS X) seemed to interpret it as \B anyway, and adding the second \ character didn't change the result. But thanks for that good reminder.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.