13

MAIN QUESTION:

Is there a regex for preserving case pattern in the vein of \U and \L?
Ideally, it would also respect word boundaries and anchors.

Example

Assume we have a large body of text where we want to convert one word to another, while preserving the capitalization of the word. For example, replacing all instances of "date" with "month"

 Input: `"This Date is a DATE that is daTe and date."`
Output: `"This Month is a MONTH that is moNth and month."`
 input     output
------     -------
"date" ~~> "month"
"Date" ~~> "Month"
"DATE" ~~> "MONTH"
"daTe" ~~> "moNth"   ## This example might be asking for too much.

Preserving word boundaries

I'd be interested in a solution that preserves word boundaries (ie, is able to match "whole word" only). In the given example, "date" would be changed, but not "dated"


Existing Workaround in R:

I currently use three nested calls to sub to accomplish this.

input <- c("date", "Date", "DATE")
expected.out <- c("month", "Month", "MONTH")

sub("date", "month", 
  sub("Date", "Month", 
    sub("DATE", "MONTH", input)
  )
)

The goal is to have a single pattern and a single replace such as

gsub("(date)", "\\Umonth", input, perl=TRUE) 

which will yield the desired output


Notes (updated 2023)

  1. The motivation behind the question is to expand knowledge about the capabilities of RegEx. The below example is given only as an illustration. The purpose of this question is not to find alternate workarounds.
  2. The question was asked with the R tag, but would accept answers that invoke flavors of RegEx not currently available in R
10
  • 1
    Why not just use a map via a named vector: map <- setNames(expected.output, input). Then do month <- map[date]. Commented Oct 3, 2014 at 0:11
  • @flodel - smart thinking - there's really no need for any regex here. Commented Oct 3, 2014 at 0:14
  • 1
    @flodel -- I suspect Ricardo is also wanting a solution that'll work for inputs like input <- "Here are a date, a Date, and a DATE" Commented Oct 3, 2014 at 0:17
  • 1
    My gut says you can't do it with a single regex; use a for loop or get fancy with a Reduce. Commented Oct 3, 2014 at 0:31
  • 1
    @r2evans To the best of my knowledge, there is no way to do what I was asking for with RegEx. The answers given simply offer alternate workarounds. The core question is: "Is there a regex for preserving case pattern in the vein of \U and \L?" AFAIK, the answer is "No," although I have just added a bounty to the question. Thanks for the ping on this :) Commented Jun 16, 2023 at 20:57

6 Answers 6

13

This is one of those occasions when I think a for loop is justified:

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

for(i in seq_along(pat)) { input <- gsub(pat[i],ret[i],input) }
input
#[1] "Here are a month, a Month, and a MONTH" 
#[2] "Here are a month, a Month, and a MONTH"

And an alternative courtesy of @flodel implementing the same logic as the loop through Reduce:

Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input)

For some benchmarking of these options, see @TylerRinker's answer.

Sign up to request clarification or add additional context in comments.

Comments

8

Here's a qdap approach. Pretty straight forward but not the fastest:

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")


library(qdap)
mgsub(pat, ret, input)

## [1] "Here are a month, a Month, and a MONTH"
## [2] "Here are a month, a Month, and a MONTH"

Benchmarking:

input <- rep("Here are a date, a Date, and a DATE",1000)

library(microbenchmark)

(op <- microbenchmark( 
    GSUBFN = gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), 
             input, ignore.case=T),
    QDAP = mgsub(pat, ret, input),
    REDUCE = Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input),
    FOR = function() {
       for(i in seq_along(pat)) { 
          input <- gsub(pat[i],ret[i],input) 
       }
       input
    },

times=100L))

## Unit: milliseconds
##    expr        min         lq     median         uq        max neval
##  GSUBFN 682.549812 815.908385 847.361883 925.385557 1186.66743   100
##    QDAP  10.499195  12.217805  13.059149  13.912157   25.77868   100
##  REDUCE   4.267602   5.184986   5.482151   5.679251   28.57819   100
##     FOR   4.244743   5.148132   5.434801   5.870518   10.28833   100

enter image description here

2 Comments

I want to select this as the answer, simply for the benchmarks :)
The qdap approach is slower because it does some reordering of the patterns to make sure more/larger n character subs/replacements come first to be less likely that they will be overwritten by the smaller replacements first. If that doesn't make sense just realize there's built in protections.
7

Using the gsubfn package, you could avoid using nested sub functions and do this in one call.

> library(gsubfn)
> x <- 'Here we have a date, a different Date, and a DATE'
> gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), x, ignore.case=T)
# [1] "Here we have a month, a different Month, and a MONTH"

2 Comments

the replacement argument of the gsubfn() call is the list of three substitutions that depend on 'date's' capitalization. But can you explain why that is list(...) is a function that R understands as making the substitutions? Sorry if that is not clear. Maybe you could explain what that one call is doing. Thank you
@lawyeR - since ignore.case=TRUE the function matches the pattern date to (date or Date or DATE) and then looks up what the match was in the replacement list(...). So, if Date was matched, it grabs list(..)[["Date"]] which is Month in this case.
6
+100

AFAIK there is no way to do what you have asked with a pure regex and a single(*) find and replace. The problem is the replacing part can only use capturing group matches as-is - it can't process them, derive information from them or do conditionals without a function being involved. So even if you use something like \b(?:(d)|(D))(?:(a)|(A))(?:(t)|(T))(?:(e)|(E))\b in a case-sensitive find (so evenly numbered captures are upper-case and oddly numbered captures are lower-case - see "MATCH INFORMATION" in the right pane of regex101), the replace part still needs a function to act on this captured information.

(*) Am assuming you don't want to perform separate find and replaces for every single combination of uppers and lowers!

Appendix

I could stop there given you've made it very clear you aren't interested in other solutions... but just for fun thought I'd try a Javascript solution (which includes the function processing as part of the regex replacment):

const text = `This Date is a DATE that is daTe and date.
But dated should not be replaced, and nor should sedate.`;

const find = "date", replace = "month";
// For the general case, could apply a regex escaping function to `find` here.
// See https://stackoverflow.com/questions/3561493

const result = text.replace(new RegExp(`\\b${find}\\b`, "gi"), match => {
  let rep = "", pos = 0, upperCase = false;
  for (; pos < find.length && pos < replace.length; pos++) {
    const matchChar = match.charAt(pos);
    upperCase = matchChar.toUpperCase() === matchChar;
    const repChar = replace.charAt(pos);
    rep += upperCase ? repChar.toUpperCase() : repChar.toLowerCase();
  }
  const remaining = replace.substring(pos);
  rep += upperCase ? remaining.toUpperCase() : remaining.toLowerCase();
  return rep;
});

console.log(result);

2 Comments

This is a nice explanation of why you need some logic for the replacement - and I like that you can do this fairly elegantly in JS. FYI the snippet you posted doesn't seem to make the replacements (the first sentence of the output is still "This Date is a DATE that is daTe and date.") If you remove the word boundaries from the pattern it works (though obviously it then makes the unwanted replacements in the second sentence).
Oops! Bad final edit, it was working earlier - the problem was backslashes not being escaped, have now amended.
5

You have to write some logic

You will not find a pure regex solution for this. Similar SO questions in C# and JS contain extensive logical flow to determine which characters are capitals.

Furthermore, these questions have additional constraints which make them considerably simpler than your question:

  1. The pattern and the replacement are the same length.
  2. Each character in the pattern has a unique replacement character, e.g. "abcd" => "wxyz".

As a response to a similar question on the Rust reddit states:

There's a lot of possible ways that this could go wrong. For example, what should happen if you try to replace with a different number of characters ("abc" -> "wxyz")? What if you have a mapping with multiple outgoing links ("aaa" -> "xyz")?

This is precisely what you are trying to do. Where the pattern and replacement are a different length, in general you want the index of each capital in the pattern to be mapped to the index in the replacement, e.g. "daTe" => ""moNth. However, sometimes you do not, e.g. "DATE" => "MONTH", and not "MONTh". Even if there were a regex flavour with some sort of \U equivalent (which is a nice question), to cope with patterns and replacements with different lengths, regex cannot be enough.

Another complication is the letters in the pattern or replacement are not guaranteed to be unique: you want to be able to replace "WEEK" with "MONTH" and vice versa. This rules out character hash map approaches like the Rust answer. The Perl response linked in comments can cope with different length replacements. However, to generalise it to more than just the first letter would require a pattern setting out all possible permutations of capitals and lower case letters. This would be at least 2^n patterns, where n is the number of letters in the word being replaced. This doesn't get you much further than doing the same in R or any language.

Python solution

You can do this in Python in a similar way to the excellent JS answer by Steve Chambers, as re.sub() can be supplied with a callback function. While the logic is basically similar, the nature of iteration in Python makes the syntax a little nicer, in my view. Rather than explicitly checking whether there are any letters left over if the inputs are different sizes, we can use itertools.cycle() to recycle different length inputs in a similar way to R.

from itertools import cycle
def change_case(m, repl = "month"):
    word_match = m.group(1)
    if len(word_match) > len(repl):
        letters_itr = zip(word_match, cycle(repl))
    else:
        letters_itr = zip(cycle(word_match), repl)

    out_l = [out_letter.upper() if in_letter.isupper() else out_letter for in_letter, out_letter in letters_itr]
    return "".join(out_l)

text = 'This Date is a DATE that is daTe and date. But dated should not be replaced, and nor should sedate.'
re.sub(r'\b(date)\b', change_case, text, flags = re.IGNORECASE)
# 'This MontH is a MONTH that is moNth and month. But dated should not be replaced, and nor should sedate.'

An advantage of using itertools.cycle() is that without any additional logic, if all caps go in then we get all caps out, so for example "DATE" => "MONTH" rather than "MONTh"

R solution

stringr approach

You can supply a callback function to stringr::str_replace_all(). This does the same thing as the Python or JS approach, though we can use cbind() to recycle to the same length in R. We can turn off the warning about recycling different length vectors using withCallingHandlers(), as that is in fact the intended purpose of this approach.

change_case  <- function(x, repl = "month") {
    # cbind so we can iterate over strings of different lengths
    withCallingHandlers(
        {
            letters_df <- cbind(x = strsplit(x, "")[[1]], repl = strsplit(repl, "")[[1]])
        },
        warning = function(w) {
            if (grepl("number of rows of result is not a multiple of vector length", w)) {
                invokeRestart("muffleWarning")
            }
        }
    )

    out  <- apply(letters_df, 1, \(letter) 
        if(grepl("^[[:upper:]]+$", letter["x"])) toupper(letter["repl"]) else letter["repl"] 
    )

    paste(out, collapse = "")
}

This then can be used similarly to the other solutions:

x <- "This Date is a DATE that is daTe and date. But dated should not be replaced, and nor should sedate."
stringr::str_replace_all(x, "(?i)\\bdate\\b", change_case)

base R solution

You cannot supply gsub() a callback function that has access to the groups. Instead, I have written a function swap(), which will do this for you with two strings, even with different numbers of letters:

x <- "This Date is a DATE that is daTe and date."
swap("date", "month", x)
# [1] "This Month is a MONTH that is moNth and month."

How it works

The swap() function uses Reduce() in a pretty similar way to this answer:

swap <- function(old, new, str, preserve_boundaries = TRUE) {
    l <- create_replacement_pairs(old, new, str, preserve_boundaries)
    Reduce(\(x, l) gsub(l[1], l[2], x, fixed = TRUE), l, init = str)
}

The workhorse function is create_replacement_pairs(), which creates a list of pairs of patterns that actually appears in the string, e.g. c("daTe", "DATE"), and generates replacements with the correct case, e.g. c("moNth", "MONTH"). The function logic is:

  1. Find all matches in the string, e.g. "Date" "DATE" "daTe" "date".
  2. Create a boolean mask indicating whether each letter is a capital.
  3. If all letters are capitals, the replacement should also be all caps, e.g. "DATE" => "MONTH". Otherwise make the letter at each index in the replacement a capital if the letter at the corresponding index in the pattern is a capital.
create_replacement_pairs <- function(old = "date", new = "month", str, preserve_boundaries) {
    if (preserve_boundaries) {
        pattern <- paste0("\\b", old, "\\b")
    } else {
        pattern <- old
    }

    matches <- unique(unlist(
        regmatches(str, gregexpr(pattern, str, ignore.case = TRUE))
    )) # e.g. "Date" "DATE" "daTe" "date"

    capital_shift <- lapply(matches, \(x) {
        out_length <- nchar(new)
        # Boolean mask if <= capital Z
        capitals <- utf8ToInt(x) <= 90

        # If e.g. DATE, replacement should be
        # MONTH and not MONTh
        if (all(capitals)) {
            shift <- rep(32, out_length)
        } else {
            # If not all capitals replace corresponding
            # index with capital e.g. daTe => moNth

            # Pad with lower case if replacement is longer
            length_diff <- max(out_length - nchar(old), 0)
            shift <- c(
                ifelse(capitals, 32, 0),
                rep(0, length_diff)
            )[1:out_length] # truncate if replacement shorter than pattern
        }
    })

    replacements <- lapply(capital_shift, \(x) {
        paste(vapply(
            utf8ToInt(new) - x,
            intToUtf8,
            character(1)
        ), collapse = "")
    })

    replacement_list <- Map(\(x, y) c(old = x, new = y), matches, replacements)

    replacement_list
}

Use cases

This approach is not subject to the same constraints as the Rust and C# answers linked at the start of this answer. We have already seen this works where the replacement is longer than the pattern. The converse is also true:

swap("date", "day", x)
# [1] "This Day is a DAY that is daY and day."

Furthermore, as it does not use a hash map, it works in cases where the letters in the replacement are not unique.

swap("date", "week", x)
# [1] "This Week is a WEEK that is weEk and week."

It also works where the letters in the pattern are not unique:

swap("that", "which", x)
# [1] "This Date is a DATE which is daTe and date."

Edit: Thanks to @shs for pointing out in the comments that this did not preserve word boundaries. It now does by default, but you can disable this with preserve_boundaries = FALSE:

swap("date", "week", "this dAte is dated", preserve_boundaries = FALSE)
# [1] "this wEek is weekd"
swap("date", "week", "this dAte is dated")
# [1] "this wEek is dated"

Performance

Dynamically generating matches from the lower case argument is likely faster than than calling a function every time, particularly in long strings with many replacements that are identical. It will not be quite as fast as hardcoding list(c("Date", "Month"), c("DATE", "MONTH"), c("daTe", "moNth"), c("date", "month")). However a fair comparison should probably include the time it takes to type that list, which I doubt could be done in less than the ten-thousandth of a second the function takes to return, even by the most committed vim user.

I had the benefit of seeing the benchmarks in Tyler Rinker's answer so have used Reduce() and gsub(), which is the fastest of the methods for replacement tested. Additionally the approach in this answer generates pairs of exact matches and replacements, so we can set fixed = TRUE in gsub(), which with a five character pattern takes about a quarter of the time to make a replacement compared with fixed = FALSE.

This does make several passes over the string, rather than some other answers which make one pass to look for a match. However those answers then apply logic after the match is found, whereas this has a one-to-one mapping of matches to replacements, so no logic is required. I suspect which is faster depends on the data, specifically how many variants of the pattern you have, and the language (it's generally quicker in R to do the regex several times, which is written in C, rather than the capital shift logic, which is written in R).

Is this still a workaround? Yes. But as a pure regex solution cannot exist, I like a solution that abstracts away the unseemly character level iteration, so I can forget it is a bit of a hack.

4 Comments

This solution also does not respect word boundaries. Otherwise nice work though.
It should also be stated that this use of utf8ToInt() only works for the standard latin alphabet. For example, ÄÖÜ in German words would be problematic.
@shs thanks re str and x - fixed. Re locale, yes you're right. I thought standard Latin alphabet was implied (the other alphabets I know don't have the concept of capitals), but good point about accented letters. That can be fixed if a character set is specified (though it becomes a bit inelegant as intToUtf8(utf8ToInt("Ä") - 32) is not "ä". Re word boundaries, I'm not sure exactly what you mean. Can you give me an example?
It's explained in the question under the heading "Preserving word boundaries". If you do perl = T a pattern like "\bdate\b" would ensure that. With the default POSIX patterns I don't know how it would work
3

edit: Note that the way shown here is a single regex, single pass solution.
It avoids re-searching the string for individual separate forms.
It should represent the fastest method to do such a thing.
The point of this question is speed, which is otherwise trivial.

This is written in Perl.
It has a function that takes the find word, replace word, default replace word,
and the string to do the replacing on.

This is fairly simple.
The function generates the four forms of each word, puts them into arrays,
constructs the regex based on the find word forms, then does a string replacement
of the passed in string.

The replacement is based on the capture group that matched.
The group number is used as an index into the replacement array to fetch the
equivalent form word.

There is a default replacement passed into this function that will be used
when the find word matches in a case insensitive way, the last group.

Even though done in Perl here it is easy to port to any language/regex engine.

use strict;
use warnings;


sub CreateForms{
   my ($wrd) = @_;
   my $w1 =  lc($wrd);                 # 1. lower case
   (my $w2 = $w1) =~ s/^(.)/uc($1)/e;  # 2. upper first letter only
   my $w3 =  uc($w1);                  # 3. upper case
   my $w4 = $w1;                       # 4. default (all the rest)
   my @forms = ("", $w1, $w2, $w3, $w4);
   return( @forms );
}

sub ReplaceForms{
   my ($findwrd, $replwrd, $replDefault, $input) = @_;

   my @ff = CreateForms($findwrd);
   my $Rx = "\\b(?:(" . $ff[1] . ")|(" . $ff[2] . ")|(" . $ff[3] . ")|((?i)" . $ff[4] . "))\\b";

   my @rr = CreateForms($replwrd);
   $rr[4] = $replDefault;

   $input =~ s/$Rx/ $rr[defined($1) ? 1 : defined($2) ? 2 : defined($3) ? 3 : 4]/eg;
   return $input;

};
 
print "\n";
print ReplaceForms( "date", "month", "monTh", "this is the date of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the Date of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the DATE of the year" ), "\n";
print ReplaceForms( "date", "month", "monTh", "this is the DaTe of the year" ), "\n";

Output

this is the month of the year
this is the Month of the year
this is the MONTH of the year
this is the monTh of the year

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.