C# remove carriage returns, line breaks and whitespaces from string as efficient as possible (benchmark) [duplicate]

Question

In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?

To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2

string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";

@MikeBeaton Yes I know, but thanks for the info and your answer, which helped me for another problem I needed to solve aswell ;-) — Daniel
– Daniel, Commented Jun 26, 2020 at 15:08
It matters a lot how long your strings actually are. I did some benchmarking and around the 10_000 characters mark a parallel method started outperforming the NewString method, once reaching the 100_000_000 characters mark the parallel version was a little over 15x as fast (also benchmarked with BenchmarkDotNet). — user10608418
– user10608418, Commented Jun 26, 2020 at 16:20
@Knoop Do you want me to specify this in the question? The strings I have are between 10 000 and 100 000 characters long. — Daniel
– Daniel, Commented Jun 26, 2020 at 16:22
With those lengths you have a good chance a Parallel method will out perform the accepted answer, but it also depends on the hardware it runs on. Optimization and efficiency problems are seldom straight forward. So if the accepted answer is fast enough I would just go with that (also from the point of view that it's best to avoid premature optimization) — user10608418
– user10608418, Commented Jun 26, 2020 at 17:35

Guru Stron · Accepted Answer · 2020-06-26 15:29:59Z

The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :

public class Replace
{
    private static readonly string S = " ab c\r\n de.\nf";
    private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);

    [Benchmark]
    public string SimpleReplace() => S
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","");

    [Benchmark]
    public string StringBuilder() => new StringBuilder().Append(S)
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","")
       .ToString();

    [Benchmark]
    public string RegexReplace() => Reg.Replace(S, "");

    [Benchmark]
    public string NewString()
    {
            var arr = new char[S.Length];
            var cnt = 0;
            for (int i = 0; i < S.Length; i++)
            {
                switch(S[i])
                {
                    case ' ':
                    case '\r':
                    case '\n':
                        break;

                    default:
                        arr[cnt] = S[i];
                        cnt++;
                        break;
                }
            }

            return new string(arr, 0, cnt);
    }

    [Benchmark]
    public string NewStringForeach()
    {
        var validCharacters = new char[S.Length];
        var next = 0;

        foreach(var c in S)
        {
            switch(c)
            {
                case ' ':
                case '\r':
                case '\n':
                    // Ignore then
                    break;

                default:
                    validCharacters[next++] = c;
                    break;
            }
        }

        return new string(validCharacters, 0, next);
    }
}

This gives on my machine:

|          Method |        Mean |     Error |    StdDev |
|---------------- |------------:|----------:|----------:|
|   SimpleReplace |   122.09 ns |  1.273 ns |  1.063 ns |
|   StringBuilder |   311.28 ns |  6.313 ns |  8.850 ns |
|    RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
|       NewString |    52.26 ns |  1.122 ns |  1.812 ns |
|NewStringForeach |    40.04 ns |  0.877 ns |  1.979 ns |

Thanks for your awesome answer! If I interpret this in the right way this means that the approach of @Sean is the quickest so far. Am I right?
@Daniel yep. But again, you should test against actual workload.
Since OP has stated that the desired optimization is specifically for large strings imo a good benchmark should represent that. Some solutions might have more startup overhead but be faster on a per character basis, making them lose out on short strings but come out ahead on very long strings.
@Knoop term "large" is very broad that is why i suggested to test(bench) against actual workload (and actual hardware). Potentially even frequency of specific characters can affect performance of one or another solution.

Sean · Accepted Answer · 2020-06-26 14:09:35Z

1

To do this efficiently you want to avoid regex and keep memory allocations to a minimum: Here I've used a raw character buffer (rather than a StringBuilder) and for rather than foreach to optimize access to each character:

string Strip(string text)
{
    var validCharacters = new char[text.Length];
    var next = 0;

    for(int i = 0; i < text.Length; i++)
    {
        char c = text[i];

        switch(c)
        {
            case ' ':
            case '\r':
            case '\n':
                // Ignore then
                break;

            default:
                validCharacters[next++] = c;
                break;
        }
    }

    return new string(validCharacters, 0, next);
}

answered Jun 26, 2020 at 14:09

Sean

62.8k11 gold badges101 silver badges138 bronze badges

8 Comments

pinkfloydx33 Over a year ago

Perhaps better to use something like char.IsWhitespace and char.IsControl since I'm assuming they want all white space/non printable removed

sTrenat Over a year ago

I cannot agree that for will be more efficient. I would even say that for will be slower due to optimalisation that compiler can do with foreach

Sean Over a year ago

@sTrenat - foreach will involve a call to get an an enumerator (probably a struct) and then a call MoveNext for each iteration and a call to Current to get the actual value. The implementation is then just grabbing the character from the specified index. All of this might get optimized away, but it might not.

sTrenat Over a year ago

I think you underestimate what compiler optimalisation can do. Check it yourself: dotnetfiddle.net/Gr6knH

user10608418 Over a year ago

@sTrenat switch them around and the results are reversed in favor of for, check it yourself: dotnetfiddle.net/M12Lwh. So in other words, not a good benchmark. Do you have any actual theory/documentation to back this up? Last I know the optimizing the compiler actually does in the case of foreach on an array is iterating instead of using an enumerator, making it almost as fast as for

|

Blindy · Accepted Answer · 2020-06-26 13:56:22Z

0

var input = " ab c\r\n de.\nf";
var result = Regex.Replace(input, @"\s+", "");

// result is now "abcde.f"

You can see it in action here

answered Jun 26, 2020 at 13:56

Blindy

67.9k10 gold badges96 silver badges140 bronze badges

2 Comments

Daniel Over a year ago

Thanks for your answer! But is using regular Expressions the most efficient way?

Blindy Over a year ago

It's efficient enough!

Vivek Nuna · Accepted Answer · 2020-06-26 13:57:55Z

0

You can do like this. You can define which special characters you want to allow in a config file. In my case I have defined in appsettings.json file.

private string RemoveUnnecessaryChars(string firstName)
{
    StringBuilder sb = new StringBuilder();
    string allowedCharacters = _configuration["AllowedChars"];
    foreach (char ch in firstName)
    {
        if (char.IsLetterOrDigit(ch))
        {
            sb.Append(ch);
        }
        else
        {
            if (allowedCharacters.Contains(ch))
            {
                sb.Append(ch);
            }
        }
    }

    return sb.ToString();
}

answered Jun 26, 2020 at 13:57

Vivek Nuna

30.6k34 gold badges161 silver badges258 bronze badges

3 Comments

Daniel Over a year ago

Thanks for your answer, however I hoped to find a more simple way to do this.

pinkfloydx33 Over a year ago

Probably better to make a HashSet out of allowedCharacters (depending on what's there) but also probably better to define a black list rather than a whitelist

Vivek Nuna Over a year ago

@pinkfloydx33 thanks, that is a valid point

Collectives™ on Stack Overflow

C# remove carriage returns, line breaks and whitespaces from string as efficient as possible (benchmark) [duplicate]

4 Answers 4

6 Comments

8 Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

8 Comments

2 Comments

3 Comments

Linked

Related