2

In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?

To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2

string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";
4
  • 1
    @MikeBeaton Yes I know, but thanks for the info and your answer, which helped me for another problem I needed to solve aswell ;-) Commented Jun 26, 2020 at 15:08
  • 1
    It matters a lot how long your strings actually are. I did some benchmarking and around the 10_000 characters mark a parallel method started outperforming the NewString method, once reaching the 100_000_000 characters mark the parallel version was a little over 15x as fast (also benchmarked with BenchmarkDotNet). Commented Jun 26, 2020 at 16:20
  • @Knoop Do you want me to specify this in the question? The strings I have are between 10 000 and 100 000 characters long. Commented Jun 26, 2020 at 16:22
  • 1
    With those lengths you have a good chance a Parallel method will out perform the accepted answer, but it also depends on the hardware it runs on. Optimization and efficiency problems are seldom straight forward. So if the accepted answer is fast enough I would just go with that (also from the point of view that it's best to avoid premature optimization) Commented Jun 26, 2020 at 17:35

4 Answers 4

6

The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :

public class Replace
{
    private static readonly string S = " ab c\r\n de.\nf";
    private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);

    [Benchmark]
    public string SimpleReplace() => S
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","");

    [Benchmark]
    public string StringBuilder() => new StringBuilder().Append(S)
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","")
       .ToString();

    [Benchmark]
    public string RegexReplace() => Reg.Replace(S, "");

    [Benchmark]
    public string NewString()
    {
            var arr = new char[S.Length];
            var cnt = 0;
            for (int i = 0; i < S.Length; i++)
            {
                switch(S[i])
                {
                    case ' ':
                    case '\r':
                    case '\n':
                        break;

                    default:
                        arr[cnt] = S[i];
                        cnt++;
                        break;
                }
            }

            return new string(arr, 0, cnt);
    }

    [Benchmark]
    public string NewStringForeach()
    {
        var validCharacters = new char[S.Length];
        var next = 0;

        foreach(var c in S)
        {
            switch(c)
            {
                case ' ':
                case '\r':
                case '\n':
                    // Ignore then
                    break;

                default:
                    validCharacters[next++] = c;
                    break;
            }
        }

        return new string(validCharacters, 0, next);
    }
} 

This gives on my machine:

|          Method |        Mean |     Error |    StdDev |
|---------------- |------------:|----------:|----------:|
|   SimpleReplace |   122.09 ns |  1.273 ns |  1.063 ns |
|   StringBuilder |   311.28 ns |  6.313 ns |  8.850 ns |
|    RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
|       NewString |    52.26 ns |  1.122 ns |  1.812 ns |
|NewStringForeach |    40.04 ns |  0.877 ns |  1.979 ns |
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for your awesome answer! If I interpret this in the right way this means that the approach of @Sean is the quickest so far. Am I right?
@Daniel yep. But again, you should test against actual workload.
Since OP has stated that the desired optimization is specifically for large strings imo a good benchmark should represent that. Some solutions might have more startup overhead but be faster on a per character basis, making them lose out on short strings but come out ahead on very long strings.
@Knoop term "large" is very broad that is why i suggested to test(bench) against actual workload (and actual hardware). Potentially even frequency of specific characters can affect performance of one or another solution.
@sTrenat added as you requested.
|
1

To do this efficiently you want to avoid regex and keep memory allocations to a minimum: Here I've used a raw character buffer (rather than a StringBuilder) and for rather than foreach to optimize access to each character:

string Strip(string text)
{
    var validCharacters = new char[text.Length];
    var next = 0;

    for(int i = 0; i < text.Length; i++)
    {
        char c = text[i];

        switch(c)
        {
            case ' ':
            case '\r':
            case '\n':
                // Ignore then
                break;

            default:
                validCharacters[next++] = c;
                break;
        }
    }

    return new string(validCharacters, 0, next);
}

8 Comments

Perhaps better to use something like char.IsWhitespace and char.IsControl since I'm assuming they want all white space/non printable removed
I cannot agree that for will be more efficient. I would even say that for will be slower due to optimalisation that compiler can do with foreach
@sTrenat - foreach will involve a call to get an an enumerator (probably a struct) and then a call MoveNext for each iteration and a call to Current to get the actual value. The implementation is then just grabbing the character from the specified index. All of this might get optimized away, but it might not.
I think you underestimate what compiler optimalisation can do. Check it yourself: dotnetfiddle.net/Gr6knH
@sTrenat switch them around and the results are reversed in favor of for, check it yourself: dotnetfiddle.net/M12Lwh. So in other words, not a good benchmark. Do you have any actual theory/documentation to back this up? Last I know the optimizing the compiler actually does in the case of foreach on an array is iterating instead of using an enumerator, making it almost as fast as for
|
0
var input = " ab c\r\n de.\nf";
var result = Regex.Replace(input, @"\s+", "");

// result is now "abcde.f"

You can see it in action here

2 Comments

Thanks for your answer! But is using regular Expressions the most efficient way?
It's efficient enough!
0

You can do like this. You can define which special characters you want to allow in a config file. In my case I have defined in appsettings.json file.

private string RemoveUnnecessaryChars(string firstName)
{
    StringBuilder sb = new StringBuilder();
    string allowedCharacters = _configuration["AllowedChars"];
    foreach (char ch in firstName)
    {
        if (char.IsLetterOrDigit(ch))
        {
            sb.Append(ch);
        }
        else
        {
            if (allowedCharacters.Contains(ch))
            {
                sb.Append(ch);
            }
        }
    }

    return sb.ToString();
}

3 Comments

Thanks for your answer, however I hoped to find a more simple way to do this.
Probably better to make a HashSet out of allowedCharacters (depending on what's there) but also probably better to define a black list rather than a whitelist
@pinkfloydx33 thanks, that is a valid point

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.