0

I have code that relies heavily on the String.Substring method, to the point where the Substring method is slowing down my application. I know that the range I want to retrieve is within the String (It is not out of bounds.) Is there something that I could use instead of Substring that would be faster? Could I write my own Substring method that could forgo any bounds checks?

sample code:

public String get_element(int element_number) {
        int count = 0;
        int start_index = 0;
        int end_index = 0;
        int current_index = 0;

        while (count < element_number && current_index != -1) {
            current_index = line_text.IndexOf(x12_reader.element_delimiter, start_index);
            start_index = current_index + 1;
            count++;
        }

        if (current_index != -1) {
            end_index = line_text.IndexOf(x12_reader.element_delimiter, start_index);
            if (end_index == -1) end_index = line_text.Length;
            return line_text.Substring(start_index, end_index - start_index); ;
        } else {
            return "";
        }
    }

I see lots of comments asking if Substring is really the problem. I know that Substring is the problem. I have run profiling in Visual Studio and It has pointed to Substring as the culprit. Also I cannot call this function any less than I currently am. The only place I have left to optimize is the Substring function. I know this is the case.

15
  • 6
    How did you determine that Substring is the bottle neck? Commented Aug 31, 2016 at 20:13
  • 1
    Can you afford storing your data as array instead of stringified DSV. For example can you at start split line_text to an array and simply access element at element_number? Commented Aug 31, 2016 at 20:14
  • 3
    I'd be surprised that the bounds checking is the culprit for your performance issue. Commented Aug 31, 2016 at 20:14
  • 4
    Something makes me think your problem is elsewhere. Have you used a profiler? Substring actually uses unsafe code to operate at the pointer level. That is going to be WAY faster than anything you write in straight C#. Also, I highly doubt the bounds checking, which equates to a handful of CPU instructions, is the culprit. Commented Aug 31, 2016 at 20:14
  • 2
    Hmm it would be good to know wider context, as I have feeling that your performance issue might be solved by using some better data structures adn/or architecture. Can you say what is purpose of this function in your domain? Commented Aug 31, 2016 at 20:31

3 Answers 3

1

The problem with your function is that you are returning the substring... Therefore, I don't see the way to avoid constructing a new string.

The next question is what exactly are you doing with the result? Maybe you could change the signature of the get_element method, e.g. to receive a StringBuilder and then to copy characters from the target string instead of building the new string.

public void get_element(int element_number, StringBuilder buffer)
{
    ...
    //  instead of: return line_text.Substring(...); 
    buffer.Append(line_text, start_index, end_index - start_index);
}

Anyway, cost of constructing new strings is not too high. Maybe there are other reasons why performance is poor in your case. Maybe you are doing too many concatenations of strings returned by this method?

Sign up to request clarification or add additional context in comments.

6 Comments

I have tried using a stringbuilder instead and the performance is slower. I am calling this method millions of times.
StringBuilder will lead to improved performance only if it is used many times in many calls to get_element. Can you add more code to the question? In particular, I would like to see what you are doing at the calling end, i.e. what is happening to the string once it has been returned from the get_element method.
The returned string is simply stored in most cases. To be used later. The overhead of Stringbuilder is too much given the size of the strings returned.
A see, but then again - how do you know that returning the substring is the problem? Substring method itself has only O(n) complexity, where n is the substring length, and it is optimized as much as you can get in .NET.
The profiler will only tell you that substring is the "problem" by saying it's being called the most number of times or is taking the most CPU cycles. It cannot tell you whether your code is going to be better by rethinking the algorithm or changing it in some other fundamental way.
|
1

I doubt the problem is caused by the bounds checking inside the Substring method.

But if you want to be sure and unsafe code is allowed, you can try the following string constructor:

public unsafe String(
    char* value,
    int startIndex,
    int length
)

like this:

public static class StringUtils
{
    public static unsafe string UnsafeSubstring(this string source, int startIndex, int length)
    {
        fixed (char* chars = source)
            return new string(chars, startIndex, length);
    }
}

Then replace Substring calls with UnsafeSubstring and see if there is any noticeable difference.

2 Comments

this is (almost) identical to the normal substring without safety calls.
@Dispersia Correct. As I mentioned at the beginning, I doubt the issue is caused by the safety calls. So the above is just to answer OP "Could I write my own Substring method that could forgo any bounds checks?"
-2

A string is only an array of char. You could loop through the char array between your known bounds and replace the char.

7 Comments

That's not accurate at all. A string is an object, it is not an array of characters. On top of that, a string is an immutable object which means you can't modify it without constructing a new object.
@dman2306 rick's answer on a string being only an array of char is correct. Internally it is an unsafe class that uses the CLR to hold a char* array. You can dostring test = "test"; unsafe { fixed (char* p = test) { p[0] = 'C'; } } and that is perfectly legal, and it does not reinstantiate the string, it just uses unsafe.
You can modify the string from unsafe code but... that is a big no-no in .NET because everything around a string object assumes it will not change. Anyway, the question is about a different thing - here part of the string needs to be transferred elsewhere, rather than modified.
@ZoranHorvat as long as you call fixed the GC will not hurl, it's like locking for threading, in which as long as you're sure the index exists, it is an ok operation.
@Dispersia This discussion is off topic, but anyway let's clarify that. You can modify the string in a way that will not crash. Anyway, different string objects may reuse the same character buffer because internally they know that buffer content cannot change. Now if you do change it, you have potentially changed several other string variables you never knew about. And that behavior would appear non-deterministically. Most notably, string literals are always reused by the compiler and you can still modify them from the unsafe code, causing completely unknown consequences.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.