51

I have a need to get rid of all line breaks that appear in my strings (coming from db). I do it using code below:

value.Replace("\r\n", "").Replace("\n", "").Replace("\r", "")

I can see that there's at least one character acting like line ending that survived it. The char code is 8232.

It's very lame of me, but I must say this is the first time I have a pleasure of seeing this char. It's obvious that I can just replace this char directly, but I was thinking about extending my current approach (based on replacing combinations of "\r" and "\n") to something much more solid, so it would not only include the '8232' char but also all others not-found-by-me yet.

Do you have a bullet-proof approach for such a problem?

EDIT#1:

It seems to me that there are several possible solutions:

  1. use Regex.Replace
  2. remove all chars if it's IsSeparator or IsControl
  3. replace with " " if it's IsWhiteSpace
  4. create a list of all possible line endings ( "\r\n", "\r", "\n",LF ,VT, FF, CR, CR+LF, NEL, LS, PS) and just replace them with empty string. It's a lot of replaces.

I would say that the best results will be after applying 1st and 4th approaches but I cannot decide which will be faster. Which one do you think is the most complete one?

EDIT#2

I posted anwer below.

4
  • 1
    For what it's worth, the character you're running into is U+2028, 'LINE SEPARATOR'. fileformat.info/info/unicode/char/2028/index.htm Commented Jul 19, 2011 at 15:57
  • I have deleted my answer but what about the following:stackoverflow.com/questions/238002/… Commented Jul 19, 2011 at 16:02
  • It just asks about line breaks, not about special cases of them. In the context of this old question, the answer is correct, because the OP obviously doesn't care about such special cases, otherwise he would have mentioned them. Commented Jul 19, 2011 at 16:03
  • 1
    thanks for the explanation I have never got -3 in less than 1 minute. is there a badge for that? :-))) Commented Jul 19, 2011 at 16:04

12 Answers 12

75

Below is the extension method solving my problem. LineSeparator and ParagraphEnding can be of course defined somewhere else, as static values etc.

public static string RemoveLineEndings(this string value)
{
    if(String.IsNullOrEmpty(value))
    {
        return value;
    }
    string lineSeparator = ((char) 0x2028).ToString();
    string paragraphSeparator = ((char)0x2029).ToString();

    return value.Replace("\r\n", string.Empty)
                .Replace("\n", string.Empty)
                .Replace("\r", string.Empty)
                .Replace(lineSeparator, string.Empty)
                .Replace(paragraphSeparator, string.Empty);
}
Sign up to request clarification or add additional context in comments.

2 Comments

this miss only one offending char I have in my strings : \f Formfeed
I used it successfully to solve problems during csv file creation. Some strings in it were with line separators and it was a cause of incorrect lines arrangement. With the code above I solved the problem.
25

According to wikipedia, there are numerous line terminators you may need to handle (including this one you mention).

LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

1 Comment

In regex form: Regex.Replace(str, @"[\u000A\u000B\u000C\u000D\u2028\u2029\u0085]+", String.Empty)
13

8232 (0x2028) and 8233 (0x2029) are the only other ones you might want to eliminate. See the documentation for char.IsSeparator.

3 Comments

Well, no -- what's "already implemented in the language" doesn't actually solve the original problem. Read the docs for char.IsSeparator -- it won't return true for the "normal" newline characters, because Unicode classifies those as "control characters".
@Joe - yes, but I was showing the OP that there is an official list of what character points he wants to get rid of, and it's in the documentation.
I think he meant just to look at the documentation, not to actually use char.IsSeparator.
11

Props to Yossarian on this one, I think he's right. Replace all whitespace with a single space:

data = Regex.Replace(data, @"\s+", " ");

2 Comments

Uh... won't that insert spaces everywhere? As that not only matches all whitespace, it also matches the empty string. You'd want to use "\s+" instead.
Yes, "\s*" will match at every character, and insert a space after each one. Great if you're coding a spiffy Geocities site for 1995!
7

I'd recommend removing ALL the whitespace (char.IsWhitespace), and replacing it with single space.. IsWhiteSpace takes care of all weird unicode whitespaces.

Comments

4

This is my first attempt at this, but I think this will do what you want....

var controlChars = from c in value.ToCharArray() where Char.IsControl(c) select c;
foreach (char c in controlChars)  
   value = value.Replace(c.ToString(), "");

Also, see this link for details on other methods you can use: Char Methods

1 Comment

Slightly shorter: value = new string(value.Where(c => !char.IsControl(c)).ToArray())
4

Have you tried string.Replace(Environment.NewLine, "") ? That usually gets a lot of them for me.

2 Comments

I've read somewhere here that it doesn't cover ALL situations.
it definitely doesn't cover ALL situations, just tested it.
1

Check out this link: http://msdn.microsoft.com/en-us/library/844skk0h.aspx

You wil lhave to play around and build a REGEX expression that works for you. But here's the skeleton...

static void Main(string[] args)
{

        StringBuilder txt = new StringBuilder();
        txt.Append("Hello \n\n\r\t\t");
        txt.Append( Convert.ToChar(8232));

        System.Console.WriteLine("Original: <" + txt.ToString() + ">");

        System.Console.WriteLine("Cleaned: <" + CleanInput(txt.ToString()) + ">");

        System.Console.Read();

    }

    static string CleanInput(string strIn)
    {
        // Replace invalid characters with empty strings.
        return Regex.Replace(strIn, @"[^\w\.@-]", ""); 
    }

1 Comment

I've tried all the other solutions and finally this is the only solution that working for me with : Regex.Replace(strIn, @"[^\w\.@-]", ""), thank you @BBC
0

Assuming that 8232 is unicode, you can do this:

value.Replace("\u2028", string.Empty);

Comments

0

personally i'd go with

    public static String RemoveLineEndings(this String text)
    {
        StringBuilder newText = new StringBuilder();
        for (int i = 0; i < text.Length; i++)
        {
            if (!char.IsControl(text, i))
                newText.Append(text[i]);
        }
        return newText.ToString();
    }

Comments

0

If you've a string say "theString" then use the method Replace and give it the arguments shown below:

theString = theString.Replace(System.Environment.NewLine, "");

Comments

0

Here are some quick solutions with .NET regex:

  • To remove any whitespace from a string: s = Regex.Replace(s, @"\s+", ""); (\s matches any Unicode whitespace chars)
  • To remove all whitespace BUT CR and LF: s = Regex.Replace(s, @"[\s-[\r\n]]+", ""); ([\s-[\r\n]] is a character class containing a subtraction construct, it matches any whitespace but CR and LF)
  • To remove any vertical whitespace, subtract \p{Zs} (any horizontal whitespace but tab) and \t (tab) from \s: s = Regex.Replace(s, @"[\s-[\p{Zs}\t]]+", "");.

Wrapping the last one into an extension method:

public static string RemoveLineEndings(this string value)
{
    return Regex.Replace(value, @"[\s-[\p{Zs}\t]]+", "");
}

See the regex demo.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.