8

In our application, user is entering data from MS word to asp.net textarea control and finally data is saved in SQL Server. For some reason, there are few junk characters which looks like small squares when viewed from SQL Server Management Studio.

This is causing error while generating Crystal Reports.

I need a regex which will strip all such characters along with bullets. The only valid input is

A-Z, a-z , 0-9, ~ ! @ # % $ ^ & *  ( ) _ + | ` - = \ {}:">? < [ ] ; ' , . /

Also, tab spaces should be replaced with single space. Enter key or New line is allowed.

Currently I am using

Regex.Replace(data, @"[^\u0000-\u007F]", " ");

but it won't work to remove bullets or tab spaces.

Can any regex ninja help me with this problem? Thanks in advance.

4
  • I know it's not exactly what you're after, but how about enabling a WYSIWYG editor on the textarea that gives an option to paste from Word. TinyMCE has a plugin for just this task? Commented Feb 13, 2013 at 12:54
  • That is not an option for me as of now. Commented Feb 13, 2013 at 12:55
  • Use the ASCII SQL function to find out exactly what character it is, my guess is that you are looking for the wrong one: msdn.microsoft.com/en-us/library/ms177545.aspx Commented Feb 13, 2013 at 13:01
  • Have a look at regex - Finding tabs or 4 spaces. Bullets are ASCII characters, so you can just copy them into your RegEx pattern and replace it. Commented Feb 13, 2013 at 13:03

2 Answers 2

3

You can use two regexes. The first, with the pattern "\t|<bullet>" (where <bullet> stands for the representation of the bullet) is used first, to replace TABs and bullets with spaces (" "). The second, the pattern of which is a negated character set containing your list of valid characters, is used second, to replace invalid characters with the empty string (""), that is, to get rid of them. Since you need to keep CR and LF characters (and space), these must be added to the set of valid characters:

using System;
using System.Text.RegularExpressions;

static class Program
{
  public static void Main()
  {
    string pattern1 = @"\t";
    Regex regex1 = new Regex(pattern1, RegexOptions.Compiled);
    string pattern2 = @"[^A-Za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]";
    Regex regex2 = new Regex(pattern2, RegexOptions.Compiled);

    string input = "ABZabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\nárvíztűrő\ttükörfúrógép";
    string temp = regex1.Replace(input, " ");
    string output = regex2.Replace(temp, "");
    Console.WriteLine(input);
    Console.WriteLine(output);
    Console.ReadKey(true);
  }
}

Output:

ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
árvíztűrő       tükörfúrógép
ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
rvztr tkrfrgp

Note that the TAB after árvíztűrő was replaced with a single space.

About bullets:

I made a bulleted list in Word and copied it into a textarea in a webpage. Then I saved the HTML and figured out that the bullets are saved as the UTF-8-encoded character E280A2. This is what I called above the "representation of the bullet". You should figure out the binary representation of possible bullet characters and add them to the first pattern: either OR them to the TAB character, or put all of them into a character set:

using System;
using System.Text;
using System.Text.RegularExpressions;

static class Program
{
  public static void Main()
  {
    byte[] bulletBytes = new byte[] { 0xE2, 0x80, 0xA2 };
    string bullet= Encoding.UTF8.GetString(bulletBytes);

    string pattern1 = @"[\t" + bullet + "]";
    Regex regex1 = new Regex(pattern1, RegexOptions.Compiled);
    string pattern2 = @"[^A-Za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]";
    Regex regex2 = new Regex(pattern2, RegexOptions.Compiled);

    string input = 
      bullet + "ABZabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\n" + 
      bullet + "árvíztűrő\ttükörfúrógép";
    string temp = regex1.Replace(input, " ");
    string output = regex2.Replace(temp, "");
    Console.OutputEncoding = Encoding.UTF8;
    Console.WriteLine(input);
    Console.WriteLine(output);
    Console.ReadKey(true);
  }
}

Output (you should change the console font to Lucida Console to see the bullet):

•ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
•árvíztűrő      tükörfúrógép
 ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
 rvztr tkrfrgp

Now in addition to TAB, the bullet at the beginning of each line has also been replaced with a space.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot @kol , as users are copying data from MS word, I would like to support “ and ‘ chars. How do i add it to above regex. I tried adding these chars as it is, but it wont filter as expected. Any help on this is highly appriciated. Thanks again.
You are welcome. I think " and ' could be added to the set of valid characters as Unicode character escape sequences. I would try \u2018-\u201f, see here: techabulary.com/u/unicode
0

I think that you should be diagnosing the underlying problem, rather than saying "some reason" and trying to fix the symptom, but that's not your question:

You'll want this regex:

 [^A-Za-z0-9~!\#\$\^&\*\(\)_\+\|`\-\=\\\{\};"\>\?\<\[\]:',\.\/]

To match any character that isn't allowed, and then you'll want to replace it with String.Empty. Then you'll just want to match:

\t

And replace it with " ".

2 Comments

you dont need to escape anything within [] except []
Absolutely untrue. Try the regex [\s] against "s". By your logic the regular expression matches either backslash or s, but you'll find that it actually matches any whitespace character. Also, when you edit someone's post make sure your edits don't change the answer and render it invalid.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.