0

I have a sentence that may contain URL's. I need to take any URL in uppercase that starts with WWW., and append HTTP://. I have tried the following:

    private string ParseUrlInText(string text)
    {
        string currentText = text;

        foreach (string word in currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, StringSplitOptions.RemoveEmptyEntries))
        {
            string thing;
            if (word.ToLower().StartsWith("www."))
            {
                if (IsAllUpper(word))
                {
                    thing = "HTTP://" + word;

                    currentText = ReplaceFirst(currentText, word, thing);
                }
            }
        }

        return currentText;
    }

    public string ReplaceFirst(string text, string search, string replace)
    {
        int pos = text.IndexOf(search);
        if (pos < 0)
        {
            return text;
        }
        return text.Substring(0, pos) + replace + text.Substring(pos + search.Length);
    }

    private static bool IsAllUpper(string input)
    {
        return input.All(t => !Char.IsLetter(t) || Char.IsUpper(t));
    }

However its only appending multiple HTTP:// to the first URL using the following:

WWW.GOOGLE.CO.ZA
WWW.GOOGLE.CO.ZA WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

to

HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

Please could someone show me the proper way to do this

Edit: I need to keep the format of the string (Spaces, newlines etc)
Edit2: A url might have an HTTP:// appended. I've updated the demo.

5
  • You shouldn't use your ReplaceFirst method, but instead save the position of the word before modification, delete it and Insert the new word (with http://) to the position you saved. Your ReplaceFirst will obviously replace the first occurence found... Kinda annoying if it appears multiple times, which is exactly your issue here. Commented May 15, 2014 at 7:36
  • Would be great if you showed :) Commented May 15, 2014 at 7:37
  • The StringBuilder in the first function is never used. Commented May 15, 2014 at 7:37
  • @Codor Its remanence of previous attempts. Commented May 15, 2014 at 7:38
  • The easies way might be using the URI class... see here: stackoverflow.com/questions/15713542/elegant-way-parsing-url Commented May 15, 2014 at 7:44

2 Answers 2

2

The issue with your code: you're using a ReplaceFirst method, which does exactly what it's meant to: it replaces the first occurence, which is obviously not always the one you want to replace. This is why only your first WWW.GOOGLE.CO.ZA get all the appending of HTTP://.

One method would be to use a StreamReader or something, and each time you get to a new word, you check if it's four first characters are "WWW." and insert at this position of the reader the string "HTTP://". But it's pretty heavy lenghted for something that can be way shorter...

So let's go Regex!

How to insert characters before a word with Regex

Regex.Replace(input, @"[abc]", "adding_text_before_match$1");

How to match words not starting with another word:

(?<!wont_start_with_that)word_to_match

Which leads us to:

private string ParseUrlInText(string text)
{
    return Regex.Replace(text, @"(?<!HTTP://)(WWW\.[A-Za-z0-9_\.]+)",
        @"HTTP://$1");
}
Sign up to request clarification or add additional context in comments.

3 Comments

I like this, unfortunately if one of those url's happen to start with HTTP://, it would still append another HTTP:// to make it double.
Indeed... That gets a bit more complicated, but I'm looking into it :p
There you go, just needed a negative lookbehind. Worked on regexhero.net/tester
0

I'd go for the following:

1) You don't handle same elements twice,
2) You replace all instances once

private string ParseUrlInText(string text)
{
    string currentText = text;
    var workingText = currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, 
                          StringSplitOptions.RemoveEmptyEntries).Distinct() // .Distinct() gives us just unique entries!
    foreach (string word in workingText)
    {
        string thing;
        if (word.ToLower().StartsWith("www."))
        {
            if (IsAllUpper(word))
            {
                thing = "HTTP://" + word;

                currentText = currentText.Replace("\r\n" + word, "\r\n" + thing)
                                         .Replace("\n" + word, "\n" + thing)
                                         .Replace(" " + word, " " + thing)
                                         .Replace("</br>" + word, "</br>" + thing)
            }
        }
    }

    return currentText;
}

1 Comment

if (word.ToLower().StartsWith("www.")) { if (IsAllUpper(word)) is pointless, test directly for StartsWith("WWW.")

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.