c# Parse URL in text

Question

I have a sentence that may contain URL's. I need to take any URL in uppercase that starts with WWW., and append HTTP://. I have tried the following:

    private string ParseUrlInText(string text)
    {
        string currentText = text;

        foreach (string word in currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, StringSplitOptions.RemoveEmptyEntries))
        {
            string thing;
            if (word.ToLower().StartsWith("www."))
            {
                if (IsAllUpper(word))
                {
                    thing = "HTTP://" + word;

                    currentText = ReplaceFirst(currentText, word, thing);
                }
            }
        }

        return currentText;
    }

    public string ReplaceFirst(string text, string search, string replace)
    {
        int pos = text.IndexOf(search);
        if (pos < 0)
        {
            return text;
        }
        return text.Substring(0, pos) + replace + text.Substring(pos + search.Length);
    }

    private static bool IsAllUpper(string input)
    {
        return input.All(t => !Char.IsLetter(t) || Char.IsUpper(t));
    }

However its only appending multiple HTTP:// to the first URL using the following:

WWW.GOOGLE.CO.ZA
WWW.GOOGLE.CO.ZA WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

to

HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

Please could someone show me the proper way to do this

Edit: I need to keep the format of the string (Spaces, newlines etc)
Edit2: A url might have an HTTP:// appended. I've updated the demo.

You shouldn't use your ReplaceFirst method, but instead save the position of the word before modification, delete it and Insert the new word (with http://) to the position you saved. Your ReplaceFirst will obviously replace the first occurence found... Kinda annoying if it appears multiple times, which is exactly your issue here. — Kilazur
– Kilazur, Commented May 15, 2014 at 7:36
The easies way might be using the URI class... see here: stackoverflow.com/questions/15713542/elegant-way-parsing-url — damaltor
– damaltor, Commented May 15, 2014 at 7:44

Community · Accepted Answer · 2017-05-23 12:05:38Z

The issue with your code: you're using a ReplaceFirst method, which does exactly what it's meant to: it replaces the first occurence, which is obviously not always the one you want to replace. This is why only your first WWW.GOOGLE.CO.ZA get all the appending of HTTP://.

One method would be to use a StreamReader or something, and each time you get to a new word, you check if it's four first characters are "WWW." and insert at this position of the reader the string "HTTP://". But it's pretty heavy lenghted for something that can be way shorter...

So let's go Regex!

How to insert characters before a word with Regex

Regex.Replace(input, @"[abc]", "adding_text_before_match$1");

How to match words not starting with another word:

(?<!wont_start_with_that)word_to_match

Which leads us to:

private string ParseUrlInText(string text)
{
    return Regex.Replace(text, @"(?<!HTTP://)(WWW\.[A-Za-z0-9_\.]+)",
        @"HTTP://$1");
}

I like this, unfortunately if one of those url's happen to start with HTTP://, it would still append another HTTP:// to make it double.
Indeed... That gets a bit more complicated, but I'm looking into it :p
There you go, just needed a negative lookbehind. Worked on regexhero.net/tester

Nefarion · Accepted Answer · 2014-05-15 08:28:46Z

I'd go for the following:

1) You don't handle same elements twice,
2) You replace all instances once

private string ParseUrlInText(string text)
{
    string currentText = text;
    var workingText = currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, 
                          StringSplitOptions.RemoveEmptyEntries).Distinct() // .Distinct() gives us just unique entries!
    foreach (string word in workingText)
    {
        string thing;
        if (word.ToLower().StartsWith("www."))
        {
            if (IsAllUpper(word))
            {
                thing = "HTTP://" + word;

                currentText = currentText.Replace("\r\n" + word, "\r\n" + thing)
                                         .Replace("\n" + word, "\n" + thing)
                                         .Replace(" " + word, " " + thing)
                                         .Replace("</br>" + word, "</br>" + thing)
            }
        }
    }

    return currentText;
}

if (word.ToLower().StartsWith("www.")) { if (IsAllUpper(word)) is pointless, test directly for StartsWith("WWW.")

Collectives™ on Stack Overflow

c# Parse URL in text

2 Answers 2

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Linked

Related