14

How to remove the URLs present in text example

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

using a regular expression?

I want to remove all the URLs in the text. But it's not working, my code is :

String pattern = "(http(.*?)\\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}
5
  • 1
    I don't understand your question... provide some examples Commented Sep 11, 2012 at 9:22
  • i want to remove the url that are comming with text . Commented Sep 11, 2012 at 9:26
  • You could maybe check this post - stackoverflow.com/questions/8694984/remove-part-of-string Commented Sep 11, 2012 at 9:29
  • @Rohwedder this not working if my text is ending with url because i dont have index number of url. Commented Sep 11, 2012 at 9:32
  • @Philipp i have the string like #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar t.co/ocq6RNFI Commented Sep 11, 2012 at 9:36

8 Answers 8

22

Input the String that contains the url

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }
Sign up to request clarification or add additional context in comments.

1 Comment

after 3 to 4 hours i realized that your code is not working
5

Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there", you can do this:

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");

This will remove all sequences starting with "http" and up to the first space character.

You should read the Javadoc on String class. It will make things clear for you.

1 Comment

It must be yourText.replaceAll("http.*?\\s", "");
4

How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.

Maybe this regular expression would do the job:

[\S]+://[\S]+

Explanation:

  • one or more non-whitespaces
  • followed by the string "://"
  • followed by one or more non-whitespaces

3 Comments

i have string #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar t.co/ocq6RNFI
The regular expression I posted should also work when the URL is at the end of the message. When there are no whitespaces after the URL, it matches until the end of the message. At least it does on regexpal.com
Why are you asking me when you went with the solution by svz?
3

Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
        i++;
    }
    return commentstr;
}    

1 Comment

Easily call replace instead of multiple replaceAll.
1

As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on: RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

where the token is going to be removed: commentstr = commentstr.replaceAll(m.group(i),"").trim();

I have faced the following error:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

where the m.group(i) is https://t.co /k9nYBu3QHu)``

Comments

0

m.group(0) should be replaced with an empty string rather than m.group(i) where i is incremented with every call to m.find() as mentioned in one of the answers above.

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}

Comments

0
"Hello https://www.google.com/hello - visit us here!".replaceAll("((https?|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)", "");

will print:

Hello  - visit us here!

Optionally add a space before 'https' and 'http' in the regex to strip the space before URL as well.

Comments

-3

If you can move on towards python then you can find much better solution here using these code,

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.