Removing the url from text using java

Question

How to remove the URLs present in text example

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

using a regular expression?

I want to remove all the URLs in the text. But it's not working, my code is :

String pattern = "(http(.*?)\\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}

You could maybe check this post - stackoverflow.com/questions/8694984/remove-part-of-string — Martin Rohwedder
– Martin Rohwedder, Commented Sep 11, 2012 at 9:29
@Rohwedder this not working if my text is ending with url because i dont have index number of url. — NLP JAVA
– NLP JAVA, Commented Sep 11, 2012 at 9:32
@Philipp i have the string like #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar t.co/ocq6RNFI — NLP JAVA
– NLP JAVA, Commented Sep 11, 2012 at 9:36

Ev0oD · Accepted Answer · 2014-06-19 12:09:55Z

22

Input the String that contains the url

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }

edited Jun 19, 2014 at 12:09

Ev0oD

1,9011 gold badge22 silver badges36 bronze badges

answered Oct 18, 2012 at 9:02

NLP JAVA

4321 gold badge3 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shubham Sharma Over a year ago

after 3 to 4 hours i realized that your code is not working

Favonius · Accepted Answer · 2017-03-24 18:26:02Z

5

Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there", you can do this:

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");

This will remove all sequences starting with "http" and up to the first space character.

You should read the Javadoc on String class. It will make things clear for you.

edited Mar 24, 2017 at 18:26

Favonius

14k4 gold badges58 silver badges95 bronze badges

answered Sep 11, 2012 at 9:29

svz

4,59811 gold badges44 silver badges66 bronze badges

1 Comment

Jaec Over a year ago

It must be yourText.replaceAll("http.*?\\s", "");

Philipp · Accepted Answer · 2012-09-11 09:34:22Z

4

How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.

Maybe this regular expression would do the job:

[\S]+://[\S]+

Explanation:

one or more non-whitespaces
followed by the string "://"
followed by one or more non-whitespaces

answered Sep 11, 2012 at 9:34

Philipp

70.1k10 gold badges121 silver badges159 bronze badges

3 Comments

NLP JAVA Over a year ago

i have string #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar t.co/ocq6RNFI

Philipp Over a year ago

The regular expression I posted should also work when the URL is at the end of the message. When there are no whitespaces after the URL, it matches until the end of the message. At least it does on regexpal.com

Philipp Over a year ago

Why are you asking me when you went with the solution by svz?

John81 · Accepted Answer · 2016-01-19 18:28:12Z

Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
        i++;
    }
    return commentstr;
}

Mir Saman · Accepted Answer · 2018-09-09 13:27:27Z

As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on: RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

where the token is going to be removed: commentstr = commentstr.replaceAll(m.group(i),"").trim();

I have faced the following error:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

where the m.group(i) is https://t.co /k9nYBu3QHu)``

tick_tack_techie · Accepted Answer · 2015-07-23 03:38:18Z

m.group(0) should be replaced with an empty string rather than m.group(i) where i is incremented with every call to m.find() as mentioned in one of the answers above.

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}

Oleg · Accepted Answer · 2022-10-21 08:20:12Z

0

"Hello https://www.google.com/hello - visit us here!".replaceAll("((https?|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)", "");

will print:

Hello  - visit us here!

Optionally add a space before 'https' and 'http' in the regex to strip the space before URL as well.

answered Oct 21, 2022 at 8:20

Oleg

4826 silver badges10 bronze badges

Comments

Shubham Sharma · Accepted Answer · 2017-09-14 10:59:25Z

-3

If you can move on towards python then you can find much better solution here using these code,

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)

answered Sep 14, 2017 at 10:59

Shubham Sharma

2,7915 gold badges33 silver badges46 bronze badges

Collectives™ on Stack Overflow

Removing the url from text using java

8 Answers 8

1 Comment

1 Comment

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

1 Comment

1 Comment

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related