Regular expression not parsing URL properly Java

Question

I have a text file that contains URLS that range in complexity. Here is a sample:

https://www.google.com/?gws_rd=ssl
http://www.cs.jhu.edu/news-events/news-articles/
maps.google.com
http://www.cnn.com/WORLD/?hpt=sitenav
http://www.cnn.com/JUSTICE/?hpt=sitenav
http://www.cs.jhu.edu/course-info/
http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/
http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html
http://mexico.cnn.com/?hpt=ed_Mexico
cnn.com

From these lines, I only want to get the "X.Y" part. In other words, from the first 4 lines, I want to get:

google.com
jhu.edu
google.com
cnn.com

In order to do this, I made a regular expression and I am attempting to match it:

public static void main(String[] args) throws IOException {
        BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\Me\\Desktop\\homework4file.txt"));
        String line = null;
        Pattern pattern = Pattern.compile("^[a-zA-Z0-9\\-\\.]+\\.(com)$");
        Matcher matcher;
        while((line = reader.readLine()) != null) {
            matcher = pattern.matcher(line);
            while(matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }
    }

My regular expression is just returning "com" for each line. I don't see what's wrong with what I've written. Could someone explain the logic error in my expression?

group(1) returns whatever is in the first capture group, i.e. the first thing you have in parentheses () in your regex. The only thing you have in parentheses is (com). Therefore, group(1) returns com. — ajb
– ajb, Commented Nov 4, 2014 at 1:51
maps.google.com is a relative URL with one path component which is maps.google.com. — Mike Samuel
– Mike Samuel, Commented Nov 4, 2014 at 2:10

Avinash Raj · Accepted Answer · 2014-11-04 02:04:45Z

You don't need to put anchors. ^ asserts that we are at the start, but the part before .com isn't at the start. [a-zA-Z0-9\\-\\.]+ would matches greedily the part before .com upto a / is reached. In this http://mexico.cnn.com/?hpt=ed_Mexico string, the regex [a-zA-Z0-9\\-\\.]+\\.(com) would matches mexico.cnn.com not cnn.com.And also by putting com, edu into a non-capturing group delimited by | would also match the string before .edu.

[^.\\n]+\\.(?:com|edu)

DEMO

String input = "https://www.google.com/?gws_rd=ssl\n" +
"http://www.cs.jhu.edu/news-events/news-articles/\n" +
"maps.google.com\n" +
"http://www.cnn.com/WORLD/?hpt=sitenav\n" +
"http://www.cnn.com/JUSTICE/?hpt=sitenav\n" +
"http://www.cs.jhu.edu/course-info/\n" +
"http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/\n" +
"http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html\n" +
"http://mexico.cnn.com/?hpt=ed_Mexico\n" +
"cnn.com";
Pattern regex = Pattern.compile("[^.\\n]+\\.(?:com|edu)");
Matcher matcher = regex.matcher(input);
while(matcher.find()){
            System.out.println(matcher.group(0));
   }

Output:

google.com
jhu.edu
google.com
cnn.com
cnn.com
jhu.edu
jhu.edu
oracle.com
cnn.com
cnn.com

Glad it worked out.. You could use this [a-zA-Z0-9\-]+\.\w{2,3}(?=\/|$) also regex101.com/r/xY3sK0/7

Collectives™ on Stack Overflow

Regular expression not parsing URL properly Java

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related