0

I have a text file that contains URLS that range in complexity. Here is a sample:

https://www.google.com/?gws_rd=ssl
http://www.cs.jhu.edu/news-events/news-articles/
maps.google.com
http://www.cnn.com/WORLD/?hpt=sitenav
http://www.cnn.com/JUSTICE/?hpt=sitenav
http://www.cs.jhu.edu/course-info/
http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/
http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html
http://mexico.cnn.com/?hpt=ed_Mexico
cnn.com

From these lines, I only want to get the "X.Y" part. In other words, from the first 4 lines, I want to get:

google.com
jhu.edu
google.com
cnn.com

In order to do this, I made a regular expression and I am attempting to match it:

public static void main(String[] args) throws IOException {
        BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\Me\\Desktop\\homework4file.txt"));
        String line = null;
        Pattern pattern = Pattern.compile("^[a-zA-Z0-9\\-\\.]+\\.(com)$");
        Matcher matcher;
        while((line = reader.readLine()) != null) {
            matcher = pattern.matcher(line);
            while(matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }
    }

My regular expression is just returning "com" for each line. I don't see what's wrong with what I've written. Could someone explain the logic error in my expression?

3
  • It's not O to 9 . It's zero to 9. Commented Nov 4, 2014 at 1:40
  • group(1) returns whatever is in the first capture group, i.e. the first thing you have in parentheses () in your regex. The only thing you have in parentheses is (com). Therefore, group(1) returns com. Commented Nov 4, 2014 at 1:51
  • maps.google.com is a relative URL with one path component which is maps.google.com. Commented Nov 4, 2014 at 2:10

1 Answer 1

1

You don't need to put anchors. ^ asserts that we are at the start, but the part before .com isn't at the start. [a-zA-Z0-9\\-\\.]+ would matches greedily the part before .com upto a / is reached. In this http://mexico.cnn.com/?hpt=ed_Mexico string, the regex [a-zA-Z0-9\\-\\.]+\\.(com) would matches mexico.cnn.com not cnn.com.And also by putting com, edu into a non-capturing group delimited by | would also match the string before .edu.

[^.\\n]+\\.(?:com|edu)

DEMO

String input = "https://www.google.com/?gws_rd=ssl\n" +
"http://www.cs.jhu.edu/news-events/news-articles/\n" +
"maps.google.com\n" +
"http://www.cnn.com/WORLD/?hpt=sitenav\n" +
"http://www.cnn.com/JUSTICE/?hpt=sitenav\n" +
"http://www.cs.jhu.edu/course-info/\n" +
"http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/\n" +
"http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html\n" +
"http://mexico.cnn.com/?hpt=ed_Mexico\n" +
"cnn.com";
Pattern regex = Pattern.compile("[^.\\n]+\\.(?:com|edu)");
Matcher matcher = regex.matcher(input);
while(matcher.find()){
            System.out.println(matcher.group(0));
   }

Output:

google.com
jhu.edu
google.com
cnn.com
cnn.com
jhu.edu
jhu.edu
oracle.com
cnn.com
cnn.com
Sign up to request clarification or add additional context in comments.

1 Comment

Glad it worked out.. You could use this [a-zA-Z0-9\-]+\.\w{2,3}(?=\/|$) also regex101.com/r/xY3sK0/7

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.