I have a text file that contains URLS that range in complexity. Here is a sample:
https://www.google.com/?gws_rd=ssl
http://www.cs.jhu.edu/news-events/news-articles/
maps.google.com
http://www.cnn.com/WORLD/?hpt=sitenav
http://www.cnn.com/JUSTICE/?hpt=sitenav
http://www.cs.jhu.edu/course-info/
http://e-catalog.jhu.edu/departments-program-requirements-and-courses/engineering/computer-science/
http://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html
http://mexico.cnn.com/?hpt=ed_Mexico
cnn.com
From these lines, I only want to get the "X.Y" part. In other words, from the first 4 lines, I want to get:
google.com
jhu.edu
google.com
cnn.com
In order to do this, I made a regular expression and I am attempting to match it:
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\Me\\Desktop\\homework4file.txt"));
String line = null;
Pattern pattern = Pattern.compile("^[a-zA-Z0-9\\-\\.]+\\.(com)$");
Matcher matcher;
while((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
My regular expression is just returning "com" for each line. I don't see what's wrong with what I've written. Could someone explain the logic error in my expression?
group(1)returns whatever is in the first capture group, i.e. the first thing you have in parentheses()in your regex. The only thing you have in parentheses is(com). Therefore,group(1)returnscom.maps.google.comis a relative URL with one path component which ismaps.google.com.