6

I need to write a tool that will report broken URL's in C#. The URL should only reports broken if the user see's a 404 Error in the browser. I believe there might be tricks to handle web servers that do URL re-writing. Here's what I have. As you can see only some URL validate incorrectly.

string url = "";

// TEST CASES
//url = "http://newsroom.lds.org/ldsnewsroom/eng/news-releases-stories/local-churches-teach-how-to-plan-for-disasters";   //Prints "BROKEN", although this is getting re-written to good url below.
//url = "http://beta-newsroom.lds.org/article/local-churches-teach-how-to-plan-for-disasters";  // Prints "GOOD"
//url = "http://";     //Prints "BROKEN"
//url = "google.com";     //Prints "BROKEN" althought this should be good.
//url = "www.google.com";     //Prints "BROKEN" althought this should be good.
//url = "http://www.google.com";     //Prints "GOOD"

try
{

    if (url != "")
    {
        WebRequest Irequest = WebRequest.Create(url);
        WebResponse Iresponse = Irequest.GetResponse();
        if (Iresponse != null)
        {
            _txbl.Text = "GOOD";
        }
    }
}
catch (Exception ex)
{
    _txbl.Text = "BROKEN";
}
4
  • 1
    No tricks needed for rewriting. Rewriting is a server-side technique to override another server-side technique. Outside of the server's black box, there's no such thing as rewriting. Commented Sep 29, 2010 at 22:48
  • It should be noted that most websites now-a-days return 404 pages with a 200-OK status code. While this is blatantly incorrect, it is a matter of fact and something that should be taken in to consideration when writing your application. Commented Sep 30, 2010 at 0:47
  • @Jared. Do they really? By default apache and iis and most other webservers will do the right thing and return a 404. It was something I used to see a bit in the past when people did custom 404 pages buggily, but it seems to be much rarer these days. There's still a lot of buggy sites that redirect to 404 instead of just serving 404 ("yep, I found it over here, success... oh not found" when it should say "no, not found"), but that's much easier to catch. Commented Sep 30, 2010 at 2:25
  • @Jon Hanna, I think you make a good point. Developers have gotten better about using the correct status codes and the problem has likely lessened with time, however there is still an interesting problem of automatically deciding if a resource cannot be located when you cannot trust the status code. Commented Sep 30, 2010 at 3:57

3 Answers 3

9

For one, Irequest and Iresponse shouldn't be named like that. They should just be webRequest and webResponse, or even just request and response. The capital "I" prefix is generally only used for interface naming, not for instance variables.

To do your URL validity checking, use UriBuilder to get a Uri. Then you should use HttpWebRequest and HttpWebResponse so that you can check the strongly typed status code response. Finally, you should be a bit more informative about what was broken.

Here's links to some of the additional .NET stuff I introduced:

Sample:

try
{
    if (!string.IsNullOrEmpty(url))
    {
        UriBuilder uriBuilder = new UriBuilder(url);
        HttpWebRequest request = HttpWebRequest.Create(uriBuilder.Uri);
        HttpWebResponse response = request.GetResponse();
        if (response.StatusCode == HttpStatusCode.NotFound)
        {
            _txbl.Text = "Broken - 404 Not Found";
        }
        if (response.StatusCode == HttpStatusCode.OK)
        {
            _txbl.Text =  "URL appears to be good.";
        }
        else //There are a lot of other status codes you could check for...
        {
            _txbl.Text = string.Format("URL might be ok. Status: {0}.",
                                       response.StatusCode.ToString());
        }
    }
}
catch (Exception ex)
{
    _txbl.Text = string.Format("Broken- Other error: {0}", ex.Message);
}     
Sign up to request clarification or add additional context in comments.

1 Comment

This code is good, however I'd use the UriBuilder (msdn.microsoft.com/en-us/library/system.uribuilder.aspx) to create the Uri instead. That removes the string manipulation around the scheme (ie "http://") as it will accept with or without & then you can check the "Scheme" property, setting it as appropriate.
0

Prepend http:// or https:// to the URL and pass it to WebClient.OpenRead method. It would throw an WebException if the URL is malformed.

  private WebClient webClient = new WebClient();

  try {
        Stream strm = webClient.OpenRead(URL);                                   
    }
    catch (WebException we) {
        throw we;
    }

Comments

-1

The problem is that most of those 'should be good' cases are actually dealt with at a browser level I believe. If you omit the 'http://' its an invalid request but the browser puts it in for you.

So maybe you could do a similar check that the browser would do:

  • Ensure there is an 'http://' at the beginning
  • Ensure there is a 'www.' at the beginning

2 Comments

Not really. There are plenty of reasons to use URLs outside of a browser (like this question does, for example), and browsers are really the only forgiving pieces of software that will kindly allow you to omit the protocol prefix. Furthermore, the subdomain (if any at all) doesn't need to start with a 'www.' to be valid.
I know, but i'm basing that on his assumptions of what 'should be good' means, because obviously they aren't the same as the request/response objects. By the looks of things omitting the www. would return 'good' so long as it has the 'http://' which makes sense as you need to tell the request object which protocol you plan on using.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.