Ruby URI / URL Standardization Method Forgiving

I am trying to find a method for entering a string of URIs / URLs from a user and determining the working, canonical form (or crashing if the resource is not valid). You should also check at the same time if the URL exists. Therefore, we check both the actual "syntax" and the existence.

For example, a string like google.com should be converted to http://www.google.com , and a string like google.com/insights should be converted to http://www.google.com/insights . A string like http://thiswebsitedoesntexistatall.com should return some error or exception.

I believe that part of the solution will most likely call the HTTP get_response() method after the redirect, until I get the 200 OK status.

It seems that the URI.parse() method does not forgive the rejection of http . I understand that I can write a simple thing to try to add http in front, etc., but I was hoping there was some existing jewel or little-known library function that would really forgive URLs and canonize them for me.

Both the built-in net/http and HTTParty seem too strict for what I'm looking for. Is there a good way to do this?

+4
source share
1 answer

There are some problems with what you ask for:

  • The URL parser should not take the value passed in HTTP when FTP and many other protocols are equally valid. If you know that the protocol is likely to be HTTP, you need to add the protocol.
  • If you try to connect to the site and redirect until you get a 200 response, you only prove that the URL allows a valid page of some kind. This 200 may be returned to the error page because the one you want is a dead link or is invalid, or that the site is temporarily unavailable. To prove what this means, you must have some intimate knowledge about the page you are looking for, for example, specific search content.
  • Assuming the url is good after you follow the redirects is not entirely safe. Many sites add all kinds of session data to a URL, so what can start with a simple and clean URL can be long and confusing.

I would recommend you look at Addressable :: URI gem. This is much more fully featured than the Ruby URI. It will not make decisions for you, but at least it will provide you with a more complete API and can rewrite / normalize URLs. Cleaning them and / or determining whether they are good is still an exercise for you.

+3
source

All Articles