blog@mcZen

Software, life, and leisure

Url Validation with Regular Expressions

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

October 16, 2007 3:26 AM

mike

Url Regular Expression validation provides us with an easy and quick way to validate input from forms when we expect/require a valid url. Today we are going to write a regular expression and break it down to its simplest parts so we can understand it. We will also use RFC 3986 to validate both domain names and the uri.

Validating the Moniker

The first part of any uri is the moniker aka "scheme:". Monikers tell the computer where we are going to be looking for a resource that we want. Common monikers include: http, https, ftp, mailto, file, gopher, telnet, etc.
Since we are going to validate just http(s), our regular expression for this is: /^https?:/gi This says that we will accept a string starting with http, where "s?" says we may have an "s" (? means 0 or 1 occurances). At the end, we will have a colon(:).
If we wanted to allow more monikers, we could simply OR the various monikers together like: /^(https?|file|mailto):/gi

Domain validation

Domain validation can be tricky. We can have anything from an ip address to a machine name.
IPv4 Address validation
First we will make a regular expression that validates only IP Addresses in the form XXX.XXX.XXX.XXX where XXX can be any number between 0-255. To validate the number range 0-255, we use: /([01]?\d{1,2}|2[0-4]\d|25[0-5])/ To match the pattern XXX.XXX.XXX.XXX, we use: /(pattern\.){3}(pattern)/ Putting these two together, we get: /(([01]?\d{{1,2}|2[0-4]\d|25[0-5])\.){3}([01]?\d{1,2}|2[0-4]\d|25[0-5])/gi
Domain Name Validation
Referring to RFC 3986, domain names can have many different styles. When it comes to validating these, we will try to account for many different possiblities. Lets start with valid Alpha Numeric characters in a dot sequences: ([a-zA-Z0-9]{1,255}\.?)+ This gives names in the format: xxxx.xx.xxxxxxx.xxx. Now complicate the matter by allowing dashes: ([a-zA-Z0-9](-[a-zA-Z0-0]+)?\.?)+ Complicated more by allowing escape characters in the form % HEXDIGIT HEXDIGIT (like %2d): (([a-zA-Z0-9]|%[a-fA-F0-9]{2})(-([a-zA-Z0-9]|%[a-fA-F0-9]{2})+)?\.?)+ This will allow us to accept valid domain names as long as any non-regular characters are encoded. (watch out for phishers!) Also note that this validator allows for a dot at the end. This is valid syntax.

If you need urls with a valid TLD (ie, com, net, org, gov, etc) then you'll have to change the above expression a bit. (([a-zA-Z0-9]|%[a-fA-F0-9]{2})(-([a-zA-Z0-9]|%[a-fA-F0-9]{2})+)?\.)+([a-zA-Z]{2,7}\.?)The end of this expression allows for many of the TLDs, however it is not an exact science. It is somewhat impractical have a list of all TLDs, however, if you only allow a known list, you can OR the values together like: /(com|net|org|gov|info|ws|us|cn|eu|mil)/gi

Where are we at?

General urls are in the format of: http://[username:password@]domain:port. I don't suggest allowing the [username:password@] portion, but you may need it, so here is the format: /[a-zA-Z0-9\._- ]+(:.*)?@/ You can adjust the allowable username characters if need be. This is simply for reference.

Now lets put the whole thing together and add the port validation. (It's getting longer!) /^https?:\/\/((([01]?\d{{1,2}|2[0-4]\d|25[0-5])\.){3}([01]?\d{1,2}|2[0-4]\d|25[0-5])|(([a-zA-Z0-9]|%[a-fA-F0-9]{2})(-([a-zA-Z0-9]|%[a-fA-F0-9]{2})+)?\.?)+))(:\d{1,5})?/g

Path Validation

Paths are relatively easy to validate as we allow just about anything as long as most non-regular characters are encoded. /((\/|\?)([a-zA-Z\.~_-+=&,# ]|%[a-fA-F0-9]{2})*)*/ So the final regular expression is: /^https?:\/\/((([01]?\d{{1,2}|2[0-4]\d|25[0-5])\.){3}([01]?\d{1,2}|2[0-4]\d|25[0-5])|(([a-zA-Z0-9]|%[a-fA-F0-9]{2})(-([a-zA-Z0-9]|%[a-fA-F0-9]{2})+)?\.?)+))(:\d{1,5})?((\/|\?)([a-zA-Z\.~_-+=&,# ]|%[a-fA-F0-9]{2})*)*$/g If you have any improvements or comments, please do so!

Powered by BlogEngine.NET 1.4.5.7