Dec 9 2009

When you write something that connects to a web server, what user agent do you use?

Far too often have I seen things like:

curl_setopt($c, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");

In my opinion you should always use a nice descriptive user agent that explains to the server exactly what your client may be trying to achieve, or at least a unique identifier. Unless you’re trying to achieve some kind of web scraping client (which probably contravenes some terms of service agreement somewhere, so I certainly don’t advocate that!), there is no reason not to provide a useful and descriptive UA string.

A good UA string from a little-known client should provide some way of contacting you. When I say little-known, I mean something like your new web app that you’ve just made that queries Last.fm for user data. In this instance, I’d give a nice descriptive UA string with contact e-mail, e.g.:

curl_setopt($c, CURLOPT_USERAGENT, "MyLastFmClient (v0.1) myemail@address.com");

As your client becomes more used, or you already have a decent way of contacting you on your website, perhaps just put a URL:

curl_setopt($c, CURLOPT_USERAGENT, "MyLastFmClient (v1.2) www.address.com");

Of course, when you’re Google for example, everyone knows who you are, so for example the UA string “Mediapartners-Google” yields 200k-odd results, revealing that this is the AdSense content bot.

Why do I think this is important? It helps servers identify you and help you in most instances. If your client goes wrong and gets itself stuck in a loop because you forgot to increment $i for example, that server can see that MyLastFmClient for example is spamming the server with 1,000+ requests a minute. They can then see your UA string and contact you about it.

Another reason is that some servers might actually block access, or provide different content depending on the client. I know that Google serves up a completely different search results page if you’re on IE5 than in IE8 for example. Another server might block all known browsers for example from accessing a web service (e.g. with the message “this page cannot be accessed using a web browser”). I’m not saying this is good or bad practice as that is a WHOLE other kettle of fish – but I’m just saying it can happen, and that sort of thing can be pretty hard to track down.

Although this all might seem pretty trivial, it is useful, and I think any HTTP(S) client should identify itself properly using a clear and descriptive user agent string. It’s no harder to do and it just makes everyone’s lives easier!

Leave a Reply