How to Maintain your Blog’s URL Consistency

The problem

URL (also known as web address) is an abbreviation of Uniform Resource Locator. It is a string that constitutes a reference to a web resource.

The following URLs may lead to same web page:

http://www.example.com

http://example.com

http://www.example.com/index.php

http://www.example.com/index.php?goback=somewhere

This is a bad practice. Each reference to the same resource (web page) must use an identical URL.

If more than one URLs lead to the same page, it is possible to

  • Duplicate Google (or other search engines) index (increased possibility of split page rank)
  • Lose Facebook likes, Tweets and similar social media rankings
  • Lose Disqus (or similar services) comments

This situation may affect the functionality of any service which uses the URL to identify a web page (resource). That’s why URL Consistency is so important.

The solution

It is a complex problem and possible solutions vary case by case. Here are some available solutions:

  • Use a single Canonical HostName
  • Strip unwanted query strings from incoming URLs

Use a single Canonical HostName

Most websites response to hostname either contains www or not. That is right. However, it is recommended to redirect www to non-www hostname or the opposite. Which one to select? There are arguments for each choice. See http://no-www.org and http://www.yes-www.org.

  • redirect non-WWW to WWW: google, bing, baidu, qq, amazon, alexa, youtube, wikipedia, blogger, reddit, mozilla, facebook, linkedin, stumbleupon, microsoft, apple, tumblr, paypal, bbc
  • redirect WWW to non-WWW: twitter, wordpress, vimeo, github, jquery, sourceforge, pinterest, instagram, delicious

Actually, you can select anyone you prefer, but to have to use it permanently.

I prefer the non-WWW to WWW redirection. Here is how non-WWW redirected to WWW using Apache configuration files (in Debian):

Except of main configuration file, which looks like:

<VirtualHost 95.211.47.207:80>
        ServerName  www.pontikis.net
        DocumentRoot /var/www/pontikis.net
</VirtualHost>

another configuration file is created:

nano /etc/apache2/sites-available/pontikis.net

with the following content:

<VirtualHost 95.211.47.207:80>
        ServerName  pontikis.net
        Redirect / http://www.pontikis.net/
</VirtualHost>

If you don’t want to directly change Apache configuartion, you may use mod_rewrite. In order to redirect non-WWW to WWW, create an .htaccess file in the server root with the following content:

RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/?(.*) http://www.example.com/$1 [L,R,NE]

Strip unwanted query strings from incoming URLs

A query string is the part of a URL that contains data to be passed to web applications such as CGI programs. For example:

http://www.example.com/?id=1&category=sales

The part of URL after the question mark (“?”) is the Query String (id=1&category=sales).

Some websites (among them some great services as Linkedin and Feedburner) include query strings to incoming URLs of your website for tracking purposes (like “?goback=” etc). In most cases these query strings considered “unwanted” and could be stripped.

Here is a solution for Apache web server. Similar solutions are available for Microsoft IIS and NGINX web servers.

I use an .htaccess file in the server root with the following content:

RewriteEngine On
RewriteCond %{QUERY_STRING} !=""
RewriteCond %{REQUEST_URI} !^/search.*
RewriteCond %{REQUEST_URI} !^/wiki.*
RewriteCond %{REQUEST_URI} !^/bbs.*
RewriteCond %{REQUEST_URI} !^/admin.*
RewriteRule ^(.*)$ /$1? [R=301,L]

  • Line 2: If query string exists
  • Line 3-6: Exclude directories search, wiki, bbs, admin
  • Line 7: Remove query string

Using RewriteCond you can exclude any QUERY_STRING or REQUEST_URI, according to your needs. Of course, we will never strip query strings, we are using in our website.

WARNING: There are no universally valid solutions. You should read carefully Apache mod_rewrite documentation and create .htaccess according to your own environment.

For example, WordPress users might need the following lines:

RewriteCond %{QUERY_STRING} !^p=.*
RewriteCond %{REQUEST_URI} !^/wp-admin.*

  • Line 1: allow post tempalinks
  • Line 2: Exclude admin directory

Use simple Feedburner URLs

If you select Feedburner to track detailed statistics for your feed, the URLs to your website contains query strings like utm_source and &utm_medium). FeedBurner URL seems like http://feedproxy.google.com/~r/YourFeedName/…

In order Feedburner URLs to be exactly as your site URLs, navigate to the Analyze tab, click on Configure Stats and deselect checkbox for Item link clicks as follows:

Share your experience with other web servers (e.g. Microsoft IIS). Do you prefer WWW or non-WWW? Leave us a comment.