Creating Dynamic XML Sitemap With PHP

What is an XML Sitemap

Almost all information you need to know about Sitemaps is available at http://www.sitemaps.org.

From this site:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

At typical Sitemap (usually called Sitemap, with a capital S, also known as Google Sitemap) is an XML file with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <url>
                <loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc>
                <lastmod>2004-12-23T18:00:15+00:00</lastmod>
        </url>
</urlset>

Of course, the <url></url> section is repeated for each URL you want to include in the Sitemap file.

This example is a static XML file. A dynamic site (based on a database) has to create a dynamic Sitemap (generated and updated by a script, not manually).

After Sitemap creation, it must be submitted to search engines (see below). Sitemaps are very important for Search Engine Optimization (SEO). Creating and maintaining an accurate Sitemap is important to proper indexing of your website pages. More here from Google.

Simple rules:

  • Generally, Sitemap file must be named “sitemap.xml” and must be located at the website root folder.
  • If it is possible, use gzip to compress your Sitemaps.
  • Sitemap.xml must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes), whether compressed or not. In these cases, you can use Sitemap index files (a group of Sitemaps).
  • In the <url></url> section, only the <loc></loc> (location) tag is required. It is recommended to include the <lastmod></lastmod> tag, but not required. Other optional tags are <changefreq></changefreq> and <priority></priority> . Read more here.
  • Sitemap.xml must be a UTF-8 encoded file. All values (between tags) must be entity escaped.
  • Date or datetime values inside <lastmod></lastmod> should be in W3C Datetime format. Example YYYY-MM-DDTHH:II:SS+02:00 (where T is the delimiter between date and time and +02:00 is the UTC offset). Using of time is optional. A simple YYYY-MM-DD structure is valid.

When you need to create a custom Sitemap

If you use the popular WordPress platform of any similar software, there are plugins available to create for you the XML Sitemap.

So, when you need to create a custom Sitemap? If your blogging platform does not support Sitemap creation or your site is something more than a typical blog, containing more sections. This is my case and I will describe it in this post.

pontikis.net consists of:

So, I use partial Sitemap files, one for each site section. The main sitemap.xml file is actually a sitemapindex with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <sitemap>
                <loc>http://www.pontikis.net/sitemap-main.xml</loc>
        </sitemap>
        <sitemap>
                <loc>http://www.pontikis.net/blog/sitemap.php</loc>
        </sitemap>
        <sitemap>
                <loc>http://www.pontikis.net/labs/sitemap-labs.xml</loc>
        </sitemap>
        <sitemap>
                <loc>http://www.pontikis.net/bbs/sitemap.php</loc>
        </sitemap>
</sitemapindex>

The partial Sitemaps are:

So, let’s see how to create a dynamic Sitemap of blog posts which are stored in a database:

The code

Syntax highlight using http://alexgorbatchev.com/SyntaxHighlighter/

<?php
        header('Content-type: application/xml');

        require_once '../common/settings.php'; // database settings
        require_once PROJECT_PATH . '/lib/php_adodb_v5.18/adodb.inc.php';
        require_once PROJECT_PATH . '/lib/small_blog_v0.8.0/smallblog.php'; // custom blogging engine
        require_once PROJECT_PATH . '/lib/utils/utils.php'; // utility functions: date_decode, now

// configuration
        $url_prefix = 'http://www.pontikis.net/blog/';
        $blog_timezone = 'UTC';
        $timezone_offset = '+00:00';
        $W3C_datetime_format_php = 'Y-m-d\Th:i:s'; // See http://www.w3.org/TR/NOTE-datetime
        $null_sitemap = '<urlset><url><loc></loc></url></urlset>';

        $blog = new smallblog();  // custom blogging engine
        $res = $blog->db_connect($blog_db_settings);
        if($res === false) {
                echo $null_sitemap;
                exit; // Database connection error...
        } else {

                // get all posts meta-data
                $posts = $blog->getPosts(0, 0, '', '', '', now($blog_timezone));
                if($posts === false) {
                        echo $null_sitemap;
                        exit; // Error retreiving posts...
                }

                $len = count($posts);
                for($i = 0; $i < $len; $i++) {
                        // entities encode URL according http://www.sitemaps.org/protocol.html#escaping
                        $posts[$i]['url'] = $url_prefix . htmlspecialchars($posts[$i]['url']);
                        // convert dates to W3C datetime format http://www.sitemaps.org/protocol.html#xmlTagDefinitions
                        $posts[$i]['date_updated'] = date_decode($posts[$i]['date_updated'], $blog_timezone, $W3C_datetime_format_php) . $timezone_offset;
                }

                // retrieve max date
                $max_date = $posts[0]['date_updated'];
        }

        $output = '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
        $output .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
        echo $output;
?>
<url>
        <loc>http://www.pontikis.net/blog/</loc>
        <lastmod><?php print $max_date ?></lastmod>
        <changefreq>daily</changefreq>
</url>
<url>
        <loc>http://www.pontikis.net/blog/archive/</loc>
        <lastmod><?php print $max_date ?></lastmod>
        <changefreq>daily</changefreq>
</url>
<?php for($i = 0; $i < $len; $i++) { ?>
<url>
        <loc><?php print $posts[$i]['url'] ?></loc>
        <lastmod><?php print $posts[$i]['date_updated'] ?></lastmod>
</url>
<?php } ?>
</urlset>

Code explanation

Inform browser that an XML file will be created

Line 2: header('Content-type: application/xml');

Connect to database and get an array of posts URLs and publish dates

Lines 16 – 28: In this example custom blogging engine and php ADODB are used. But the same result (array $posts) can occur with usual MySQL statements.

Convert URL and dates according to Sitemap protocol specifications

Line 33: Convert URLs using htmlspecialchars (see specification).

Line 35: Convert dates to W3C datetime format. In this example function date_decode is used (see this post for details). But any php code could be used. In my case publish dates are stored in UTC ($timezone_offset = '+00:00';)

Output XML sitemap

Lines 46 – 55: Include the blog home URL and blog archive URL using as <lastmod> the date of the most recent post ($max_date).

Lines 56 – 61: Finally, iterate over the $posts array and create the rest of the Sitemap.

Sitemap validation

It is recommended to validate the sitemap file before submission to search engines. Many online tools are available:

How to submit Sitemap to search engines

Once you have created the Sitemap file, you need to inform the search engines (which support this protocol). You can do this by:

  1. using the search engine’s submission interface (known as Webmaster Tools)
  2. sending an HTTP request (see more)
  3. specifying the location in your site’s robots.txt file (strongly recommended)

    To do this, include a statement like: Sitemap: http://www.yoursite.com/sitemap.xml

How to re-submit Sitemap when content changes

Theoretically, you have to re-submit a Sitemap, when site content changes. Either with manually submission or automated using scripts to “ping” the search engine.

But, once a Sitemap is submitted, search engines will regulary come back and reload the Sitemap looking for new URLs, whether you re-submit it or not. Much more if Sitemap location is specified in robots.txt file, as search engines first look at robots.txt file.

In conclusion:

  • submit Sitemap using search engines interface the first time.
  • specify the Sitemap location in robots.txt
  • optionally re-submit Sitemap at infrequent intervals.