What is a sitemap?

A sitemap is an XML file contains links of a website's URLs. Sitemap is used to tell search engines that these URLs exist on your website, and you want to crawl them and to appear them in search engines. It also allows website owners or webmasters to add additional information with each URL of their website in sitemap including lastmod, changefreq, and priority, where lastmod tells search engines that when that particular resource/file/data URI was modified on server, changefreq tells search engine that how frequently the resource content will change on the server, possible values for changefreq include always, hourly, daily, weekly, monthly, yearly and never, priority tells search engines that how much important a URL is relative to other URLs in that sitemap for same domain/website for crawling purpose, possible values of priority could range from 0.0 to 1.0 where 1.0 means the URL is more important. XML based minimal sitemap structure is given as below:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 
    <url>
        <loc>http://www.example.com/</loc> 
    </url>
    <url>
        <loc>http://www.example.com/file-url.html</loc> 
    </url>
</urlset>

Other sitemap formats:

A sitemap could be submitted to search engines in three formats: XML, RSS Feed and Text file. If a website or blog has RSS or Atom feed, that feed could be submitted to search engines as sitemap. Mostly the open source blog software like WordPress contains the feature to generate RSS feed which could be used as sitemap for search engines. Text file based sitemap is a simple form of sitemap contains one URL of a website per line for example:

http://www.example.com/url-1
http://www.example.com/url-2
http://www.example.com/my-file-1.html
http://www.example.com/my-file-2.html

Sitemap limitations:

Sitemap files have a limit of 50,000 URLs per one sitemap and 50 mega bytes of limit on its file size. But sitemap file could be compressed to gzip file format. Websites having more than 50,000 URLs should contain a sitemap index file with sub-sitemap files where sitemap index file will contain the list of sitemap files and each sitemap file should not contain more than 50,000 URLs and should not be larger than 50 mega bytes in size. Another sitemap limitation is any data or any website URL in sitemap should be escaped that is the special entities characters should be properly escaped like ampersand (&) should be written as (& amp;), single quote (') should be written as (& apos;), double quote (") should be written as (& quot;), less than symbol (<) should be written as (& lt;), and greater than symbol (>) should be written as (& gt;).

Generate a sitemap for search engines (Google, Yahoo, Bing and other)

Creating a sitemap for a small website with few static pages is very simple. But what will happen if a website is built on custom platform and not using any open source software and has hundreds of dynamic resources/URLs on the server? There are tons of online sitemap generators, which use a methodology where you supply them the URL of your website, and then a sitemap generator crawls that URL and scans for other URLs usually mentioned in anchor tag <a href="http://www.example.com/another-page.html">Another Page</a> on that main page and collects all the URLs related to main domain. And this way the sitemap generator tool iterate each collected URL for more URLs of the same domain for some depth level, and finally makes a list of collected URLs and then allows site owner or webmaster to download XML sitemap containing those collected URLs. So creating an XML sitemap for a custom website is not really hard today.