Understanding XML for RSS
Learn the rules of XML so that you can write a syndicated feed by hand.
What is XML?
While there are a great many syndicated feeds that are automatically generated — and depending on the purpose of your feed you might want to look into that — there’s nothing in a feed that can’t be written by hand.
But first you need to understand the basic rules of XML. It’ll be okay. In terms of writing it, it’s not a radical departure from HTML. It’s just another Markup Language.
RSS and Atom syndicated feeds are written using XML. XML is an open, non-proprietary, data-driven file structure standard designed to make it easier to publish different types of electronic media easier. The design goals were to emphasize simplicity, generality, and usability across the Internet while remaining both human and machine readable. As a simplified subset of SGML, XML describes a group of technologies and specifications: XML, XSLT, XPath, XML Schema, DTDs, and more. With XML, data can be exchanged without any prior communication as long as both parties use a common vocabulary, through using Schemas and/or Document Type Definitions.
Much like HTML — both of which are markup languages based on SGML — XML uses the angle bracket markup syntax for its elements. That’s about where the similarities end.
While in web development we like to say that HTML defines the structure of a web page, HTML has styles built into it that changes the presentation of those elements. Headings (h1, h2, etc.) come pre-sized. Lists come with markers. Tables present columns and rows. XML, however, is pure structure. It has no predefined presentation. When you write an XML file you even get to choose what to call your elements. You can share those definitions (DTDs) with others, and then everyone can use that as a point of reference for a common language.
So, in those terms, RSS is an XML application with elements that have been agreed on for the purpose of creating syndicated feeds to share web content.
Basic Rules of XML
XML on its own is a vast topic. We’re restricting ourselves to just how to use XML and its associated technologies needed to create an RSS feed. I’m only going to cover enough XML so that you can use it to write your syndicated feed. I won’t be talking in depth about XML Schemas or DTDs. There are plenty of other resources on this thing called The World Wide Web if you’re interested in learning extensive, core, foundational XML knowledge.
The two requirements for XML documents are that they must be well-formed and valid.
A well-formed document meets the following criteria:
- XML elements are case sensitive.
PubDate
≠pubDate
- XML elements must be properly nested.
- All XML elements must have a closing tag or be self-closed.
- Attributes in XML elements must have a value.
hidden="true"
never justhidden
- Attributes values in XML must always use quotes.
hidden="true"
neverhidden=true
A valid document is one that adheres to the rules provided in a DTD. For RSS 2.0 this means that we’re using only those elements defined in the RSS 2.0 specification, and that if we use elements from another standard (definition or module), that they’re properly namespaced.
Unlike HTML, if your document is malformed in any way and it isn’t valid then it simply won’t work. You’ll just get a blank page.
File Extensions for Syndicated Feeds
You can use either .rss
or .xml
for your feed, though they can be handled differently by the browser. The .xml
extension seems to be the more popular choice. In many cases browsers will recognize and display a document tree when presented with an XML file.
As already mentioned, the elements you use in an XML file don’t have any inherent styles applied to them. So this, at best, is the view most people will get from a plain XML file in a modern browser.
Escaping Characters in XML
Because we’re using XML, there are certain characters that need to be escaped in order to display properly. Using XML entities is one way to do that. They’re similar to HTML entities. The XML escape characters are:
Name | Character | Code |
---|---|---|
Ampersand | & | & |
Single quote | ' | ' |
Greater than | > | > |
Less than | < | < |
Double quote | " | " |
Any other characters, unfortunately, require a numbered entity, either as decimal code or hexadecimal code.
Name | Character | Code | Hex Code |
---|---|---|---|
Copyright | © | © |
© |
Trademark | ™ | ™ |
™ |
Right Arrow | → | → |
→ |
Hearts | ♥ | ♥ |
♥ |
Not Equals | ≠ | ≠ |
≠ |
The TopTal HTML5 Character Code Reference is the most complete list of character entity codes that I’ve found (for HTML). The codes should work for XML as well, but I won’t swear to it.
CDATA Section
Rather than escaping all of those characters, which can be tedious, you can use the CDATA Section to wrap the entire content of the element: <![CDATA[
and ]]>
.
XML Comments
XML comments look and act just like comments in HTML.
<!-- Just like HTML! -->
Since you’ll be writing this by hand and XML is more strict it could be very helpful to have comments in your feed for gentle reminders to yourself. How else will you remember the RFC-822 date-time format?
XML Namespaces
XML is meant to describe data, and namespaces are important because it’s possible to describe information in different ways. Namespaces also let you shared data structures that have already been defined by other organizations. The namespaces help give elements with similar names context for the data that they try to describe.
For example, suppose you were trying to write an XML file that described the contents of your bookshelf. A book list might have a name element: <name>The Hitchhiker’s Guide to the Galaxy</name>
. A list of publishers might also have a name element: <name>Megadodo Publications</name>
. And an author list would most likely have a name element: <name>Douglas Adams</name>
.
This would all be perfectly fine as long as they were all different XML files. If you tried to put the data all into your bookshelf XML file, you’d most likely have a conflict.
The solution to that problem would be namespaces.
Likewise, you couldn’t just use an element that didn’t exist in the current definition. For example, if you wanted to use a penname
element. You could include it through the author namespace, if it existed.
This was an abstract example to discuss the general concept of namespaces. The rules around namespaces are pretty clear for syndicated feeds. You’ll see namespaces applied when working with Atom, Dublin Core, iTunes, Podcasts, and other content in a syndicated feed. The hard work of defining the namespaces to use has already been done.
When you see a namespace in the code, look at the URL:
xmlns:atom="http://www.w3.org/2005/Atom"
That URL exists to give the namespace a unique identity in your document, but some organizations use the URL to provide documentation about the specification and how to use it. Technically it’s not required, and due to the fact that some of these URLs are over 20 years old, some have been subject to linkrot.
- http://www.w3.org/2005/Atom
- http://purl.org/dc/elements/1.1/
- http://purl.org/rss/1.0/modules/content/
In any case, once you know the name and purpose of a namespace it becomes much easier to find more information and tutorials. Not that it makes it any easier to understand some specs, but it’s self-documenting, if it’s done right.
Feed Validation
That might seem like a lot, and XML can be much more finicky than HTML, but you can use feed validators that will make sure your XML for your feed is valid.
Publishing an RSS feed without validating it is like publishing a web page without checking it in a browser. You’ll save yourself some heartache and time if you validate your feed often:
- as you develop it,
- when you make changes,
- before you add it to your server,
- and once again after you’ve made it live on your server.
Two recommended validators are:
- W3C Feed Validation Service, which can validate from copied code or from a URL.
- Feed Validator, which only validates from a URL.
A good validation service will tell you what and where the error is, by line number, and give you tips on how you might fix it.