Flat File vs. Database CMS
The idea of a flat file CMS is intriguing. There’s a lot of overhead in a database and CMS content doesn’t change much, and isn’t highly structured. CMS content more closely resembles a document than data, and the filesystem is very fast at dealing with documents in the form of files. Let’s explore some benefits of a CMS and flat file systems.
When using a file per content item, basic operations can be performed with a single contiguous read or write. A database still needs to store it’s data in persistent storage, so these operations cannot be avoided. In theory, a database’s caching could improve performance, but in practice data will need to be gathered from multiple tables which will require multiple reads and writes, and slow down performance. And caching can be done by the CMS or memcache with either approach, you can’t beat flat files for when the disk access is absolutely required.
Winner: Flat File
In a database CMS, a basic design approach would be to create columns for each data element (or child tables for multi-valued fields), but this has some major drawbacks. The needs of any system change over time, and adding columns to data tables with millions of records may be completely unfeasible.
Drupal sidestepped this problem in version 7 by creating a separate table for nearly every field, but this means that performance will suffer as the database has to join many tables together, for every single piece of content. This is further complicated by the fact that Drupal is designed to work with MySQL, a database which has very poor join optimization, and limits the amount of joins that can be performed in one query to 61. It’s not uncommon for this limit to be exceeded in larger sites, and although drupal does handle this, it’s never going to perform well as it’s going to necessitate accessing the index on every table involved, and then doing a dictionary lookup, requiring hundreds of disk accesses to just get the data for a single content item (or node in Drupal terminology).
Drupal solves this with many layers of caching but that also adds complexity and new problems.
Another common approach, also employed by Drupal, is to use a ‘data’ column which simply stores a serialized dataset which just throws away any of the potential benefits of a database.
Another possible approach would be to use a single table to store all values, as a key-value store (columns: id, delta, field value), but this once again throws away many benefits of a database, and is inferior to a proper key-value store.
A flat file based CMS simplifies this - we can easily use a markup language like YML or XML, to store multiple fields in a document. At any time, we can add a new field simply by adding it to the markup, we can add informational metadata, or fields for things like an image, post date, multiple content selections, and make the document as complex as we want.
The one downside is that we can’t update a field or change data types, but it’s not that difficult to write a script to loop over all the files in a directory and update them.
Winner: Flat File
Version control system is something that has never really worked well. You will need both the data and the code to restore to any particular version but there’s no good way to keep both together. You could take a full database backup at each point you check into source control but it’s still not versioning the data, and none of the versioning tools will work.
With a flat file system, you can just check in the content alongside the code. This seems like the obvious winner, but the problem is that you may still need to include binary resources for each content item, like images and media, which also don’t work well with revision systems.
Another question is whether or not this actually makes sense. Including content will bloat the repository size, and you have to check in your content changes periodically which creates a bunch of new problems.
Either system can do revisions. A database system will typically just store them outside the data table for the “live” revision (as Drupal does). A flat file system can store them in a separate place, appending version numbers to each file for each new revision.
For dynamic content, including lists of content (ex. List of blog articles for a certain category), comments, related or popular articles, or personalized content the database makes things simple. These things can typically be boiled down to some kind of query, which will perform well because that’s what databases are designed for.
To do this with flat files requires a lot of cleverness. Naive approaches, such as reading a directory of files to find the newest ten articles for a certain category will always perform badly. A better approach would be to update any aggregated views when saving a content item, but this would have to be baked into the CMS and may also perform badly if for example, there are many views that need to be updated.
Comments on a popular site could change too often that the file system can’t update the content page fast enough and cause problems, but at the same time, comments are often structured as a tree, something hard to represent in a database. Reddit for example, uses a document store for comments instead of storing them in a database or flat file.
Data vs. Content
There’s a difference between data and content. Content is unstructured content that’s only for presenting to your visitors, but data is structured, and includes things like store orders, product information, content metadata that may need to be filtered, or referenced in a variety of ways, aggregated, or modified by many users at once in many ways like comments. These kinds of data will be best served by the database. It could be possible to use some kind of hybrid data store where the content is stored on the filesystem and data is stored in the database, or that the database is simply used as an index for the content. Ultimately, not using a database for this data just means replicating database features where they’re needed.
Perhaps in this case, a flat file based CMS is just not the right choice if you need functionality of a database CMS.
Cache data, like content, is not well suited for a database in most cases. The whole point of cache is to make access fast, and the database overhead causes a lot of lag. Drupal uses this method by default, and I’ve seen on some sites, it takes over a second to add cache entries to the database. All of the overhead of a database make it the clearly wrong architectural choice in this case. Using a memory cache such as memcached or redis could be ideal, or simply using the built-in caching apis in .Net languages, or use data structures to cache data in the application where it makes sense. Scripting languages don’t allow this, so PHP-based CMS’s should use either memcached, redis or a flat file store.
Search engines work based on an index, not by actually scanning all of the content on the site for your search terms, therefore in some ways this doesn’t matter. An external search engine such as Google site search will not have any difference. A built-in search platform such as Apache Solr will simply go page by page and index your content, it doesn’t matter how that content is stored o the back end. It will need to store its data somewhere, but could use a flat file so you can still avoid a database.
With a database, there are several options for scaling. You can create a cluster and share the data across multiple servers. You can use sharding to split up the data across multiple instances or designated different instances to handles different types of data.
With a blog or something where one person is managing all of the content, there’s no problems, but as you increase the requirements you’ll soon run into problems. If you want to scale across multiple servers, synchronization will become a problem.
If you have multiple users, the CMS will have to take steps to ensure you don’t overwrite the changes of other users. This may not be too difficult for smaller systems, because the CMS can easily check that the current revision number hasn’t changed between editing and saving a content item.
Once you want to scale across multiple servers, the synchronization problem becomes more complex. You now need to make sure that changes from different servers don’t overwrite each other, and while with few users, rsync may be okay (or designated that a certain system is to be used for adding and editing content), for many users, or systems where the public can create content, you’ll need to use a distributed file system, or use a system designed to handle the synchronization for you: a database.
For site with comments or where a lot of people could be changing the content at once, a flat file will not be the right choice. But, for simple sites, where content isn’t changing, commenting isn’t needed, and there’s minimal dynamic content, it’s possible to use a forgo the database, and use a flat file CMS.