In my discussion of RethinkDB, I mentioned that ReQL is inspired by functional languages.
What I neglected to mention is that this will serve as an initial stumbling block for some developers. Being functional, it makes use of chaining, and results from one function are fed into the next. This means that order matters in ReQL.
This does open up some power you don’t have in SQL (and some non-SQL query languages too), but simultaneously introduces a potential for error.
Let’s take a look at a simple use case: We want the results ordered, but we only want 5 of them.
In standard SQL, this is unambiguous. There’s only one allowable way to write the query. If you try to LIMIT before ORDERing, your DB (PostgreSQL, in this case) will drop the hammer:
SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' LIMIT 5 ORDER BY date DESC;
ERROR: syntax error at or near "ORDER"
LINE 1: ...M heat_index_data WHERE state = 'Arizona' LIMIT 5 ORDER BY d...
Write it correctly, with LIMIT at the end, and we’re good:
SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' ORDER BY date DESC LIMIT 5;
weather_dev=# SELECT avg_max_f FROM heat_index_data WHERE state = 'Arizona' ORDER BY date DESC LIMIT 5;
avg_max_f
-----------
59.16
60.57
58.51
54.11
51.8
(5 rows)
But ReQL is functional. It doesn’t care where you stick your limits or orderBys and it doesn’t care how many of them you have. You can orderBy(r.asc(date)).orderBy(r.desc(date)) for all it cares. Data comes out of one function and is fed into the next.
…and as is plainly visible here, you get different results because they’re different queries.
Wanting to limit before ordering is certainly an edge case (which is why SQL doesn’t allow for it), but it — along with many other unconventional queries — is perfectly possible with the limited constraints of ReQL. If you run into unexpected results working with RethinkDB, take a step back and ask yourself: “What am I really asking for here? Does it make sense?”
So some developers walk into Y Combinator with an alternate MySQL storage engine optimized for SSDs, and… No, wait! This gets better than it starts out!
Along the way, MySQL is cut out of the picture and something new and more interesting springs up. RethinkDB the SSD-optimized storage engine becomes RethinkDB the human-optimized NoSQL database. It attracts some brainy talent, conquers some nifty problems, and uses MongoDB’s list of pain points as a checklist of things to do better. What emerges is a newer, smarter source of document store. One that’s agile while valuing data integrity and stability. (Novel idea, right?)
RethinkDB may not be the first NoSQL database out of the gate, but that’s exactly why it’s so good. It’s the much-overlooked advantage of not being the first mover; Apple’s been succeeding on this basis for decades. AppleĀ built an empire around the iPod, which was an extremely refined product in a market analysts had declared “mature” and “crowded.” They did it again with iPad, kickstarting a market Microsoft had been trying and failing to create for two decades. Now RethinkDB has come into a “mature” and “crowded” niche to build on the shortcomings of the prevalent NoSQL solutions.
But as a developer, what’s the draw of yet another document store? Don’t we have enough? It’s been discussed to death that PostgreSQL is frequently better than MongoDB at being MongoDB. Postgres features a pedigree and modern performance numbers that are the stuff of dreams. It’s taken years to get there. So why in the hell bother with a database that’s a tender 3 years from its 1.0 release?
Because it’s awesome. Not perfect for all applications, and perhaps not perfect yet for any application, but awesome.
Admin Interface
The first feature of RethinkDB that most users will encounter is its beautiful, modern administrative web interface. At a glance, you can see the status of your server or cluster. With a few clicks, you can perform nearly any administrative task you like. There’s even a live query console.
I’m not going to replicate effort here, so if you want to see the admin interface in action without installing RethinkDB, give their screencast a spin. Sharding, replication, and multi-datacenter management are truly only a few clicks away. Even many grotesquely expensive commercial products don’t offer cluster management this easy.
For the small team that’s dev and ops rolled into one, this level of control and accessibility is breathtaking. While I have a copy of Panic’s Prompt SSH client on every mobile device I own, there’s power in being able to administer your database from literally any browser.
Query Language
RethinkDB dispenses with SQL-style query languages in favor of its own language, ReQL. While distinct from SQL, it has most of your familiar features — including joins — and can be picked up enough for basic use in all of 5 minutes.
Assuming you have some grounding in set theory and databases, it’s simple and well-documented enough that you can get started even without an SQL background.
ReQL is functional in its construction, building queries through chaining rather than monolithic command strings. This, among other design decisions, also helps minimize (but not eliminate!) injection and similar attack vectors; unless you’re doing something very foolish, like passing user input to RethinkDB’s JavaScript engine, it would be quite difficult for a typical attacker to achieve the kinds of malfeasance easily accomplished with SQL.
By now there has to be a thought lurking in the back of every reader’s mind: What’s the catch? And, in fact, I purposely helped plant that seed with the title.
There are a couple catches and it’s important to be aware of them. For some users they’ll be unimportant; for others they may be a deal-breaker. But by being aware of them, you can make an educated decision and focus your testing when evaluating RethinkDB. You want to choose a data store on its merits and weaknesses, not whether or not it’s webscale.
Performance
While performance is a priority for the RethinkDB team, you may find it’s not quite where you need it right now. Thus the “next next database.” I’ll touch on this in some detail in a later post, but performance ranges from as fast as or faster than Postgres (on very simple queries) to dreadfully slower than Postgres (on more complex queries).
Tolerance for RethinkDB’s performance will vary with your expectations; based on my own testing and experience, many MongoDB or MySQL users will likely find it acceptable or even an improvement. Pragmatic Postgres users with the right usage patterns may even find response times perfectly acceptable (more so if you can take advantage of caching). It’s frequently slower than Postgres to be sure, but there’s a sizable gulf between “slower” and “slow”. And Postgres itself has seen some massive improvements in performance over the past several releases, so certainly it’s not fair to act as if RethinkDB will never see a speed increase.
Query Optimization
RethinkDB doesn’t currently have a query optimizer. If you want to use an index, you need to explicitly request the use of that index in your ReQL. If you’re doing a series of expensive operations, it’s up to you to determine the best order (and identify any redundancies or stupid code). This is a significant difference from modern SQL servers, most of which have extremely intelligent query optimizers, allowing excellent performance from even half-baked queries.
A developer today can adopt a popular SQL ORM — or even raw SQL — with no or very little knowledge of SQL and still get great performance.
A developer today using RethinkDB has to put some thought into his queries and develop an understanding of functions’ performance through experience.
Indexing
There are also some gotchas (for SQL users, anyway) surrounding secondary index use; as RethinkDB functions can only utilize an index when operating on a table (which you generally cease to have after the first function in the query chain), a good understanding of your application’s usage pattern is required for both index creation and querying. Whereas SQL will generally gleefully assemble an optimized query across multiple conditions and multiple single-column indexes, RethinkDB’s functions can only use an index when acting directly on a table. This means that, in many cases, you get to use one index and it has to be in your first function. It thus becomes important to both create appropriate compound indexes and determine what portion of your query will actually benefit from using your one available index-enabled call.
To be fair, ReQL and RethinkDB are well-designed and this isn’t nearly as painful as it might first sound. Coming from SQL, though, it does initially feel a bit clumsy and dated.
Circling back to “a good understanding of your application,” index creation times are — at least currently — quite long. While index creation is not a blocking operation, the expense does mean you want to get it right the first time.
The RethinkDB team is also super friendly and super accessible, so if you have any questions or suggestions, link up with the community.
Maybe it’ll work for you, maybe it won’t, but in either case it may start to change your mind about what your database needs are and how a NoSQL solution should work.