# SYS1: Build a simple key value store
## Overview
This is a general database design question that is purposefully open-ended. The primary goal is to provide a vehicle for discussing
design tradeoffs when building a durable storage system. Typically, you'll want to encourage the interviewee to start with a simple but
correct solution, and then make it harder by adding increasingly difficult requirements. There is no one correct answer, but instead
you are looking for their ability to:
- come up with a simple design and implement it in pseudo-code (it is important to force them to write out code so that later
you can probe where locks are needed and other ordering concerns).
- explain trade-offs and why they chose the path they did (along the way ask for alternatives and why did not choose them).
- compose the basic primitives of systems programming (files, os buffering, hash tables, logs, spinning disks, SSDs,
locks, etc.) to build a complex system (it's fine if they're not already familiar with all of
these concepts as long as they can reason about how to use them once you explain how they work).
Since this question is open ended, you can use it for a wide range of skill levels (i.e. fresh graduate to seasoned
database implementor). Based on their skill level, you can skip ahead, or lay out more of the basic building blocks and only
ask them to assemble them to meet the requirements.
**NOTE:** Because this question is so open-ended, you should mostly skip the resume questions and
get directly to the technical problem. Aim to give your candidate at least 50 minutes with this
question out of a one hour interview slot.
## Introduction
Since this question is so open-ended and since we ask it to candidates with different levels of
systems experience, it's important that we give a consistent introduction to all of our candidates.
Please make sure to mention the following points when you are introducing the question:
1. This is a high-level design question: it's an excuse for us to have a discussion about trade-offs
when building software
1. They may use any language or pseudocode as long as they are sufficiently precise about what
operations they're performing, on what inputs, and in what order
1. There is a simple, fully correct solution that will fit easily on the whiteboard (at least for
the first few levels of the problem) - you shouldn't need to do anything too complicated
1. They should NOT implement a hashtable, although they may use one
1. They should approach this problem as if they are writing a library that will be used by
other peoples' projects (this is especially important for candidates who are more used to calling
systems code than writing it)
1. If they forget the exact semantics of any low-level APIs that they need, they should discuss with
you and you can come up with reasonable pseudocode APIs
1. They can assume that they have a disk and that it is perfect and atomic (for candidates that do
not have a lot of systems experience, it is often helpful to be direct about the fact that they
should write directly to disk rather than using e.g. the network or an off-the-shelf atomic store.)
## Question
Implement in pseudo code a key/value store with the following interface:
```scala
/** scala
* Writes the given key/value pair into the store. If `key` is already present replace the
* current `value` with one given. After this function returns, the value must be *durable*
* meaning that even in the presense of failures (program crashing or power going out), subsequent
* `get()` calls for this key must continue to return the result of the most recent put.
*/
def put(key, value)
/**
* Returns the most recently `put` value for the given `key`.
* Returns null, if no value has been `put` for this key.
*/
def get(key) => value
/**
* Called when the system first comes up. You can assume that no `get()` or `put()` calls will be made
* before this function has completed. For candidates with an OO background, you can think of this
* as the constructor for your key-value store.
*/
def init()
class Store { // java
public Store() {
// Called when the system first comes up. You can assume that no `get` or `put` calls will be made
// before this function has completed.
}
public void put(String key, String value) {
// Writes a given key/value pair into the store. If `key` is already present replace the
// current `value` with one given. After this function returns, the value must be *durable*
// meaning that even in the presence of failures (program crashing or power going out), subsequent
// `get` calls for this key must continue to return the result of the most recent `put`.
}
public String get(String key) {
// Returns the most recently `put` value for the given `key`.
// Returns NULL, if no value has been `put` for this key.
}
};
class store: # python
# Called when the system first comes up. You can assume that no `get` or `put` calls will be made
# before this function has completed.
def __init__(self):
# Writes a given key/value pair into the store. If `key` is already present replace the
# current `value` with one given. After this function returns, the value must be *durable*
# meaning that even in the presence of failures (program crashing or power going out), subsequent
# `get` calls for this key must continue to return the result of the most recent `put`.
def put(self, key, value):
# Returns the most recently `put` value for the given `key`.
# Returns None, if no value has been `put` for this key.
def get(self, key):
class Store { // C++
public:
Store() {
// Called when the system first comes up. You can assume that no `get` or `put` calls will be made
// before this function has completed.
}
void put(std::string key, std::string value) {
// Writes a given key/value pair into the store. If `key` is already present replace the
// current `value` with one given. After this function returns, the value must be *durable*
// meaning that even in the presence of failures (program crashing or power going out), subsequent
// `get` calls for this key must continue to return the result of the most recent `put`.
}
std::string get(std::string key) {
// Returns the most recently `put` value for the given `key`.
// Returns an empty string, if no value has been `put` for this key.
}
};
type Store struct { // go
}
func NewStore() *Store {
// Called when the system first comes up. You can assume that no `get` or `put` calls will be made
// before this function has completed.
}
func (s *Store) Put(key string, value string) {
// Writes a given key/value pair into the store. If `key` is already present replace the
// current `value` with one given. After this function returns, the value must be *durable*
// meaning that even in the presence of failures (program crashing or power going out), subsequent
// `get` calls for this key must continue to return the result of the most recent `put`.
}
func (s *Store) Get(key string) string {
// Returns the most recently `put` value for the given `key`.
// Returns an empty string, if no value has been `put` for this key.
}
Additionally, they should be optimizing for the following workload:
For candidates who are not as familiar with filesystem operations (or who know the super-clunky Java API and don't want to use that), it's often helpful to provide the following interface:
/** Reads/writes the file at the given path */
class File(path: String) {
/** Writes to the end of the file and synchronously flushes to disk */
def append(contents: String): Unit
/** Reads all of the lines of the file sequentially */
def lines(): Iterator[String]
/** Writes at the given offset from the start of the file and synchronously flushes to disk */
def write(offset: Long, contents: String): Unit
/** Reads the given number of bytes starting at the given offset from the start of the file */
def read(offset: Long, length: Long): String
}
You can also be flexible and have the append()
and write()
methods take a (key, value)
pair
and have the lines()
and read()
methods return such a pair. We don't really care about the nitty
gritty of serialization.
NOTE: However the candidate chooses to use this API, you should make sure that they understand
the effect of their choice on performance. For example, the write()
and read()
methods perform
random I/O, which for most designs will probably be less efficient than the sequential I/O performed
by the append()
and lines()
. We should accept solutions that use either set of methods as long
as the final system will work (and as long as the candidate can explain why).
One simple solution is to use an in-memory hash table for lookups along with a write-ahead log for durability. I usually try and steer them away from getting into the details of serialization, etc as thats not really the point of the question. Also, you can intially make assumptions, such as there is only one thread accessing the database at a time.
def init():
# A simple log file.
log = new File("/db/log")
db = new HashMap
log.lines.foreach { (k, v) =>
db.put(k, v)
}
def get(key):
db.get(key)
def put(key, value):
log.append(key, value)
db.put(key, value)
Allow them to use any libraries or interfaces they want, but ask them to state their assumptions. Specifically look to see if they understand how the operating system buffers writes before flushing them to the disk.
There are several common missteps that people hit when trying to come up with the simple solution. If they start going down one of these paths, then ask them to contrast their design with a simpler solution.
O(log n)
instead O(1)
.
Since the API only requires point-lookups instead of range queries (i.e. getRange(startKey
, endKey
)), this is uncessary.
Using trees can be a good follow-up question though: "How would your answer change if I added getRange
to the API?"Once they get the basic structure on the whiteboard, there are several different avenues you can go down to probe deeper. Start by asking them if there is anything they would improve about their implementation. If they don't have any ideas, present them with problems and let them try and solve them.
If the simple solution is used for a long time in production, eventually the log will get very large. This is true, even though the amount of data still fits in memory (because you may be updating the same key over and over again). There are two problems with a very large log:
A solution to this problem is to checkpoint the current state of the hashtable to disk and truncate the log. This should be done
periodically (i.e. every X number of put
operations), but its fine if they just write an explicit checkpoint()
function that
must be invoked manually. At a high level there are two approaches. Depending on the candidate you can iterate or jump directly
to the harder solution.
get
and put
operationsget
and put
operations to resume.This solution is simple and correct. Even so, there are a couple of missteps to watch out for:
The solution above works, but incurs downtime while the checkpoint is being written out. Ideally, we would lift this restriction and allow operation to continue during checkpointing. In order to make this tractable for whiteboard programming, I usually provide them with some assumptions about the hashtable.
get
/put
calls from corrupting its internal state (you can
ask more about this part later).Iterator[(Key, Value)]
that returns all of the keys that were present in the hashtable when
it was first opened. Keys that are added after the iterator was opened, may or may not be returned. Values produced
by the iterator may or not reflect updates that happend after the iterator was opened.Using the above, you can implement checkpointing as follows:
log.{n + 1}
)checkpoint.{n + 1}
)log.{n}
and checkpoint.{n}
It is important to read from these files in the correct order during the init()
function.
log.0
is present, apply it.log.{n}
and checkpoint.{n}
are present, n+1
has not completed sucessfully. Apply the checkpoint.{n}
first then log.{n}
, and finally log.{n+1}
.log.{n}
or checkpoint.{n}
are missing, apply checkpoint.{n+1}
first then log.{n+1}
.Ask why the ordering is important: Applying the checkpoint first and then the log, allows updates that occured during checkpointing
to overwrite the potentiatlly stale values that are present in the checkpoint
itself.
The simple solution only works when a single thread is issuing get
and put
calls. If we want to lift this restriction
we'll need to use locks to avoid race conditions that could corrupt the hash table, or violate our durability gurantee.
One possible issue is that writes to the same key can be applied in a different order to the log and the hashtable. When this
ordering happens, it is possible that the result returned by get(key)
could change due to a failure, even if no put
call have occured. This phenomenon would violate the durability requirments of our system.
If this problem is not obvious to the candidate, help them to write out possible incorrect orderings of operations:
thread 1 | thread2 |
---|---|
put(k,v1) | |
put(k, v2) | |
log(k, v2) | |
log(k,v1) |
At the end of these two concurrent operations, the hashtable holds v2
, but the log will recover v1
if replayed. In order to
prevent the inconsistency one must use a lock to ensure that hash table and log updates do not interleave with operations from
other threads.
l = new Lock
def put():
l.lock()
log.append(key, value)
db.put(key, value)
l.unlock()
Ask why performing the log-append operation needs to happen before the hashtable-update operation, even within the lock. Locks provide mutexing, but not transactionality. Make sure they understand that performing hashtable-update before the log-append can cause us to read an uncommitted update, and thus result in a violation of isolation:
writer thread | reader thread |
---|---|
db.put(k, v1) | Reads v1 |
log.append(k, v1) | Reads v1 |
db.put(k, v2) | Reads v2 |
[CRASH] | [CRASH] |
init() | Reads v1 |
Note: If the candidate uses a mutex in the implementation of get, the log-append and hashtable-update operations can happen in any order. With this mutex, a reader thread will be blocked from reading v2 until after the write finishes, so the key-value store will not return v2, so durability is preserved.
We also need to protect the internals of the hashtable from concurrent operations. If the candidate suggests using a library
like java.util.ConcurrentHashMap
, that is great, but see if if they can explain how that works internally. There are a couple of
levels of performance / efficiency possible here:
get
/put
- only one operation happening at a time. correct, but limits throughput.N
locks, up to N
different writes can happen concurrently. Unlimited reads can happen at the same time.If they make it this far, there are other things I like to talk about:
getRange(startKey, endKey)
All candidates must implement stop-the-world checkpointing to pass ("Yes" or "Strong Yes" in Greenhouse). All candidates who are applying for a position above L3 (a position without "New Grad" in its name) must implement basic locking to pass.
Candidates with little/no systems experience should be able to get to that point with the filesystem and/or lock APIs (and can still pass even if they start by using the random access I/O APIs, as long as they realize why that doesn't work).
Candidates with experience in systems should be able to get to that point without being provided the filesystem or lock APIs (they don't need to use any specific language's APIs, but they should be able to create their own reasonable pseudocode API).
Candidates with lots of systems experience (i.e. database PhDs or senior candidates from other systems companies) should be able to go further and with less help.