Key Value Store

# SYS1: Build a simple key value store

## Overview

This is a general database design question that is purposefully open-ended.  The primary goal is to provide a vehicle for discussing
design tradeoffs when building a durable storage system. Typically, you'll want to encourage the interviewee to start with a simple but
correct solution, and then make it harder by adding increasingly difficult requirements. There is no one correct answer, but instead 
you are looking for their ability to:
 - come up with a simple design and implement it in pseudo-code (it is important to force them to write out code so that later 
   you can probe where locks are needed and other ordering concerns).
 - explain trade-offs and why they chose the path they did (along the way ask for alternatives and why did not choose them).
 - compose the basic primitives of systems programming (files, os buffering, hash tables, logs, spinning disks, SSDs,
   locks, etc.) to build a complex system (it's fine if they're not already familiar with all of
   these concepts as long as they can reason about how to use them once you explain how they work).
 
Since this question is open ended, you can use it for a wide range of skill levels (i.e. fresh graduate to seasoned 
database implementor). Based on their skill level, you can skip ahead, or lay out more of the basic building blocks and only 
ask them to assemble them to meet the requirements.

**NOTE:** Because this question is so open-ended, you should mostly skip the resume questions and
get directly to the technical problem. Aim to give your candidate at least 50 minutes with this
question out of a one hour interview slot.

## Introduction

Since this question is so open-ended and since we ask it to candidates with different levels of
systems experience, it's important that we give a consistent introduction to all of our candidates.
Please make sure to mention the following points when you are introducing the question:
1. This is a high-level design question: it's an excuse for us to have a discussion about trade-offs
when building software
1. They may use any language or pseudocode as long as they are sufficiently precise about what
operations they're performing, on what inputs, and in what order
1. There is a simple, fully correct solution that will fit easily on the whiteboard (at least for
the first few levels of the problem) - you shouldn't need to do anything too complicated
1. They should NOT implement a hashtable, although they may use one
1. They should approach this problem as if they are writing a library that will be used by
other peoples' projects (this is especially important for candidates who are more used to calling
systems code than writing it)
1. If they forget the exact semantics of any low-level APIs that they need, they should discuss with
you and you can come up with reasonable pseudocode APIs
1. They can assume that they have a disk and that it is perfect and atomic (for candidates that do
not have a lot of systems experience, it is often helpful to be direct about the fact that they
should write directly to disk rather than using e.g. the network or an off-the-shelf atomic store.)

## Question

Implement in pseudo code a key/value store with the following interface:

```scala
/**   scala
 * Writes the given key/value pair into the store.  If `key` is already present replace the
 * current `value` with one given.  After this function returns, the value must be *durable*
 * meaning that even in the presense of failures (program crashing or power going out), subsequent 
 * `get()` calls for this key must continue to return the result of the most recent put.
 */
def put(key, value)

/**
 * Returns the most recently `put` value for the given `key`.
 * Returns null, if no value has been `put` for this key.
 */
def get(key) => value

/**
 * Called when the system first comes up. You can assume that no `get()` or `put()` calls will be made
 * before this function has completed. For candidates with an OO background, you can think of this
 * as the constructor for your key-value store.
 */
 def init()

class Store {  // java
  public Store() {
    // Called when the system first comes up. You can assume that no `get` or `put` calls will be made
    // before this function has completed. 
  }

  public void put(String key, String value) {
    // Writes a given key/value pair into the store.  If `key` is already present replace the
    // current `value` with one given.  After this function returns, the value must be *durable*
    // meaning that even in the presence of failures (program crashing or power going out), subsequent
    // `get` calls for this key must continue to return the result of the most recent `put`.
  }

  public String get(String key) {
    // Returns the most recently `put` value for the given `key`.
    // Returns NULL, if no value has been `put` for this key.
  }
};

class store:  # python
  # Called when the system first comes up. You can assume that no `get` or `put` calls will be made
  # before this function has completed. 
  def __init__(self):

  # Writes a given key/value pair into the store.  If `key` is already present replace the
  # current `value` with one given.  After this function returns, the value must be *durable*
  # meaning that even in the presence of failures (program crashing or power going out), subsequent
  # `get` calls for this key must continue to return the result of the most recent `put`.
  def put(self, key, value):

  # Returns the most recently `put` value for the given `key`.
  # Returns None, if no value has been `put` for this key.
  def get(self, key):

class Store {  // C++
 public:
  Store() {
    // Called when the system first comes up. You can assume that no `get` or `put` calls will be made
    // before this function has completed.
  }

  void put(std::string key, std::string value) {
    // Writes a given key/value pair into the store.  If `key` is already present replace the
    // current `value` with one given.  After this function returns, the value must be *durable*
    // meaning that even in the presence of failures (program crashing or power going out), subsequent
    // `get` calls for this key must continue to return the result of the most recent `put`.
  }

  std::string get(std::string key) {
    // Returns the most recently `put` value for the given `key`.
    // Returns an empty string, if no value has been `put` for this key.
  }
};

type Store struct { // go
}

func NewStore() *Store {
// Called when the system first comes up. You can assume that no `get` or `put` calls will be made
// before this function has completed.
}

func (s *Store) Put(key string, value string) {
// Writes a given key/value pair into the store.  If `key` is already present replace the
// current `value` with one given.  After this function returns, the value must be *durable*
// meaning that even in the presence of failures (program crashing or power going out), subsequent
// `get` calls for this key must continue to return the result of the most recent `put`.
}

func (s *Store) Get(key string) string {
// Returns the most recently `put` value for the given `key`.
// Returns an empty string, if no value has been `put` for this key.
}

Additionally, they should be optimizing for the following workload:

The total number of key/value pairs at any time will be small enough to fit in memory. Any single key may get updated over and over again, but the total number of keys will remain small enough to fit in RAM.

Helpful APIs

File

For candidates who are not as familiar with filesystem operations (or who know the super-clunky Java API and don't want to use that), it's often helpful to provide the following interface:

/** Reads/writes the file at the given path */ 
class File(path: String) {
  /** Writes to the end of the file and synchronously flushes to disk */
  def append(contents: String): Unit
  
  /** Reads all of the lines of the file sequentially */
  def lines(): Iterator[String]
  
  /** Writes at the given offset from the start of the file and synchronously flushes to disk */
  def write(offset: Long, contents: String): Unit
  
  /** Reads the given number of bytes starting at the given offset from the start of the file */
  def read(offset: Long, length: Long): String
}

You can also be flexible and have the append() and write() methods take a (key, value) pair and have the lines() and read() methods return such a pair. We don't really care about the nitty gritty of serialization.

NOTE: However the candidate chooses to use this API, you should make sure that they understand the effect of their choice on performance. For example, the write() and read() methods perform random I/O, which for most designs will probably be less efficient than the sequential I/O performed by the append() and lines(). We should accept solutions that use either set of methods as long as the final system will work (and as long as the candidate can explain why).

Simple Solution

One simple solution is to use an in-memory hash table for lookups along with a write-ahead log for durability. I usually try and steer them away from getting into the details of serialization, etc as thats not really the point of the question. Also, you can intially make assumptions, such as there is only one thread accessing the database at a time.


def init():
  # A simple log file. 
  log = new File("/db/log")
  db = new HashMap

  log.lines.foreach { (k, v) =>
    db.put(k, v)
  }
  
def get(key):
  db.get(key)
  
def put(key, value):
  log.append(key, value)
  db.put(key, value)

Allow them to use any libraries or interfaces they want, but ask them to state their assumptions. Specifically look to see if they understand how the operating system buffers writes before flushing them to the disk.

Common Missteps

There are several common missteps that people hit when trying to come up with the simple solution. If they start going down one of these paths, then ask them to contrast their design with a simpler solution.

keeping the index on disk - Some interviewees will try and do lookups directly from the disk, either by building an on-disk hash table, or someting like a b-tree. While this might work, it's probably not going to be as efficent, given the contraints we put on the workload. Specifically, a write ahead log is very friendly to spinning disks (long sequential writes, that don't require you to seek the disk head). If they start going down this path, remind them that data fits in memory and see if they adjust their tactic.
using a tree instead of a hashtable - This data structure would work, but look-ups/insertions are O(log n) instead O(1). Since the API only requires point-lookups instead of range queries (i.e. getRange(startKey, endKey)), this is uncessary. Using trees can be a good follow-up question though: "How would your answer change if I added getRange to the API?"

Follow-up questions

Once they get the basic structure on the whiteboard, there are several different avenues you can go down to probe deeper. Start by asking them if there is anything they would improve about their implementation. If they don't have any ideas, present them with problems and let them try and solve them.

Checkpointing

If the simple solution is used for a long time in production, eventually the log will get very large. This is true, even though the amount of data still fits in memory (because you may be updating the same key over and over again). There are two problems with a very large log:

initialization becomes more and more expensive
eventually even the largest disk will fill up

A solution to this problem is to checkpoint the current state of the hashtable to disk and truncate the log. This should be done periodically (i.e. every X number of put operations), but its fine if they just write an explicit checkpoint() function that must be invoked manually. At a high level there are two approaches. Depending on the candidate you can iterate or jump directly to the harder solution.

Stop the world checkpoint (simple)

Stop all get and put operations
Write the contents of the hashtable out to new log
Replace the old log with the new log
Allow get and put operations to resume.

This solution is simple and correct. Even so, there are a couple of missteps to watch out for:

Reading in the contents of the log, de-duplicating it with another hash table, and then writting it out. While you can do this, it is incuring unecessary disk operations. The deduplicated set of log entries is already present in the hashtable.
Overwriting the log before you complete the checkpoint. This soluton would fail to provide durability in the case of a failure during checkpointing.

Concurrent checkpointing (better)

The solution above works, but incurs downtime while the checkpoint is being written out. Ideally, we would lift this restriction and allow operation to continue during checkpointing. In order to make this tractable for whiteboard programming, I usually provide them with some assumptions about the hashtable.

Internally, the hashtable uses locks to protect concurrent get/put calls from corrupting its internal state (you can ask more about this part later).
The hashtable provides an Iterator[(Key, Value)] that returns all of the keys that were present in the hashtable when it was first opened. Keys that are added after the iterator was opened, may or may not be returned. Values produced by the iterator may or not reflect updates that happend after the iterator was opened.

Using the above, you can implement checkpointing as follows:

Switch the logfile to a new logfile (i.e. log.{n + 1})
Iterate through the elements of the hashtable, writing them to a new checkpoint file. (i.e. checkpoint.{n + 1})
After the checkpoint is complete, you can finalize it by deleting log.{n} and checkpoint.{n}

It is important to read from these files in the correct order during the init() function.

if only log.0 is present, apply it.
if log.{n} and checkpoint.{n} are present, n+1 has not completed sucessfully. Apply the checkpoint.{n} first then log.{n}, and finally log.{n+1}.
if either log.{n} or checkpoint.{n} are missing, apply checkpoint.{n+1} first then log.{n+1}.

Ask why the ordering is important: Applying the checkpoint first and then the log, allows updates that occured during checkpointing to overwrite the potentiatlly stale values that are present in the checkpoint itself.

Concurrent access

The simple solution only works when a single thread is issuing get and put calls. If we want to lift this restriction we'll need to use locks to avoid race conditions that could corrupt the hash table, or violate our durability gurantee.

Synchronization between the log and the hashtable

One possible issue is that writes to the same key can be applied in a different order to the log and the hashtable. When this ordering happens, it is possible that the result returned by get(key) could change due to a failure, even if no put call have occured. This phenomenon would violate the durability requirments of our system.

If this problem is not obvious to the candidate, help them to write out possible incorrect orderings of operations:

thread 1	thread2
put(k,v1)
	put(k, v2)
	log(k, v2)
log(k,v1)

At the end of these two concurrent operations, the hashtable holds v2, but the log will recover v1 if replayed. In order to prevent the inconsistency one must use a lock to ensure that hash table and log updates do not interleave with operations from other threads.

l = new Lock
def put():
  l.lock()
  log.append(key, value)
  db.put(key, value)
  l.unlock()

Ask why performing the log-append operation needs to happen before the hashtable-update operation, even within the lock. Locks provide mutexing, but not transactionality. Make sure they understand that performing hashtable-update before the log-append can cause us to read an uncommitted update, and thus result in a violation of isolation:

writer thread	reader thread
db.put(k, v1)	Reads v1
log.append(k, v1)	Reads v1
db.put(k, v2)	Reads v2
[CRASH]	[CRASH]
init()	Reads v1

Note: If the candidate uses a mutex in the implementation of get, the log-append and hashtable-update operations can happen in any order. With this mutex, a reader thread will be blocked from reading v2 until after the write finishes, so the key-value store will not return v2, so durability is preserved.

Hashtable concurrency

We also need to protect the internals of the hashtable from concurrent operations. If the candidate suggests using a library like java.util.ConcurrentHashMap, that is great, but see if if they can explain how that works internally. There are a couple of levels of performance / efficiency possible here:

single lock for get/put - only one operation happening at a time. correct, but limits throughput.
reader/writer lock - allows multiple reads to occur at the same time, but only a single write is allowed better throughput for read operations, still serializes writes.
lock striping - hash partition the keyspace and keep a different reader/writer locks for each partition. Now, given N locks, up to N different writes can happen concurrently. Unlimited reads can happen at the same time.

Other optimizations

If they make it this far, there are other things I like to talk about:

batching - you can trade off latency for throughput by committing multiple records in a single update to the log. The trick is you need to block returning unill it has actually been written to disk.
range queries - how does the system change if you add getRange(startKey, endKey)

Grading Guidelines

All candidates must implement stop-the-world checkpointing to pass ("Yes" or "Strong Yes" in Greenhouse). All candidates who are applying for a position above L3 (a position without "New Grad" in its name) must implement basic locking to pass.

Candidates with little/no systems experience should be able to get to that point with the filesystem and/or lock APIs (and can still pass even if they start by using the random access I/O APIs, as long as they realize why that doesn't work).

Candidates with experience in systems should be able to get to that point without being provided the filesystem or lock APIs (they don't need to use any specific language's APIs, but they should be able to create their own reasonable pseudocode API).

Candidates with lots of systems experience (i.e. database PhDs or senior candidates from other systems companies) should be able to go further and with less help.