# SYS Chat Server
Owner: Marek Brysa <[email protected]>
Changelog:
* 2021-03-09 - Add a guideline about using Python's defaultdict.
* 2021-02-23 - Move to prod. Added sample implementation.
* 2021-02-03 - Made `handleNewConnection` optional followup. Implementing this method was not strictly necessary,
and it seemed to confuse candidates as a result.
Added Python- and Golang-specific considerations.
Clarified grading a bit.
Interview primer recording: <https://drive.google.com/file/d/1kX5XnouhTUL69x0PqAoCXDOW5yMmg8L1/view>
### Main goal
Design and partially implement a single-process chat server (IRC-like).
Make it clear to the candidate that knowledge of IRC is not required, and we will not be talking about the details of the actual protocol.
### Environment
Single, reasonably powerful server on a good Internet connection. Our code will run as a process on that server.
We want to run everything on that server, i.e. not rely on any cloud services etc.
We're looking for simple client-server architecture. We're only focusing on the server, not the client.
### Required functionality
* Users must be able to establish a connection to the server.
* Users must be able to join chat channels at will. Users can be in multiple chat channels.
* Messages sent by a user to a channel are received by all users in the channel.
* If the candidate asks about latency SLA (extra points): Initially we can aim for p99 1s between message received by server and propagated to all recepients. (Note that precisely defining good SLAs is a bit of a rabbit hole, so be careful to not spend too much time on this point)
Out of scope if the candidate asks:
* Keeping chat history - users can expect to only receive messages sent after they joined a channel.
* Any persistence - if the process crashes, all is lost and we start over.
* Private messages (this can be achieved with a 2-user channel in any case, no need for special functionality).
* Messages are only best-effort ordered, it's an informal chat system. (In other words, if the candidate talks about SLA: there is no SLA on message ordering) (This might come up in Stage 2 or 3)
* Authentication - any user may freely connect with any username.
* Channels "appear" as the first user joins.
### Discussion points before coding
* How is incoming connection handled? (E.g. socket is listening, when a new connection is established, create a new thread; or epoll/kqueue)
If the candidate does not elaborate, try asking more detailed question: *What exactly happens in the server when a new user connects?*
Or in other words: *How do we get from "packets received by socket" to "handler function called"?*
If the candidate is not familiar with this, tell them there will be a thread pool handling the incoming connections.
Optional for candidates with expected deeper networking knowledge:
* UDP or TCP? (Steer to TCP)
* What should the protocol look like? (E.g. text based, GRPC, json - discuss pros/cons)
Make it clear we don't need to implement the protocol details later.
## Stage 1 - Basic functionality
Ask the candidate to implement the interface below with pseudocode
(i.e. Use constructs from any language or any library, as long as we understand each other. Do not spend time on getter/setters, beautiful OOP etc.)
It is a good approach to start with a single-threaded implementation to get the business logic right, then worry about concurrency.
You may suggest this to the candidate.
Unless the candidate does it on their own, bring up the fact that these methods must be able to handle multiple concurrent connections.
We expect the implementation to be thread-safe (if traditional threads are involved).
If the candidate talks about having 1 thread per connection (which is not good), you may let that be for now,
but make it clear that it is not guaranteed that there is 1:1 mapping between connection and threads.
It could lead the candidate relying on that fact (or using thread-locals), which could lead to undesired implementation.
Some candidates may choose to implement this in Node.js or use async/coroutine features of a language instead of traditional threads.
That is also a good solution, but in that case you need to dig a little deeper to ensure they understand how these mechanisms work,
e.g. are able to explain how they ultimately map to processes or threads.
Specifically for **Python**: they should explain how threads work in Python and that due to
[GIL](<https://en.wikipedia.org/wiki/Global_interpreter_lock>) (in CPython), operations over e.g. a `dict`
are safe. Race condition in `handleJoinChannels` is still an issue and should be handled with a lock. If the candidate
chooses to use [native coroutines](<https://docs.python.org/3/library/asyncio-task.html#coroutines>),
they should be able to explain how it works at a high level and maps to threads.
The candidate may use [defaultdict](<https://docs.python.org/3/library/collections.html#collections.defaultdict>),
this is idomatic (appreciate it), but it hides the race condition in `handleJoinChannels`, so ask them to use a regular `dict`.
Specifically for **Golang**: Golang exposes [goroutines](<https://medium.com/technofunnel/understanding-golang-and-goroutines-72ac3c9a014d>),
not threads, but goroutines are still backed by threads and the candidate should explain how.
Golangs's `map` [is not goroutine-safe](<https://medium.com/a-journey-with-go/go-concurrency-access-with-maps-part-iii-8c0a0e4eb27e>)
, it's necessary to lock over any operation.
```java
/**
* We're assuming connections are established outside of this class.
* The methods below are called by existing underlying server code.
*/
interface ChatServer {
/** The process will call this when a user wants to join a list of channels */
void handleJoinChannels(Connection conn, List<String> channelsToJoin);
/** The process will call this when a user wants to send a message */
void handleRequestToSendMessage(Connection conn, Message message);
}
/** Think of this as an abstraction of a socket, file descriptor, Stream etc. */
class Connection {
/** Synchronously write a message to be received by the client. Assume this is already implemented. */
void writeMessage(Message message);
}
/** Reference Message data class. Feel free to modify. */
class Message {
String body;
String channelName;
}
A good implementation should:
ConcurrentHashMap
in Java or equivalent elsewhere).
The candidate should be able to explain what would go wrong if we used regular data structures.ConcurrentHashMap.computeIfAbsent()
if appropriate),
alternatively use granular locking to keep individual data structures in sync
(plus points for "sharded" locking on users or channels).O(1)
.A bad implementation might:
handleNewConnection
is called concurrently with the same username
(e.g. leaving conn <-> username
mapping in inconsistent state).handleJoinChannels
is called concurrently for the same conn
(e.g. overwriting/recreating the list of channels assigned to a given user).O(n)
.class SampleChatServer implements ChatServer {
ConcurrentMap<String, Set<Connection>> channelsToConnections = new ConcurrentHashMap<>();
/** The process will call this when a user wants to join a list of channels */
void handleJoinChannels(Connection conn, List<String> channelsToJoin) {
for(String chan: channelsToJoin) {
// Note: ConcurrentSet does not exist, let's assume it does for simplicity
// Option 1
channelsToConnections.computeIfAbsent(chan, chan -> new ConcurrentSet<>());
channelsToConnections.get(chan).add(conn);
// Option 2
channelsToConnections.compute(chan, (chan, m) -> (m == null) ? new ConcurrentSet<>(conn) : m.add(conn));
// Option 3
synchronized (channelsToConnections) {
if(!channelsToConnections.containsKey(chan)) {
channelsToConnections.put(new ConcurrentSet<>());
}
}
channelsToConnections.get(chan).add(conn);
// Wrong - if two invocations happens simultaneously with the same channel that does not exist,
// both could create a new set, one overriding the other
if(!channelsToConnections.containsKey(chan)) {
channelsToConnections.put(new ConcurrentSet<>());
}
channelsToConnections.get(chan).add(conn);
}
}
/** The process will call this when a user wants to send a message */
void handleRequestToSendMessage(Connection conn, Message message) {
// Basic solution which does not work well if e.g. one connection is very slow
for(Connection receiverConnection: channelsToConnections.get(message.channel)) {
receiverConnection.writeMessage(message);
}
}
}
This is just in case things go super quickly. Extend the interface to include:
/** The process will call this function when a new connection is established to the server */
void handleNewConnection(Connection conn, String username);
Here you can explore how to handle duplicate user connections - e.g. drop the old connection if a user with the same name connects again.
The candidate might initially do something like this in handleSendMessage()
:
for each connection in channel:
connection.writeMessage(message)
Ask: What could go wrong with this? The candidate should talk about that:
This should lead the candidate to adding some kind of queue mechanism, ideally with limited retry with back-off, and if last retry fails, disconnect the user.
This does not need to be fully implemented.
Ask: How many users can this server handle? The candidate should talk about:
Ask: What is the worst case scenario for our chat server in normal use? (Or in more detail, with a slight hint: What happens if a particular user is a part of hundreds of channels?) This should lead to something along the lines of: Many users, many channels, each user on each channel and each user sending messages on each channel frequently.
One possible solution is to add batching of outgoing messages - e.g. buffer messages in a queue, flush the queue every few seconds, sending a batch per user containing all messages in all channels they're part of.
The candidate should realize that even like this, our server could run out of resources. Then talk about some kind of load shedding - if we can't handle all user requests, something has to give to maintain at least partial service instead of being totally overwhelmed. Even if the candidate decides to use an event-based mechanism (epoll/kqueue), the queue might get so large the system runs out of resources.
The candidate should describe the various resources we may start running out of an how to load-shed against them e.g.:
This is our tentative guidance as of 2020-10-13, feel free to propose any changes or reach out to [email protected] with any thoughts/questions.
For grade 3, the candidate should complete stages 1 and 2 with a correct and reasonably performant solution, requiring few hints. Intuition about concurrency and being able to use basic concurrency primitives (locks) matters more than the implementation details in the chosen base programming language.
One of the signs of grade 4 is that they see performance issues upfront and later propose solutions on their own.
For grade 3, the candidate should explore the requirements and explicitly state some of the non-requirements. They should complete stages 1 and 2 with a correct and reasonably performant solution, requiring few hints. They should be able to correctly apply concurrency tools of the chosen base language. In stage 3, they should be able to describe and reason about the worst-case load scenario and propose some solutions.