Crawler | Notion

# SYS2: Single machine web crawler
Question Owners: Hossein Falaki, Nick Lanham

**Warning**: This question has leaked and is being phased out: <https://leetcode.com/problems/web-crawler-multithreaded/>

## Overview

* This question tests candidate’s system knowledge in areas of:
  * Multithreading and synchronization
  * Network and disk I/O
* Candidates usually need the full hour to answer this question. You
  should not spend much time on anything other than the question.
* The question has multiple stages:
  * [Mandatory] Single threaded crawling algorithm (BFS or DFS), 5-15 minutes
  * [Mandatory] Concurrent version
  * [Optional] Followups for intermediate, senior, advanced candidates (e.g. error handling)
* Notes for new-grads and/or junior candidates
  * The initial graph algorithm is not the focus of the problem and should be completed within 15
    minutes. Do not hesitate to give hints after 10 minutes.
  * These candidates will often have limited exposure to threading. It's okay to give them lots of
    hints and help them arrive at a sensible threading API (see API section below).
  * We do expect them to at least realize they need to do some amount of concurrency to produce code
    that is reasonably performant.
  * This question can be entirely solved only knowing concurrency locks (and some notion of worker threads).
* This is a question in which candidates can sometimes get really *stuck*. If you've burned about
  half of the interview time and feel the candidate is spinning their wheels and/or simply can't
  get on track for a good solution, we encourage you to tell them to scrap what they've done and
  start from a clean slate, likely giving them the threading API below.
* We have found that Google and Amazon ask this question from their candidates.
  * It is suggested to verify at the beginning if the candidate had seen or done this problem before. However, note that we have our special flavor of the question.
* Ask the candidate to write code (or pseudocode) so that you can point to specific lines and ask the follow-up questions.
* If they choose Python, you can provide a simple hypothetical abstraction for threads and thread pools.
* The solution provided here is just one example of a correct answer. There can be multiple ways of implementing the answer.

## Notes on setup
There are some useful things to emphasize at the start of this question to keep candidates on
track. You should say the following things in one way or another:
* The algorithm for this question isn't hard. We're not looking at what algorithm you come up with,
  but rather how you design a system, think about it's components and how they hang together, think
  about debugging, performance, etc.
* We're running on a single big machine.
  * No need for a distributed crawler.
  * Avoid saying "multi-core" straight away. In the second phase, if they are
    struggling to realize that the solution should be multi-threaded, mention that.
* Remind them again to write pseudo-code, but emphasize that it's just so we can be explicit about
  the order things happen in, not so we can nit-pick on syntax, or anything of that nature.
* For remote interviews that use Coderpad, you can ask them to set the language to 'text' so that
  it's absolutely clear that this is pseudocode. Alternatively, you may give them the option to
  choose a language that will highlight their particular style of pseudocode well, but emphasize
  that it doesn't need to be compilable code.
* Feel free to use any standard library functions/object/structures/abstractions from any programming
  language, even if it is not the language that they chose to use for pseudocode.

## Question
You are given a single URL `start_url` and depth value, `K`. Write a program that downloads all the
web pages at `start_url` and the pages it points to up to `K` levels.
The result is a directory with all the content written into it (as files or sub-directories).

You can assume the functions are available to you:

```Scala
// This syntax is in Scala, but using Scala is not at all required

/**
 * Synchronously fetches the content of a page identified by
 * input URL from the internet and returns the HTML text.
 */
 def fetch(url: URL): HTML

/**
 * Parses the html page and returns all the links inside it.
 */
 def parse(html: HTML): Seq[URL]

/**
 * Writes the content to a file corresponding to the given URL.
 */
 def save(url: URL, content: HTML): Unit

Points to emphasize:

Imagine the Internet is much simpler than it is in reality, there are no dynamic pages, scripts, images, etc., only plain HTML and links.
If the candidate asks: we expect all the URLs that are visited to fit comfortably in memory, but not the web pages.
Some senior candidates say immediately that fetch() should be asynchronous. Make it clear to them that the API is synchronous and cannot be modified. If you feel comfortable with their ability to do so, you should allow senior candidates to jump to the multi-threaded version immediately, but the API must remain synchronous.

[Mandatory] Stage 1: Single-thread Solution

Let the candidate start with the simple case of the single-threaded version. Do not stop them. The purpose of the single threaded solution is to make sure they fully understand the problem and its requirements. We expect all candidates to finish the single-threaded solution in 5-15 minutes.

The single threaded solution should avoid fetching the same URL twice. A good candidate would start thinking about this from the beginning, but if they do not, the interviewer should raise this issue and ask them to think about it. Ideally the candidate should articulate why this is an issue in terms of resource consumption.

Possible Hints

Think of this as a graph traversal problem.

Example solutions

Here is a sketch of a simple, unoptimized solution in python-ish language using breadth-first search (BFS):

visited = set()
to_visit = [(start_url, 0)]
while to_visit:
  (url, depth) = to_visit.pop()
  if depth < K and url not in visited:
    visited.add(url)
    html = fetch(url)
    save(url, html)
    candidate_urls = parse(html)
    for next_url in candidate_urls:
      to_visit.append_tail((next_url, depth + 1))

Other common solution variants and optimizations include:

Explicit iteration over levels. Iterate K times, each time processing the URLs at level K and producing the list of URLs at level K+1. A set of visited nodes is still required, but no need to track the depth for each URL.
Recursive depth-first search (DFS). This approach has a nasty pitfall, see below. Also, it is less natural to convert this approach into a parallel version.
[Optmization] Marking URLs as visited when inserting into the queue. This prevents duplicate URLs from entering the queue. This is more correct, as we can fit all unique URLs in memory but not necessarily all links to the same URL.

[Extra credit] The DFS Pitfall

Candidates may try to solve this problem with DFS, avoiding duplicate work by tracking a set of visited nodes. This does not work because it cuts the exploration short. E.g. if a node is first seen at level 5, it can later be seen at level 3, but at that point all of its transitive links have already been processed as if they were at level 6, not level 4. BFS exploration does not have this problem because the nodes are visited in an optimal order. Depending on the seniority of the candidate, you can provide the following hints:

This is a graph traversal, not a tree.
Do we revisit nodes?
How do we deal with cycles in the graph?
Finally, give an example. Usually, candidates at this point draw a tree, you just need to add an extra edge from the root to a leaf

start -> b -> c -> d
  |           ^
  |           |
  +-----------+

Noticing or solving this problem is not required for a passing grade. The candidates get extra credit if they do notice and/or solve it.

A possible solution is to track not only that a URL is visited, but also the level at which it was visited. A URL could then be re-visited even if it has already been visited, if it is seen at a lower level than the one at which it was visited before. Re-visiting can fetch and parse the URL again, or it can be optimized by reading the HTML from disk or by tracking the links graph explicitly in memory. Re-visiting may trigger re-visiting of links from that page as well. For those pages, the same consideration applies: the visits should only happen if the linked pages were previously seen only at a higher level.

[Mandatory] Stage 2: Multithread Solution

Ask the candidate to identify the most obvious performance problem with their single-threaded solution. We expect candidates to identify the network I/O bottleneck as the primary problem, and propose multi-threading (or parallelizing) as a solution.

You can formulate this as 'You run this on a big server and it takes 24 hours to run. What takes the longest in this program? What is your guess?'. Just asking about performance can trip the candidate into trying to optimize for memory. It is good to ask numbers here:

inexperienced candidates struggle to say anything
decent candidates mention ~ 10ms for writes, ~ 5ms for parsing and ~ 10ms-100ms for fetching
good candidates identify that the write is stable and that we are fetching from the internet -> the p90 of fetching can be much higher

Many candidates will home in on the disk writes as the main problem, and will try to parallelize those. That is a much simpler problem, but not the one that the candidate needs to tackle. You can steer them away from this by mentioning that (a) the disk is a super fast SSD, and/or (b) the writes go into the buffer cache of the OS, and the writes will happen in the background anyway.

Good candidates will also suggest profiling to identify the problem. If they do, or if your candidate does not notice the I/O bottleneck, you may hint that the CPU utilization is really low.

When candidates have identified the fetch bottleneck and have decided to use parallelism, ask them to modify their code to implement it. (If they realize that their changes are going to give incorrect results on the fringe, tell them that we will consider that later.)

When candidate starts using threads, ask them to identify their desired threading API. We have found that different people come up with different abstractions, this will tell us about the way they think.

Possible Hints

Big Hint: Use a driver thread to schedule work, and workers to actually fetch/parse/save the data. (Use this only as a last resort, and withdraw points when you do.)
If they can't get the termination right, e.g. it terminates right away: Print "Done." after the last URL has been fully processed. Can you verify that it actually prints that at the right time?

Example Solution

Most candidates come up with something for which the work is taken from a queue by workers, and a main thread monitors the work. This version is slightly incorrect (see the "DFS Pitfall" above), but it is accepted for this exercise. Here is a sample solution that makes minimalistic use of concurrency primitives and that an undergrad can code.

// Shared State
queue: ConcurrentQueue<(url, levels)>
seen: Set<url>
K: const int = <X>
pool = new Pool(threadCount)  // should talk about what this should be set to (likely higher than # of cores in the machine, due to blocking operations)

class ScrapeJob(url, level) extends Runnable {
  // scrape a url, this should be about the same in the single vs. multithreaded case
  def run() {
    if (level < K) {
      val html: String = fetch(url)
      save(url, html)
      val links: Seq[URL] = parse(html)
      links.foreach { link =>
        queue.enqeue((link, level+1))
      }
    }
  }
}

// main loop to start threads for each url encountered
def main(url) {
  queue.enqueue((start_url, 0))
  while pool.active() > 0 || !queue.empty() {
    url, level = queue.dequeue()
    if (!seen.contains(url)) {
      seen.put(url)
      pool.enqueue(new ScrapeJob(url, level))
    }
  }
  println("Done")
}

Threading API

For new-grads, junior candidates, candidates who are stuck and can't figure out threading, and/or any other cases where you feel the candidate shouldn't reasonably have deep knowledge, you can give them the following threading API to (hopefully) set them on the right track.

pool(threads: Int) => Pool     // return a new thread pool
Pool.enqueue(func: Runnable)   // enqueue the given runnable to be run on the pool (thread-safe)
Pool.active()                  // return the number of active threads in the pool (thread-safe)

You should explain that Runnable can be any object they like (so they can use their own object to store whatever state the thread needs to access to perform its task)

[Advanced] The DFS Pitfall Revisited

In the parallel version the same problem can appear as well. For instance when worker threads may work on URLs from multiple levels at a time, they may "race" to insert the same link URLs at different levels, and the first one will win. Error handling, which typically involves re-processing URLs at a later time, makes this problem worse.

[Not for Junior candidates] Stage 3: Follow-Up Questions

This question has various possible follow-ups. These are all optional (even for senior+ candidates), as they are used primarily to further explore and identify strengths in the candidate's system knowledge depth and thought process.

[Intermediate] Ask about I/O failure modes and how the solution handles those

Note: every candidate should get to at least talk about this part if not writing code or coming up with a solution.

What if fetch() deterministically fails for some pages (they are not reachable)?
What if save() fails nondeterministically?
What if fetch() is very slow?

A common solution for errors is to push the pages on a separate queue for later re-processing. This solution is susceptible to the "DFS Pitfall" described earlier.

[Senior] How can we benchmark and optimize this in different conditions?

You can often work the first question into the preamble to the multithreaded solution, when the candidate is figuring out what the bottleneck is. It is then easy to come back to benchmarking / profiling later and discuss different scenarios.

We benchmark this implementation and find that CPU utilization is 5%, and network utilization is 10%. How can we improve performance? (The first step is multithreading, the second step is tuning the level of parallelism.)
The crawler sometimes runs out of memory. How do you figure out what to improve? When the candidate suggests memory profiling, you may describe profiles that point towards different issues such as huge raw HTML sizes, large URLs, large URL counts, or the common inefficiency where the crawler pushes duplicate URLs onto the queue and only deduplicates them when they are pulled from the queue.

[Senior] Ask to implement a resumable version

Implement the solution such that any point in time the machine can shutdown and restart. For clarity provide the following interface:

/**
 * This function is called by the OS just before your process is terminated.
 * You have 2 seconds before your process is shut down.
 */
def pause(): Unit

/**
 * This function is called by the OS to resume the crawler after pause() has been called
 */
def resume(): Unit

[Advanced] How to mitigate stragglers

Definition: A straggler is a thread that takes much longer to complete its task than the average thread.

Identify cases where stragglers could arise.
For each case propose solutions (no need to write code although if it translates to code changes it is a bonus point).

Common Missteps

Terminal case
- Not thinking about the terminal case
- Main thread exiting before all children threads have finished (child thread hasn’t enqueued new threads but parent exits because queue is empty)
Thread starvation
- If they use the same thread to process children, they will prevent the other threads from being able to surface them. This could lead to one thread with a long backlog and the other thread starved.
Thread exhaustion
- Creating a new thread for each link (not considering system limits).
Not handling failures
- Network failure handling
  - What if fetch() fails?
  - What if fetch() times out or is too slow?
  - Retry
    - In the processing thread or pushed to parent to be rescheduled?
  - Write to Disk
  - Parsing
Duplicates
- Do they handle the case when there could be duplicate sub pages?
Checkpointing
- When saving to disk, do they mark saved before/after saved?

Grading Guidelines

For a 3 or 4 the candidate must come up with the multithreaded solution that handles not re-scraping the same url twice.
If you give the threading api, new-grad/junior can get a 3, but higher levels should be failed if you have to get that explicit.
Things that are "extra credit":
- Pointing out the trade-off between distributed vs. central state management.
- Noticing and solving the "DFS Pitfall" described above.
- Noticing that save() does not specify that it includes an fsync, i.e., that it is effectively asynchronous after data is copied into OS buffers.
- Any of the other followups.