# SYS2: Single machine web crawler
Question Owners: Hossein Falaki, Nick Lanham
**Warning**: This question has leaked and is being phased out: <https://leetcode.com/problems/web-crawler-multithreaded/>
## Overview
* This question tests candidate’s system knowledge in areas of:
* Multithreading and synchronization
* Network and disk I/O
* Candidates usually need the full hour to answer this question. You
should not spend much time on anything other than the question.
* The question has multiple stages:
* [Mandatory] Single threaded crawling algorithm (BFS or DFS), 5-15 minutes
* [Mandatory] Concurrent version
* [Optional] Followups for intermediate, senior, advanced candidates (e.g. error handling)
* Notes for new-grads and/or junior candidates
* The initial graph algorithm is not the focus of the problem and should be completed within 15
minutes. Do not hesitate to give hints after 10 minutes.
* These candidates will often have limited exposure to threading. It's okay to give them lots of
hints and help them arrive at a sensible threading API (see API section below).
* We do expect them to at least realize they need to do some amount of concurrency to produce code
that is reasonably performant.
* This question can be entirely solved only knowing concurrency locks (and some notion of worker threads).
* This is a question in which candidates can sometimes get really *stuck*. If you've burned about
half of the interview time and feel the candidate is spinning their wheels and/or simply can't
get on track for a good solution, we encourage you to tell them to scrap what they've done and
start from a clean slate, likely giving them the threading API below.
* We have found that Google and Amazon ask this question from their candidates.
* It is suggested to verify at the beginning if the candidate had seen or done this problem before. However, note that we have our special flavor of the question.
* Ask the candidate to write code (or pseudocode) so that you can point to specific lines and ask the follow-up questions.
* If they choose Python, you can provide a simple hypothetical abstraction for threads and thread pools.
* The solution provided here is just one example of a correct answer. There can be multiple ways of implementing the answer.
## Notes on setup
There are some useful things to emphasize at the start of this question to keep candidates on
track. You should say the following things in one way or another:
* The algorithm for this question isn't hard. We're not looking at what algorithm you come up with,
but rather how you design a system, think about it's components and how they hang together, think
about debugging, performance, etc.
* We're running on a single big machine.
* No need for a distributed crawler.
* Avoid saying "multi-core" straight away. In the second phase, if they are
struggling to realize that the solution should be multi-threaded, mention that.
* Remind them again to write pseudo-code, but emphasize that it's just so we can be explicit about
the order things happen in, not so we can nit-pick on syntax, or anything of that nature.
* For remote interviews that use Coderpad, you can ask them to set the language to 'text' so that
it's absolutely clear that this is pseudocode. Alternatively, you may give them the option to
choose a language that will highlight their particular style of pseudocode well, but emphasize
that it doesn't need to be compilable code.
* Feel free to use any standard library functions/object/structures/abstractions from any programming
language, even if it is not the language that they chose to use for pseudocode.
## Question
You are given a single URL `start_url` and depth value, `K`. Write a program that downloads all the
web pages at `start_url` and the pages it points to up to `K` levels.
The result is a directory with all the content written into it (as files or sub-directories).
You can assume the functions are available to you:
```Scala
// This syntax is in Scala, but using Scala is not at all required
/**
* Synchronously fetches the content of a page identified by
* input URL from the internet and returns the HTML text.
*/
def fetch(url: URL): HTML
/**
* Parses the html page and returns all the links inside it.
*/
def parse(html: HTML): Seq[URL]
/**
* Writes the content to a file corresponding to the given URL.
*/
def save(url: URL, content: HTML): Unit
Points to emphasize:
Let the candidate start with the simple case of the single-threaded version. Do not stop them. The purpose of the single threaded solution is to make sure they fully understand the problem and its requirements. We expect all candidates to finish the single-threaded solution in 5-15 minutes.
The single threaded solution should avoid fetching the same URL twice. A good candidate would start thinking about this from the beginning, but if they do not, the interviewer should raise this issue and ask them to think about it. Ideally the candidate should articulate why this is an issue in terms of resource consumption.
Here is a sketch of a simple, unoptimized solution in python-ish language using breadth-first search (BFS):
visited = set()
to_visit = [(start_url, 0)]
while to_visit:
(url, depth) = to_visit.pop()
if depth < K and url not in visited:
visited.add(url)
html = fetch(url)
save(url, html)
candidate_urls = parse(html)
for next_url in candidate_urls:
to_visit.append_tail((next_url, depth + 1))
Other common solution variants and optimizations include:
Explicit iteration over levels. Iterate K
times, each time processing the URLs
at level K
and producing the list of URLs at level K+1
. A set of visited
nodes is still required, but no need to track the depth for each URL.
Recursive depth-first search (DFS). This approach has a nasty pitfall, see below. Also, it is less natural to convert this approach into a parallel version.
[Optmization] Marking URLs as visited
when inserting into the queue. This prevents duplicate
URLs from entering the queue. This is more correct, as we can fit all unique
URLs in memory but not necessarily all links to the same URL.
Candidates may try to solve this problem with DFS, avoiding duplicate work by tracking a set of visited nodes. This does not work because it cuts the exploration short. E.g. if a node is first seen at level 5, it can later be seen at level 3, but at that point all of its transitive links have already been processed as if they were at level 6, not level 4. BFS exploration does not have this problem because the nodes are visited in an optimal order. Depending on the seniority of the candidate, you can provide the following hints:
start -> b -> c -> d
| ^
| |
+-----------+
Noticing or solving this problem is not required for a passing grade. The candidates get extra credit if they do notice and/or solve it.
A possible solution is to track not only that a URL is visited, but also the level at which it was visited. A URL could then be re-visited even if it has already been visited, if it is seen at a lower level than the one at which it was visited before. Re-visiting can fetch and parse the URL again, or it can be optimized by reading the HTML from disk or by tracking the links graph explicitly in memory. Re-visiting may trigger re-visiting of links from that page as well. For those pages, the same consideration applies: the visits should only happen if the linked pages were previously seen only at a higher level.
Ask the candidate to identify the most obvious performance problem with their single-threaded solution. We expect candidates to identify the network I/O bottleneck as the primary problem, and propose multi-threading (or parallelizing) as a solution.
You can formulate this as 'You run this on a big server and it takes 24 hours to run. What takes the longest in this program? What is your guess?'. Just asking about performance can trip the candidate into trying to optimize for memory. It is good to ask numbers here:
Many candidates will home in on the disk writes as the main problem, and will try to parallelize those. That is a much simpler problem, but not the one that the candidate needs to tackle. You can steer them away from this by mentioning that (a) the disk is a super fast SSD, and/or (b) the writes go into the buffer cache of the OS, and the writes will happen in the background anyway.
Good candidates will also suggest profiling to identify the problem. If they do, or if your candidate does not notice the I/O bottleneck, you may hint that the CPU utilization is really low.
When candidates have identified the fetch bottleneck and have decided to use parallelism, ask them to modify their code to implement it. (If they realize that their changes are going to give incorrect results on the fringe, tell them that we will consider that later.)
When candidate starts using threads, ask them to identify their desired threading API. We have found that different people come up with different abstractions, this will tell us about the way they think.
Most candidates come up with something for which the work is taken from a queue by workers, and a main thread monitors the work. This version is slightly incorrect (see the "DFS Pitfall" above), but it is accepted for this exercise. Here is a sample solution that makes minimalistic use of concurrency primitives and that an undergrad can code.
// Shared State
queue: ConcurrentQueue<(url, levels)>
seen: Set<url>
K: const int = <X>
pool = new Pool(threadCount) // should talk about what this should be set to (likely higher than # of cores in the machine, due to blocking operations)
class ScrapeJob(url, level) extends Runnable {
// scrape a url, this should be about the same in the single vs. multithreaded case
def run() {
if (level < K) {
val html: String = fetch(url)
save(url, html)
val links: Seq[URL] = parse(html)
links.foreach { link =>
queue.enqeue((link, level+1))
}
}
}
}
// main loop to start threads for each url encountered
def main(url) {
queue.enqueue((start_url, 0))
while pool.active() > 0 || !queue.empty() {
url, level = queue.dequeue()
if (!seen.contains(url)) {
seen.put(url)
pool.enqueue(new ScrapeJob(url, level))
}
}
println("Done")
}
For new-grads, junior candidates, candidates who are stuck and can't figure out threading, and/or any other cases where you feel the candidate shouldn't reasonably have deep knowledge, you can give them the following threading API to (hopefully) set them on the right track.
pool(threads: Int) => Pool // return a new thread pool
Pool.enqueue(func: Runnable) // enqueue the given runnable to be run on the pool (thread-safe)
Pool.active() // return the number of active threads in the pool (thread-safe)
You should explain that Runnable
can be any object they like (so they can use their own object to
store whatever state the thread needs to access to perform its task)
In the parallel version the same problem can appear as well. For instance when worker threads may work on URLs from multiple levels at a time, they may "race" to insert the same link URLs at different levels, and the first one will win. Error handling, which typically involves re-processing URLs at a later time, makes this problem worse.
This question has various possible follow-ups. These are all optional (even for senior+ candidates), as they are used primarily to further explore and identify strengths in the candidate's system knowledge depth and thought process.
Note: every candidate should get to at least talk about this part if not writing code or coming up with a solution.
fetch()
deterministically fails for some pages (they are not reachable)?save()
fails nondeterministically?fetch()
is very slow?A common solution for errors is to push the pages on a separate queue for later re-processing. This solution is susceptible to the "DFS Pitfall" described earlier.
You can often work the first question into the preamble to the multithreaded solution, when the candidate is figuring out what the bottleneck is. It is then easy to come back to benchmarking / profiling later and discuss different scenarios.
Implement the solution such that any point in time the machine can shutdown and restart. For clarity provide the following interface:
/**
* This function is called by the OS just before your process is terminated.
* You have 2 seconds before your process is shut down.
*/
def pause(): Unit
/**
* This function is called by the OS to resume the crawler after pause() has been called
*/
def resume(): Unit
Definition: A straggler is a thread that takes much longer to complete its task than the average thread.
fetch()
fails?fetch()
times out or is too slow?fsync
, i.e.,
that it is effectively asynchronous after data is copied into OS buffers.