Design the backend of the facebook posts search cake .
eminence that given the argument we should immediately notice that :
- at facebook level, the number of posts is too large to fit in a single machine.
- the number of posts is increasing constantly, our design needs to take this into consideration.
- as a design goal, we probably want to minimize the latency, while a relaxed consistency model might be acceptable.
typically, an industrial-level search backend might look like this :
Reading: Design facebook posts search
an architecture for a search backend A set details involved in the diagram above. I will explain each of them below .
The question node layer is typically a homeless layer. Its independent job is to translate the question request to the calls to the exponent layer. The raw keyword typed by drug user might be processed through a crowd of steps, like text standardization, question expansion, etc. The question node layer may besides contact early services, like the service that provides drug user preferences, geo-location, or search history, which are all signals that help the transformation. The response from each exponent node below may contain a list of rank mail ids. A question node is besides responsible for aggregating the partial derivative results to present a concluding result to the users .
An index node stores index data locally. Each position has a singular mail idaho, 1, 2, 3… An index maps a keyword to a tilt of post ids, with a ranking score. The index layer answers the doubt : given a list of keywords, and a rank algorithm, what ’ s the peak K mail ids ? democratic libraries like Lucene does this work for you.
flush the size of an index might be much smaller than the original posts ( the index only stores post id, it doesn ’ metric ton store the integral post ), it hush may exceed the capacity of a unmarried machine. For example, assume each position has a size of 1KB, a billion of posts cost about 1TB data, a typical size of index could be about 30 % of the mail size, so approximately it may cost 300GB. An index file of 300GB can be easily stored in a one storage optimized AWS EC2 machine ; however, applying sharding besides opens the possibility of accelerating the queries. For exemplar, if we partition the index data to 30 shards, each of them holds 10GB of data, then we can use the memory-mapped charge to make most data stay in memory. Given that, in the remaining part of the post we assume a partition ( or sharding ) is necessary. When it comes to sharding, some typical questions may be raised, as we design other distribute systems, like : how to partition the data ? how do we query each shard ? what if a shard fails ? etc. As common, we can partition the data by hash ( post id ) % number_of_shards. The question layer is responsible for sending the requests to all shards and aggregate the results. besides, to tolerate shard failure, a shard should be replicated on other nodes ( replica ). Depends on the consistency model we choose, we besides need to choose the right replication algorithm to keep all replica in synchronize. Each index node relies on the stream index data, and a custom-make rank model to answer a question. As shown in the diagram, ideally the index data should be updated to reflect the latest change in real-time. We will explore this character in more details below. Before returning the results to the question node, the index node besides uses its own ranking model to give a ranking score to each post. The results from all index nodes will be aggregated and ranked again on the question layer, before returning a final result to drug user. popular solutions, like the elasticsearch, uses similar architecture above. By default option, it uses a hash based sharding, and a primary-secondary rejoinder ( each update is forwarded to a primary replica first, then to all other replica ).
One challenge we have to address is, keep the index data up-to-date since new posts are creating, old posts are updating all the meter. We assume the very posts data are stored in a key-value memory such as Apache Cassandra, which used to be used widely in Facebook. The posts can be queried / updated by a wrapper Post military service ( a side notice : in industry it ’ randomness common to build a data service on lead of the database, even though the data service itself does nothing more than forwarding the request to DB. by doing thus we don ’ t have to affect users if we decide to switch to another DB ). A coarse techniques used in industry to export the data from DB incrementally is called the Change Data Capture. Using this proficiency, any change on DB could be captured by the real-time updater, and translated to index update request, and sent to the index layer, in about real-time. As a companion to the real-time updater, we often need an offline update pipeline, to update the exponent data in batch mode. This is chiefly because 1. it ’ s normally hard to make the real-time updater 100 % dependable, there might be significant data loss or incorrect update, due to hardware failure or race condition ( ex., two updates are sent to the index data in turn back holy order, where the cold data is updated belated ) ; 2. sometimes we merely want to exponent the doctor differently, so we need to re-compute the index and roll it out to all index nodes. In general, having such offline grapevine makes the index layer more robust — if we have a daily offline subcontract to update the index as a whole, any mistake we made wouldn ’ thyroxine stopping point for more than one sidereal day .
finally, since the question layer returns post ids only, the question frontend should query the Post service by the ids, and do proper collection. It ’ randomness much called the presentation layer .