Scaling LinkedIn's User Restriction System to Handle 5 Million Queries Per Second

Scaling LinkedIn's User Restriction System to Handle 5 Million Queries Per Second

Disclaimer: The information in this article is based on content from the LinkedIn Engineering Blog and bytebytego. Full acknowledgment is given to the LinkedIn engineering team for the technical insights. For original sources, please refer to the links in the references section at the end of this article. We have analyzed the information and added our insights. Should you notice any inaccuracies or missing information, kindly leave a comment, and we will address it as promptly as possible.

A key objective of LinkedIn is to maintain a secure and professional atmosphere for its users.

Central to this endeavor is a system known as CASAL.

CASAL, short for Community Abuse and Safety Application Layer, serves as the initial barrier against malicious entities and adversarial threats. This system integrates technological solutions with human insight to detect and thwart detrimental actions.


The system comprises several key components, each playing a vital role in maintaining safety and integrity:

ML Models: Machine learning models analyze user behavior patterns to identify unusual or suspicious activities. For example, if a user suddenly sends hundreds of connection requests to strangers or repeatedly posts harmful content, the system flags these actions for further review.

Rule-Based Systems: These systems operate on predefined rules, acting as guidelines to determine acceptable behavior. Actions or content that violate LinkedIn’s policies—such as hate speech or spam—automatically trigger alerts for immediate attention.

Human Review Processes: While automation is powerful, human expertise is essential. A dedicated team of reviewers evaluates flagged activities and makes decisions on borderline cases, ensuring nuanced and accurate judgments.

Multi-Faceted Restrictions: Harmful activities vary in nature, so LinkedIn employs a range of restrictions tailored to the severity of the issue. For instance, a user might be temporarily restricted from sending connection requests, while more severe cases, such as posing a significant threat, could result in permanent account suspension.

Together, these elements create a multi-layered defense system, safeguarding LinkedIn’s community from abuse and preserving its professional and trusted environment for networking.

In this article, we’ll explore the design and evolution of LinkedIn’s enforcement infrastructure in greater detail.


Article content
Figure 1: LinkedIn CASAL & restriction management integration

Evolution of Enforcement Infrastructure

LinkedIn’s restriction enforcement system has undergone three major generations of development. Let’s delve into each phase in detail.

First Generation

In its early stages, LinkedIn relied on a relational database (Oracle) to store and manage restriction data.

Restriction details were stored in Oracle tables, with distinct types of restrictions separated into individual tables for improved organization and manageability. CRUD (Create, Read, Update, Delete) workflows were implemented to manage the lifecycle of restriction records, ensuring accurate updates and timely removal when needed.

Refer to the diagram below for a visual representation:


Article content
Figure 2: Relational database schema

Nonetheless, this method encountered several obstacles:

As LinkedIn expanded and shifted towards a microservices architecture, the relational database strategy struggled to meet the escalating demands.

The system became burdensome because of Oracle's limitations in managing high volumes of queries while maintaining low latency.

### Server-Side Cache Implementation

To tackle the scaling issues, the team implemented server-side caching, which drastically reduced latency by cutting down the need for repeated database queries.

A cache-aside strategy was adopted, functioning as follows:

1. When restriction data was requested, the system first checked the in-memory cache.

2. If the data was available in the cache (a cache hit), it was delivered immediately.

3. If the data was not found (a cache miss), it was retrieved from the database and asynchronously updated in the cache for future requests.

Refer to the diagram below for an illustration of the server-side cache approach:


Article content
Figure 3. Server-side cache for restrictions data

Restrictions were given predefined TTL (Time-to-Live) values to ensure that cached data was periodically refreshed.

However, this method had its drawbacks:

The server-side cache was not distributed, so each host had to manage its own cache.

While this system functioned effectively in low-traffic conditions, it faced challenges with high cache-hit demands, prompting the need for further enhancements.

Introduction of Client-Side Caching

Expanding upon the server-side caching, LinkedIn incorporated client-side caching to further improve performance. This method allowed upstream applications (such as LinkedIn Feed and Talent Solutions) to hold their own local caches.

Refer to the diagram below:


Article content
Figure 4. Client side cache with server side cache for restrictions data

To address this, a client-side library was created to cache restriction data directly on application hosts, minimizing reliance on server-side caches.

However, this method also introduced its own set of challenges:

  • Client-side caching added operational complexity, as engineers had to ensure consistency between client and server caches.
  • Refresh operations placed additional strain on the database, particularly during updates or restarts when caches needed to be reloaded.


Adopting Full Refresh-Ahead Caching

To address the limitations of client-side caching, the team implemented a full refresh-ahead cache model.

In this model, each client stored all restriction data locally, thereby eliminating the need for repeated database queries. A polling system was established to regularly check and update the cache to ensure its accuracy.

This approach significantly enhanced response times, primarily because all necessary data was immediately accessible on the client side, avoiding the need for network calls.

However, this method came with its own set of limitations and compromises:

  • The memory requirement was considerable, as each client had to have sufficient memory to house the entire dataset.
  • Application restarts or deployments led to high resource usage as the system reloaded all data, creating performance bottlenecks.
  • Refresh operations periodically increased the load on the database, causing latency peaks and putting additional strain on the Oracle database infrastructure.

Utilization of Bloom Filters

To tackle challenges related to scalability and efficiency, LinkedIn integrated Bloom Filters into their system.

A Bloom Filter is a probabilistic data structure optimized for managing large data sets efficiently. Rather than storing the entire dataset, it utilizes a compact, memory-efficient encoding to check the presence of a restriction record. If a query aligns with a record in the Bloom Filter, the system then applies the restriction.

The primary benefit of employing a Bloom Filter is resource conservation. It reduces the memory footprint compared to conventional caching methods and speeds up query processing, enhancing system responsiveness.

Nonetheless, Bloom Filters come with certain trade-offs:

Bloom Filters are probabilistic, which means there is a minor chance of false positives—situations where the filter mistakenly suggests the presence of a restriction.

Despite this limitation, the trade-off was considered worthwhile for LinkedIn's objectives, as it promoted enhanced performance and scalability.

Advancements in the Second Generation

As LinkedIn expanded and its platform increased in complexity, the engineering team recognized the limitations of the first-generation system, which was heavily dependent on relational databases and caching strategies.

To meet the needs of a billion-member platform, the second-generation of LinkedIn’s restriction enforcement system was developed.

The objectives of the second-generation system included:

High QPS (4-5 million): The system was required to manage restriction enforcement for every request across LinkedIn’s broad range of products.

Ultra-Low Latency (<5 ms): The upgraded system utilized in-memory lookups to retrieve restriction data, eliminating the dependence on repetitive database queries. This significantly cut down response times, providing a smooth experience for LinkedIn members and applications.

High Availability (Five 9’s): The system was engineered to achieve 99.999% availability, a critical level of reliability necessary to enforce restrictions continuously. By distributing data across several nodes and data centers, the system reduced the likelihood of downtime.

Horizontal Scaling: To accommodate the increasing number of restrictions and requests, the system was capable of horizontal scaling by adding more nodes or servers.

Data Freshness and Synchronization: Real-time updates through Kafka ensured that all restriction data was kept synchronized across the platform, preventing any inconsistencies.


Transition to NoSQL Distributed Systems

A pivotal development in this phase was the shift of restriction data management to LinkedIn’s Espresso, a bespoke distributed NoSQL document store.

The move was necessitated by the limitations of relational databases like Oracle, which struggled with the high query throughput and latency demands of LinkedIn’s expanding platform. Espresso, as a distributed NoSQL system, offered enhanced scalability and performance while ensuring data consistency.

Espresso was seamlessly integrated with Kafka, LinkedIn’s real-time data streaming platform. Whenever a new restriction record was created, Espresso would generate Kafka messages that included the data and metadata of the record. These messages facilitated the real-time synchronization of restriction data across various servers and data centers, keeping the system updated with the most current information.

Refer to the diagram below to visualize the architecture of the 2nd generation restriction enforcement system.


Article content

Challenges Faced by the Second-Generation System

Despite significant improvements, the second-generation system encountered operational hurdles, particularly under certain conditions:

Bootstrapping Data During Restarts:

  • Whenever a server or application was rebooted, it necessitated the reloading of all restriction records from the database back into memory.
  • This bootstrapping procedure was both resource-intensive and lengthy, often extending beyond 30 minutes for extensive datasets.
  • The intense load during this bootstrapping phase could overburden the Espresso database, adversely affecting overall system performance.

Managing Large-Scale Growth:

Although the system was adept at horizontal scaling, the enormous quantity of restriction records and the influx of requests during peak adversarial activities pushed the boundaries of the existing infrastructure.

Navigating CAP Theorem Trade-offs

While designing the second-generation system, LinkedIn faced critical architectural decisions involving trade-offs based on the CAP Theorem.

The CAP Theorem states that a system can only guarantee two out of the following three properties at any given time:

- Consistency (C): Every read operation retrieves the most recent write or returns an error.

- Availability (A): Every request receives a non-error response, even if some nodes in the system are unavailable.

- Partition Tolerance (P): The system continues to function despite network partitions or communication failures.

These trade-offs played a pivotal role in shaping the system's architecture.


Article content

LinkedIn chose to prioritize consistency (C) and availability (A) over partition tolerance (P), a decision driven by their goals for low latency and high reliability.

The restriction enforcement system required accurate and up-to-date data across the platform. Inaccurate or outdated restriction records could compromise security or degrade user experiences. Additionally, high availability was crucial to ensure seamless enforcement of restrictions, even during periods of peak activity.

Past experiences with partitioned databases had shown that network partitions could introduce latencies, conflicting with LinkedIn’s strict performance requirements, such as achieving ultra-low latency (<5 ms).

By leveraging Espresso, LinkedIn was able to manage consistency and availability more effectively within their system design. Integration with Kafka ensured real-time synchronization of restriction records across servers, maintaining consistency without introducing significant delays.

Third Generation

As LinkedIn continued to expand, the second-generation restriction enforcement system, despite being robust, started to falter under the growing pressure of data volume and adversarial threats.

In response, the LinkedIn engineering team rolled out a new generation of the restriction enforcement system. Refer to the diagram below:


Article content

The third generation brought innovations aimed at optimizing memory usage, enhancing resilience, and speeding up the bootstrap process.

1 - Utilizing Off-Heap Memory

A significant bottleneck in the second-generation system was the use of in-heap memory for data storage. This method led to complications during Garbage Collection (GC) cycles, which caused latency spikes and reduced overall system performance.

To resolve these issues, the third-generation system transitioned to using off-heap memory for data storage.

Off-heap memory operates outside the Java Virtual Machine’s (JVM) management, unlike in-heap memory. By relocating data storage to off-heap memory, the system diminished the frequency and severity of GC events.

The advantages of this strategy included:

  • Off-heap memory allowed for more extensive storage of restriction data without burdening the JVM heap.
  • This adjustment lessened GC disruptions, yielding smoother and more consistent system performance.
  • Hosts were able to manage larger datasets without the increased risk of reaching memory limits.

2. Venice and DaVinci Framework

To further enhance the system, the LinkedIn engineering team introduced DaVinci, an advanced client library, and integrated it with Venice, a scalable derived data platform.

Here’s how these tools work together:

- DaVinci functions as an eager cache, proactively loading all restriction data into memory at startup. This eliminated the need for frequent on-demand lookups, improving efficiency.

- Restriction data was stored in bitset-like data structures, which are highly memory-efficient. These structures enabled the system to handle large datasets while keeping memory usage to a minimum.

- Venice, a distributed platform for derived data, facilitated seamless integration and synchronization of restriction data. It allowed DaVinci to fetch and store data efficiently, ensuring high-speed performance even during peak activity periods.

The innovations in the third-generation system addressed many limitations of its predecessors, offering several key benefits:

- Faster Bootstrapping Processes: With DaVinci’s eager caching, restriction data was loaded into memory more quickly during server restarts or deployments. This reduced downtime and ensured the system could enforce restrictions almost immediately after initialization.

- Greater Resilience: The system became more capable of handling both organic growth (increased users and data) and adversarial data growth (spikes in restrictions due to malicious activity). By leveraging memory-efficient data structures and off-heap storage, the system scaled effectively without encountering performance bottlenecks.


Conclusion

LinkedIn's evolution from early dependence on relational databases to embracing sophisticated NoSQL systems like Espresso, and incorporating state-of-the-art frameworks like DaVinci and Venice, illustrates a consistent dedication to innovation throughout the development of the restriction enforcement system.

Yet, this progression wasn't solely driven by innovation. It was also shaped by fundamental principles:

Start Simple, Scale Thoughtfully: The initial designs were streamlined, aimed at addressing immediate challenges without overcomplicating solutions.

Proactive Problem Identification: This strategy enabled the engineering team to foresee potential issues, such as memory constraints or latency spikes, and tackle them through well-planned interventions.

Collaboration Across Teams: By facilitating cooperation between different teams, efficiency was boosted through shared insights and minimized duplicated efforts.

Benchmarking and Testing: The team employed time-sensitive experiments and performance evaluations to rapidly assess different strategies.

Continuous Improvement: Each new version of the restriction enforcement system built on the learnings and limitations of its predecessors.

References:

  • The Evolution of Enforcing Our Professional Community Policies at Scale
  • Building Trust and Combating Abuse on our Platform

To view or add a comment, sign in

More articles by Momen Negm

Others also viewed

Explore content categories