QCon London 2026

The Future of Observability: Netflix Pioneers the “Knowledge Graph” Approach

Netflix is pushing the boundaries of observability, moving beyond traditional monitoring to a system built on interconnected knowledge. Engineers Prasanna Vijayanathan and Renzo Sanchez-Silva recently unveiled their function at QCon London 2026, detailing how a knowledge graph is transforming how the streaming giant understands and responds to issues across its vast infrastructure.

From Siloed Data to a Unified View: The Challenge of E2E Observability

Traditional observability often struggles with fragmented data. Metrics, events, logs and traces exist in silos, making it difficult to correlate information and pinpoint root causes. Here’s the core challenge of End-to-End (E2E) Observability – the ability to monitor a complex system from the user interface to the underlying infrastructure. Netflix’s approach directly addresses these issues.

The MELT Layer: A Foundation for Unified Observability

Central to Netflix’s strategy is the MELT Layer (Metrics, Events, Logs, Traces). This unified layer aims to improve incident resolution time by consolidating observability data. It’s a crucial step towards breaking down silos and providing a more holistic view of system health.

Ontology: Encoding Knowledge for Machine Understanding

But simply collecting data isn’t enough. Netflix leverages the power of Ontology – a formal specification of types, properties, and relationships – to encode knowledge about its systems. This isn’t just about the data itself, but about understanding the connections between data points. The fundamental unit of this knowledge is the Triple: (Subject | Predicate | Object), representing a single fact within the knowledge graph.

For example, a triple might state: “api-gateway | rdf:type | ops:Application,” defining the api-gateway as an application. Another could be: “INC-5377 | ops:affects | api-gateway,” indicating that incident INC-5377 impacts the api-gateway.

12 Operational Namespaces: Connecting the Netflix Universe

To manage the complexity of its infrastructure, Netflix utilizes 12 Operational Namespaces – including Slack, Alerts, Metrics, Logs, and Incidents – to categorize and connect all elements. The ontology captures, structures, and preserves this information in a machine-readable format, transforming operational chaos into a structured understanding.

The Knowledge Flywheel: Continuous Learning and Adaptation

Netflix’s system isn’t static. The Knowledge Flywheel embodies a continuous learning loop. It operates through three states – Observer, Enrich, and Infer – constantly adapting and improving its understanding of the system. This flywheel is integrated with a development process utilizing Claude, where the AI proposes code changes (pull requests) that are then reviewed and merged by human engineers.

This integration of AI and human expertise is a key element, allowing for automated improvements while maintaining control and oversight.

Future Trends: Automation and Self-Healing Infrastructure

Netflix’s vision extends beyond simply understanding incidents. They aim to automate root cause analysis, provide auto-remediation, and ultimately create a self-healing infrastructure. This represents a significant leap forward in operational efficiency and reliability.

The Rise of AI-Powered Observability

The integration of AI, as demonstrated by the utilize of Claude, is a major trend. Expect to see more AI-powered tools that can automatically analyze observability data, identify anomalies, and even suggest solutions. This will free up engineers to focus on more strategic tasks.

Knowledge Graphs as the Fresh Standard

Netflix’s knowledge graph approach is likely to become a standard practice. By representing infrastructure as interconnected entities, organizations can gain a deeper understanding of their systems and improve their ability to respond to incidents.

Shift Towards Proactive Observability

The goal is to move beyond reactive monitoring to proactive observability – predicting and preventing issues before they impact users. This requires sophisticated analytics and machine learning algorithms that can identify patterns and anomalies.

FAQ

What is an ontology in the context of observability?
An ontology is a formal specification of types, properties, and relationships, used to encode knowledge about a system and its components.

What is the MELT layer?
The MELT layer (Metrics, Events, Logs, Traces) is a unified observability layer designed to consolidate data and improve incident resolution time.

What is a Triple?
A Triple is a tuple (Subject | Predicate | Object) that defines one fact in a knowledge graph.

How does Netflix use AI in its observability system?
Netflix uses AI, specifically Claude, to propose code changes and automate parts of the observability workflow.

What are the 12 Operational Namespaces?
These are categories used by Netflix to organize and connect all elements of its infrastructure, including Slack, Alerts, Metrics, Logs, and Incidents.

Did you recognize? The concept of a knowledge graph isn’t new, but its application to large-scale observability, as demonstrated by Netflix, is a significant advancement.

Pro Tip: Start compact when implementing observability solutions. Focus on identifying key metrics and events, and gradually expand your coverage as you gain experience.

Seek to learn more about modern data engineering practices? Explore our other articles on data architecture and observability tools.

Booking.com’s AI Journey: Lessons for the Future of Data-Driven Platforms

Booking.com’s evolution from Perl scripts and MySQL databases to a sophisticated AI platform, as detailed at QCon London 2026 by Senior Principal Engineer Jabez Eliezer Manuel, offers valuable insights into the challenges and triumphs of scaling AI within a large organization. The presentation, “Behind Booking.com’s AI Evolution: The Unpolished Story,” highlighted a 20-year journey marked by pragmatic experimentation and a willingness to adapt.

The Power of Data-Driven DNA

In 2005, Booking.com began extensive A/B testing, running over 1,000 experiments concurrently and accumulating 150,000 total experiments. Despite a less than 25% success rate, the company prioritized rapid learning over immediate results, fostering a “Data-Driven DNA” that continues to shape its approach to innovation. This early commitment to experimentation laid the groundwork for future AI initiatives.

From Hadoop to a Unified Platform: A Migration Story

Booking.com initially leveraged Apache Hadoop for distributed storage and processing, building two on-premise clusters with approximately 60,000 cores and 200 PB of storage by 2011. However, limitations such as noisy neighbors, lack of GPU support, and capacity issues eventually led to a seven-year migration away from Hadoop. The migration strategy involved mapping the entire ecosystem, analyzing usage to reduce scope, applying the PageRank algorithm, migrating in waves, and finally phasing out Hadoop. A unified command center proved crucial to this complex undertaking.

The Evolution of the Machine Learning Stack

The company’s machine learning stack has undergone significant transformation, evolving from Perl and MySQL in 2005 to agentic systems in 2025. Key technologies along the way included Apache Oozie with Python, Apache Spark with MLlib, and H2O.ai. 2015 marked a turning point with the resolution of challenges in real-time predictions and feature engineering. As of 2024, the platform handles over 400 billion predictions daily with a latency of less than 20 milliseconds, powered by more than 480 machine learning models.

Domain-Specific AI Platforms

Booking.com has developed four distinct domain-specific machine learning platforms:

GenAI: Used for trip planning, smart filters, and review summaries.
Content Intelligence: Focused on image and review analysis, and text generation for detailed hotel content.
Recommendations: Delivering personalized content to customers.
Ranking: A complex platform optimizing for choice and value, exposure and growth, and efficiency and revenue.

The initial ranking formula, a simple function of bookings, views, and a random number, proved surprisingly resilient to machine learning replacements due to infrastructure limitations. The company adopted an interleaving technique for A/B testing, allowing for more variants with less traffic, followed by validation with traditional A/B testing.

Future Trends: What Lies Ahead?

Booking.com’s journey highlights several key trends likely to shape the future of AI-powered platforms:

Unified Orchestration Layers: The convergence of domain-specific AI platforms into a unified orchestration layer, as demonstrated by Booking.com, will become increasingly common. This allows for greater synergy and efficiency.
Pragmatic AI Adoption: The emphasis on learning from failures and iterating quickly, rather than striving for perfection, will be crucial for successful AI implementation.
Infrastructure as a Limiting Factor: Infrastructure limitations can significantly impact the effectiveness of even the most sophisticated algorithms. Investing in scalable and robust infrastructure is paramount.
The Importance of Data Management: Effective data management, including strategies for handling large datasets and ensuring data quality, remains a foundational element of any successful AI initiative.

FAQ

Q: What was the biggest challenge Booking.com faced during its AI evolution?
A: Migrating away from Hadoop proved to be a significant undertaking, requiring a seven-year phased approach.

Q: What is the current latency of Booking.com’s machine learning inference platform?
A: Less than 20 milliseconds.

Q: What is “interleaving” in the context of A/B testing?
A: A technique where 50% of experiments are interwoven into a single experiment, allowing for more variants with less traffic.

Q: What technologies did Booking.com use in its machine learning stack?
A: Perl, MySQL, Apache Oozie, Python, Apache Spark, MLlib, H2O.ai, deep learning, and GenAI.

Did you realize? Booking.com’s initial A/B testing experiments had a less than 25% success rate, but the focus was on learning, not immediate results.

Pro Tip: Don’t be afraid to experiment and fail quick. A culture of learning from mistakes is essential for successful AI adoption.

Want to learn more about the latest trends in AI and machine learning? Explore our other articles or subscribe to our newsletter for regular updates.

QCon London 2026: Ontology‐Driven Observability: Building the E2E Knowledge Graph at Netflix Scale