<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:webfeeds="http://webfeeds.org/rss/1.0" version="2.0">
  <channel>
    <atom:link href="http://pubsubhubbub.appspot.com/" rel="hub"/>
    <atom:link href="https://f43.me/netflix-tech-blog.xml" rel="self" type="application/rss+xml"/>
    <title>Netflix Tech Blog</title>
    <description>Learn about Netflix’s world class engineering efforts, company culture, product developments and more. - Medium</description>
    <link>http://medium.com</link>
    <webfeeds:icon>https://s2.googleusercontent.com/s2/favicons?alt=feed&amp;domain=medium.com</webfeeds:icon>
    <generator>f43.me</generator>
    <lastBuildDate>Fri, 13 Mar 2026 05:20:00 +0100</lastBuildDate>
    <item>
      <title><![CDATA[Scaling Global Storytelling: Modernizing Localization Analytics at Netflix]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/valentingeffrier/">Valentin Geffrier</a>, <a href="https://www.linkedin.com/in/tanguycornuau/">Tanguy Cornuau</a></p><p><em>Each year, we bring the Analytics Engineering community together for an Analytics Summit — a multi-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. This post is one of several topics presented at the Summit highlighting the breadth and impact of Analytics work across different areas of the business.</em></p><p>At Netflix, our goal is to entertain the world, which means we must speak the world’s languages. Given the company’s growth to serving 300 million+ members in more than 190+ countries and 50+ languages, the Localization team has had to scale rapidly in creating more dubs and subtitle assets than ever before. However, this growth created technical debt within our systems: a fragmented landscape of analytics workflows, duplicated pipelines, and siloed dashboards that we are now actively modernizing.</p><h4>The Challenge: “Who Made This Dub?”</h4><p>Historically, business logic for localization metrics was replicated across isolated domains. A question as simple as “<em>Who made this dub/subtitle?”</em> is actually complex — it requires mapping multiple data sources through intricate and constantly changing logic, which varies depending on the specific language asset type and creation workflow.</p><p>When this logic is copied into isolated pipelines for different use cases it creates two major risks: inconsistency in reporting and a massive maintenance burden whenever upstream logic changes. We realized we needed to move away from these vertical silos.</p><h4>Our Modernization Strategy</h4><p>To address this, we defined a vision centered on consolidation, standardization, and trust, executed through three strategic pillars:</p><p>1. The Audit and Consolidation Playbook</p><p>We initiated a comprehensive audit of over 40 dashboards and tools to assess usage and code quality. Our focus has shifted from patching frontend visualizations to consolidating backend pipelines. For example, we are currently merging three legacy dashboards related to dubbing partner KPIs (around operational performance, capacity, and finances), focusing first on a unified data and backend layer that can support a variety of future frontend iterations.</p><p>2. Reducing “Not-So-Tech” Debt</p><p>Technical debt isn’t just about code; it is also about the user experience. We define “Not-So-Tech Debt” as the friction stakeholders feel when tools are hard to interpret or can benefit from better storytelling. To fix this, we revamped our Language Asset Consumption tool — instead of reporting dub and subtitle metrics independently, we combine audio and text languages into one consumption language that helps differentiate Original Language versus Localized Consumption and measure member preferences between subtitles, dubs, or a combination of both for a given language. This unlocks more intuitive insights based on actual recurring stakeholder use cases.</p><p>3. Investing in Core Building Blocks</p><p>We are shifting to a <em>write once, read many</em> architecture. By centralizing business logic into unified tables — such as a “Language Asset Producer” table — we solve the “<em>Who made this dub?”</em> problem once. This centralized source now feeds into multiple downstream domains, including our Dub Quality and Translation Quality metrics, ensuring that any logic update propagates instantly across the ecosystem.</p><h4>The Future: Event-Level Analytics</h4><p>Looking ahead, we are moving beyond asset-level metrics to event-level analytics. We are building a generic data model to capture granular timed-text events, such as individual subtitle lines. This data helps us understand how subtitle characteristics (e.g. reading speed) affect member engagement and, in turn, refine the style guidelines we provide to our subtitle linguists to improve the member experience with localized content.</p><p>Ultimately, this modernization effort is about scaling our ability to measure and enhance the joy and entertainment we deliver to our diverse global audience, ensuring that every member, regardless of their language, has the best possible Netflix experience.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=816f47290641" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/scaling-global-storytelling-modernizing-localization-analytics-at-netflix-816f47290641">Scaling Global Storytelling: Modernizing Localization Analytics at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/scaling-global-storytelling-modernizing-localization-analytics-at-netflix-816f47290641</link>
      <guid>https://netflixtechblog.com/scaling-global-storytelling-modernizing-localization-analytics-at-netflix-816f47290641</guid>
      <pubDate>Fri, 06 Mar 2026 16:01:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Optimizing Recommendation Systems with JDK’s Vector API]]></title>
      <description><![CDATA[<p><em>By </em><a href="https://www.linkedin.com/in/harshad-sane-56711a11/"><em>Harshad San</em></a><em>e</em></p><p>Ranker is one of the largest and most complex services at Netflix. Among many things, it powers the personalized rows you see on the Netflix homepage, and runs at an enormous scale. When we looked at CPU profiles for this service, one feature kept standing out: <strong>video serendipity scoring</strong> — the logic that answers a simple question:</p><p><em>“How different is this new title from what you’ve been watching so far?”</em></p><p>This single feature was consuming about 7.5% of total CPU on each node running the service. What started as a simple idea — “just batch the video scoring feature” — turned into a deeper optimization journey. Along the way we introduced batching, re-architected memory layout and tried various libraries to handle the scoring kernels.</p><p>Read on to learn how we achieved the same serendipity scores, but at a meaningfully lower CPU per request, resulting in a reduced cluster footprint.</p><h3><strong>Problem: The Hotspot in Ranker</strong></h3><p>At a high level, serendipity scoring works like this: A candidate title and each item in a member’s viewing history are represented as embeddings in a vector space. For each candidate, we compute its similarity against the history embeddings, find the maximum similarity, and convert that into a “novelty” score. That score becomes an input feature to the downstream recommendation logic.</p><p>The original implementation was straightforward but expensive. For each candidate we fetch its embedding, loop over the history to compute cosine similarity one pair at a time and track the maximum similarity score. Although it is easy to reason about, at Ranker’s scale, this results in significant sequential work, repeated embedding lookups, scattered memory access, and poor cache locality. Profiling confirmed this.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/655/1*w8Z7CwTNc4dW84n8-S-CMw.png"><figcaption>Flamegraph showing inefficient scoring</figcaption></figure><p>A flamegraph made it clear: One of the top hotspots in the service was Java dot products inside the serendipity encoder. Algorithmically, the hotspot was a nested loop structure of M candidates × N history items where each pair generates its own cosine similarity i.e. O(M×N) separate dot product operations.</p><h3><strong>Solution</strong></h3><h4>The Original Implementation: Single video cosine loop</h4><p>In simplified form the code looked like this:</p><pre>for (Video candidate : candidates) {<br>  Vector c = embedding(candidate); // D-dimensional<br>  double maxSim = -1.0;<br><br>  for (Video h : history) {<br>    Vector v = embedding(h); // D-dimensional<br>    double sim = cosine(c, v); // dot(c, v) / (||c|| * ||v||)<br>    maxSim = Math.max(maxSim, sim);<br>  }<br><br>  double serendipity = 1.0 - maxSim;<br>  emitFeature(candidate, serendipity);<br>}</pre><p>The nested for loop with O(M×N) separate dot products brought upon its own overheads. One interesting detail we learned by instrumenting traffic shapes: most requests (about 98%) were single-video, but the remaining 2% were large batch requests. Because those batches were so large, the total volume of videos processed ended up being roughly 50:50 between single and batch jobs. This made batching worth pursuing even if it didn’t help the median request.</p><h4>Step 1 : Batching, from Nested Loops to Matrix Multiply</h4><p>The first idea was to stop thinking in terms of “many small dot products” and instead treat the work as a matrix operation. i.e. For batch candidates, implement a data layout to parallelize the math in a single operation i.e. matrix multiply. If D is the embedding dimension:</p><ol><li>Pack all candidate embeddings into a matrix A of shape M x D</li><li>Pack all history embeddings into a matrix B of shape N x D</li><li>Normalize all rows to unit length.</li><li>Compute: cosine similarities as <br> [ C = A x B^T ]; where C is an M x N matrix of cosine similarities.</li></ol><p>In pseudo‑code:</p><pre>// Build matrices<br>double[][] A = new double[M][D]; // candidates<br>double[][] B = new double[N][D]; // history<br><br>for (int i = 0; i &lt; M; i++) {<br>  A[i] = embedding(candidates[i]).toArray();<br>}<br>for (int j = 0; j &lt; N; j++) {<br>  B[j] = embedding(history[j]).toArray();<br>}<br><br>// Normalize rows to unit vectors<br>normalizeRows(A);<br>normalizeRows(B);<br><br>// Compute C = A * B^T<br>double[][] C = matmul(A, B);<br>C[i][j] = cosine(candidates[i], history[j])<br><br>// Derive serendipity<br>for (int i = 0; i &lt; M; i++) {<br>  double maxSim = max(C[i][0..N-1]);<br>  double serendipity = 1.0 - maxSim;<br>  emitFeature(candidates[i], serendipity);<br>}</pre><p>This turns <strong>M×N separate dot products into a single matrix multiply</strong>, which is exactly what CPUs and optimized kernels are built for. We integrated this into the existing framework by supporting both, encode()for single videos and batchEncode() for batches, while maintaining backward compatibility. At this point it seemed like we were “done”, but we weren't.</p><h4>Step 2: When Batching Isn’t Enough</h4><p>Once we had a batched implementation, we ran canaries and saw something surprising: about a 5% performance regression. The algorithm wasn’t the issue — turning M×N separate dot products into a matrix multiplication is mathematically sound. The problem was the overhead we introduced in the first implementation.</p><ol><li>Our initial version built double[][] matrices for candidates, history, and results on every batch. Those large, short-lived allocations created GC pressure, and the double[][] layout itself is non-contiguous in memory, which meant extra pointer chasing and worse cache behavior.</li><li>On top of that, the first-cut Java matrix multiply was a straightforward scalar implementation, so it couldn’t take advantage of SIMD. In other words, we paid the cost of batching without getting the compute efficiency we were aiming for.</li></ol><p>The lesson was immediate: algorithmic improvements don’t matter if the implementation details—memory layout, allocation strategy, and the compute kernel—work against you. That set up the next step for making the data layout cache-friendly and eliminating per-batch allocations before revisiting the matrix multiply kernel.</p><h4><strong>Step 3: Flat Buffers &amp; ThreadLocal Reuse</strong></h4><p>We reworked the data layout to be cache-friendly and allocation-light. Instead of double[m][n], we moved to flat double[] buffers in row-major order. That gave us contiguous memory and predictable access patterns. Then we introduced a ThreadLocal&lt;BufferHolder&gt; that owns reusable buffers for candidates, history, and any other scratch space. Buffers grow as needed but never shrink, which avoids per-request allocation while keeping each thread isolated (no contention). A simplified sketch:</p><pre>class BufferHolder {  <br>  double[] candidatesFlat = new double[0];  <br>  double[] historyFlat = new double[0];  <br><br>  double[] getCandidatesFlat(int required) {  <br>    if (candidatesFlat.length &lt; required) {  <br>      candidatesFlat = new double[required];  <br>    }  <br>    return candidatesFlat;  <br>  }  <br>  <br>  double[] getHistoryFlat(int required) {  <br>    if (historyFlat.length &lt; required) {  <br>      historyFlat = new double[required];  <br>    }  <br>    return historyFlat;  <br>  }  <br>}  <br><br>private static final ThreadLocal&lt;BufferHolder&gt; threadBuffers =  <br>    ThreadLocal.withInitial(BufferHolder::new);</pre><p>This change alone made the batched path far more predictable: fewer allocations, less GC pressure, and better cache locality.</p><p>Now the remaining question was the one we originally thought we were answering: what’s the best way to do the matrix multiply?</p><p><strong>Step 4: BLAS: Great in Tests, Not in Production</strong></p><p>The obvious next step was BLAS (Basic Linear Algebra Subprograms). In isolation, microbenchmarks looked promising. But once integrated into the real batch scoring path, the gains didn’t materialize. A few things were working against us:</p><ul><li>The default netlib-java path was using F2J (Fortran-to-Java) BLAS rather than a truly native implementation.</li><li>Even with native BLAS, we paid overhead for setup and JNI transitions.</li><li>Java’s row-major layout doesn’t match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers.</li><li>Those extra allocations and copies mattered in the full pipeline, especially alongside TensorFlow embedding work.</li></ul><p>BLAS was still a useful experiment — it clarified where time was being spent, but it wasn’t the drop-in win we wanted. What we needed was something that stayed pure Java, fit our flat-buffer architecture, and could still exploit SIMD.</p><p><strong>Step 5: JDK Vector API to the rescue</strong></p><p><strong><em>A Short Note on the JDK Vector API: </em></strong>The JDK Vector API is an <em>incubating</em> feature that provides a portable way to express data-parallel operations in Java — think “SIMD without intrinsics”. You write in terms of vectors and lanes, and the JIT maps those operations to the best SIMD instructions available on the host CPU (SSE/AVX2/AVX-512), with a scalar fallback when needed. More crucially for us, it’s pure Java: no native dependencies, no JNI transitions, and a development model that looks like normal Java code rather than platform-specific assembly or intrinsics.</p><p>This was a particularly good match for our workload because we had already moved embeddings into flat, contiguous double[] buffers, and the hot loop was dominated by large numbers of dot products. The final step was to replace BLAS with a pure-Java SIMD implementation using the JDK Vector API. By this point we already had the right shape for high performance — batching, flat buffers, and ThreadLocal reuse. So the remaining work was to swap out the compute kernel without introducing JNI overhead or platform-specific code. We did that behind a small factory. At class load time, MatMulFactory selects the best available implementation:</p><ul><li>If jdk.incubator.vector is available, use a Vector API implementation.</li><li>Otherwise, fall back to a scalar implementation with a highly optimized loop-unrolled dot product (implemented by my colleague Patrick Strawderman, inspired by patterns used in <a href="https://github.com/apache/lucene/blob/6d4314d46fd69ca16edce0cd1c8507aa0e66ccd6/lucene/core/src/java/org/apache/lucene/util/VectorUtilDefaultProvider.java#L26">Lucene</a>)</li></ul><p>In the Vector API implementation, the inner loop computes a dot product by accumulating a * b into a vector accumulator using fma() (fused multiply-add). DoubleVector.SPECIES_PREFERRED lets the runtime pick an appropriate lane width for the machine. Here’s a simplified sketch of the inner loop:</p><pre>// Vector API path (simplified)  <br>for (int i = 0; i &lt; M; i++) {  <br>  for (int j = 0; j &lt; N; j++) {  <br>  <br>  DoubleVector acc = DoubleVector.zero(SPECIES);  <br>    int k = 0;  <br>    // SPECIES.length() (e.g. often 4 doubles on AVX2 and 8 doubles on AVX-512). <br>    for (; k + SPECIES.length() &lt;= D; k += SPECIES.length()) {  <br>      DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);  <br>      DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat,   j*D + k);<br>      acc = a.fma(b, acc);  // fused multiply-add  <br>    }  <br>    double dot = acc.reduceLanes(VectorOperators.ADD);  <br>    // handle tail k..D-1  <br>    similaritiesFlat[i*N + j] = dot;  <br>  }  <br>}</pre><p>Figure below shows how the Vector API utilizes SIMD hardware to process multiple doubles per instruction (e.g., 4 lanes on AVX2 and 8 lanes on AVX‑512). What used to be many scalar multiply-adds becomes a smaller number of vector fma() operations plus a reduction—same algorithm, much better use of the CPU’s vector units.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xU5DvGS8SVz3hfdrOfC1KQ.png"><figcaption>Vectorization with SIMD</figcaption></figure><h4>Fallbacks &amp; Safety: When the Vector API Isn’t Available</h4><p>Because the Vector API is still incubating, it requires a runtime flag: --add-modules=jdk.incubator.vector We didn’t want correctness or availability to depend on that flag. So we designed the fallback behavior explicitly: At startup, we detect Vector API support and use the SIMD batched matmul when available; otherwise we fall back to an optimized scalar path, with single-video requests continuing to use the per-item implementation.</p><p>That gives us a clean operational story: services can opt in to the Vector API for maximum performance, but the system remains safe and predictable without it.</p><h4><strong>Results in Production:</strong></h4><p>With the full design in place with batching, flat buffers, ThreadLocal reuse, and the Vector API, we ran canaries that run production traffic. We observed a ~7% drop in CPU utilization and ~12% drop in average latency. To normalize across any small throughput differences, we also tracked CPU/RPS (CPU consumed per request-per-second). That metric improved by roughly 10%, meaning we could handle the same traffic with about 10% less CPU, and we saw similar numbers hold after full production rollout.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*qV3G1oUywCDqssuz6oTaJw.png"><figcaption>CPU/RPS on Ranker</figcaption></figure><p>At the function operator level, we saw the CPU drop from the initial 7.5% to a merely ~1% with the optimization in place. At the assembly level, the shift was clear: from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/742/1*2oKKZ-jKZrHTZP_vU-ppxg.png"><figcaption>Assembly snippet from batchEncode</figcaption></figure><h3><strong>Closing Thoughts</strong></h3><p>This optimization ended up being less about finding the “fastest library” and more about getting the fundamentals right: choosing the right computation shape, keeping data layout cache-friendly, and avoiding overheads that can erase theoretical wins. Once those pieces were in place, the JDK Vector API was a great fit, as it let us express SIMD-style math in pure Java, without JNI, while still keeping a safe fallback path. Another bonus was the low developer overhead: compared to lower-level approaches, the Vector API let us replace a much larger, more complex implementation with a relatively small amount of readable Java code, which made it easier to review, maintain, and iterate on.</p><p>Have you tried the Vector API in a real service yet? I’d love to hear what workloads it helped (or didn’t), and what you learned about benchmarking and rollout in production.</p><p><em>Special thanks to </em><a href="https://www.linkedin.com/in/jason-koch-5692172/"><em>Jason Koch</em></a><em>, </em><a href="https://www.linkedin.com/in/patrickstrawderman/"><em>Patrick Strawderman</em></a><em>, </em><a href="https://www.linkedin.com/in/yuewangh/"><em>Daniel Huang</em></a><em>, </em><a href="https://www.linkedin.com/in/fan-yang-a15ba249/"><em>Fan Yang</em></a><em>, and the Performance Engineering team at Netflix</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=30d2830401ec" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/optimizing-recommendation-systems-with-jdks-vector-api-30d2830401ec">Optimizing Recommendation Systems with JDK’s Vector API</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/optimizing-recommendation-systems-with-jdks-vector-api-30d2830401ec</link>
      <guid>https://netflixtechblog.com/optimizing-recommendation-systems-with-jdks-vector-api-30d2830401ec</guid>
      <pubDate>Tue, 03 Mar 2026 02:36:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Mount Mayhem at Netflix: Scaling Containers on Modern CPUs]]></title>
      <description><![CDATA[<p>Authors: <a href="https://www.linkedin.com/in/harshad-sane-56711a11/">Harshad Sane</a>, <a href="https://www.linkedin.com/in/andrew-halaney/">Andrew Halaney</a></p><p>Imagine this — you click play on Netflix on a Friday night and behind the scenes hundreds of containers spring to action in a few seconds to answer your call. At Netflix, scaling containers efficiently is critical to delivering a seamless streaming experience to millions of members worldwide. To keep up with responsiveness at this scale, we modernized our container runtime, only to hit a surprising bottleneck: the CPU architecture itself.</p><p>Let us walk you through the story of how we diagnosed the problem and what we learned about scaling containers at the hardware level.</p><h3>The Problem</h3><p>When application demand requires that we scale up our servers, we get a new instance from AWS. To use this new capacity efficiently, pods are assigned to the node until its resources are considered fully allocated. A node can go from no applications running to being maxed out within moments of being ready to receive these applications.</p><p>As we migrated more and more from our old container platform to our new container platform, we started seeing some concerning trends. Some nodes were stalling for long periods of time, with a simple health check timing out after 30 seconds. An initial investigation showed that the mount table length was increasing dramatically in these situations, and reading it alone could take upwards of 30 seconds. Looking at systemd’s stack it was clear that it was busy processing these mount events as well and could lead to complete system lockup. Kubelet also timed out frequently talking to containerd in this period. Examining the mount table made it clear that these mounts were related to container creation.</p><p>The affected nodes were almost all r5.metal instances, and were starting applications whose container image contained many layers (50+).</p><h3>Challenge</h3><h4>Mount Lock Contention</h4><p>The flamegraph in Figure 1 clearly shows where containerd spent its time. Almost all of the time is spent trying to grab a kernel-level lock as part of the various mount-related activities when assembling the container’s root filesystem!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1w_zMp68xhQI0W1bTXG0Dw.png"><figcaption>Figure 1: Flamegraph depicting lock contention</figcaption></figure><p>Looking closer, containerd executes the following calls for each layer if using user namespaces:</p><ol><li>open_tree() to get a reference to the layer / directory</li><li>mount_setattr() to set the idmap to match the container’s user range, shifting the ownership so this container can access the files</li><li>move_mount() to create a bind mount on the host with this new idmap applied</li></ol><p>These bind mounts are owned by the container’s user range and are then used as the lowerdirs to create the overlayfs-based rootfs for the container. Once the overlayfs rootfs is mounted, the bind mounts are then unmounted since they are not necessary to keep around once the overlayfs is constructed.</p><p>If a node is starting many containers at once, every CPU ends up busy trying to execute these mounts and umounts. The kernel VFS has various global locks related to the mount table, and each of these mounts requires taking that lock as we can see in the top of the flamegraph. Any system trying to quickly set up many containers is prone to this, and this is a function of the number of layers in the container image.</p><p>For example, assume a node is starting 100 containers, each with 50 layers in its image. Each container will need 50 bind mounts to do the idmap for each layer. The container’s overlayfs mount will be created using those bind mounts as the lower directories, and then all 50 bind mounts can be cleaned up via umount. Containerd actually goes through this process twice, once to determine some user information in the image and once to create the actual rootfs. This means the total number of mount operations on the start up path for our 100 containers is 100 * 2 * (1 + 50 + 50) = 20200 mounts, all of which require grabbing various global mount related locks!</p><h3>Diagnosis</h3><h4>What’s Different In The New Runtime?</h4><p>As alluded to in the introduction, Netflix has been undergoing a modernization of its container runtime. In the past a virtual kubelet + docker solution was used, whereas now a kubelet + containerd solution is being used. Both the old runtime and the new runtime used user namespaces, so what’s the difference here?</p><ol><li>Old Runtime:<br>All containers shared a single host user range. UIDs in image layers were shifted at untar time, so file permissions matched when containers accessed files. This worked because all containers used the same host user.</li><li>New Runtime:<br>Each container gets a unique host user range, improving security — if a container escapes, it can only affect its own files. To avoid the costly process of untarring and shifting UIDs for every container, the new runtime uses the kernel’s idmap feature. This allows efficient UID mapping per container without copying or changing file ownership, which is why containerd performs many mounts.</li></ol><p>Figure 2 below is a simplified example of how this idmap feature looks like:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TYCK-IO0aLp0jhS8KUM6QQ.png"><figcaption>Figure 2: idmap feature</figcaption></figure><h4>Why Does Instance Type Matter?</h4><p>As noted earlier, the issue was predominantly occurring on r5.metal instances. Once we identified the root issue we could easily reproduce by creating a container image with many layers and sending hundreds of workloads using the image to a test node.</p><p>To better understand why this bottleneck was more profound on some instances compared to others, we benchmarked container launches on different AWS instance types:</p><ul><li>r5.metal (5th gen Intel, dual-socket, multiple NUMA domains)</li><li>m7i.metal-24xl (7th gen Intel, single-socket, single NUMA domain)</li><li>m7a.24xlarge (7th gen AMD, single-socket, single NUMA domain)</li></ul><h4>Baseline Results</h4><p>Figure 3 shows the baseline results from scaling containers on each instance type</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/648/1*KcdXLeml8rDOdnk2aukv1w.png"></figure><ul><li>At low concurrency (≤ ~20 containers), all platforms performed similarly</li><li>As concurrency increased, r5.metal began to fail around 100 containers</li><li>7th generation AWS instances maintained lower launch times and higher success rates as concurrency grew</li><li>m7a instances showed the most consistent scaling behavior with the lowest failure rates even at high concurrency</li></ul><h3>Deep Dive</h3><p>Using perf record and custom microbenchmarks, we can see the hottest code path was in the Linux kernel’s Virtual Filesystem (VFS) path lookup code — specifically, a tight spin loop waiting on a sequence lock in path_init(). The CPU spent most of its time executing the pause instruction, indicating many threads were spinning, waiting for the global lock, as shown in the disassembly snippet below</p><pre>path_init():<br>…<br>mov mount_lock,%eax<br>test $0x1,%al<br>je 7c<br>pause<br>…</pre><p>Using Intel’s Topdown Microarchitecture Analysis (<a href="https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/TMA%20addressing%20challenges%20in%20Icelake%20-%20Ahmad%20Yasin.pdf">TMA</a>), we observed:</p><ul><li>95.5% of pipeline slots were stalled on contested accesses (tma_contested_accesses).</li><li>57% of slots were due to false sharing (multiple cores accessing the same cache line).</li><li>Cache line bouncing and lock contention were the primary culprits.</li></ul><p>Given a high amount of time being spent in contested accesses, the natural thinking from a perspective of hardware variations led to investigation of NUMA and Hyperthreading impact coming from the architecture to this subset</p><h4>NUMA Effects</h4><p>Non-Uniform Memory Access (NUMA) is a system design where each processor has its own local memory for faster access but relies on an interconnect to access the memory attached to a remote processor. Introduced in the 1990s to improve scalability in multiprocessor systems, NUMA boosts performance but also introduces higher latency when a CPU needs to access memory attached to another processor. Figure 4 is a simple image describing local vs remote access patterns of a NUMA architecture</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/802/1*sDCMnrdke_7aAOI-wf5rGQ.png"><figcaption>Figure 4:<em> Source: </em><a href="https://pmem.io/images/posts/numa_overview.png"><em>https://pmem.io/images/posts/numa_overview.png</em></a></figcaption></figure><p>AWS instances come in a variety of shapes and sizes. To obtain the largest core count, we tested the 2-socket 5th generation metal instances (r5.metal), on which containers were orchestrated by the titus agent. Modern dual-socket architectures implement NUMA design, leading to faster local but higher remote access latencies. Although container orchestration can maintain locality, global locks can easily run into high latency effects due to remote synchronization. In order to test the impact of NUMA, we tested an AWS 48xl sized instance with 2 NUMA nodes or sockets versus an AWS 24xl sized instance, which represents a single NUMA node or socket. As seen from Figure 5, the extra hop introduces high latencies and hence failures very quickly.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TXZhTqAQHDu3NffCO15JrA.png"><figcaption>Figure 5: Numa Impact</figcaption></figure><h4>Hyperthreading Effects</h4><ul><li>Hyperthreading (HT): Disabling HT on m7i.metal-24xl (Intel) improved container launch latencies by 20–30% as seen in Figure 6, since hyperthreads compete for shared execution resources, worsening the lock contention. When hyperthreading is enabled, each physical CPU core is split into two logical CPUs (hyperthreads) that share most of the core’s execution resources, such as caches, execution units, and memory bandwidth. While this can improve throughput for workloads that are not fully utilizing the core, it introduces significant challenges for workloads that rely heavily on global locks. By disabling hyperthreading, each thread runs on its own physical core, eliminating this competition for shared resources between hyperthreads. As a result, threads can acquire and release global locks more quickly, reducing overall contention and improving latency for operations that generally share underlying resources.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/623/1*9I-QZB66dEnBtLLuu23Ydw.png"><figcaption>Figure 6: Hyperthreading impact</figcaption></figure><h3>Why Does Hardware Architecture Matter?</h3><h4>Centralized Cache Architectures</h4><p>Some modern server CPUs use a mesh-style interconnect to link cores and cache slices, with each intersection managing cache coherence for a subset of memory addresses. In these designs, all communication passes through a central queueing structure, which can only handle one request for a given address at a time. When a global lock (like the mount lock) is under heavy contention, all atomic operations targeting that lock are funneled through this single queue, causing requests to pile up and resulting in memory stalls and latency spikes.</p><p>In some well-known mesh-based architectures as shown in Figure 7 below, this central queue is called the “Table of Requests” (TOR), and it can become a surprising bottleneck when many threads are fighting for the same lock. If you’ve ever wondered why certain CPUs seem to “pause for breath” under heavy contention, this is often the culprit.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/975/1*kmbGU_M8AK_VHgh4QLJp2A.png"><figcaption>Figure 7: <em>Public document from one of the major CPU vendors Source:</em><a href="https://www.intel.com/content/dam/developer/articles/technical/ddio-analysis-performance-monitoring/Figure1.png">https://www.intel.com/content/dam/developer/articles/technical/ddio-analysis-performance-monitoring/Figure1.png</a></figcaption></figure><h4>Distributed Cache Architectures</h4><p>Some modern server CPUs use a distributed, chiplet-based architecture (Figure 8), where multiple core complexes, each with their own local last-level cache — are connected via a high-speed interconnect fabric. In these designs, cache coherence is managed within each core complex, and traffic between complexes is handled by a scalable control fabric. Unlike mesh-based architectures with centralized queueing structures, this distributed approach spreads contention across multiple domains, making severe stalls from global lock contention less likely. For those interested in the technical details, public documentation from major CPU vendors provides deeper insight into these distributed cache and chiplet designs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0hQAquZ36l8_7Ib9aAnH0g.png"><figcaption>Figure 8: <em>Public document from one of the major CPU vendors, Source: (</em><a href="https://www.servethehome.com/amd-epyc-genoa-gaps-intel-xeon-in-stunning-fashion/amd-epyc-9004-genoa-chiplet-architecture-8x-ccd/">AMD EPYC 9004 Genoa Chiplet Architecture 8x CCD — ServeTheHome</a>)</figcaption></figure><p>Here is a comparison of the same workload run on m7i (centralized cache architecture) vs m7a (distributed cache architecture). Note that, in order to make it closely comparable, Hyperthreading (HT) was disabled on m7i, given previous regression seen in Figure 6, and experiments were run using same core counts. The result clearly shows a fairly consistent difference in performance of approximately 20% as shown in Figure 9</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*85aavnaRC6TAcBlCZgN6xA.png"><figcaption>Figure 9: Architectural impact between m7i and m7a</figcaption></figure><h4>Microbenchmark Results</h4><p>To prove the above theory related to NUMA, HT and micro-architecture, we developed a small <a href="https://github.com/Netflix/global-lock-bench">microbenchmark</a> which basically invokes a given number of threads that then spins on a globally contended lock. Running the benchmark at increasing thread counts reveals the latency characteristics of the system under different scenarios. For example, Figure 10 below is the microbenchmark results with NUMA, HT and different microarchitectures.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bs0MDU5xNs9VIguhHhl0zA.png"><figcaption>Figure 10: Global lock contention benchmark results</figcaption></figure><p>Results from this custom synthetic benchmark (pause_bench) confirmed:</p><ul><li>On r5.metal, eliminating NUMA by only using a single socket significantly drops latency at high thread counts</li><li>On m7i.metal-24xl, disabling hyperthreading further improves scaling</li><li>On m7a.24xlarge, performance scales the best, demonstrating that a distributed cache architecture handles cache-line contention in this case of global locks more gracefully.</li></ul><h3>Improving Software Architecture</h3><p>While understanding the impacts of the hardware architecture is important for assessing possible mitigations, the root cause here is contention over a global lock. Working with containerd upstream we came to two possible solutions:</p><ol><li>Use the newer kernel mount API’s fsconfig() lowerdir+ support to supply the idmap’ed lowerdirs as fd’s instead of filesystem paths. This avoids the move_mount() syscall mentioned prior which requires global locks to mount each layer to the mount table</li><li>Map the common parent directory of all the layers. This makes the number of mount operations go from O(n) to O(1) per container, where n is the number of layers in the image</li></ol><p>Since using the newer API requires using a new kernel, we opted to make the latter <a href="https://github.com/containerd/containerd/pull/12092">change</a> to benefit more of the community. With that in place, no longer do we see containerd’s flamegraph being dominated by mount-related operations. In fact, as seen in Figure 11 below we had to highlight them in purple below to see them at all!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ingCksvvc5HrLd-LvMX7Dw.png"><figcaption>Figure 11: Optimized solution</figcaption></figure><h3><strong>Conclusion</strong></h3><p>Our journey migrating to a modern kubelet + containerd runtime at Netflix revealed just how deeply intertwined software and hardware architecture can be when operating at scale. While kubelet/containerd’s usage of unique container users brought significant security gains, it also surfaced new bottlenecks rooted in kernel and CPU architecture — particularly when launching hundreds of many layered container images in parallel. Our investigation highlighted that not all hardware is created equal for this workload: centralized cache management amplified cache contention while distributed cache design smoothly scaled under load.</p><p>Ultimately, the best solution combined hardware awareness with software improvements. For an immediate mitigation we chose to route these workloads to CPU architectures that scaled better under these conditions. By changing the software design to minimize per-layer mount operations, we eliminated the global lock as a launch-time bottleneck — unlocking faster, more reliable scaling regardless of the underlying CPU architecture. This experience underscores the importance of holistic performance engineering: understanding and optimizing both the software stack and the hardware it runs on is key to delivering seamless user experiences at Netflix scale.</p><p>We trust these insights will assist others in navigating the evolving container ecosystem, transforming potential challenges into opportunities for building robust, high-performance platforms.</p><p><em>Special thanks to the Titus and Performance Engineering teams at Netflix.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=f3b09b68beac" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac">Mount Mayhem at Netflix: Scaling Containers on Modern CPUs</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac</link>
      <guid>https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac</guid>
      <pubDate>Sat, 28 Feb 2026 23:55:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/avneesh/">Avneesh Saluja</a>, <a href="https://www.linkedin.com/in/santiagocastroserra/">Santiago Castro</a>, <a href="https://www.linkedin.com/in/bowei-yan-0080a326/">Bowei Yan</a>, <a href="https://www.linkedin.com/in/ashish-rastogi-11362a/">Ashish Rastogi</a></p><h4>Introduction</h4><p>Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.</p><p>Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. <a href="https://netflixtechblog.com/detecting-scene-changes-in-audiovisual-content-77a61d3eaad6">Previous work</a> has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual <a href="https://en.wikipedia.org/wiki/Shot_(filmmaking)">shots</a> across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as <a href="https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d#befc">ads relevancy, clip popularity prediction, and clip tagging</a>.</p><p>For these reasons, we developed the <strong>Netflix Media Foundational Model (MediaFM)</strong>, our new, in-house, multimodal content embedding model. MediaFM is the first tri-modal (audio, video, text) model pretrained on portions of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate rich, contextual embeddings¹ for shots from our catalog by learning the temporal relationships between them through integrating visual, audio, and textual information. The resulting shot-level embeddings are powerful representations designed to create a deeper, more nuanced, and machine-readable understanding of our content, providing the critical backbone for effective cold start of newly launching titles in recommendations, optimized promotional assets (like art and trailers), and internal content analysis tools.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y5IOwk46Pu82512T00ywGA.png"><figcaption>Figure 1: MediaFM Architecture</figcaption></figure><h4>Input Representation &amp; Preprocessing</h4><p>The model’s fundamental unit of input is a shot, derived by segmenting a movie or episode (collectively referred to as “title”) using a <a href="https://arxiv.org/abs/2008.04838">shot boundary detection</a> algorithm. For each shot, we generate three distinct embeddings from its core modalities:</p><ul><li><strong>Video</strong>: an internal model called <a href="https://netflixtechblog.com/building-in-video-search-936766f0017c">SeqCLIP</a> (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots</li><li><strong>Audio</strong>: the audio samples from the same shots are embedded using Meta FAIR’s <a href="https://arxiv.org/abs/2006.11477">wav2vec2</a></li><li><strong>Timed Text</strong>: OpenAI’s text-embedding-3-large <a href="https://openai.com/index/new-embedding-models-and-api-updates/">model</a> is used to encode the corresponding timed text (e.g., closed captions, audio descriptions, or subtitles) for each shot</li></ul><p>For each shot, the three embeddings² are concatenated and unit-normed to form a single 2304-dimensional fused embedding vector. The transformer encoder is trained on sequences of shots, so each example in our dataset is a temporally-ordered sequence of these fused embeddings from the same movie or episode (up to 512 shots per sequence). We also have access to title-level metadata which is used to provide global context for each sequence (via the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (such as synopses and tags) through the text-embedding-3-large model.</p><h4>Model Architecture and Training Objective</h4><p>The core of our model is a transformer encoder, architecturally similar to BERT. A sequence of preprocessed shot embeddings is passed through the following stages:</p><ol><li><strong>Input Projection</strong>: The fused shot embeddings are first projected down to the model’s hidden dimension via a linear layer.</li><li><strong>Sequence Construction &amp; Special Tokens</strong>: Before entering the Transformer, two special embeddings are prepended to the sequence:<br>• a learnable [CLS] embedding is added at the very beginning.<br>• the title-level embedding is projected to the model’s hidden dimension and inserted after the [CLS] token as the [GLOBAL] token, providing title-level context to every shot in the sequence and participating in the self-attention process.</li><li><strong>Contextualization</strong>: The sequence is enhanced with positional embeddings and fed through the Transformer stack to provide shot representations based on their surrounding context.</li><li><strong>Output Projection</strong>: The contextualized hidden states from the Transformer are passed through a final linear layer, projecting them from the hidden layers back up to the 2304-dimensional fused embedding space for prediction.</li></ol><p>We train the model using a <strong>Masked Shot Modeling (MSM)</strong> objective. In this self-supervised task, we randomly mask 20% of the input shot embeddings in each sequence by replacing them with a learnable [MASK] embedding. The model’s objective is to predict the original, unmasked fused embedding for these masked positions. The model is optimized by minimizing the <strong>cosine distance</strong> between its predicted embedding and the ground-truth embedding for each masked shot.</p><p>We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s worth noting that the switch to Muon resulted in noticeable improvements.</p><h4>Evaluation</h4><p>To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes). Most of the tasks are clip-level, i.e., each example is a short clip ranging from a few seconds to a minute which are often presented to our members while recommending a title to them. When embedding these clips, we find that “embedding in context”, namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip.</p><h4><a href="https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d#befc">Tasks</a></h4><p>Our embeddings are foundational and we find that they bring value to applications across Netflix. Here are a few:</p><ul><li><strong>Ad Relevancy:</strong> A multilabel classification task to categorize Netflix clips for relevant ad placement, measured by <strong>Average Precision</strong>. In this task, these representations operate at the retrieval stage, where they help in identifying the candidate set and in turn are fed into the ad serving system for relevance optimization.</li><li><strong>Clip Popularity Ranking:</strong> A ranking task to predict the relative performance (in click-through rate, <a href="https://en.wikipedia.org/wiki/Click-through_rate">CTR</a>) of a media clip relative to other clips from that show or movie, measured by a ten-fold with <strong>Kendall’s tau correlation coefficient</strong>.</li><li><strong>Clip Tone:</strong> A multi-label classification of hook clips into 100 tone categories (e.g., creepy, scary, humorous) from our internal Metadata &amp; Ratings team, measured by <strong>micro Average Precision </strong>(averaged across tone categories).</li><li><strong>Clip Genre:</strong> A multi-label classification of clips into eleven core genres (Action, Anime, Comedy, Documentary, Drama, Fantasy, Horror, Kids, Romance, Sci-fi, Thriller) derived from the genre of the parent title, measured by <strong>macro Average Precision </strong>(averaged across genres).</li><li><strong>Clip Retrieval: </strong>a binary classification of clips from movies or episodes into “clip-worthy” (i.e., a good clip to showcase the title) or not, as determined by human annotators, and as measured by <strong>Average Precision</strong>. The positive to negative clip ratio is 1:3, and for each title we select 6–10 positive clips and the corresponding number of negatives.</li></ul><p>It’s worth noting that for the tasks above (as well as other tasks that use our model), the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion. Many of the improvements are also in various stages of deployment.</p><h4>Results</h4><p>Figure 2³ compares MediaFM to several strong baselines:</p><ul><li>The previously mentioned SeqCLIP, which also provides the video embedding input for MediaFM</li><li>Google’s <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings">VertexAI multimodal embeddings</a></li><li>TwelveLabs’ <a href="https://www.twelvelabs.io/blog/introducing-marengo-2-7">Marengo 2.7 embeddings</a></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/984/1*aydFwELbXQSMBRWauB3IKg.png"><figcaption>Figure 2: Performance of MediaFM vs. external and internal models.</figcaption></figure><p>On all tasks, MediaFM is better than the baselines. Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context. We look further into this next.</p><h4>Ablations</h4><p>MediaFM’s primary improvements over previous Netflix work stem from two key areas: combining multiple modalities and learning to contextualize shot representations. To determine the contribution of each factor across different tasks, we compared MediaFM to a baseline. This baseline concatenates the three input embeddings, essentially providing the same complete, shot-level input as MediaFM but without the contextualization step. This comparison allows us to isolate which tasks benefit most from the contextualization aspect.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sLEMAVa5NAra9h5WvG4D2g.png"></figure><p>Additional modalities help somewhat for tone but the main improvement comes from contextualization.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mWYcJIXT_lPBkcTHpvQwGA.png"></figure><p>Oddly, multiple uncontextualized modalities <strong>hurts </strong>the clip popularity ranking model, but adding contextualization significantly improves performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y0MEiCxlpGmvIIgxjmarVg.png"></figure><p>For clip retrieval we see a natural progression of around 15% for each improvement.</p><h4>Next Steps</h4><p>MediaFM presents a way to learn how to fuse and/or contextualize shot-level information by leveraging Netflix’s catalog in a self-supervised manner. With this perspective, we are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like <a href="https://qwen.ai/blog?id=65f766fc2dcba7905c1cb69cc4cab90e94126bf4&amp;from=research.latest-advancements-list">Qwen3-Omni</a>, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations.</p><p>Next in this series of blog posts, we will present our method to embed title-level metadata and adapt it to our needs. Stay tuned!</p><h4>Footnotes</h4><ol><li>We chose embeddings over generative text outputs to prioritize modular design. This provides a tighter, cleaner abstraction layer: we generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning, allowing us to enhance our existing embedding-based workflows with new modalities more flexibly.</li><li>All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur (e.g., in shots without dialogue).</li><li>The title-level tasks couldn’t be evaluated with the VertexAI MM and Marengo embedding models as the videos exceed the length limit set by the APIs.</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=e8c28df82e2d" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d">MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d</link>
      <guid>https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d</guid>
      <pubDate>Mon, 23 Feb 2026 19:24:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Scaling LLM Post-Training at Netflix]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/baolin-li-659426115/">Baolin Li</a>, <a href="https://www.linkedin.com/in/lingyi-liu-4b866016/">Lingyi Liu</a>, <a href="https://www.linkedin.com/in/binh-tang-3b76557b/">Binh Tang</a>, <a href="https://www.linkedin.com/in/shaojingli/">Shaojing Li</a></p><h3>Introduction</h3><p>Pre-training gives Large Language Models (LLMs) broad linguistic ability and general world knowledge, but post-training is the phase that actually aligns them to concrete intents, domain constraints, and the reliability requirements of production environments. At Netflix, we are exploring how LLMs can enable new member experiences across recommendation, personalization, and search, which requires adapting generic foundation models so they can better reflect our catalog and the nuances of member interaction histories. At Netflix scale, post-training quickly becomes an engineering problem as much as a modeling one: building and operating complex data pipelines, coordinating distributed state across multi-node GPU clusters, and orchestrating workflows that interleave training and inference. This blog describes the architecture and engineering philosophy of our internal <strong>Post-Training Framework</strong>, built by the AI Platform team to hide infrastructure complexity so researchers and model developers can focus on model innovation — not distributed systems plumbing.</p><h3>A Model Developer’s Post-Training Journey</h3><p>Post-training often starts deceptively simply: curate proprietary domain data, load an open-weight model from Hugging Face, and iterate batches through it. At the experimentation scale, that’s a few lines of code. But when fine-tuning production-grade LLMs at scale, the gap between “running a script” and “robust post-training” becomes an abyss of engineering edge cases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/1*at60qZXd0j6SphkjzdmazQ.png"><figcaption>Figure 1. Simple steps to post-train an open-weight model.</figcaption></figure><h4>Getting the data right</h4><p>On paper, post-training is straightforward: choose a tokenizer, preprocess the dataset, and build a dataloader. In practice, data preparation is where things break. High-quality post-training — instruction following, multi-turn dialogue, Chain-of-Thought — depends on precisely controlling which tokens contribute to the loss. Hugging Face chat templates serialize conversations, but don’t specify what to train on versus ignore. The pipeline must apply explicit loss masking so only assistant tokens are optimized; otherwise the model learns from prompts and other non-target text, degrading quality.</p><p>Variable sequence length is another pitfall. Padding within a batch can waste compute, and uneven shapes across FSDP workers can cause GPU synchronization overhead. A more GPU-efficient approach is to pack multiple samples into fixed-length sequences and use a “document mask” to prevent cross-attention across samples, reducing padding and keeping shapes consistent.</p><h4>Setting up the model</h4><p>Loading an open-source checkpoint sounds simple until the model no longer fits on one GPU. At that point you need a sharding strategy (e.g., FSDP, TP) and must load partial weights directly onto the device mesh to avoid ever materializing the full model on a single device.</p><p>After loading, you still need to make the model trainable: choose full fine-tuning vs. LoRA, and apply optimizations like activation checkpointing, compilation, and correct precision settings (often subtle for RL, where rollout and policy precision must align). Large vocabularies (&gt;128k) add a further memory trap: logits are<em> [batch, seq_len, vocab] </em>and can spike peak memory. Common mitigations include dropping ignored tokens before projection and computing logits/loss in chunks along the sequence dimension.</p><h4>Starting the training</h4><p>Even with data and models ready, production training is not a simple “for loop”. The system must support everything from SFT’s forward/backward pass to on-policy RL workflows that interleave rollout generation, reward/reference inference, and policy updates.</p><p>At Netflix scale, training runs as a distributed job. We use Ray to orchestrate workflows via actors, decoupling modeling logic from hardware. Robust runs also require experiment tracking (model quality metrics like loss and efficiency metrics like MFU) and fault tolerance via standardized checkpoints to resume cleanly after failures.</p><p>These challenges motivate a post-training framework that lets developers focus on modeling rather than distributed systems and operational details.</p><h3>The Netflix Post-Training Framework</h3><p>We built Netflix’s LLM post-training framework so Netflix model developers can turn ideas like those in Figure 1 into scalable, robust training jobs. It addresses the engineering hurdles described above, and also constraints that are specific to the Netflix ecosystem. Existing tools (e.g., Thinking Machines’ <a href="https://thinkingmachines.ai/tinker/">Tinker</a>) work well for standard chat and instruction-tuning, but their structure can limit deeper experimentation. In contrast, our internal use cases often require architectural variation (for example, customizing output projection heads for task-specific objectives), expanded or nonstandard vocabularies driven by semantic IDs or special tokens, and even transformer models pre-trained from scratch on domain-specific, non-natural-language sequences. Supporting this range requires a framework that prioritizes flexibility and extensibility over a fixed fine-tuning paradigm.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*z2IFH-4iIQ5qxARTur-wQA.png"><figcaption>Figure 2. The post-training library within Netflix stack</figcaption></figure><p>Figure 2 shows the end-to-end stack from infrastructure to trained models. At the base is Mako, Netflix’s internal ML compute platform, which provisions GPUs on AWS. On top of Mako, we run robust open-source components — PyTorch, Ray, and vLLM — largely out of the box. Our post-training framework sits above these foundations as a library: it provides reusable utilities and standardized training recipes for common workflows such as Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Reinforcement Learning (RL), and Knowledge Distillation. Users typically express jobs as configuration files that select a recipe and plug in task-specific components.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RzZzr3wADPwism5CQP8mYw.png"><figcaption>Figure 3. Main components developed for the post-training framework</figcaption></figure><p>Figure 3 summarizes the modular components we built to reduce complexity across four dimensions. As with most ML systems, training success hinges on three pillars — <strong>Data</strong>, <strong>Model</strong>, and <strong>Compute</strong> — and the rise of RL fine-tuning adds a fourth pillar: <strong>Workflow</strong>, to support multi-stage execution patterns that don’t fit a simple training loop. Below, we detail the specific abstractions and features the framework provides for each of these dimensions:</p><ul><li><strong>Data:</strong> Dataset abstractions for SFT, reward modeling, and RL; high-throughput streaming from cloud and disk for datasets that exceed local storage; and asynchronous, on-the-fly sequence packing to overlap CPU-heavy packing with GPU execution and reduce idle time.</li><li><strong>Model:</strong> Support for modern architectures (e.g., Qwen3, Gemma3) and Mixture-of-Experts variants (e.g., Qwen3 MoE, GPT-OSS); LoRA integrated into model definitions; and high-level sharding APIs so developers can distribute large models across device meshes without writing low-level distributed code.</li><li><strong>Compute:</strong> A unified job submission interface that scales from a single node to hundreds of GPUs; MFU (Model FLOPS Utilization) monitoring that remains accurate under custom architectures and LoRA; and comprehensive checkpointing (states of trained parameters, optimizer, dataloader, data mixer, etc.) to enable exact resumption after interruptions.</li><li><strong>Workflow:</strong> Support for training paradigms beyond SFT, including complex online RL. In particular, we extend Single Program, Multiple Data (SPMD) style SFT workloads to run online RL with a hybrid single-controller + SPMD execution model, which we’ll describe next.</li></ul><p>Today, this framework supports research use cases ranging from post-training large-scale foundation models to fine-tuning specialized expert models. By standardizing these workflows, we’ve lowered the barrier for teams to experiment with advanced techniques and iterate more quickly.</p><h3>Learnings from Building the Post-Training Framework</h3><p>Building a system of this scope wasn’t a linear implementation exercise. It meant tracking a fast-moving open-source ecosystem, chasing down failure modes that only appear under distributed load, and repeatedly revisiting architectural decisions as the post-training frontier shifted. Below are three engineering learnings and best practices that shaped the framework.</p><h4>Scaling from SFT to RL</h4><p>We initially designed the library around Supervised Fine-Tuning (SFT): relatively static data flow, a single training loop, and a Single Program, Multiple Data (SPMD) execution model. That assumption stopped holding in 2025. With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from “offline training loop” to “multi-stage, on-policy orchestration.”</p><p>SFT’s learning signal is dense and immediate: for each token position we compute logits over the full vocabulary and backpropagate a differentiable loss. Infrastructure-wise, this looks a lot like pre-training and maps cleanly to SPMD — every GPU worker runs the same step function over a different shard of data, synchronizing through Pytorch distributed primitives.</p><p>On-policy RL changes the shape of the system. The learning signal is typically sparse and delayed (e.g., a scalar reward at the end of an episode), and the training step depends on data generated by the current policy. Individual sub-stages — policy updates, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination: you’re constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle.</p><p>In our original SFT architecture, the driver node was intentionally “thin”: it launched N identical Ray actors, each encapsulating the full training loop, and scaling meant launching more identical workers. That model breaks down for RL. RL required us to decompose the system into distinct roles — Policy, Rollout Workers, Reward Model, Reference Model, etc. — and evolve the driver into an active controller that encodes the control plane: when to generate rollouts, how to batch and score them, when to trigger optimization, and how to manage cluster resources across phases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5lhd3rDmexD0KGoHy78CZQ.png"><figcaption>Figure 4. Architectural differences of SFT and RL framework</figcaption></figure><p>Figure 4 highlights this shift. To add RL support without reinventing distributed orchestration from scratch, we integrated the core infrastructure from the open-source <a href="https://github.com/verl-project/verl"><strong>Verl</strong></a> library to manage Ray actor lifecycle and GPU resource allocation. Leveraging Verl’s backend let us focus on the “modeling surface area” — our Data/Model/Compute abstractions and internal optimizations — while keeping orchestration concerns decoupled. The result is a hybrid design: a unified user interface where developers can move between SFT and RL workflows without adopting an entirely different mental model or API set.</p><h4>Hugging Face-Centric Experience</h4><p>The Hugging Face Hub has effectively become the default distribution channel for open-weight LLMs, tokenizers, and configs. We designed the framework to stay close to that ecosystem rather than creating an isolated internal standard. Even when we use optimized internal model representations for speed, we load and save checkpoints in standard Hugging Face formats. This avoids “walled garden” friction and lets teams pull in new architectures, weights, and tokenizers quickly.</p><p>This philosophy also shaped our tokenizer story. Early on, we bound directly to low-level tokenization libraries (e.g., SentencePiece, tiktoken) to maximize control. In practice, that created a costly failure mode: silent training–serving skew. Our inference stack (vLLM) defaults to Hugging Face AutoTokenizer, and tiny differences in normalization, special token handling, or chat templating can yield different token boundaries — exactly the kind of mismatch that shows up later as inexplicable quality regressions. We fixed this by making Hugging Face AutoTokenizer the single source of truth. We then built a thin compatibility layer (BaseHFModelTokenizer) to handle post-training needs — setting padding tokens, injecting generation markers to support loss masking, and managing special tokens / semantic IDs — while ensuring the byte-level tokenization path matches production.</p><p>We do take a different approach for model implementations. Rather than training directly on transformers model classes, we maintain our own optimized, unified model definitions that can still load/save Hugging Face checkpoints. This layer is what enables framework-level optimizations — e.g., FlexAttention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility — without re-implementing them separately for every model family. A unified module naming convention also makes it feasible to programmatically locate and swap components (Attention, MLP, output heads) across architectures, and provides a consistent surface for Tensor Parallelism and FSDP wrapping policies.</p><p>The trade-off is clear: supporting a new model family requires building a bridge between the Hugging Face reference implementation and our internal definition. To reduce that overhead, we use AI coding agents to automate much of the conversion work, with a strict <strong>logit verifier</strong> as the gate: given random inputs, our internal model must match the Hugging Face logits within tolerance. Because the acceptance criterion is mechanically checkable, agents can iterate autonomously until the implementation is correct, dramatically shortening the time-to-support for new architectures.</p><p>Today, this design means we can only train architectures we explicitly support — an intentional constraint shared by other high-performance systems like <a href="https://huggingface.co/docs/transformers/main/transformers_as_backend">vLLM, SGLang</a>, and <a href="https://github.com/pytorch/torchtitan/pull/2048">torchtitan</a>. To broaden coverage, we plan to add a fallback Hugging Face backend, similar to the compatibility patterns these projects use: users will be able to run training directly on native transformers models for rapid exploration of novel architectures, with the understanding that some framework optimizations and features may not apply in that mode.</p><h4>Providing Differential Value</h4><p>A post-training framework is only worth owning if it delivers clear value beyond assembling OSS components. We build on open source for velocity, but we invest heavily where off-the-shelf tools tend to be weakest: performance tuned to our workload characteristics, and integration with Netflix-specific model and business requirements. Here are some concrete examples:</p><p>First, we optimize training efficiency for our real use cases. A representative example is extreme variance in sequence length. In FSDP-style training, long-tail sequences create stragglers: faster workers end up waiting at synchronization points for the slowest batch, lowering utilization. Standard bin-packing approaches help, but doing them offline at our data scale can add substantial preprocessing latency and make it harder to keep datasets fresh. Instead, we built on-the-fly sequence packing that streams samples from storage and dynamically packs them in memory. Packing runs asynchronously, overlapping CPU work with GPU compute. Figure 5 shows the impact: for our most skewed dataset, on-the-fly packing improved the effective token throughput by up to 4.7x.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Z2mdtXVFJsr764NihkguIA.png"><figcaption>Figure 5. Training throughput on two of our internal datasets on A100 and H200 GPUs</figcaption></figure><p>We also encountered subtler performance cliffs around vocabulary expansion. Our workloads frequently add custom tokens and semantic IDs. We found that certain vocabulary sizes could cause the language model head to fall back from a highly optimized cuBLAS kernel to a much slower CUTLASS path, tripling that layer’s execution time. The framework now automatically pads vocabulary sizes to multiples of 64 so the compiler selects the fast kernel, preserving throughput without requiring developers to know these low-level constraints.</p><p>Second, owning the framework lets us support “non-standard” transformer use cases that generic LLM tooling rarely targets. For example, some internal models are trained on member interaction event sequences rather than natural language, and may require bespoke RL loops that integrate with highly-customized inference engines and optimize business-defined metrics. These workflows demand custom environments, reward computation, and orchestration patterns — while still needing the same underlying guarantees around performance, tracking, and fault tolerance. The framework is built to accommodate these specialized requirements without fragmenting into one-off pipelines, enabling rapid iteration.</p><h3>Wrap up</h3><p>Building the Netflix Post-Training Framework has been a continual exercise in balancing standardization with specialization. By staying anchored to the open-source ecosystem, we’ve avoided drifting into a proprietary stack that diverges from where the community is moving. At the same time, by owning the core abstractions around Data, Model, Compute, and Workflow, we’ve preserved the freedom to optimize for Netflix-scale training and Netflix-specific requirements.</p><p>In the process, we’ve moved post-training from a loose collection of scripts into a managed, scalable system. Whether the goal is maximizing SFT throughput, orchestrating multi-stage on-policy RL, or training transformers over member interaction sequences, the framework provides a consistent set of primitives to do so reliably and efficiently. As the field shifts toward more agentic, reasoning-heavy, and multimodal architectures, this foundation will help us translate new ideas into scalable GenAI prototypes — so experimentation is constrained by our imagination, not by operational complexity.</p><h3>Acknowledgements</h3><p>This work builds on the momentum of the broader open-source ML community. We’re especially grateful to the teams and contributors behind Torchtune, Torchtitan, and Verl, whose reference implementations and design patterns informed many of our training framework choices — particularly around scalable training recipes, distributed execution, and RL-oriented orchestration. We also thank our partner teams in Netflix AI for Member Systems for close collaboration, feedback, and shared problem-solving throughout the development and rollout of the Post-Training Framework, and the Training Platform team for providing the robust infrastructure and operational foundation that makes large-scale post-training possible.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=0046f8790194" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194">Scaling LLM Post-Training at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194</link>
      <guid>https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194</guid>
      <pubDate>Fri, 13 Feb 2026 09:05:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Automating RDS Postgres to Aurora Postgres Migration]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/ramsrivatsa/">Ram Srivasta Kannan,</a> <a href="https://www.linkedin.com/in/wale-akintayo-30782a82/">Wale Akintayo</a>, <a href="https://www.linkedin.com/in/jay-bharadwaj-4b310ab8/">Jay Bharadwaj</a>, <a href="https://www.linkedin.com/in/john-crimmins-39730b3a/">John Crimmins</a>, <a href="https://www.linkedin.com/in/shengwei4721/">Shengwei Wang,</a> <a href="https://www.linkedin.com/in/zhitao-cathy-zhu/">Zhitaou Zhu</a></p><h3>Introduction</h3><p>In 2024, the Online Data Stores team at Netflix conducted a comprehensive review of the relational database technologies used across the company. This evaluation examined functionality, performance, and total cost of ownership across our database ecosystem. Based on this analysis, we decided to standardize on <strong>Amazon Aurora PostgreSQL as the primary relational database</strong> offering for Netflix teams.</p><p>Several key factors influenced this decision:</p><ul><li><strong>PostgreSQL already </strong>underpinned<strong> </strong>the majority of our relational workloads, which made it a natural foundation for standardization. Internal evaluations revealed that Aurora PostgreSQL had supported over 95% of the applications and workloads running on other relational databases across our internal services.</li><li><strong>Industry momentum had continued to shift toward PostgreSQL, </strong>driven by its open ecosystem, strong community support, and broad adoption across modern data platforms.</li><li><strong>Aurora’s cloud-native, distributed architecture</strong> provided clear advantages in scalability, high availability, and elasticity compared to traditional single-node PostgreSQL deployments.</li><li>Aurora PostgreSQL offered a <strong>rich feature set</strong>, along with a <strong>strong, forward-looking roadmap</strong> aligned with the needs of large-scale, globally distributed applications.</li></ul><h3>A Clear Migration Path Forward</h3><p>As part of this strategic shift, one of our key initiatives for 2024/2025 was migrating existing users to Aurora PostgreSQL. This effort began with RDS PostgreSQL migrations and will expand to include migrations from other relational systems in subsequent phases.</p><p>As a data platform organization, our goal is to make this evolution predictable, well-supported, and minimally disruptive. This allows teams to adopt Aurora PostgreSQL at a pace that aligns with their product and operational roadmaps, while we move toward a unified and scalable relational data platform across the organization.</p><h3>Database Migration: More Than a Simple Transfer</h3><p>Migrating a database involves far more than copying rows from one system to another. It is a coordinated process of transitioning both data and database functionality while preserving correctness, availability, and performance. At scale, a well-designed migration must minimize disruption to applications and ensure a clean, deterministic handoff from the old system to the new one.</p><p>Most database migrations follow a common set of high-level steps:</p><ol><li><strong>Data Replication</strong>: Data is first copied from the source database to the destination, typically using replication, so that ongoing changes are continuously captured and applied.</li><li><strong>Quiescence</strong>: Write traffic to the source database is halted, allowing the destination to fully catch up and eliminate any remaining divergence.</li><li><strong>Validation</strong>: The system verifies that the source and destination databases are fully synchronized and contain identical data.</li><li><strong>Cutover</strong>: Client applications are reconfigured to point to the destination database, which becomes the new primary source of truth.</li></ol><h3>Challenges</h3><h4>Operational Challenges</h4><p>Migrating to a new relational database at Netflix scale presents substantial operational challenges. With a fleet approaching 400 PostgreSQL clusters, manually migrating each one is simply not scalable for the data platform team. Such an approach would require a significant amount of time, introduce the risk of human error, and necessitate considerable hands-on engineering effort. Compounding the problem, coordinating downtime across the many interconnected services that depend on each database is extremely cumbersome at this scale.</p><p>To address these challenges, we designed a self-service migration workflow that enables service owners to run their own RDS PostgreSQL to Aurora PostgreSQL migrations. The workflow automatically handles orchestration, safety checks, and correctness guarantees end-to-end, resulting in lower operational overhead and a predictable, reliable migration experience.</p><h3>Technical challenges</h3><ul><li><strong>Zero data loss</strong> — We must guarantee that all data from the source cluster is fully and safely migrated to the destination within a very tight window, with no possibility of data loss.</li><li><strong>Minimal downtime — </strong>Some downtime is unavoidable during migration, as applications must briefly pause write traffic while cutting over to Aurora PostgreSQL. For higher-tier services that power critical parts of the Netflix ecosystem, this window must be kept extremely short to prevent user-facing impact and maintain service reliability.</li><li><strong>No control over client applications</strong> — As the platform team, we manage the databases, but application teams handle the read and write operations. We cannot assume that they have the ability to pause writes on demand, nor do we want to expose such controls to them, as mistakes could lead to data inconsistencies post migration. Therefore, building a self-service migration pipeline requires creative control-plane solutions to halt traffic, ensuring that no writes occur during the validation and cutover phases.</li><li><strong>No direct access to RDS credentials</strong> — The migration automation must perform replication, quiescence, and validation without requesting database credentials from users or relying on manual authentication. Source databases are often tightly secured, allowing access only from client applications, but more importantly, requiring credential access — even if it were possible — would significantly increase operational overhead and risk. At the same time, the migration platform may operate in environments without direct access to the source database, making traditional verification or parity checks impossible.</li><li><strong>No Degradation in Performance</strong> — The migration process must not impact the performance or stability of production databases once they are running in the Aurora PostgreSQL ecosystem.</li><li><strong>Full Ecosystem Parity</strong> — Beyond migrating the core database, associated components such as parameter groups, read replicas, and replication slots must also be migrated to ensure functional equivalence.</li></ul><p><strong>Minimal User Effort</strong> — Since we rely on teams who are not database experts to perform migrations, the process must be simple, intuitive, and fully self-guided.</p><h3>AWS recommended migration techniques</h3><h4>Using a snapshot</h4><p>One of the simplest AWS-recommended approaches for migrating from RDS PostgreSQL to Aurora PostgreSQL is based on snapshots. In this model, write traffic to the source PostgreSQL database is first stopped. A manual snapshot of the RDS PostgreSQL instance is then taken and migrated to Aurora, where AWS converts it into an Aurora-compatible format.<br> <br>Once the conversion completes, a new Aurora PostgreSQL cluster is created from the snapshot. After the cluster is brought online and validated, application traffic is redirected to the Aurora endpoint, completing the migration.</p><p><a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Migrating.RDSPostgreSQL.Import.Console.html">Reference</a></p><h4>Using an Aurora read replica</h4><p>In the read-replica–based approach, an Aurora PostgreSQL read replica is created from an existing RDS PostgreSQL instance. AWS establishes continuous, asynchronous replication from the RDS source to the Aurora replica, allowing ongoing changes to be streamed in near real time.</p><p>Because replication runs continuously, the Aurora replica remains closely synchronized with the source database. This enables teams to provision and validate the Aurora environment — including configuration, connectivity, and performance characteristics — while production traffic continues to flow to the source.</p><p>When the replication lag is sufficiently low, write traffic is briefly paused to allow the replica to fully catch up. The Aurora read replica is then promoted to a standalone Aurora PostgreSQL cluster, and application traffic is redirected to the new Aurora endpoint. This approach significantly reduces downtime compared to snapshot-based migrations and is well-suited for production systems that require minimal disruption.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/647/1*_xydKq_VWpaxyqcwCwQyuA.png"><figcaption>Migration Strategy Trade-Offs</figcaption></figure><p>These differences represent the key considerations when choosing a migration strategy from RDS PostgreSQL to Aurora PostgreSQL. For our automation, we opted for the Aurora Read Replica approach, trading increased implementation complexity for a significantly shorter downtime window for client applications.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/682/1*SALMQ9wXHsJY9kBTmZ1D5w.png"><figcaption>Netflix RDS PostgreSQL Deployment Architecture</figcaption></figure><p>In Netflix’s RDS setup, a <a href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6">Data Access Layer</a> (DAL) sits between applications and backend databases, acting as middleware that centralizes database connectivity, security, and traffic routing on behalf of client applications.</p><p>On the client side, applications connect through a forward proxy that manages mutual TLS (mTLS) authentication and establishes a secure tunnel to the Data Gateway service. The Data Gateway, acting as a reverse proxy for database servers, terminates client connections, enforces centralized authentication and authorization, and forwards traffic to the appropriate RDS PostgreSQL instance.</p><p>This layered design ensures that applications never handle raw database credentials, provides a consistent and secure access pattern across all datastore types, and delivers isolated, transparent connectivity to managed PostgreSQL clusters. While the primary goal of this architecture is to enforce strong security controls and standardize how applications access external AWS data stores, it also allows backend databases to be switched transparently via configuration, enabling controlled, low-downtime migrations.</p><h3>Migration Process</h3><p>The Platform team’s goal is to deliver a fully automated, self-service workflow that helps with the migration of customer RDS PostgreSQL instances to Aurora PostgreSQL clusters. This migration tool orchestrates the entire process — from preparing the source environment, initializing the Aurora read replica, and maintaining continuous synchronization, all the way through to cutover — without requiring any database credentials or manual intervention from the customer.</p><p>Designed for minimal downtime and seamless user experience, the workflow ensures full ecosystem parity between RDS and Aurora, preserving performance characteristics and operational behavior while enabling customers to benefit from Aurora’s improved scalability, resilience, and cost efficiency.</p><h3>Data Replication Phase</h3><h4>Enable Automated Backups</h4><p>Automated backups must be enabled on the source database because the Aurora <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.Configuration.html?utm_source=chatgpt.com">read replica</a> is initialized from a consistent snapshot of the source and then kept in sync through continuous replication. Automated backups provide the stable snapshot required to bootstrap the replica, along with the continuous streaming of write-ahead log (WAL) records needed to keep the read replica closely synchronized with the source.</p><h4>Port RDS parameters to an Aurora parameter group</h4><p>We create a dedicated Aurora parameter group for each cluster and migrate all RDS-compatible parameters from the source RDS instance. This ensures that the Aurora cluster inherits the same configuration settings — such as memory configuration, connection limits, query planner behavior, and other PostgreSQL engine parameters that have equivalents in Aurora. Parameters that are unsupported or behave differently in Aurora are either omitted or adjusted according to Aurora best practices.</p><h4>Create an Aurora read replica cluster and instance</h4><p>Creating an Aurora read replica cluster is a critical step in migrating from RDS PostgreSQL to Aurora PostgreSQL. At this stage, the Aurora cluster is created and attached to the RDS PostgreSQL primary as a replica, establishing continuous replication from the source RDS PostgreSQL instance. These Aurora read replicas stay nearly in sync with ongoing changes by streaming write-ahead logs (WAL) from the source, enabling minimal downtime during cutover. The cluster is fully operational for validation and performance testing, but it is not yet writable — RDS remains the authoritative primary.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a3JN0YW1wuYTuez8oWw5SA.png"></figure><h4>Quiescence Phase</h4><p>The goal of the quiescence phase is to transition client applications from the source RDS PostgreSQL instance to the Aurora PostgreSQL cluster as the new primary database, while preserving data consistency during cutover.</p><p>The first step in this process is to stop all write traffic to the source RDS PostgreSQL instance to guarantee consistency. To achieve this, we instruct users to halt application-level traffic, which helps prevent issues such as retry storms, queue backlogs, or unnecessary resource consumption when connectivity changes during cutover. This coordination also gives teams time to prepare operationally, for example, by suppressing alerts, notifying downstream consumers, or communicating planned maintenance to their customers.</p><p>However, relying solely on application-side controls is unreliable. Operational gaps, misconfigurations, or lingering connections can still modify the source database state, potentially resulting in changes that are not replicated to the destination and leading to data inconsistency or loss. To enforce a clean and deterministic cutover, we also block traffic at the infrastructure layer. This is done by detaching the RDS instance’s security groups to prevent new inbound connections, followed by a reboot of the instance. With security groups removed, no new SQL sessions can be established, and the reboot forcibly terminates any existing connections.</p><p>This approach intentionally avoids requiring database credentials or logging into the PostgreSQL server to manually terminate connections. While it may be slower than application- or database-level intervention, it provides a reliably automated and repeatable mechanism to fully quiesce the source RDS PostgreSQL instance before Aurora promotion, eliminating the risk of divergent writes or an inconsistent WAL state.</p><h4>Validation Phase</h4><p>To determine whether the Aurora read replica has fully caught up with the source RDS PostgreSQL instance, we track replication progress using Aurora’s OldestReplicationSlotLag metric. This metric represents how far the Aurora replica is behind the source in applying write-ahead log (WAL) records.</p><p>Once client traffic is halted during quiescence, the source RDS PostgreSQL instance stops producing meaningful WAL entries. At that point, the replication lag should converge to zero, indicating that all WAL records corresponding to real writes have been fully replayed on Aurora.</p><p>However, in practice, our experiments show that the metric never settles at a steady zero. Instead, it briefly drops to <strong>0</strong>, then quickly returns to <strong>64 MB</strong>, repeating this pattern every few minutes as shown in the figure below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/674/1*0pgUo0X6RiwoeATejg6NfQ.png"><figcaption>OldestReplicationSlotLag</figcaption></figure><p>This behavior stems from how OldestReplicationSlotLag is calculated. Internally, the lag is derived using the following query:</p><pre>SELECT<br>  slot_name,<br>  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS slot_lag_bytes<br>FROM pg_replication_slots;</pre><p>Conceptually, this translates to:</p><pre>OldestReplicationSlotLag = current_WAL_position_on_RDS <br>                           – restart_lsn </pre><p>See AWS references <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PostgreSQL.Replication.ReadReplicas.Monitor.html">here</a> and <a href="https://repost.aws/knowledge-center/rds-postgresql-use-logical-replication">here</a>.</p><p>The <a href="https://www.morling.dev/blog/postgres-replication-slots-confirmed-flush-lsn-vs-restart-lsn/#:~:text=And%20this%20is%20exactly%20the,has%20a%20few%20important%20implications:"><em>restart_lsn</em></a> represents the oldest write-ahead log (WAL) record that PostgreSQL must retain to ensure a replication consumer can safely resume replication.</p><p>When PostgreSQL performs a WAL segment switch, Aurora typically catches up almost immediately. At that moment, the restart_lsn briefly matches the source’s current WAL position, causing the reported lag to drop to 0. During idle periods, PostgreSQL performs an empty WAL segment rotation approximately every five minutes, driven by the archive_timeout = 300s setting in the database parameter group.</p><p>Immediately afterward, PostgreSQL begins writing to the new WAL segment. Since this new segment has not yet been fully flushed or consumed by Aurora, the WAL position in source RDS PostgreSQL advances ahead of the restart_lsn of Aurora PostgreSQL by exactly one segment. As a result, OldestReplicationSlotLag jumps to 64 MB, which corresponds to the configured WAL segment size at database initialization, and remains there until the next segment switch occurs.</p><p>Because idle PostgreSQL performs an empty WAL rotation approximately every five minutes, this zero-then-64 MB oscillation is expected. Importantly, the moment when the lag drops to 0 indicates that all meaningful WAL records have been fully replicated, and the Aurora read replica is fully caught up with the source.</p><h4>Cutover Phase</h4><p>Once the Aurora read replica has fully caught up with the source RDS PostgreSQL instance — as confirmed through replication lag analysis — the final step is to promote the replica and redirect application traffic. Promoting the Aurora read replica converts it into an independent, writable Aurora PostgreSQL cluster with its own writer and reader endpoints. At this point, the source RDS PostgreSQL instance is no longer the authoritative primary and is made inaccessible.</p><p>Because Netflix’s RDS ecosystem is fronted by a Data Access Layer (DAL), consisting of client-side forward proxies and a centralized Data Gateway, switching databases does not require application code changes or database credential access. Instead, traffic redirection is handled entirely through configuration updates in the reverse-proxy layer. Specifically, we update the runtime configuration of the Envoy-based Data Gateway to route traffic to the newly promoted Aurora cluster. Once this configuration change propagates, all client-initiated database connections are transparently routed through the DAL to the Aurora writer endpoint, completing the migration without requiring any application changes.</p><p>This proxy-level cutover, combined with Aurora promotion, enables a seamless transition for service owners, minimizes downtime, and preserves data consistency throughout the migration process.</p><h3>Customer Experience: Migrating a Business-Critical Partner Platform</h3><p>One of the critical teams to adopt the RDS PostgreSQL to Aurora PostgreSQL migration workflow was the Enablement Applications team. This team owns a set of databases that model Netflix’s entire ecosystem of partner integrations, including device manufacturers, discovery platforms, and distribution partners. These databases power a suite of enterprise applications that partners worldwide rely on to build, test, certify, and launch Netflix experiences on their devices and services.</p><p>Because these databases sit at the center of Netflix’s partner enablement and certification workflows, they are consumed by a diverse set of client applications across both internal and external organizations. <strong>Internally</strong>, reliability teams use this data to identify streaming failures for specific devices and configurations, supporting quality improvements across the device ecosystem. At the same time, these databases directly serve <strong>external</strong> partners operating across many regions. Device manufacturers rely on them to configure, test, and certify new hardware, while payment partners use them to set up and launch bundled offerings with Netflix.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DJHayui83MkdSMYYd4TIQw.png"><figcaption>Simplified Enablement Applications Overview</figcaption></figure><p><strong>Device Lifecycle Management</strong></p><p>Netflix works with a wide range of device partners to ensure Netflix streams seamlessly across a diverse ecosystem of consumer devices. A core responsibility of Device Lifecycle Management is to provide tools and workflows that allow partners to develop, test, and certify Netflix integrations on their devices.</p><p>As part of the device lifecycle, partners run Netflix-provided test suites against their NRDP implementation. We store <strong>signals that represent the current stage for each device in the certification process</strong>. This certification data forms the backbone of Netflix’s device enablement program, ensuring that only validated devices can launch Netflix experiences.</p><p><strong>Partner Billed Integrations</strong></p><p>In addition to device enablement, the same partner metadata is also consumed by Netflix’s Partner Billed Integrations organization. This group enables external partners to offer Netflix as part of bundled subscription and billing experiences.</p><p>Any disruption in these databases affects partner integration workflows. If the database is unavailable, partners may be unable to configure or launch service bundles with Netflix. Maintaining high availability and data correctness is essential to preserving smooth integration operations.</p><p>The global nature of these workflows makes it difficult to schedule downtime windows. Any disruption would impact partner productivity and risk eroding trust in Netflix’s integration and certification processes.</p><h3>Preparation</h3><p>Given the criticality of the Enablement Applications databases, thorough preparation was essential before initiating the migration. The team invested significant effort upfront to understand traffic patterns, identify all consumers, and establish clear communication channels.</p><p><strong>Understand Client Fan-Out and Traffic Patterns<br></strong>The first step was to gain a complete view of how the databases were being used in production. Using observability tools like CloudWatch metrics, the team analyzed PostgreSQL connection counts, read and write patterns, and overall load characteristics. This helped establish a baseline for normal behavior and ensured there were no unexpected traffic spikes or hidden dependencies that could complicate the migration.</p><p>Just as importantly, this baseline gave the Enablement Applications team a rough idea of the post-migration behavior on Aurora. For example, they expected to see a similar number of active database connections and comparable traffic patterns after cutover, making it easier to validate that the migration had preserved operational characteristics.</p><p><strong>Identify and Enumerate All Database Consumers<br></strong>Unlike most databases, where the set of consumers is well known to the owning team, these databases were accessed by a wide range of internal services and external-facing systems that were not fully enumerated upfront. To address this, we leveraged a tool called flowlogs, an eBPF-based network attribution tooling was used to capture TCP flow data to identify the services and applications establishing connections to the database(<a href="https://netflixtechblog.com/how-netflix-accurately-attributes-ebpf-flow-logs-afe6d644a3bc">link</a>).<br> <br>This approach allowed the team to enumerate active consumers, including those that were not previously documented, ensuring no clients were missed during migration planning.</p><p><strong>Establish Dedicated Communication Channels<br></strong>Once all consumers were identified, a dedicated communication channel was created to provide continuous updates throughout the migration process. This channel was used to share timelines, readiness checks, status updates, and cutover notifications, ensuring that all stakeholders remained aligned and could respond quickly if issues arose.</p><h3>Migration Process</h3><p>After completing application-side preparation, the Enablement Applications team initiated the data replication phase of the migration workflow. The automation successfully provisioned the Aurora read replica cluster and ported the RDS PostgreSQL parameter group to a corresponding Aurora parameter group, bringing the destination environment up with equivalent configuration.</p><h4><strong>Unexpected Replication Slot Behavior</strong></h4><p>However, shortly after replication began, we observed that the OldestReplicationSlotLag metric was unexpectedly high. This was counterintuitive, as Aurora read replicas are designed to remain closely synchronized with the source database by continuously streaming write-ahead logs (WAL).</p><p>Further investigation revealed the presence of an inactive logical replication slot on the source RDS PostgreSQL instance. An inactive replication slot can cause elevated OldestReplicationSlotLag because PostgreSQL must retain all WAL records required by the slot’s last known position (restart_lsn), even if no client is actively consuming data from it. Replication slots are intentionally designed to prevent data loss by ensuring that a consumer can resume replication from where it left off. As a result, PostgreSQL will not recycle or delete WAL segments needed by a replication slot until the slot advances. When a slot becomes inactive — such as when a client migration task is stopped or abandoned — the slot’s position no longer moves forward. Meanwhile, the database continues to generate WAL, forcing PostgreSQL to retain increasingly older WAL files. This growing gap between the current WAL position and the slot’s restart_lsn manifests as a high OldestReplicationSlotLag.</p><p>Identifying and addressing these inactive replication slots was a critical prerequisite to proceeding safely with the migration and ensuring accurate replication state during cutover.</p><p><strong>Successful Migration After Remediation<br> </strong>After identifying the inactive logical replication slot, the team safely cleaned it up on the source RDS PostgreSQL instance and resumed the migration workflow. With the stale slot removed, replication progressed as expected, and the Aurora read replica quickly converged with the source. The migration then proceeded smoothly through the quiescence phase, with no unexpected behavior or replication anomalies observed.</p><p>Following promotion, application traffic transitioned seamlessly to the newly writable Aurora PostgreSQL cluster. Through the Data Access Layer, new client connections were automatically routed to Aurora, and observability metrics confirmed healthy behavior — connection counts, read/write patterns, and overall load closely matched pre-migration baselines. From the application and partner perspective, the cutover was transparent, validating both the correctness of the migration workflow and the effectiveness of the preparation steps.</p><h3>Open questions</h3><h4>How do we select target Aurora PostgreSQL instance types based on the existing production RDS PostgreSQL instance?</h4><p>When selecting the target Aurora PostgreSQL instance type for a production migration, our guidance is intentionally conservative. We prioritize stability and performance first, and optimize for cost only after observing real workload behavior on Aurora.</p><p>In practice, the recommended approach is to adopt Graviton2-based instances (particularly the <em>r6g</em> family) whenever possible, maintain the same instance family and size where feasible, and — at minimum — preserve the memory footprint of the existing RDS instance.</p><p>Unlike RDS PostgreSQL, Aurora does not support the <em>m</em>-series, making a direct family match impossible for those instances. In such cases, simply keeping the same “size” (e.g., 2xlarge → 2xlarge) is not meaningful because the memory profiles differ across families. Instead, we map instances by memory equivalence. For example, an Aurora <em>r6g.xlarge</em> provides a memory footprint comparable to an RDS <em>m5.2xlarge</em>, making it a practical replacement. This memory-aligned strategy offers a safer and more predictable baseline for production migrations.</p><h4><strong>Downtime During RDS → Aurora Cutover?</strong></h4><p>To achieve minimal downtime during an RDS PostgreSQL → Aurora PostgreSQL migration, we front-load as much work as possible into the preparation phase. By the time we reach cutover, the Aurora read replica is already provisioned and continuously replicating WAL from the source RDS instance. Before initiating downtime, we ensure that the replication lag between Aurora and RDS has stabilized within an acceptable threshold. If the lag is large or fluctuating significantly, forcing a cutover will only inflate downtime.</p><p>Downtime begins the moment we remove the security groups from the source RDS instance, blocking all inbound traffic. We then reboot the instance to forcibly terminate existing connections, which typically takes up to a minute. From this point forward, no writes can be performed.</p><p>After traffic is halted, the next objective is to verify that Aurora has fully replayed all meaningful WAL records from RDS. We track this using <strong>OldestReplicationSlotLag</strong>. We first wait for the metric to drop to <strong>0</strong>, indicating that Aurora has consumed all WAL with real writes. Under normal idle behavior, PostgreSQL triggers an empty WAL switch every five minutes. After observing one data point at 0, we wait for an additional idle WAL rotation and confirm that the lag oscillates within the expected <strong>0 → 64 MB</strong> pattern — signifying that the only remaining WAL segments are empty ones produced during idle time. At this point, we know the Aurora replica is fully caught up and can be safely promoted.</p><p>While these validation steps run, we perform the configuration updates on the Envoy reverse proxy in parallel. Once promotion completes and Envoy is restarted with the new runtime configuration, all client-initiated connections begin routing to the Aurora cluster. In practice, the total write-downtime observed across services averages <strong>around 10 minutes</strong>, dominated largely by the RDS reboot and the idle WAL switch interval.</p><p><strong>Optimization: Reducing Idle-Time Wait</strong></p><p>For services requiring stricter downtime budgets, waiting the full five minutes for an idle WAL switch can be prohibitively expensive. In such cases, we can force a WAL rotation immediately after traffic is cut off by issuing:</p><p>SELECT pg_switch_wal();</p><p>Once the switch occurs, OldestReplicationSlotLag will drop to 0 again as Aurora consumes the new (empty) WAL segment. This approach eliminates the need to wait for the default archive_timeout interval, which can significantly reduce overall downtime.</p><h4>How do we migrate CDC consumers?</h4><p>As part of the data platform organization in Netflix, we provide a managed Change Data Capture (CDC) service across a variety of datastores. For PostgreSQL, logical replication slots is the way of implementing change data capture. At Netflix, we build a managed abstraction on top of these replication slots called <strong>datamesh</strong> to manage customers who are leveraging them (<a href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">link</a>).</p><p>Each logical replication slot tracks a consumer’s position in the write-ahead log (WAL), ensuring that WAL records are retained until the consumer has successfully processed them. This guarantees ordered and reliable delivery of row-level changes to downstream systems. At the same time, it tightly couples the lifecycle of replication slots to database operations, making their management a critical consideration during database migrations.</p><p>A key challenge in migrating from RDS PostgreSQL to Aurora PostgreSQL is transitioning these CDC consumers safely — without data loss, stalled replication, or extended downtime — while ensuring that replication slots are correctly managed throughout the cutover process.</p><p>Each row-level change in PostgreSQL is emitted as a CDC event with an operation type of INSERT, UPDATE, DELETE, or REFRESH. REFRESH events are generated during backfills by querying the database directly and emitting the current state of rows in chunks. Downstream consumers are designed to be idempotent and eventually consistent, allowing them to safely process retries, replays, and backfills.</p><p><strong>Handling Replication Slots During Migration</strong></p><p>Before initiating database cutover, we temporarily pause CDC consumption by stopping the infrastructure responsible for consuming from PostgreSQL replication slots and writing into datamesh source. This also drops the replication slot from the database and cleans up our internal state around replication slot offsets. This essentially resets the state of the connector to one of a brand new one.</p><p>This step is critical for two reasons. First, it prevents replication slots from blocking WAL recycling during migration. Second, it ensures that no CDC consumers are left pointing at the source database once traffic is quiesced and cutover begins. While CDC consumers are paused, downstream systems temporarily stop receiving new change events, but remain stable. Once CDC consumers are paused, we proceed with stopping other client traffic and executing the RDS-to-Aurora cutover.</p><p><strong>Reinitializing CDC After Cutover</strong></p><p>After the Aurora PostgreSQL cluster has been promoted and traffic has been redirected, CDC consumers are reconfigured to point to the Aurora endpoint and restarted. Because their previous state was intentionally cleared, consumers initialize as if they are starting fresh.</p><p>On startup, new logical replication slots are created on Aurora, and a full backfill is performed by querying the database and emitting REFRESH events for all existing rows. These events let the consumer know that a manual refresh was done from Aurora and to treat this as an upsert operation. This establishes a clean and consistent baseline from which ongoing CDC can resume. Consumers are expected to handle these refresh events correctly as part of normal operation.</p><p>By explicitly managing PostgreSQL replication slots as part of the migration workflow, we are able to migrate CDC consumers safely and predictably, without leaving behind stalled slots, retained WAL, or consumers pointing to the wrong database. This approach allows CDC pipelines to be cleanly re-established on Aurora while preserving correctness and operational simplicity.</p><h4>How do we roll back in the middle of the process?</h4><p><strong>Pre-quiescence<br></strong>Rolling back before the pre-quienscence phase is quite easy. Your primary RDS database is still the source. Rolling back before the quiescence phase is straightforward. At this stage, the primary RDS PostgreSQL instance continues to serve as the sole source of truth, and no client traffic has been redirected.</p><p>If a rollback is required, the migration can be safely aborted by deleting the newly created Aurora PostgreSQL cluster along with its associated parameter groups. No changes are needed on the application side, and normal operations on RDS PostgreSQL can continue without impact.</p><p><strong>During-quiescence<br></strong>Rolling back during the quiescence phase is more involved. At this point, client traffic to the source RDS PostgreSQL instance has already been stopped by detaching its security groups. To roll back safely, access must first be restored by reattaching the original security groups to the RDS instance, allowing client connections to resume. In addition, any logical replication slots removed during the migration must be recreated so that CDC consumers can continue processing changes from the source database.</p><p>Once connectivity and replication slots are restored, the RDS PostgreSQL instance can safely resume its role as the primary source of truth.</p><p><strong>Post-quiescence <br></strong>Rolling back after cutover, once the Aurora PostgreSQL cluster is serving production traffic, is significantly more complex. At this stage, Aurora has become the primary source of truth, and client applications may already have written new data to it.</p><p>In this scenario, rollback requires setting up replication in the opposite direction, with Aurora as the source and RDS PostgreSQL as the destination. This can be achieved using a service such as AWS Database Migration Service (DMS). AWS provides detailed guidance for setting up this reverse replication flow, which can be followed to migrate data back to RDS if necessary.</p><h3>Conclusion</h3><p>Standardizing and reducing the surface area of data technologies is crucial for any large-scale platform. For the Netflix platform team, this strategy allows us to concentrate engineering effort, deliver deeper value on a smaller set of well-understood systems, and significantly cut the operational overhead of running multiple database technologies that serve similar purposes. Within the relational database ecosystem, Aurora PostgreSQL has become the paved-path datastore — offering strong scalability, resilience, and consistent operational patterns across the fleet.</p><p>Migrations of this scale demand solutions that are reliable, low-touch, and minimally disruptive for service owners. Our automated RDS PostgreSQL → Aurora PostgreSQL workflow represents a major step forward, providing predictable cutovers, strong correctness guarantees, and a migration experience that works uniformly across diverse workloads.</p><p>As we continue this journey, the Relational Data Platform team is building higher-level abstractions and capabilities on top of Aurora, enabling service owners to focus less on the complexities of database internals and more on delivering product value. More to come — stay tuned.</p><h3>Acknowledgements</h3><p>Special thanks to our other stunning colleagues/customers who contributed to the success of the RDS PostgreSQL to Aurora PostgreSQL migration. <a href="mailto:spasupuleti@netflix.com">Sumanth Pasupuleti</a>, <a href="mailto:coleantoniop@netflix.com">Cole Perez</a>, <a href="mailto:akhaku@netflix.com">Ammar Khaku</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=261ca045447f" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/automating-rds-postgres-to-aurora-postgres-migration-261ca045447f">Automating RDS Postgres to Aurora Postgres Migration</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/automating-rds-postgres-to-aurora-postgres-migration-261ca045447f</link>
      <guid>https://netflixtechblog.com/automating-rds-postgres-to-aurora-postgres-migration-261ca045447f</guid>
      <pubDate>Thu, 12 Feb 2026 15:07:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[The AI Evolution of Graph Search at Netflix]]></title>
      <description><![CDATA[<h3>The AI Evolution of Graph Search at Netflix: From Structured Queries to Natural Language</h3><p>By <a href="https://www.linkedin.com/in/ahutter/">Alex Hutter</a> and <a href="https://www.linkedin.com/in/bartosz-balukiewicz/">Bartosz Balukiewicz</a></p><p>Our previous blog posts (<a href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf">part 1</a>, <a href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-part-2-49348511c06c">part 2</a>, <a href="https://netflixtechblog.com/reverse-searching-netflixs-federated-graph-222ac5d23576">part 3</a>) detailed how Netflix’s Graph Search platform addresses the challenges of searching across federated data sets within Netflix’s enterprise ecosystem. Although highly scalable and easy to configure, it still relies on a structured query language for input. Natural language based search has been possible for some time, but the level of effort required was high. The emergence of readily-available AI, specifically Large Language Models (LLMs), has created new opportunities to integrate AI search features, with a smaller investment and improved accuracy.</p><p>While Text-to-Query and Text-to-SQL are established problems, the complexity of distributed Graph Search data in the GraphQL ecosystem necessitates innovative solutions. This is the first in a three-part series where we will detail our journey: how we implemented these solutions, evaluated their performance, and ultimately evolved them into a self-managed platform.</p><h3>The Need for Intuitive Search: Addressing Business and Product Demands</h3><p>Natural language search is the ability to use everyday language to retrieve information as opposed to complex, structured query languages like the Graph Search Filter Domain Specific Language (DSL). When users interact with 100’s of various UIs within the suite of Content and Business Products applications, a frequent task is filtering a data table like the one below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qujlm9wJ_fZTdOrG-EiBcA.png"><figcaption>Example Content and Business Products application view</figcaption></figure><p>Ideally, a user simply wants to satisfy a query like <strong>“I want to see all movies from the 90s about robots from the US.”</strong> Because the underlying platform operates on the Graph Search Filter DSL, the application acts as an intermediary. Users input their requirements through UI elements — toggling facets or using query builders — and the system programmatically converts these interactions into a valid DSL query to filter the data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/612/1*vCXVNhRXGbjLjseNY2vmlQ.png"><figcaption>The Complexity of filtering and DSL generation</figcaption></figure><p>This process presents a few issues.</p><p>Today, many applications have bespoke components for collecting user input — the experience varies across them and they have inconsistent support for the DSL. Users need to “learn” how to use each application to achieve their goals.</p><p>Additionally, some domains have hundreds of fields in an index that could be faceted or filtered by. A <em>subject matter expert </em>(SME) may know exactly what they want to accomplish, but be bottlenecked by the inefficient pace of filling out a large scale UI form and translating their questions in order to encode it in a representation Graph Search needs.</p><p>Most importantly, users think and operate using natural language, not technical constructs like query builders, components, or DSLs. By requiring them to switch contexts, we introduce friction that slows them down or even prevents their progress.</p><p>With readily-available AI components, our users can now interact with our systems through natural language. The challenge now is to make sure our offering, searching Netflix’s complex enterprise state with natural language, is an intuitive and trustworthy experience.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/615/1*EklLSPYRp9yrPP3wLW5qfg.png"><figcaption>Natural language queries translated into Graph Search Filter DSL</figcaption></figure><p>We’ve made a decision to pursue generating Graph Search Filter statements from natural language to meet this need. Our intention is to augment and not replace existing applications with <a href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation">retrieval augmented generation</a> (RAG), providing tooling and capabilities so that applications in our ecosystem have newly accessible means of processing and presenting their data in their distinct domain flavours. It should be noted that all the work here has direct application to building a RAG system on top of Graph Search in the future.</p><h3>Under the Hood: Our Approach to Text-to-Query</h3><p>The core function of the text-to-query process is converting a user’s (often ambiguous) natural language question into a structured query. We primarily achieve this through the use of an LLM.</p><p>Before we dive deeper, let’s quickly revisit the structure of Graph Search Filter DSL. Each Graph Search index is <a href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf#:~:text=of%20configuration%20required.-,Configuration,-For%20collecting%20the">defined by a GraphQL query</a>, made up of a collection of fields. Each field has a type e.g. boolean, string, and some have their permitted values governed by controlled vocabularies — a standardized and governed list of values (like an enumeration, or a foreign key). The names of those fields can be used to construct expressions using comparison (e.g. &gt; or ==) or inclusion/exclusion operators (e.g. IN). In turn those expressions can be combined using logical operators (e.g. AND) to construct complex statements.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/615/1*UbP1eIXDDqt5Q8nlCdqIKQ.png"><figcaption>Graph Search Filter DSL</figcaption></figure><p>With that understanding, we can now more rigorously define the conversion process. We need the LLM to generate a Graph Search Filter DSL statement that is syntactically, semantically, and pragmatically correct.</p><p><strong>Syntactic correctness</strong> is easy — does it parse? To be syntactically correct, the generated statement must be well formed<strong> </strong>i.e. follow the grammar of the Graph Search Filter DSL.</p><p><strong>Semantic correctness </strong>adds some additional complexity as it requires more knowledge of the index itself. To be semantically correct:</p><ul><li>it must respect the field types i.e. only use comparisons that make sense given the underlying type;</li><li>it must only use fields that are actually present in the index, i.e. does not <em>hallucinate;</em></li><li>when the values of a field are constrained to a controlled vocabulary, any comparison must only use values from that controlled vocabulary.</li></ul><p><strong>Pragmatic correctness</strong> is much more difficult. It asks the question: does the generated filter actually capture the intent of the user’s query?</p><p>The following sections will detail how we pre-process the user’s question to create appropriate context for the instructions that we will provide to the LLM — both of <a href="https://developers.google.com/machine-learning/resources/intro-llms">which are fundamental to LLM interaction</a> — as well as post-processing we perform on the generated statement to validate it, and help users understand and trust the results they receive.</p><p>At a high level that process looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/284/1*s34kKg5FI3TDd6UeXaQWOA.png"><figcaption>Graph Search FIlter DSL generation process</figcaption></figure><h3>Context Engineering</h3><p>Preparation for the filter generation task is predominantly engineering the appropriate context. The LLM will need access to the fields of an index and their metadata in order to construct semantically correct filters. As the indices are defined by GraphQL queries, we can use the type information from the GraphQL schema to derive much of the required information. For some fields, there is additional information we can provide beyond what’s available in the schema as well, in particular permissible values that pull from controlled vocabularies.</p><p>Each field in the index is associated with metadata as seen below, and that metadata is provided as part of the context.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/655/1*Z34Ek0ib34dd600ZH9iT0Q.png"><figcaption>Graph Search index representation</figcaption></figure><ul><li>The <strong>field</strong> is derived from the document path as characterized by the GraphQL query.</li><li>The <strong>description</strong> is the comment from the GraphQL schema for the field.</li><li>The <strong>type</strong> is derived from the GraphQL schema for the field e.g. Boolean, String, enum. We also support an additional controlled vocabulary type we will discuss more of shortly.</li><li>The <strong>valid values</strong> are derived from enum values for the enum type or from a controlled vocabulary as we will now discuss.</li></ul><p>A <em>controlled vocabulary</em> is a specific field type that consists of a finite set of allowed values, which are defined by a SMEs or domain owners. Index fields can be associated with a particular controlled vocabulary, e.g. countries with members such as Spain and Thailand, and any usage of that field within a generated statement must refer to values from that vocabulary.</p><p>Naively providing all the metadata as context to the LLM worked for simple cases but did not scale. Some indices have hundreds of fields and some controlled vocabularies have thousands of valid values. Providing all of those, especially the controlled vocabulary values and their accompanying metadata, expands the context; this proportionally increases latency and decreases the correctness of generated filter statements. Not providing the values wasn’t an option as we needed to ground the LLMs generated statements- without them, the LLM would frequently hallucinate values that did not exist.</p><p>Curating the context to an appropriate subset was a problem we addressed using the well known RAG pattern.</p><h4>Field RAG</h4><p>As mentioned previously, some indices have hundreds of fields, however, most user’s questions typically refer only to a handful of them. If there was no cost in including them all, we would, but as mentioned prior, there is a cost in terms of the latency of query generation as well as the correctness of the generated query (e.g. needle-in-the-hackstack problem) and non-deterministic results.</p><p>To determine which subset of fields to include in the context, we “match” them against the intent of the user’s question.</p><ul><li>Embeddings are created for index fields and their metadata (name, description, type) and are indexed in a vector store</li><li>At filter generation time, the user’s question is chunked with an overlapping strategy. For each chunk, we perform a vector search to identify the top K most relevant values and the fields to which they belong.</li><li><strong>Deduplication:</strong> The top K fields from each chunk are both consolidated and deduplicated before being provided as context to the system instructions.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/599/1*_fkaUvg0ljRYJnws1xu_Rw.png"><figcaption>Field RAG process (chunking, merge, deduplicate)</figcaption></figure><h4>Controlled Vocabularies RAG</h4><p>Index fields of the controlled vocabulary type are associated with a particular controlled vocabulary, again, countries are one example. Given a user’s question, we can infer whether or not it refers to values of a particular controlled vocabulary. In turn, by knowing which controlled vocabulary values are present, we can identify additional, related index fields that should be included in the context that may not have been identified by the field RAG step.</p><p>Each controlled vocabulary value has:</p><ul><li>a unique<strong> identifier</strong> within its type;</li><li>a human readable <strong>display name;</strong></li><li>a <strong>description</strong> of the value;</li><li>also-known-as values or <strong>AKA</strong> display names, e.g. “romcom” for “Romantic Comedy”.</li></ul><p>To determine which subset of values to include in the context for controlled vocabulary fields (and also possibly infer additional fields), we “match” them against the user’s question.</p><ul><li>Embeddings are created for controlled vocabulary values and their metadata, and these are indexed in a vector store. The controlled vocabularies are available via GraphQL and are regularly fetched and reindexed so this system stays up to date with any changes in the domain.</li><li>At filter generation time, the user’s question is chunked. For each chunk, we perform a vector search to identify the top K most relevant values (but only for the controlled vocabularies that are associated with fields in the index)</li><li>The top K values from each chunk are deduplicated by their controlled vocabulary type. The associated field definition is then injected into the context along with the matched values.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/461/1*N70dw5GEqNDVDYPmpRmqUQ.png"><figcaption>Controlled Vocabularies RAG</figcaption></figure><p><strong>Combining both approaches, the RAG of fields and controlled vocabularies, we end up with the solution that each input question resolves in available and matched fields and values:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/706/1*hIALZVLkkOjZ5ePgsdVWIA.png"><figcaption>Field and CV RAG</figcaption></figure><p>The quality of results generated by the RAG tool can be significantly enhanced by tuning its various parameters, or “levers.” These include strategies for reranking, chunking, and the selection of different embedding generation models. The careful and systematic evaluation of these factors will be the focus of the subsequent parts of this series.</p><h3>The Instructions</h3><p>Once the context is constructed, it is provided to the LLM with a set of instructions and the user’s question. The instructions can be summarised as follows: <strong>“<em>Given a natural language question, generate a syntactically, semantically, and pragmatically correct filter statement given the availability of the following index fields and their metadata</em>.”</strong></p><ul><li>In order to generate a <em>syntactically</em> correct filter statement, the instructions include the syntax rules of the DSL.</li><li>In order to generate a <em>semantically</em> correct filter statement, the instructions tell the LLM to ground the generated statement in the provided context.</li><li>In order to generate a <em>pragmatically</em> correct filter statement, so far we focus on better context engineering to ensure that only the most relevant fields and values are provided. We haven’t identified any instructions that make the LLM just “do better” at this aspect of the task.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/893/1*gCfHD-yVJhpcj6zPYRhzqQ.png"><figcaption>Graph Search Filter DSL generation</figcaption></figure><p>After the filter statement is generated by the LLM, we deterministically validate it prior to returning the values to the user.</p><h3>Validation</h3><h4>Syntactic Correctness</h4><p>Syntactic correctness ensures the LLM output is a parsable filter statement. We utilize an Abstract Syntax Tree (AST) parser built for our custom DSL. If the generated string fails to parse into a valid AST, we know immediately that the query is malformed and there is a fundamental issue with the generation.</p><p>The other approach to solve this problem could be using the <a href="https://platform.openai.com/docs/guides/structured-outputs">structured outputs</a> modes provided by some LLMs. However, our initial evaluation yielded mixed results, as the custom DSL is not natively supported and requires further work.</p><h4>Semantic Correctness</h4><p>Despite careful context engineering using the RAG pattern, the LLM sometimes hallucinates both fields and available values in the generated filter statement. The most straightforward way of preventing this phenomenon is validating the generated filters against available index metadata. This approach does not impact the overall latency of the system, as we are already working with an AST of the filter statement, and the metadata is freely available from the context engineering stage.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/553/1*0LUUAK1G7CtwFDBZsAjgDA.png"><figcaption>DSL verification &amp; hallucinations</figcaption></figure><p>If a hallucination is detected it can be returned as an error to a user, indicating the need to refine the query, or can be provided back to the LLM in the form of a feedback loop for self correction.</p><p>This increases the filter generation time, so should be used cautiously with a limited number of retries.</p><h3>Building Confidence</h3><p>You probably noticed we are not validating the generated filter for pragmatic correctness. That task is the hardest challenge: The filter parses (<em>syntactic</em>) and uses real fields (<em>semantic</em>), but is it what the user meant? When a user searches for <strong>“Dark”</strong>, do they mean <strong>the specific German sci-fi series <em>Dark</em>, </strong>or are they browsing for the mood category<strong> “dark TV shows”</strong>?</p><p>The gap between what a user intended and the generated filter statement is often caused by ambiguity. Ambiguity stems from the <a href="https://en.wikipedia.org/wiki/Semantic_compression">compression of natural language</a>. A user says <strong>“German time-travel mystery with the missing boy and the cave”</strong> but the index contains <strong>discrete metadata fields</strong> like <strong>releaseYear</strong>, <strong>genreTags</strong>, and <strong>synopsisKeywords</strong>.</p><p>How do we ensure users aren’t inadvertently led to wrong answers or to answers for questions they didn’t ask?</p><h4>Showing Our Work</h4><p>One way we are handling ambiguity is by <em>showing our work</em>. We visualise the generated filters in the UI in a user-friendly way allowing them to very clearly see if the answer we’re returning is what they were looking for so they can trust the results..</p><p>We cannot show a raw DSL string (e.g., <em>origin.country == ‘Germany’ AND genre.tags CONTAINS ‘Time Travel’ AND synopsisKeywords LIKE ‘*cave*’</em>) to a non-technical user. Instead, we reflect its underlying AST into UI components.</p><p>After the LLM generates a filter statement, we parse it into an AST, and then map that AST to the existing “Chips” and “Facets” in our UI (see below). If the LLM generates a filter for <em>origin.country == ‘Germany’</em>, the user sees the “Country” dropdown pre-selected to “Germany.” This gives users immediate visual feedback and the ability to easily fine-tune the query using standard UI controls when the results need improvement or further experimentation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_Gx0THjlWW9_jwb8KwCrYw.png"><figcaption>Generated filters visualisation</figcaption></figure><h4>Explicit Entity Selection</h4><p>Another strategy we’ve developed to remove ambiguity happens at query time. We give users the ability to constrain their input to refer to known entities using “@mentions”. Similar to Slack, typing @ lets them search for entities directly from our specialized UI Graph Search component, giving them easy access to multiple controlled vocabularies (plus other identifying metadata like launch year) to feel confident they’re choosing the entity they intend.</p><p>If a user types, “When was <em>@dark</em> produced”, we explicitly know they are referring to the <em>Series</em> controlled vocabulary, allowing us to bypass the RAG inference step and hard-code that context, significantly increasing pragmatic correctness (and building user trust in the process).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*u3YrDuaYJr4LlVB1286Xzg.png"><figcaption>Example @mentions usage in the UI</figcaption></figure><h3>End-to-end architecture</h3><p>As mentioned previously, the solution architecture is divided into <em>pre-processing</em>, filter statement generation, and then <em>post-processing</em> stages. The pre-processing handles context building and involves a RAG pattern for similarity search, while the post-processing validation stage checks the correctness of the LLM-generated filter statements and provides visibility into the results for end users. This design strategically balances LLM involvement with more deterministic strategies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lw47a7h67i4EUmvhlRKcYQ.png"><figcaption>End-to-end architecture</figcaption></figure><p>The end-to-end process is as follows:</p><ol><li>A user’s natural language question (with optional `@mentions` statements) are provided as input, along with the Graph Search index context</li><li>The context is scoped by using the RAG pattern on both fields and possible values</li><li>The pre-processed context and the question are fed into the LLM with an instruction asking for<em> a syntactically and semantically correct filter statement</em></li><li>The generated filer statement DSL is verified and checked for hallucinations</li><li>The final response contains the related AST in order to build “Chips” and “Facets”</li></ol><h3>Summary</h3><p>By combining our existing Graph Search infrastructure with the power and flexibility of LLMs, we’ve bridged the gap between complex filter statements and user intent. We moved from requiring users to speak our language (DSL) to our systems understanding theirs.</p><p>The initial challenge for our users was successfully addressed. However, our next steps involve transforming this system into a comprehensive and expandable platform, rigorously evaluating its performance in a live production environment, and expanding its capabilities to support GraphQL-first user interfaces. These topics, and others, will be the focus of the subsequent installments in this series. Be sure to follow along!</p><p>You may have noticed that we have a lot more to do on this project, including named entity recognition and extraction, intent detection so we can route questions to the appropriate indices, and query rewriting among others. If this kind of work interests you, reach out! We’re hiring in our Warsaw office, check for open roles <a href="https://explore.jobs.netflix.net/careers?location=Warsaw%2C%20Masovian%20Voivodeship%2C%20Poland&amp;pid=790302168096&amp;domain=netflix.com&amp;sort_by=relevance&amp;triggerGoButton=false">here</a>.</p><h3>Credits</h3><p>Special thanks to <a href="https://www.linkedin.com/in/quesadaalejandro/">Alejandro Quesada</a>, <a href="https://www.linkedin.com/in/yevgeniya-li-9877ba160/">Yevgeniya Li</a>, <a href="https://www.linkedin.com/in/dkyrii/">Dmytro Kyrii</a>, <a href="https://www.linkedin.com/in/razvan-gabriel-gatea/">Razvan-Gabriel Gatea</a>, <a href="https://www.linkedin.com/in/milodorif/">Orif Milod</a>, <a href="https://www.linkedin.com/in/michal-krol-45973411a/">Michal Krol</a>, <a href="https://www.linkedin.com/in/jeffbalis/">Jeff Balis</a>, <a href="https://www.linkedin.com/in/czhao/">Charles Zhao</a>, <a href="https://www.linkedin.com/in/shilpamotukuri/">Shilpa Motukuri</a>, <a href="https://www.linkedin.com/in/shervineamidi/">Shervine Amidi</a>, <a href="https://www.linkedin.com/in/aborysov/">Alex Borysov</a>, <a href="https://www.linkedin.com/in/mike-azar-7064883b/">Mike Azar</a>, <a href="https://www.linkedin.com/in/bernardo-g-4414b41/">Bernardo Gomez Palacio</a>, <a href="https://www.linkedin.com/in/haoyuan-h-98b587134/">Haoyun He</a>, <a href="https://www.linkedin.com/in/edyr96/">Eduardo Ramirez</a>, <a href="https://www.linkedin.com/in/yujiaxie2019/">Cynthia Xie</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=d416ec5b1151" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151">The AI Evolution of Graph Search at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151</link>
      <guid>https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151</guid>
      <pubDate>Mon, 26 Jan 2026 20:01:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Temporal Powers Reliable Cloud Operations at Netflix]]></title>
      <description><![CDATA[<p>By <a href="https://www.linkedin.com/in/jacobmeyers35/">Jacob Meyers</a> and <a href="https://www.linkedin.com/in/robzienert/">Rob Zienert</a></p><p><a href="https://temporal.io/">Temporal</a> is a <a href="https://docs.temporal.io/evaluate/understanding-temporal#durable-execution">Durable Execution</a> platform which allows you to write code “as if failures don’t exist”. It’s become increasingly critical to Netflix since its initial adoption in 2021, with users ranging from the operators of our <a href="https://about.netflix.com/en/news/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience">Open Connect</a> global CDN to our <a href="https://medium.com/netflix-techblog/behind-the-streams-live-at-netflix-part-1-d23f917c2f40">Live</a> reliability teams now depending on Temporal to operate their business-critical services. In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from <strong>4% to 0.0001%</strong>.</p><h3>A Crash Course on (some of) Spinnaker</h3><p><a href="https://netflixtechblog.com/global-continuous-delivery-with-spinnaker-2a6896c23ba7">Spinnaker</a> is a multi-cloud continuous delivery platform that powers the vast majority of Netflix’s software deployments. It’s composed of several (mostly nautical themed) microservices. Let’s double-click on two in particular to understand the problems we were facing that led us to adopting Temporal.</p><p>In case you’re completely new to Spinnaker, Spinnaker’s fundamental tool for deployments is the <em>Pipeline</em>. A Pipeline is composed of a sequence of steps called <em>Stages</em>, which themselves can be decomposed into one or more <em>Tasks</em>, or other Stages. An example deployment pipeline for a production service may consist of these stages: Find Image -&gt; Run Smoke Tests -&gt; Run Canary -&gt; Deploy to us-east-2 -&gt; Wait -&gt; Deploy to us-east-1.</p><figure><img alt="An example Spinnaker Pipeline" src="https://cdn-images-1.medium.com/max/1024/1*7sGhc8LhyqQlW9Uiq76TWQ.png"><figcaption>An example Spinnaker Pipeline for a Netflix service</figcaption></figure><p>Pipeline configuration is extremely flexible. You can have Stages run completely serially, one after another, or you can have a mix of concurrent and serial Stages. Stages can also be executed conditionally based on the result of previous stages. This brings us to our first Spinnaker service: <em>Orca</em>. Orca is the <a href="https://raw.githubusercontent.com/spinnaker/orca/refs/heads/master/logo.jpg">orca-stration</a> engine of Spinnaker. It’s responsible for managing the execution of the Stages and Tasks that a Pipeline unrolls into and coordinating with other Spinnaker services to actually execute them.</p><p>One of those collaborating services is called <em>Clouddriver</em>. In the example Pipeline above, some of the Stages will require interfacing with cloud infrastructure. For example, the canary deployment involves creating ephemeral hosts to run an experiment, and a full deployment of a new version of the service may involve spinning up new servers and then tearing down the old ones. We call these sorts of operations that mutate cloud infrastructure <em>Cloud Operations</em>. Clouddriver’s job is to decompose and execute Cloud Operations sent to it by Orca as part of a deployment. Cloud Operations sent from Orca to Clouddriver are relatively high level (for example: createServerGroup), so Clouddriver understands how to translate these into lower-level cloud provider API calls.</p><p>Pain points in the interaction between Orca and Clouddriver and the implementation details of Cloud Operation execution in Clouddriver are what led us to look for new solutions and ultimately migrate to Temporal, so we’ll next look at the anatomy of a Cloud Operation. Cloud Operations in the OSS version of Spinnaker still work as described below, so motivated readers can follow along in <a href="https://github.com/spinnaker/clouddriver">source code</a>, however our migration to Temporal is entirely closed-source following a fork from OSS in 2020 to allow Netflix to make larger pivots to the product such as this one.</p><h4><strong>The Original Cloud Operation Flow</strong></h4><p>A Cloud Operation’s execution goes something like this:</p><ol><li>Orca, in orchestrating a Pipeline execution, decides a particular Cloud Operation needs to be performed. It sends a POST request to Clouddriver’s /ops endpoint with an untyped bag-of-fields.</li><li>Clouddriver attempts to resolve the operation Orca sent into a set of AtomicOperation s— internal operations that only Clouddriver understands.</li><li>If the payload was valid and Clouddriver successfully resolved the operation, it will immediately return a Task ID to Orca.</li><li>Orca will immediately begin polling Clouddriver’s GET /task/&lt;id&gt; endpoint to keep track of the status of the Cloud Operation.</li><li>Asynchronously, Clouddriver begins executing AtomicOperations using <em>its own</em> internal orchestration engine. Ultimately, the AtomicOperations resolve into cloud provider API calls. As the Cloud Operation progresses, Clouddriver updates an internal state store to surface progress to Orca.</li><li>Eventually, if all went well, Clouddriver will mark the Cloud Operation complete, which eventually surfaces to Orca in its polling. Orca considers the Cloud Operation finished, and the deployment can progress.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Y57y00EsM2YGRph9IRNmLQ.png"><figcaption>A sequence diagram of a Cloud Operation execution</figcaption></figure><p>This works well enough on the happy path, but veer off the happy path and dragons begin to emerge:</p><ol><li>Clouddriver has its own internal orchestration system independent of Orca to allow Orca to query the progress of Cloud Operation. This is largely undifferentiated lifting relative to Clouddriver’s goal of actuating cloud infrastructure changes, and ultimately adds complexity and surface area for bugs to the application. Additionally, Orca is tightly coupled to Clouddriver’s orchestration system — it must understand how to poll Clouddriver, interpret the status, and handle errors returned by Clouddriver.</li><li>Distributed systems are messy — networks and external services are unreliable. While executing a Cloud Operation, Clouddriver could experience transient network issues, or the cloud provider it’s attempting to call into may be having an outage, or any number of issues in between. Despite all of this, Clouddriver must be as reliable as reasonably possible as a core platform service. To deal with this shape of issue, Clouddriver internally evolved complex retry logic, further adding cognitive complexity to the system.</li><li>Remember how a Cloud Operation gets decomposed by Clouddriver into AtomicOperations? Sometimes, if there’s a failure in the middle of a Cloud Operation, we need to be able to roll back what was done in AtomicOperations prior to the failure. This led to a homegrown Saga framework being implemented inside Clouddriver. While this did result in a big step forward in reliability of Cloud Operations facing transient failures because the Saga framework <em>also</em> allowed replaying partially-failed Cloud Operations, it added yet more undifferentiated lifting inside the service.</li><li>The task state kept by Clouddriver was <em>instance-local</em>. In other words, if the Clouddriver instance carrying out a Cloud Operation crashed, that Cloud Operation state was lost, and Orca would eventually time out polling for the task status. The Saga implementation mentioned above mitigated this for certain operations, but was not widely adopted across all cloud providers supported by Spinnaker.</li></ol><p>We introduced a <em>lot</em> of incidental complexity into Clouddriver in an effort to keep Cloud Operation execution reliable, and despite all this deployments still failed around 4% of the time due to transient Cloud Operation failures.</p><p>Now, I can already hear you saying: “So what? Can’t people re-try their deployments if they fail?” While true, some pipelines take <em>days</em> to complete for complex deployments, and a failed Cloud Operation mid-way through requires re-running the <em>whole</em> thing. This was detrimental to engineering productivity at Netflix in a non-trivial way. Rather than continue trying to build a faster horse, we began to look elsewhere for our reliable orchestration requirements, which is where Temporal comes in.</p><h3>Temporal: Basic Concepts</h3><p>Temporal is an open source product that offers a durable execution platform for your applications. Durable execution means that the platform will ensure your programs run to completion despite adverse conditions. With Temporal, you organize your business logic into <em>Workflows</em>, which are a deterministic series of steps. The steps inside of Workflows are called <em>Activities</em>, which is where you encapsulate all your non-deterministic logic that needs to happen in the course of executing your Workflows. As your Workflows execute in processes called <em>Workers</em>, the Temporal server durably stores their execution state so that in the event of failures your Workflows can be retried or even migrated to a different Worker. This makes Workflows incredibly resilient to the sorts of transient failures Clouddriver was susceptible to. Here’s a simple example Workflow in Java that runs an Activity to send an email once every 30 days:</p><pre>@WorkflowInterface<br>public interface SleepForDaysWorkflow {<br>    @WorkflowMethod<br>    void run();<br>}<br><br>public class SleepForDaysWorkflowImpl implements SleepForDaysWorkflow {<br><br>    private final SendEmailActivities emailActivities =<br>            Workflow.newActivityStub(<br>                    SendEmailActivities.class,<br>                    ActivityOptions.newBuilder()<br>                            .setStartToCloseTimeout(Duration.ofSeconds(10))<br>                            .build());<br><br>    @Override<br>    public void run() {<br>        while (true) {<br>            // Activities already carry retries/timeouts via options.<br>            emailActivities.sendEmail();<br><br>            // Pause the workflow for 30 days before sending the next email.<br>            Workflow.sleep(Duration.ofDays(30));<br>        }<br>    }<br>}<br><br>@ActivityInterface<br>public interface SendEmailActivities {<br>    void sendEmail();<br>}</pre><p>There’s some interesting things to note about this Workflow:</p><ol><li>Workflows and Activities are just code, so you can test them using the same techniques and processes as the rest of your codebase.</li><li>Activities are automatically retried by Temporal with configurable exponential backoff.</li><li>Temporal manages all the execution state of the Workflow, including timers (like the one used by Workflow.sleep). If the Worker executing this workflow were to have its power cable unplugged, Temporal would ensure another Worker continues to execute it (even during the 30 day sleep).</li><li>Workflow sleeps are not compute-intensive, and they don’t tie up the process.</li></ol><p>You might already begin to see how Temporal solves a lot of the problems we had with Clouddriver. Ultimately, we decided to pull the trigger on migrating Cloud Operation execution to Temporal.</p><h3>Cloud Operations with Temporal</h3><p>Today, we execute Cloud Operations as Temporal workflows. Here’s what that looks like.</p><ol><li>Orca, using a Temporal client, sends a request to Temporal to execute an UntypedCloudOperationRunner Workflow. The contract of the Workflow looks something like this:</li></ol><pre>@WorkflowInterface<br>interface UntypedCloudOperationRunner {<br>  /**<br>   * Runs a cloud operation given an untyped payload.<br>   *<br>   * WorkflowResult is a thin wrapper around OutputType providing a standard contract for<br>   * clients to determine if the CloudOperation was successful and fetching any errors.<br>   */<br>  @WorkflowMethod<br>  fun &lt;OutputType : CloudOperationOutput&gt; run(stageContext: Map&lt;String, Any?&gt;, operationType: String): WorkflowResult&lt;OutputType&gt;<br>}</pre><p>2. The Clouddriver Temporal worker is constantly polling Temporal for work. A worker will eventually see a task for an UntypedCloudOperationRunner Workflow and start executing it.</p><p>3. Similar to before with resolution into AtomicOperations, Clouddriver does some pre-processing of the bag-of-fields in stageContext and resolves it to a strongly typed implementation of the CloudOperation Workflow interface based on the operationType input and the stageContext:</p><pre>interface CloudOperation&lt;I : CloudOperationInput, O : CloudOperationOutput&gt; {<br>  @WorkflowMethod<br>  fun operate(input: I, credentials: AccountCredentials&lt;out Any&gt;): O<br>}</pre><p>4. Clouddriver starts a <a href="https://docs.temporal.io/child-workflows">Child Workflow</a> execution of the CloudOperation implementation it resolved. The child workflow will execute Activities which handle the actual cloud provider API calls to mutate infrastructure.</p><p>5. Orca uses its Temporal Client to await completion of the UntypedCloudOperationRunner Workflow. Once it’s complete, Temporal notifies the client and sends the result and Orca can continue progressing the deployment.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*leM3bH8iyb65_cmtl3vm4A.png"><figcaption>Sequence diagram of a Cloud Operation execution with Temporal</figcaption></figure><h3>Results and Lessons Learned from the Migration</h3><p>A shiny new architecture is great, but equally important is the non-glamorous work of refactoring legacy systems to fit the new architecture. How did we integrate Temporal into critical dependencies of all Netflix engineers transparently?</p><p>The answer, of course, is a combination of abstraction and dynamic configuration. We built a CloudOperationRunner interface in Orca to encapsulate whether the Cloud Operation was being executed via the legacy path or Temporal. At runtime, <a href="https://netflixtechblog.com/announcing-archaius-dynamic-properties-in-the-cloud-bc8c51faf675">Fast Properties</a> (Netflix’s dynamic configuration system) determined which path a stage that needed to execute a Cloud Operation would take. We could set these properties quite granularly — by Stage type, cloud provider account, Spinnaker application, Cloud Operation type (createServerGroup), and cloud provider (either AWS or <a href="https://netflix.github.io/titus/">Titus</a> in our case). The Spinnaker services themselves were the first to be deployed using Temporal, and within two quarters, all applications at Netflix were onboarded.</p><h4>Impact</h4><p>What did we have to show for it all? With Temporal as the orchestration engine for Cloud Operations, the percentage of deployments that failed due to transient Cloud Operation failures dropped from 4% to 0.0001%. For those keeping track at home, that’s a four and a half order of magnitude reduction. Virtually eliminating this failure mode for deployments was a huge win for developer productivity, especially for teams with long and complex deployment pipelines.</p><p>Beyond the improvement in deployment success metrics, we saw a number of other benefits:</p><ol><li>Orca no longer needs to directly communicate with Clouddriver to start Cloud Operations or poll their status with Temporal as the intermediary. The services are less coupled, which is a win for maintainability.</li><li>Speaking of maintainability, with Temporal doing the heavy lifting of orchestration and retries inside of Clouddriver, we got to remove a lot of the homegrown logic we’d built up over the years for the same purpose.</li><li>Since Temporal manages execution state, Clouddriver instances became stateless and Cloud Operation execution can bounce between instances with impunity. We can treat Clouddriver instances more like cattle and enable things like <a href="https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116">Chaos Monkey</a> for the service which we were previously prevented from doing.</li><li>Migrating Cloud Operation steps into Activities was a forcing function to re-write the logic to be idempotent. Since Temporal retries activities by default, it’s generally recommended they be idempotent. This alone fixed a number of issues that existed previously when operations were retried in Clouddriver.</li><li>We set the retry timeout for Activities in Clouddriver to be two hours by default. This gives us a long leash to fix-forward or rollback Clouddriver if we introduce a regression before customer deployments fail — to them, it might just look like a deployment is taking longer than usual.</li><li>Cloud Operations are much easier to introspect than before. Temporal ships with a great UI to help visualize Workflow and Activity executions, which is a huge boon for debugging live Workflows executing in production. The Temporal SDKs and server also emit a lot of useful metrics.</li></ol><figure><img alt="A Cloud Operation Workflow as seen from the Temporal UI. This operation executes 3 Activities: DescribeAutoScalingGroup, GetHookConfigurations, and ResizeServerGroup" src="https://cdn-images-1.medium.com/max/1024/1*zmCyjwzTXji921mulJjmTw.png"><figcaption>Execution of a resizeServerGroup Cloud Operation as seen from the Temporal UI. This operation executes 3 Activities: DescribeAutoScalingGroup, GetHookConfigurations, and ResizeServerGroup</figcaption></figure><h4>Lessons Learned</h4><p>With the benefit of hindsight, there are also some lessons we can share from this migration:</p><p>1. <strong>Avoid unnecessary Child Workflows</strong>: Structuring Cloud Operations as an UntypedCloudOperationRunner Workflow that starts Child Workflows to actually execute the Cloud Operation’s logic was unnecessary and the indirection made troubleshooting more difficult. There are <a href="https://community.temporal.io/t/purpose-of-child-workflows/652">situations</a> where Child Workflows are appropriate, but in this case we were using them as a tool for code organization, which is generally unnecessary. We could’ve achieved the same effect with class composition in the top-level parent Workflow.</p><p>2. <strong>Use single argument objects</strong>: At first, we structured Workflow and Activity functions with variable arguments, much as you’d write normal functions. This can be problematic for Temporal because of Temporal’s <a href="https://community.temporal.io/t/workflow-determinism/4027">determinism constraints</a>. Adding or removing an argument from a function signature is <strong>not</strong> a backward-compatible change, and doing so can break long-running workflows — and it’s not immediately obvious in code review your change is problematic. The preferred pattern is to use a single serializable class to host all your arguments for Workflows and Activities — these can be more freely changed without breaking determinism.</p><p>3. <strong>Separate business failures from workflow failures</strong>: We like the pattern of the WorkflowResult type that UntypedCloudOperationRunner returns in the interface above. It allows us to communicate business process failures without failing the Workflow itself and have more overall nuance in error handling. This is a pattern we’ve carried over to other Workflows we’ve implemented since.</p><h3>Temporal at Netflix Today</h3><p>Temporal adoption has skyrocketed at Netflix since its initial introduction for Spinnaker. Today, we have hundreds of use cases, and we’ve seen adoption double in the last year with no signs of slowing down.</p><p>One major difference between initial adoption and today is that Netflix migrated from an on-prem Temporal deployment to using <a href="https://temporal.io/cloud">Temporal Cloud</a>, which is Temporal’s SaaS offering of the Temporal server. This has let us scale Temporal adoption while running a lean team. We’ve also built up a robust internal platform around Temporal Cloud to integrate with Netflix’s internal ecosystem and make onboarding for our developers as easy as possible. Stay tuned for a future post digging into more specifics of our Netflix Temporal platform.</p><h3>Acknowledgement</h3><p>We all stand on the shoulders of giants in software. I want to call out that I’m retelling the work of my two stunning colleagues <a href="https://www.linkedin.com/in/chris-smalley/">Chris Smalley</a> and <a href="https://www.linkedin.com/in/robzienert/">Rob Zienert</a> in this post, who were the two aforementioned engineers who introduced Temporal and carried out the migration.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=73c69ccb5953" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953">How Temporal Powers Reliable Cloud Operations at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953</link>
      <guid>https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953</guid>
      <pubDate>Tue, 16 Dec 2025 00:51:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Netflix Live Origin]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/xiaomei-liu-b475711/">Xiaomei Liu</a>, <a href="https://www.linkedin.com/in/joseph-lynch-9976a431/">Joseph Lynch</a>, <a href="https://www.linkedin.com/in/chrisnewton2/">Chris Newton</a></p><h3>Introduction</h3><p><a href="https://netflixtechblog.com/building-a-reliable-cloud-live-streaming-pipeline-for-netflix-8627c608c967">Behind the Streams: Building a Reliable Cloud Live Streaming Pipeline for Netflix</a> introduced the architecture of the streaming pipeline. This blog post looks at the custom Origin Server we built for Live — the Netflix Live Origin. It sits at the demarcation point between the cloud live streaming pipelines on its upstream side and the distribution system, Open Connect, Netflix’s in-house Content Delivery Network (CDN), on its downstream side, and acts as a broker managing what content makes it out to Open Connect and ultimately to the client devices.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*44sJszKXEHZvSHnEQgYiiw.png"><figcaption><strong>Live Streaming Distribution and Origin Architecture</strong></figcaption></figure><p>Netflix Live Origin is a multi-tenant microservice operating on EC2 instances within the AWS cloud. We lean on standard HTTP protocol features to communicate with the Live Origin. The Packager pushes segments to it using PUT requests, which place a file into storage at the particular location named in the URL. The storage location corresponds to the URL that is used when the Open Connect side issues the corresponding GET request.</p><p>Live Origin architecture is influenced by key technical decisions of the live streaming architecture. First, resilience is achieved through redundant regional live streaming pipelines, with failover orchestrated at the server-side to reduce client complexity. The implementation of <a href="https://netflixtechblog.com/building-a-reliable-cloud-live-streaming-pipeline-for-netflix-8627c608c967">epoch locking at the cloud encoder</a> enables the origin to select a segment from either encoding pipeline. Second, Netflix adopted a manifest design with <a href="https://netflixtechblog.com/behind-the-streams-live-at-netflix-part-1-d23f917c2f40">segment templates and constant segment duration</a> to avoid frequent manifest refresh. The constant duration templates enable Origin to predict the segment publishing schedule.</p><h3>Multi-pipeline and multi-region aware origin</h3><p>Live streams inevitably contain defects due to the non-deterministic nature of live contribution feeds and strict real-time segment publishing timelines. Common defects include:</p><ul><li><strong>Short segments:</strong> Missing video frames and audio samples.</li><li><strong>Missing segments:</strong> Entire segments are absent.</li><li><strong>Segment timing discontinuity:</strong> Issues with the Track Fragment Decode Time.</li></ul><p>Communicating segment discontinuity from the server to the client via a segment template-based manifest is impractical, and these defective segments can disrupt client streaming.</p><p>The redundant cloud streaming pipelines operate independently, encompassing distinct cloud regions, contribution feeds, encoder, and packager deployments. This independence substantially mitigates the probability of simultaneous defective segments across the dual pipelines. Owing to its strategic placement within the distribution path, the live origin naturally emerges as a component capable of intelligent candidate selection.</p><p>The Netflix Live Origin features multi-pipeline and multi-region awareness. When a segment is requested, the live origin checks candidates from each pipeline in a deterministic order, selecting the first valid one. Segment defects are detected via lightweight media inspection at the packager. This defect information is provided as metadata when the segment is published to the live origin. In the rare case of concurrent defects at the dual pipeline, the segment defects can be communicated downstream for intelligent client-side error concealment.</p><h3>Open Connect streaming optimization</h3><p>When the Live project started, Open Connect had become highly optimised for VOD content delivery — <a href="https://freenginx.org/en/">nginx</a> had been chosen many years ago as the Web Server since it is highly capable in this role, and a number of enhancements had been added to it and to the underlying operating system (BSD). Unlike traditional CDNs, Open Connect is more of a distributed origin server — VOD assets are pre-positioned onto carefully selected server machines (OCAs, or Open Connect Appliances) rather than being filled on demand.</p><p>Alongside the VOD delivery, an on-demand fill system has been used for non-VOD assets — this includes artwork and the downloadable portions of the clients, etc. These are also served out of the same <a href="https://freenginx.org/en/">nginx</a> workers, albeit under a distinct server block, using a distinct set of hostnames.</p><p>Live didn’t fit neatly into this ‘small object delivery’ model, so we extended the proxy-caching functionality of <a href="https://freenginx.org/en/">nginx</a> to address Live-specific needs. We will touch on some of these here related to optimized interactions with the Origin Server. Look for a future blog post that will go into more details on the Open Connect side.</p><p>The segment templates provided to clients are also provided to the OCAs as part of the Live Event Configuration data. Using the Availability Start Time and Initial Segment number, the OCA is able to determine the legitimate range of segments for each event at any point in time — requests for objects outside this range can be rejected, preventing unnecessary requests going up through the fill hierarchy to the origin. If a request makes it through to the origin, and the segment isn’t available yet, the origin server will return a 404 Status Code (indicating File Not Found) with the expiration policy of that error so that it can be cached within Open Connect until just before that segment is expected to be published.</p><p>If the Live Origin knows when segments are being pushed to it, and knows what the live edge is — when a request is received for the immediately next object, rather than handing back another 404 error (which would go all the way back through Open Connect to the client), the Live Origin can ‘hold open’ the request, and service it once the segment has been published to it. By doing this, the degree of chatter within the network handling requests that arrive early has been significantly reduced. As part of this, millisecond grain caching was added to <a href="https://freenginx.org/en/">nginx</a> to enhance the standard HTTP Cache Control, which only works at second granularity, a long time when segments are generated every 2 seconds.</p><h4>Streaming metadata enhancement</h4><p>The HTTP standard allows for the addition of request and response headers that can be used to provide additional information as files move between clients and servers. The HTTP headers provide notifications of events within the stream in a highly scalable way that is independently conveyed to client devices, regardless of their playback position within the stream.</p><p>These notifications are provided to the origin by the live streaming pipeline and are inserted by the origin in the form of headers, appearing on the segments generated at that point in time (and persist to future segments — they are cumulative). Whenever a segment is received at an OCA, this notification information is extracted from the response headers and used to update an in-memory data structure, keyed by event ID; and whenever a segment is served from the OCA, the latest such notification data is attached to the response. This means that, given any flow of segments into an OCA, it will always have the most recent notification data, even if all clients requesting it are behind the live edge. In fact, the notification information can be conveyed on any response, not just those supplying new segments.</p><h4>Cache invalidation and origin mask</h4><p>An invalidation system has been available since the early days of the project. It can be used to “flush” all content associated with an event by altering the key used when looking up objects in cache — this is done by incorporating a version number into the cache key that can then be bumped on demand. This is used during pre-event testing so that the network can be returned to a pristine state for the test with minimal fuss.</p><p>Each segment published by the Live Origin conveys the encoding pipeline it was generated by, as well as the region it was requested from. Any issues that are found after segments make their way into the network can be remedied by an enhanced invalidation system that takes such variants into account. It is possible to invalidate (that is, cause to be considered expired) segments in a range of segment numbers, but only if they were sourced from encoder A, or from Encoder A, but only if retrieved from region X.</p><p>In combination with Open Connect’s enhanced cache invalidation, the Netflix Live Origin allows <em>selective encoding pipeline masking</em> to exclude a range of segments from a particular pipeline when serving segments to Open Connect. The enhanced cache invalidation and origin masking enable live streaming operations to hide known problematic segments (e.g., segments causing client playback errors) from streaming clients once the bad segments are detected, protecting millions of streaming clients during the DVR playback window.</p><h3>Origin storage architecture</h3><p>Our original storage architecture for the Live Origin was simple: just use <a href="https://aws.amazon.com/s3/">AWS S3</a> like we do for SVOD. This served us well initially for our low-traffic events, but as we scaled up we discovered that Live streaming has unique latency and workload requirements that differ significantly from on-demand where we have significant time ahead-of-time to pre-position content. While S3 met its stated uptime guarantees, our strict 2-second retry budget inherent to Live events (where every write is critical) led us to explore optimizations specifically tailored for real-time delivery at scale. AWS S3 is an amazing object store, but our Live streaming requirements were closer to those of a global low-latency highly-available database. So, we went back to the drawing board and started from the requirements. The Origin required:</p><ol><li>[HA Writes] Extremely high <em>write</em> availability, ideally as close to full write availability within a single AWS region, with low second replication delay to other regions. Any failed write operation within 500ms is considered a bug that must be triaged and prevented from re-occurring.</li><li>[Throughput] High write throughput, with hundreds of MiB replicating across regions</li><li>[Large Partitions] Efficiently support O(MiB) writes that accumulate to O(10k) keys per partition with O(GiB) total size per event.</li><li>[Strong Consistency] Within the same region, we needed read-your-write semantics to hit our &lt;1s read delay requirements (must be able to read published segments)</li><li>[Origin Storm] During worst-case load involving Open Connect edge cases, we may need to handle O(<strong>GiB</strong>) of read throughput <em>without affecting writes</em>.</li></ol><p>Fortunately, Netflix had previously invested in building a <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">KeyValue Storage Abstraction</a> that cleverly leveraged <a href="https://youtu.be/sQ-_jFgOBng?t=1061">Apache Cassandra</a> to provide chunked storage of MiB or even GiB values. This abstraction was initially built to support cloud saves of Game state. The Live use case would push the boundaries of this solution, however, in terms of availability for writes (#1), cumulative partition size (#3), and read throughput during Origin Storm (#5).</p><h4>High Availability for Writes of Large Payloads</h4><p>The <a href="https://youtu.be/paTtLhZFsGE?t=1077">KeyValue Payload Chunking and Compression Algorithm</a> breaks O(MiB) work down so each part can be idempotently retried and hedged to maintain strict latency service level objectives, as well as spreading the data across the full cluster. When we combine this algorithm with Apache Cassandra’s local-quorum consistency model, which allows write availability even with an entire Availability Zone outage, plus a write-optimized <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">Log-Structured Merge Tree</a> (LSM) storage engine, we could meet the first four requirements. After iterating on the performance and availability of this solution, we were not only able to achieve the write availability required, but did so with a P99 <em>tail</em> latency that was similar to the status quo’s P50 <em>average </em>latency while also handling cross-region replication behind the scenes for the Origin. This new solution was significantly more expensive (as expected, databases backed by SSD cost more), but minimizing cost was <em>not</em> a key objective and low latency with high availability was:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bUPc4gC-mSDcybayhBJQ8g.png"><figcaption><strong>Storage System Write Performance</strong></figcaption></figure><h4>High Availability Reads at Gbps Throughputs</h4><p>Now that we solved the write reliability problem, we had to handle the Origin Storm failure case, where potentially dozens of Open Connect top-tier caches could be requesting multiple O(MiB) video segments at once. Our back-of-the-envelope calculations showed worst-case read throughput in the O(100Gbps) range, which would normally be extremely expensive for a strongly-consistent storage engine like Apache Cassandra. With careful tuning of chunk access, we were able to respond to reads at network line rate (100Gbps) from Apache Cassandra, but we observed unacceptable performance and availability degradation on concurrent writes. To resolve this issue, we introduced write-through caching of chunks using our distributed caching system <a href="https://github.com/Netflix/EVCache">EVCache</a>, which is based on Memcached. This allows almost all reads to be served from a highly scalable cache, allowing us to easily hit 200Gbps and beyond without affecting the write path, achieving read-write separation.</p><h4>Final Storage Architecture</h4><p>In the final storage architecture, the Live Origin writes and reads to KeyValue, which manages a write-through cache to EVCache (memcached) and implements a safe chunking protocol that spreads large values and partitions them out across the storage cluster (Apache Cassandra). This allows almost all read load to be handled from cache, with only misses hitting the storage. This combination of cache and highly available storage has met the demanding needs of our Live Origin for over a year now.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yA9H-BFemM_-99FXBlVOMg.png"><figcaption><strong>Storage System High Level Architecture</strong></figcaption></figure><p>Delivering this consistent low latency for large writes with cross-region replication and consistent write-through caching to a distributed cache required solving numerous hard problems with novel techniques, which we plan to share in detail during a future post.</p><h3>Scalability and scalable architecture</h3><p>Netflix’s live streaming platform must handle a high volume of diverse stream renditions for each live event. This complexity stems from supporting various video encoding formats (each with multiple encoder ladders), numerous audio options (across languages, formats, and bitrates), and different content versions (e.g., with or without advertisements). The combination of these elements, alongside concurrent event support, leads to a significant number of unique stream renditions per live event. This, in turn, necessitates a high Requests Per Second (RPS) capacity from the multi-tenant live origin service to ensure publishing-side scalability.</p><p>In addition, Netflix’s global reach presents distinct challenges to the live origin on the retrieval side. During the Tyson vs. Paul fight event in 2024, a historic peak of 65 million concurrent streams was observed. Consequently, a scalable architecture for live origin is essential for the success of large-scale live streaming.</p><h4>Scaling architecture</h4><p>We chose to build a highly scalable origin instead of relying on the traditional origin shields approach for better end-to-end cache consistency control and simpler system architecture. The live origin in this architecture directly connects with top-tier Open Connect nodes, which are geographically distributed across several sites. To minimize the load on the origin, only designated nodes per stream rendition at each site are permitted to directly fill from the origin.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jW7eBCQtjlna0VKaWrKz_A.png"><figcaption><strong>Netflix Live Origin Scalability Architecture</strong></figcaption></figure><p>While the origin service can autoscale horizontally using EC2 instances, there are other system resources that are not autoscalable, such as storage platform capacity and AWS to Open Connect backbone bandwidth capacity. Since in live streaming, not all requests to the live origin are of the same importance, the origin is designed to prioritize more critical requests over less critical requests when system resources are limited. The table below outlines the request categories, their identification, and protection methods.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dYKFJkq22KI8sDBW_Njmog.png"></figure><h4>Publishing isolation</h4><p>Publishing traffic, unlike potentially surging CDN retrieval traffic, is predictable, making path isolation a highly effective solution. As shown in the scalability architecture diagram, the origin utilizes separate EC2 publishing and CDN stacks to protect the latency and failure-sensitive origin writes. In addition, the storage abstraction layer features distinct clusters for key-value (KV) read and KV write operations. Finally, the storage layer itself separates read (EVCache) and write (Cassandra) paths. This comprehensive path isolation facilitates independent cloud scaling of publishing and retrieval, and also prevents CDN-facing traffic surges from impacting the performance and reliability of origin publishing.</p><h4>Priority rate limiting</h4><p>Given Netflix’s scale, managing incoming requests during a traffic storm is challenging, especially considering non-autoscalable system resources. The Netflix Live Origin implemented priority-based rate limiting when the underlying system is under stress. This approach ensures that requests with greater user impact are prioritized to succeed, while requests with lower user impact are allowed to fail during times of stress in order to protect the streaming infrastructure and are permitted to retry later to succeed.</p><p>Leveraging Netflix’s microservice platform priority rate limiting feature, the origin prioritizes live edge traffic over DVR traffic during periods of high load on the storage platform. The live edge vs. DVR traffic detection is based on the predictable segment template. The template is further cached in memory on the origin node to enable priority rate limiting without access to the datastore, which is valuable especially during periods of high datastore stress.</p><p>To mitigate traffic surges, TTL cache control is used alongside priority rate limiting. When the low-priority traffic is impacted, the origin instructs Open Connect to slow down and cache identical requests for 5 seconds by setting a max-age = 5s and returns an HTTP 503 error code. This strategy effectively dampens traffic surges by preventing repeated requests to the origin within that 5-second window.</p><p>The following diagrams illustrate origin priority rate limiting with simulated traffic. The nliveorigin_mp41 traffic is the low-priority traffic and is mixed with other high-priority traffic. In the first row: the 1st diagram shows the request RPS, the 2nd diagram shows the percentage of request failure. In the second row, the 1st diagram shows datastore resource utilization, and the 2nd diagram shows the origin retrieval P99 latency. The results clearly show that only the low-priority traffic (nliveorigin_mp41) is impacted at datastore high utilization, and the origin request latency is under control.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-_YP6H3sEDaw1lS8prH4mQ.png"><figcaption><strong>Origin Priority Rate Limiting</strong></figcaption></figure><h4>404 storm and cache optimization</h4><p>Publishing isolation and priority rate limiting successfully protect the live origin from DVR traffic storms. However, the traffic storm generated by requests for non-existent segments presents further challenges and opportunities for optimization.</p><p>The live origin structures metadata hierarchically as event &gt; stream rendition &gt; segment, and the segment publishing template is maintained at the stream rendition level. This hierarchical organization allows the origin to preemptively reject requests with an HTTP 404(not found)/410(Gone) error, leveraging highly cacheable event and stream rendition level metadata, avoiding unnecessary queries to the segment level metadata:</p><ul><li>If the event is unknown, reject the request with 404</li><li>If the event is known, but the segment request timing does not match the expected publishing timing, reject the request with 404 and cache control TTL matching the expected publishing time</li><li>If the event is known, the requested segment is never generated or misses the retry deadline, reject the request with a 410 error, preventing the client from repeatedly requesting</li></ul><p>At the storage layer, metadata is stored separately from media data in the control plane datastore. Unlike the media datastore, the control plane datastore does not use a distributed cache to avoid cache inconsistency. Event and rendition level metadata benefits from a high cache hit ratio when in-memory caching is utilized at the live origin instance. During traffic storms involving non-existent segments, the cache hit ratio for control plane access easily exceeds 90%.</p><p>The use of in-memory caching for metadata effectively handles 404 storms at the live origin without causing datastore stress. This metadata caching complements the storage system’s distributed media cache, providing a complete solution for traffic surge protection.</p><h3>Summary</h3><p>The Netflix Live Origin, built upon an optimized storage platform, is specifically designed for live streaming. It incorporates advanced media and segment publishing scheduling awareness and leverages enhanced intelligence to improve streaming quality, optimize scalability, and improve Open Connect live streaming operations.</p><h3>Acknowledgement</h3><p>Many teams and stunning colleagues contributed to the Netflix live origin. Special thanks to <a href="https://www.linkedin.com/in/flavioribeiro/?originalSubdomain=br">Flavio Ribeiro</a> for advocacy and sponsorship of the live origin project; to <a href="https://www.linkedin.com/in/rummadis/">Raj Ummadisetty</a>, <a href="https://www.linkedin.com/in/prudhviraj9/">Prudhviraj Karumanchi</a> for the storage platform; to <a href="https://www.linkedin.com/in/rosanna-lee-197920/">Rosanna Lee</a>, <a href="https://www.linkedin.com/in/hunterford/">Hunter Ford</a>, and <a href="https://www.linkedin.com/in/thiagopnts/">Thiago Pontes</a> for storage lifecycle management; to <a href="https://www.linkedin.com/in/ameya-vasani-8904304/">Ameya Vasani</a> for e2e test framework; <a href="https://www.linkedin.com/in/thomas-symborski-b4216728/">Thomas Symborski</a> for orchestrator integration; to <a href="https://www.linkedin.com/in/jschek/">James Schek</a> for Open Connect integration; to <a href="https://www.linkedin.com/in/kzwang/">Kevin Wang</a> for platform priority rate limit; to <a href="https://www.linkedin.com/in/di-li-09663968/">Di Li</a>, <a href="mailto:nhubbard@netflix.com">Nathan Hubbard</a> for origin scalability testing.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=41f1b0ad5371" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/netflix-live-origin-41f1b0ad5371">Netflix Live Origin</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/netflix-live-origin-41f1b0ad5371</link>
      <guid>https://netflixtechblog.com/netflix-live-origin-41f1b0ad5371</guid>
      <pubDate>Mon, 15 Dec 2025 18:38:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[AV1 — Now Powering 30% of Netflix Streaming]]></title>
      <description><![CDATA[<h3><strong>AV1 — Now Powering 30% of Netflix Streaming</strong></h3><p><a href="https://www.linkedin.com/in/liwei-guo/">Liwei Guo</a>, <a href="https://www.linkedin.com/in/henryzhili/">Zhi Li</a>, <a href="https://www.linkedin.com/in/sheldon-radford/">Sheldon Radford</a>, <a href="https://www.linkedin.com/in/jeffrwatts/">Jeff Watts</a></p><p>Streaming video has become an integral part of our daily lives. At Netflix, our top priority is delivering the best possible entertainment experience to our members, regardless of their devices or network conditions. One of the key technologies enabling this is <a href="https://aomedia.org/specifications/av1/">AV1</a>, a modern, open video codec that is rapidly transforming both how we stream content and how users experience it. Today, AV1 powers approximately 30% of all Netflix viewing, marking a major milestone in our efforts to bring more efficient and higher-quality streaming to our members.</p><p>In this post, we’ll revisit Netflix’s AV1 journey to date, highlight emerging use cases, and share adoption trends across the device ecosystem. Having witnessed AV1’s significant impact，and with <a href="https://aomedia.org/press%20releases/AOMedia-Announces-Year-End-Launch-of-Next-Generation-Video-Codec-AV2-on-10th-Anniversary/">AV2 on the horizon</a>, we’re more excited than ever about how open codecs will continue to revolutionize streaming for everyone.</p><h3>AV1: A Modern, Open Codec</h3><p>Since entering the streaming business in 2007, Netflix has primarily relied on H.264/AVC as its streaming format. However, we quickly recognized that a modern, open codec would benefit not only Netflix, but the entire multimedia industry. In 2015, together with a group of like-minded industry leaders, Netflix co-founded the <a href="https://aomedia.org/">Alliance for Open Media (AOMedia)</a> to develop and promote next generation, open source media technologies. The AV1 codec became the first major project of this collaboration, with ambitious goals: to deliver significant improvements in compression efficiency over state-of-the-art codecs, and to introduce rich features that enable new use cases. After three years of collaborative development, AV1 was officially released in 2018.</p><h3>Netflix’s AV1 Journey: From Android to TVs and Beyond</h3><h4><strong>Piloting on Android Mobile</strong></h4><p>When we first set out to bring AV1 streaming to Netflix members, Android was the ideal starting point. Android’s flexibility allowed us to quickly integrate a software AV1 decoder using the efficient <a href="https://code.videolan.org/videolan/dav1d">dav1d</a> library, which was already optimized for ARM chipsets in mobile devices.</p><p>AV1’s superior compression efficiency was especially valuable for mobile users, many of whom are mindful of their data usage and network conditions. By adopting AV1, we were able to deliver noticeably better video quality at lower bitrates. For members relying on cellular data, this meant crisper images with fewer compression artifacts, even when bandwidth was limited. <a href="https://netflixtechblog.com/netflix-now-streaming-av1-on-android-d5264a515202">Launching AV1 support on Android</a> in 2020 marked a significant step forward for Netflix on mobile, making high-quality streaming more accessible and enjoyable for members everywhere.</p><h4><strong>Front-and-Center for Netflix VOD Streaming</strong></h4><p>The success of our AV1 launch on Android proved its value for Netflix streaming, motivating us to expand support to smart TVs and other large-screen devices, where most of our members watch their favorite shows.</p><p>Smart TVs depend on hardware decoders for efficient high-quality playback. We worked closely with device manufacturers and SoC vendors to certify these devices, ensuring they are both conformant and performant. This collaborative effort enabled our AV1 streaming to TV devices in <a href="https://netflixtechblog.com/bringing-av1-streaming-to-netflix-members-tvs-b7fc88e42320">late 2021</a>. Shortly thereafter, we expanded AV1 streaming to web browsers (in 2022) and continued to broaden device support. In 2023, this included Apple devices with the introduction of AV1 hardware support in the new M3 and A17 Pro chips.</p><p>As more devices began shipping with AV1 hardware support, a rapidly growing share of our members could enjoy the benefits of this advanced codec. Combined with our investment in adding AV1 streams across the entire catalog, AV1 viewing share has been consistently increasing in recent years. Today, AV1 accounts for approximately 30% of all Netflix streaming, making it our second most-used codec — and it’s on track to become number one very soon. The payoff has been substantial.</p><ul><li><strong>Elevating Streaming Experience Across the Board</strong>: Large-screen TVs and other devices demand higher bitrates to deliver stunning 4K, high frame rate (HFR) experiences. AV1’s superior compression efficiency has allowed us to provide these experiences using less data, making high-quality streaming more accessible and reliable. On average, AV1 streaming sessions achieve VMAF scores¹ that are 4.3 points higher than AVC and 0.9 points higher than HEVC sessions. At the same time, AV1 sessions use one-third less bandwidth than both AVC and HEVC, resulting in 45% fewer buffering interruptions. Moreover, Netflix’s diverse content catalog benefits universally from AV1, with improvements across all content types.</li><li><strong>Driving Network Efficiency Worldwide</strong>: Netflix streams are delivered through our own content delivery network (<a href="https://openconnect.netflix.com/en/?utm_referrer=https://www.google.com/">Open Connect</a>), in partnership with local ISPs around the globe. With more than 300 million members, Netflix streaming constitutes a non-trivial portion of global internet traffic. Because AV1 is a more efficient codec, its streams are smaller in size (while providing even better visual quality). By shifting a substantial share of our streaming to AV1, we reduce overall internet bandwidth consumption, and lessen system and network load for both Netflix and our partners.</li></ul><h4>Unlocking Advanced Experiences</h4><p>In addition to its superior compression efficiency, AV1 was designed to support a rich set of features. Once we established a robust framework for the continuous expansion of AV1 streaming, we quickly shifted our focus towards exploring AV1’s unique features to unlock even more advanced and immersive experiences for our members.</p><p><strong>High-Dynamic-Range(HDR)<br></strong>HDR brings enhanced detail, vivid colors, and greater clarity to images. As a premium streaming service, Netflix has been a pioneer in adopting HDR, offering HDR streaming since 2016. In March 2025, we launched <a href="https://netflixtechblog.com/hdr10-now-streaming-on-netflix-c9ab1f4bd72b">AV1 HDR streaming</a>. We chose HDR10+ as the HDR format for its use of dynamic metadata, which enabled us to adapt the tone mapping per device in a scene-dependent manner.</p><p>As anticipated, the combination of AV1 and HDR10+ allows us to deliver images with greater detail, more vibrant colors, and an overall heightened sense of immersion for our members. At the moment, 85% of our HDR catalog (from the perspective of view-hours) has AV1-HDR10+ coverage, and this number is expected to reach 100% in the next couple of months.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/759/1*Ubhj9prgqb0zuTHt6oOx0g.png"><figcaption><strong><em>Photographs of devices displaying the same (cropped) frame with HDR10 metadata (left) and HDR10+ metadata (right). Notice the preservation of the flashlight detail in the HDR10+ capture, and the over-exposure of the region under the flashlight in the HDR10 one.</em></strong></figcaption></figure><p><strong>Cinematic Film Grain<br></strong>Film grain is a hallmark of the cinematic experience, widely used in the movie industry to enhance a film’s depth, texture, and realism. However, because film grain is inherently random, faithfully representing it in digital video requires a significant amount of data. This presents a unique challenge for streaming: restricting the bitrate can result in grain that appears unnatural or distorted, while increasing the bitrate to accurately preserve cinematic grain almost inevitably leads to elevated rebuffering. The AV1 specification incorporates a unique solution called Film Grain Synthesis (FGS). Instead of encoding grain as part of every frame, the grain is stripped out before encoding and then resynthesized at the decoder using parameters sent in the bitstream, delivering a realistic cinematic film grain experience without the usual data costs.</p><p>This approach represents a significant shift from traditional compression and streaming techniques. Our team invested substantial effort in fine-tuning the media processing pipeline, ensuring FGS delivers robust performance at scale. In July 2025, we successfully <a href="https://netflixtechblog.com/av1-scale-film-grain-synthesis-the-awakening-ee09cfdff40b">productized AV1 FGS</a>, and the results were astonishing: AV1 with FGS could deliver videos with cinematic film grain at a bitrate well within the capabilities of typical household internet connections. For non-FGS AV1 encodings, even at much higher bitrate, they may not be able to achieve comparable quality.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1016/1*9fB5xuoFpbpN8ZQIzTHDtg.png"><figcaption><strong><em>The same (cropped) frame from source (left), regular AV1 stream encoded at 8274kbps (middle) and AV1 FGS stream encoded at 2804 kbps (right). The AV1 FGS stream reduces the bitrate by 66% while delivering clearly better quality.</em></strong></figcaption></figure><h4><strong>Beyond VOD Streaming</strong></h4><p>So far, our AV1 journey has been mainly on VOD, but we see significant opportunities for AV1 beyond traditional VOD streaming. On a mission to entertain the world, Netflix has constantly explored and established other ways to bring joy to our members, and we believe AV1 could contribute to the success of these new products.</p><p><strong>Live Streaming<br></strong>Debuting in 2023, live streaming has experienced <a href="https://help.netflix.com/en/node/129840">rapid growth</a> at Netflix, becoming a key part of our streaming offerings in just two short years. We are actively evaluating the use of AV1 in live streaming, as we believe it could help further scale Netflix’s live programming:</p><ul><li><strong>Hyper-scale concurrent viewership: </strong>Live streaming at Netflix means delivering content to <a href="https://www.netflix.com/tudum/articles/jake-paul-vs-mike-tyson-live-release-date-news">tens of millions</a> of viewers simultaneously. AV1’s superior compression efficiency could significantly reduce the required bandwidth, enabling us to deliver high-quality live experiences to large audiences without compromising video quality.</li><li><strong>Customizable graphics overlay</strong>: for live sport events such as football, tennis and boxing, graphics overlays have become an integral part of the member experience — from embedding game statistics to delivering sponsorships. AV1 offers an opportunity to make the graphics highly customizable: layered coding is supported in AV1’s main profile, allowing encoding the main content in the base layer, and graphics in the enhancement layer, and easily swapping out one version of the enhancement layer with another. We envision that the use of AV1’s layered coding can greatly simplify the live streaming workflow and reduce delivery costs.</li></ul><p><strong>Cloud Gaming<br></strong>Cloud gaming is a new Netflix offering that is currently in the <a href="https://help.netflix.com/en/node/132197">beta phase</a> and is available to members in select countries. The game engines run on cloud servers, while the rendered graphics are streamed directly to members’ devices. By removing barriers and transforming every Netflix-enabled device into a game console, Cloud gaming aims to deliver a seamless, “play anywhere” experience for our members. For a glimpse of this in action, <a href="https://www.linkedin.com/feed/update/urn:li:activity:7382077927875825664/">watch as Co-CEO Greg Peters and CTO Elizabeth Stone play a round of Boggle Party — powered entirely by Netflix’s cloud gaming platform</a>!</p><p>Unlike traditional video streaming, cloud gaming requires that every player action is reflected instantly on the screen to ensure a responsive and immersive experience. This makes delivering high-quality video frames with extremely low latency, despite fluctuating network conditions, one of the biggest challenges in cloud gaming.</p><p>Our team is actively working on productizing AV1 for cloud gaming. Given AV1’s high compression efficiency, we can reduce frame sizes, helping video frames get through even when network conditions become challenging. This positions AV1 as a promising technology for enabling a high-quality, low-latency gaming experience across a wide range of devices.</p><h3>A Device Ecosystem United for AV1</h3><p>Netflix is a streaming company, and we have worked diligently to create highly efficient and standards-conformant AV1 streams for our catalog. However, an equally, if not more, important factor in AV1’s success is the widespread support from device manufacturers. Throughout our AV1 journey, we have been impressed by the unprecedented pace at which the device ecosystem has embraced AV1.</p><p>Just six months after the AV1 specification was finalized, the open-source AV1 decoder library sponsored by AOM, dav1d, was released. Small, performant, and highly resource-efficient, dav1d bridged the gap for early adopters like Netflix while hardware solutions were still in development. Continuous improvements to its performance and compatibility have made dav1d the preferred choice for a wide range of platforms and practical applications. Today, it serves as <a href="https://aomedia.org/av1-adoption-showcase/google-story/">Android’s default software decoder</a>. Additionally, it plays a key role in web browsers — for Netflix, it powers approximately 40% of our browser playback. This broad adoption has significantly expanded access to high-quality AV1 streaming, even in the absence of dedicated hardware decoders.</p><p>Netflix maintains a close working relationship with device manufacturers and SoC vendors, and we have witnessed first-hand their enthusiasm for adopting AV1. To ensure optimal streaming performance, Netflix has a rigorous certification process to verify proper support for our streaming formats on devices. AV1 was added to this certification process in 2019, and since then, we have seen a steady increase in the number of devices with full AV1 decoding capabilities. Over the past five years (2021–2025), 88% of large-screen devices, including TVs, set-top boxes, and streaming sticks, submitted for Netflix certification have supported AV1, with the vast majority offering full 4K@60fps capability. Notably, since 2023, almost all devices we have received for certification are AV1-capable.</p><p>We have also been impressed by the robustness of AV1 implementations across these devices. As mentioned earlier, FGS is an innovative tool that departs from traditional codec architectures and was not included in our initial full-scale AV1 streaming rollout. When we launched FGS this July, we worked closely with our partners to ensure broad device compatibility. We are pleased with the successful progress made, and AV1 with FGS is now supported across a significant and growing number of in-field devices.</p><h3>Looking Ahead: AV1 Today, AV2 Tomorrow</h3><p>As we reflect on our AV1 journey, it’s clear that the codec has already transformed the streaming experience for hundreds of millions of Netflix members worldwide. Thanks to industry-wide collaboration and rapid device adoption, AV1 is delivering higher quality, greater efficiency, and new cinematic features to more screens than ever before.</p><p>Looking ahead, we are excited about the forthcoming release of AV2, announced by the Alliance for Open Media for the end of 2025. <a href="https://www.youtube.com/watch?v=RUMwMe_2Dqo">AV2 is poised to set a new benchmark for compression efficiency and streaming capabilities, building on the solid foundation laid by AV1</a>. At Netflix, we remain committed to adopting the best open technologies to delight our members around the globe. While AV2 represents the future of streaming, AV1 is very much the present — serving as the backbone of our platform and powering exceptional entertainment experiences across a vast and ever-expanding ecosystem of devices.</p><h3>Acknowledgement</h3><p>The success of AV1 at Netflix is the result of the dedication, expertise, and collaboration of many teams across the company — including Encoding, Clients, Device Certification, Partner Engineering, Data Science &amp; Engineering, Infra, Platform, etc.</p><p>We would also like to thank <a href="https://www.linkedin.com/in/artemdanylenko/">Artem Danylenko</a>, <a href="https://www.linkedin.com/in/aditya-mavlankar-7139791/">Aditya Mavlankar</a>, <a href="https://www.linkedin.com/in/anne-aaron/">Anne Aaron</a>, <a href="https://www.linkedin.com/in/cyril-concolato-567a522/">Cyril Concolato</a>, <a href="https://www.linkedin.com/in/allanzp/">Allan Zhou</a> and <a href="https://www.linkedin.com/in/anush-moorthy-b8451142/">Anush Moorthy</a> for their valuable comments and feedback on earlier drafts of this post.</p><h3>Footnotes</h3><ol><li>These numbers represent a snapshot of data from November 13, 2025. Actual values may vary slightly from day to day and across different regions, depending on the mix of content, devices, and internet connectivity.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xuLP8glDDcj-DBYO8djmNA.png"></figure><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=02f592242d80" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80">AV1 — Now Powering 30% of Netflix Streaming</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80</link>
      <guid>https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80</guid>
      <pubDate>Thu, 04 Dec 2025 21:09:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Supercharging the ML and AI Development Experience at Netflix]]></title>
      <description><![CDATA[<h3>Supercharging the ML and AI Development Experience at Netflix with Metaflow</h3><p><a href="https://www.linkedin.com/in/shashanksrikanth/"><em>Shashank Srikanth</em></a>, <a href="https://www.linkedin.com/in/romain-cledat-4a211a5/"><em>Romain Cledat</em></a></p><p><a href="https://docs.metaflow.org/">Metaflow</a> — a framework we started and <a href="https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9">open-sourced</a> in 2019 — now powers <a href="https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d">a wide range of ML and AI systems across Netflix</a> and at <a href="https://github.com/Netflix/metaflow/blob/master/ADOPTERS.md">many other companies</a>. It is well loved by users for helping them take their ML/AI workflows from <a href="https://docs.metaflow.org/introduction/what-is-metaflow#how-does-metaflow-support-prototyping-and-production-use-cases">prototype to production</a>, allowing them to focus on building cutting-edge systems that bring joy and entertainment to audiences worldwide.</p><p>Metaflow allows users to:</p><ol><li><strong>Iterate and ship quickly </strong>by minimizing friction</li><li><strong>Operate systems reliably</strong> in production with minimal overhead, at Netflix scale.</li></ol><p>Metaflow works with many battle-hardened tooling to address the second point — among them <a href="https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041">Maestro</a>, our newly open-sourced workflow orchestrator that powers nearly every ML and AI system at Netflix and serves as a backbone for Metaflow itself.</p><p>In this post, we focus on the first point and introduce a new Metaflow functionality, <strong>Spin</strong>, that helps users <strong>accelerate their iterative development process</strong>. By the end, you’ll have a solid understanding of Spin’s capabilities and learn how to try it out yourself with <strong>Metaflow 2.19</strong>.</p><h3>Iterative development in ML and AI workflows</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0I8DAvXCQEN1RpTTC0ZyaQ.jpeg"><figcaption>Developing a Metaflow flow with cards in VSCode</figcaption></figure><p>To understand our approach to improving the ML and AI development experience, it helps to consider how these workflows differ from traditional software engineering.</p><p>ML and AI development revolves not just around code but also around data and models, which are large, mutable, and computationally expensive to process. Iteration cycles can involve long-running data transformations, model training, and stochastic processes that yield slightly different results from run to run. These characteristics make fast, stateful iteration a critical part of productive development.</p><p>This is where notebooks — such as Jupyter, <a href="https://observablehq.com/documentation/notebooks/">Observable</a>, or <a href="https://marimo.io/">Marimo</a> — shine. Their ability to preserve state in memory allows developers to load a dataset once and iteratively explore, transform, and visualize it without reloading or recomputing from scratch. This persistent, interactive environment turns what would otherwise be a slow, rigid loop into a fluid, exploratory workflow — perfectly suited to the needs of ML and AI practitioners.</p><p>Because ML and AI development is computationally intensive, stochastic, and data- and model-centric, tools that optimize iteration speed must treat state management as a first-class design concern. Any system aiming to improve the development experience in this domain must therefore enable quick, incremental experimentation without losing continuity between iterations.</p><h3>New: rapid, iterative development with spin</h3><p>At first glance, Metaflow code looks like a workflow — similar to <a href="https://airflow.apache.org/">Airflow</a> — but there’s another way to look at it: each Metaflow @step serves as <a href="https://docs.metaflow.org/metaflow/basics#what-should-be-a-step">a checkpoint boundary</a>. At the end of every step, Metaflow automatically persists all instance variables as <em>artifacts</em>, allowing the execution to <a href="https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-command">resume</a> seamlessly from that point onward. The below animation shows this behavior in action:</p><figure><img alt="An animated GIF showing how resume can be used in Metaflow. The GIF shows how using `flow.py resume join` makes Metaflow clone previously executed steps and resumes the computation from the `join` step and continues executing till the end of the flow." src="https://cdn-images-1.medium.com/max/1024/1*AEDpnt-YULYV4mcyrwLk7g.gif"><figcaption>Using resume in Metaflow</figcaption></figure><p>In a sense, we can consider a @step similar to a notebook cell: it is the smallest unit of execution that updates state upon completion. It does have a few differences that address the issues with notebook cells:</p><ul><li><strong>The execution order is explicit and deterministic: </strong>no surprises due to out-of-order cell execution;</li><li><strong>The state is not hidden: </strong>state is explicitly stored as self. variables as shared state, which can be <a href="https://docs.metaflow.org/metaflow/client">discovered and inspected</a>;</li><li><strong>The state is versioned and persisted</strong> making results more reproducible.</li></ul><p>While <strong>Metaflow</strong>’s resume feature can approximate the incremental and iterative development approach of notebooks, it restarts execution from the selected step onward, introducing more latency between iterations. In contrast, a <strong>notebook</strong> allows near-instant feedback by letting users tweak and rerun individual cells while seamlessly reusing data from earlier cells held in memory.</p><p>The new spin command in Metaflow 2.19 addresses this gap. Similar to executing a single notebook cell, it quickly executes a single Metaflow @step — with all the state carried over from the parent step. As a result, users can develop and debug Metaflow steps as easily as a cell in a notebook.</p><p>The effect becomes clear when considering the three complementary execution modes — run, resume, and spin — side by side, mapping them to the corresponding notebook behavior:</p><figure><img alt="Diagram showing the various modes of execution in Metaflow: Run, Resume and Spin" src="https://cdn-images-1.medium.com/max/1024/1*DgRIxOu-7keiFoHia9JMrg.png"><figcaption>Run, Resume and Spin “modes”</figcaption></figure><p>Another major difference isn’t just what gets executed, but what gets recorded. Both run and resume create a full, versioned run with complete metadata and artifacts, while spin skips tracking altogether. It’s built for fast, throw-away iterations during development.</p><p>The one-minute clip below illustrates a typical iterative development workflow that alternates between run and spin. In this example, we are building a flow that reads a dataset from a Parquet file and trains a separate model for each product category, focusing on computer-related categories.</p><a href="https://medium.com/media/36b60cb7c79bf5c77ddd1f60cc97ae8b/href">https://medium.com/media/36b60cb7c79bf5c77ddd1f60cc97ae8b/href</a><p>As shown in the video, we start by creating a flow from scratch and running a minimal version of it to persist test artifacts — in this case, a Parquet dataset. From there, we can use spin to iterate on one step at a time, incrementally building out the flow, for example, by adding the parallel training steps demonstrated in the clip.</p><p>Once the flow has been iterated on locally, it can be seamlessly deployed to production orchestrators like Maestro or <a href="https://docs.metaflow.org/production/scheduling-metaflow-flows/scheduling-with-argo-workflows">Argo</a>, and <a href="https://docs.metaflow.org/scaling/remote-tasks/requesting-resources">scaled up</a> on compute platforms such as AWS Batch, Titus, Kubernetes and more. Thus, the experience is as smooth as developing in a notebook, but the outcome is a production-ready, scalable workflow, implemented as an idiomatic Python project!</p><h3>Spin up smooth development in VSCode/Cursor</h3><p>Instead of typing run and spin manually in the terminal, we can bind them to keyboard shortcuts. For example, <a href="https://github.com/outerbounds/metaflow-dev-vscode">the simple metaflow-dev VS Code extension</a> (works with Cursor as well) maps Ctrl+Opt+R to run and Ctrl+Opt+S to spin. Just hack away, hit Ctrl+Opt+S, and the extension will save your file and spin the step you are currently editing.</p><p>One area where spin truly shines is in creating mini-dashboards and reports with <a href="https://docs.metaflow.org/metaflow/visualizing-results">Metaflow Cards</a>. Visualization is another strong point of notebooks but the combination of spin and cards makes Metaflow a very compelling alternative for developing real-time and post-execution visualizations. Developing cards is inherently iterative and visual (much like building web pages) where you want to tweak code and see the results instantly. This workflow is readily available with the combination of VSCode/Cursor, which includes a built-in web-view, <a href="https://docs.metaflow.org/metaflow/visualizing-results/effortless-task-inspection-with-default-cards#using-local-card-viewer">the local card viewer</a>, and spin.</p><p>To see the trio of tools — along with the VS Code extension — in action, in this short clip we add observability to the train step that we built in the earlier example:</p><a href="https://medium.com/media/e2596ec88a7e65b4bad1b6ba7a2a167c/href">https://medium.com/media/e2596ec88a7e65b4bad1b6ba7a2a167c/href</a><p>A major benefit of Metaflow Cards is that we don’t need to deploy any extra services, data streams, and databases for observability. Just develop visual outputs as above, deploy the flow, and wehave a complete system in production with reporting and visualizations included.</p><h3>Spin to the next level: injecting inputs, inspecting outputs</h3><p>Spin does more than just run code — it also lets us take full control of a spun @step’s inputs and outputs, enabling a range of advanced patterns.</p><p>In contrast to notebooks, we can spin any arbitrary @step in a flow using state from any past run, making it easy to test functions with different inputs. For example, if we have multiple models produced by separate runs, we could spin an inference step, supplying a different model run each time.</p><p>We can also override artifact values or inject arbitrary Python objects — similar to a notebook cell — for spin. Simply specify a Python module with an ARTIFACTS dictionary:</p><pre>ARTIFACTS = {<br>  "model": "kmeans",<br>  "k": 15<br>}</pre><p>and point spin at the module:</p><pre>spin train --artifacts-module artifacts.py</pre><p>By default spin doesn’t persist artifacts, but we can easily change this by adding --persist. Even in this case, artifacts are not persisted in the usual Metaflow datastore but to a directory-specific location which you can easily clean up after testing. We can access the results with <a href="https://docs.metaflow.org/metaflow/client">the Client API</a> as usual — just specify the directory you want to inspect with inspect_spin:</p><pre>from metaflow import inspect_spin<br><br>inspect_spin(".")<br>Flow("TrainingFlow").latest_run["train"].task["model"].data</pre><p>Being able to inspect and modify a step’s inputs and outputs on the fly unlocks a powerful use case:<strong> unit testing individual steps</strong>. We can use spin programmatically through <a href="https://docs.metaflow.org/metaflow/managing-flows/runner">the Runner API</a> and assert the results:</p><pre>from metaflow import Runner<br><br>with Runner("flow.py").spin("train", persist=True) as spin:<br>  assert spin.task["model"].data == "kmeans"</pre><h3>Making AI agents spin</h3><p>In addition to speeding up development for humans, spin turns out to be surprisingly handy for coding agents too. There are two major advantages to teaching AI how to spin:</p><ol><li><strong>It accelerates the development loop</strong>. Agents don’t naturally understand what’s slow, or why speed matters, so they need to be nudged to favor faster tools over slower ones.</li><li><strong>It helps surface errors faster </strong>and contextualizes them to a specific piece of code, increasing the chance that the agent is able to fix errors by itself.</li></ol><p>Metaflow users are already <a href="https://claude.com/product/claude-code">using</a> Claude Code; spin makes this even easier. In the example below, we added the following section in a CLAUDE.md file:</p><pre>## Developing Metaflow code<br>Follow this incremental development workflow that ensures quick iterations<br>and correct results. You must create a flow incrementally, step by step<br>following this process:<br>1. Create a flow skeleton with empty `@step`s.<br>2. Add a data loading step.<br>3. `run` the flow.<br>4. Populate the next step and use `spin` to test it with the correct inputs.<br>5. `run` the flow to record outputs from the new step.<br>5. Iterate on (4–5) until all steps have been implemented and work correctly.<br>6. `run` the whole flow to ensure final correctness.<br><br>To test a flow, run the flow as follows<br>```<br>python flow.py - environment=pypi run<br>```<br><br>Do this once before running `spin`.<br>As you are building the flow, you `spin` to test steps quickly.<br>For instance<br>```<br>python flow.py - environment=pypi spin train<br>```</pre><p>Just based on these quick instructions, the agent is able to use spin effectively. Take a look at the following inspirational example that one-shots Claude to create a flow, along the lines of our earlier examples, which trains a classifier to predict product categories:</p><a href="https://medium.com/media/706e83950e73a05ca41b4fb463702690/href">https://medium.com/media/706e83950e73a05ca41b4fb463702690/href</a><p>In the video, we can see Claude using spin around the 45-second mark to test a preprocess step. The step initially fails due to a classic data science pitfall: during testing, Claude samples only a small subset of data, causing some classes to be underrepresented. The first spin surfaces the issue, which Claude then fixes by switching to stratified sampling — and finally does another spin to confirm the fix, before proceeding to complete the task.</p><h3>The inner loop of end-to-end ML/AI</h3><p>To circle back to where we started, our motivation for adding spin — and for creating Metaflow in the first place — is to accelerate development cycles so we can deliver more joy to our subscribers, faster. Ultimately, we believe there’s no single magic feature that makes this possible. It takes all parts of an ML/AI platform working together coherently — spin included.</p><p>From this perspective, it’s useful to place spin in the context of other Metaflow features. It’s designed for the innermost loop of model and business-logic development, with the added benefit of supporting unit testing during deployment, as shown in the overall blueprint of the Metaflow toolchain below.</p><figure><img alt="Metaflow tool-chain." src="https://cdn-images-1.medium.com/max/1024/1*9cd6SHFrW7A4iWMZHYkVkw.png"><figcaption>Metaflow tool-chain</figcaption></figure><p>In this diagram, the solid blue boxes represent different Metaflow commands, while the blue text denotes decorators and other features. In particular, note the <em>Shared Functionality</em> box — another key focus area for us over the past year — which includes <a href="https://netflixtechblog.com/introducing-configurable-metaflow-d2fb8e9ba1c6">configuration management</a> and <a href="https://docs.metaflow.org/metaflow/composing-flows/introduction">custom decorators</a>. These capabilities let domain-specific teams and platform providers tailor Metaflow to their own use cases. Following our ethos of composability, all of these features integrate seamlessly with spin as well.</p><p>Another key design philosophy of Metaflow is to let projects start small and simple, adding complexity only when it becomes necessary. So don’t be overwhelmed by the diagram above. To get started, install Metaflow easily with</p><pre>pip install metaflow</pre><p>and take your first baby @steps for a spin! Check out the <a href="https://docs.metaflow.org/metaflow/authoring-flows/introduction">docs</a> and for questions, support, and feedback, join the friendly <a href="http://chat.metaflow.org/">Metaflow Community Slack</a>.</p><h3>Acknowledgments</h3><p>We would like to thank our partners at <a href="https://outerbounds.com/">Outerbounds</a>, and particularly <a href="https://www.linkedin.com/in/villetuulos/">Ville Tuulos</a>, <a href="https://www.linkedin.com/in/savingoyal/">Savin Goyal</a>, and <a href="https://www.linkedin.com/in/madhur-tandon/">Madhur Tandon</a>, for their collaboration on this feature, from initial ideation to review, testing and documentation. We would also like to acknowledge the rest of the Model Development and Management team (<a href="https://www.linkedin.com/in/maria-alder/">Maria Alder</a>, <a href="https://www.linkedin.com/in/david-j-berg/">David J. Berg</a>, <a href="https://www.linkedin.com/in/shaojingli/">Shaojing Li</a>, <a href="https://www.linkedin.com/in/rui-lin-483a83111/">Rui Lin</a>, <a href="https://www.linkedin.com/in/nissanpow/">Nissan Pow</a>, <a href="https://www.linkedin.com/in/chaoying-wang/">Chaoying Wang</a>, <a href="https://www.linkedin.com/in/reginalw/">Regina Wang</a>, <a href="https://www.linkedin.com/in/shuishiyang/">Seth Yang</a>, <a href="https://www.linkedin.com/in/zitingyu/">Darin Yu</a>) for their input and comments.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=b2d5b95c63eb" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/supercharging-the-ml-and-ai-development-experience-at-netflix-b2d5b95c63eb">Supercharging the ML and AI Development Experience at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/supercharging-the-ml-and-ai-development-experience-at-netflix-b2d5b95c63eb</link>
      <guid>https://netflixtechblog.com/supercharging-the-ml-and-ai-development-experience-at-netflix-b2d5b95c63eb</guid>
      <pubDate>Tue, 04 Nov 2025 21:33:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning]]></title>
      <description><![CDATA[<p>Author: <a href="https://keertanavc.github.io/">Keertana Chidambaram</a>, <a href="https://www.linkedin.com/in/qiuling-xu-a445b815a">Qiuling Xu</a>, <a href="https://www.linkedin.com/in/markhsiao/">Ko-Jen Hsiao</a>, <a href="https://www.linkedin.com/in/moumitab/">Moumita Bhattacharya</a></p><p>(*The work was done when Keertana interned at Netflix.)</p><h3>Introduction</h3><p>This blog focuses on post-training generative recommender systems. Generative recommenders (GRs) represent a new paradigm in the field of recommendation systems (e.g. <a href="https://github.com/meta-recsys/generative-recommenders">HSTU</a>, <a href="https://arxiv.org/abs/2502.18965">OneRec</a>). These models draw inspiration from recent advancements in transformer architectures used for language and vision tasks. They approach the recommendation problem, including both ranking and retrieval, as a sequential transduction task. This perspective enables generative training, where the model learns by imitating the next event in a sequence of user activities, thereby effectively modeling user behavior over time.</p><p>However, a key challenge with simply replicating observed user patterns is that it may not always lead to the best possible recommendations. User interactions are influenced by a variety of factors — such as trends, or external suggestions — and the system’s view of these interactions is inherently limited. For example, if a user tries a popular show but later indicates it wasn’t a good fit, a model that only imitates this behavior might continue to recommend similar content, missing the chance to enhance the user’s experience.</p><p>This highlights the importance of incorporating user preferences and feedback, rather than solely relying on observed behavior, to improve recommendation quality. In the context of recommendation systems, we benefit from a wealth of user feedback, which includes explicit signals such as ratings and reviews, as well as implicit signals like watch time, click-through rates, and overall engagement. This abundance of feedback serves as a valuable resource for improving model performance.</p><p>Given the recent success of reinforcement learning techniques in post-training large language models, such as DPO and GRPO, this study investigates whether similar methods can be applied to generative recommenders. Ultimately, our goal is to identify both the opportunities and challenges in using these techniques to enhance the quality and relevance of recommendations.</p><p>Unlike language models, post-training generative recommenders presents unique challenges. One of the most significant is the difficulty of obtaining counterfactual feedback in recommendation scenarios. The recommendation feedback is generated on-policy — that is, it reflects users’ real-time interactions with the system as they naturally use it. Since a typical user sequence can span weeks or even years of activity, it is impractical to ask users to review or provide feedback on hypothetical, counterfactual experiences. As a result, the absence of counterfactual data makes it challenging to apply post-training methods such as PPO or DPO, which require feedback from counterfactual user sequences.</p><p>Furthermore, post-training methods typically rely on a reward model — either implicit or explicit — to guide optimization. The quality of reward models heavily influences the effectiveness of post-training. In the context of recommendation systems, however, reward signals tend to be much noisier. For instance, if we use watch time as an implicit reward, it may not always accurately reflect user satisfaction: a viewer might stop watching a favorite show simply due to time constraints, while finishing a lengthy show doesn’t necessarily indicate genuine enjoyment.</p><p>To address these post-training challenges, we introduce a novel algorithm called Advantage-Weighted Supervised Fine-tuning (A-SFT). Our analysis first demonstrates that reward models in recommendation systems often exhibit higher uncertainty due to the issues discussed above. Rather than relying solely on these uncertain reward models, A-SFT combines supervised fine-tuning with the advantage function to more effectively guide post-training optimization. This approach proves especially effective when the reward model has high variance but still provides valuable directional signals. We benchmark A-SFT against four other representative methods, and our results show that A-SFT achieves better alignment between the pre-trained generative recommendation model and the reward model.</p><p>In Figure 1, we conceptualize the pros and cons of different post-training paradigms. For example, Online Reinforcement Learning is most useful when the reward model has a good generalization ability, and behavior cloning is suitable when no reward models are available. Using these algorithms under fitting use cases is the key to a successful post-training. For example, over-exploitation of noisy reward models will hurt task performance, as guidance from the reward models can be simply noise. Conversely, not leveraging a good reward model leaves out potential improvements. We find A-SFT fits the sweet point between offline reinforcement learning and behavior cloning, where it benefits from the directional signals in those noisy estimations and is less dependent on the reward accuracy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*N9QNNLspEJJCxQjtNl9HZw.png"></figure><p>Figure 1: The landscape of RL algorithms based on the reward models’ accuracy</p><h3>Challenges in Post-training for Recommendation</h3><p>Reinforcement Learning from Human Feedback (RLHF) is the most popular framework for post-training large language models. In this framework, human annotators evaluate and rank different outputs generated by a model. This feedback is then used to train a reward model that predicts how well a model output aligns with human preferences. This reward model then serves as a proxy for human judgment during reinforcement learning, guiding the model to generate outputs that are more likely to be preferred by humans.</p><p>While traditional RLHF methods like PPO or DPO are effective for aligning LLMs, there are several challenges in applying them directly to large-scale recommendation systems:</p><ol><li>Lack of Counter-factual Observations</li></ol><p>As in typical RLHF settings, collecting real-time feedback from a diverse user base across a wide range of items is both costly and impractical. The data in recommendation are generated by the real-time user interests. Any third-party annotators or even the user themselves lack the practical means to evaluate an alternative reality. For example, it is impractical to ask the Netflix users to evaluate hundreds of unseen movies. Consequently, we lack a live environment in which to perform reinforcement learning.</p><p>2. Noisy Reward Models</p><p>In addition to the limited counter-factual data, the recommendation task itself has a higher randomness by its nature. The recommendation data has less structure than language data. Users choose to watch some shows not because there is a grammar rule that nouns need to follow by the verbs. In fact, the users’ choices usually exhibit a level of permutation invariance, where swapping the order of events in the user history still makes a valid activity sequence. This randomness in the behaviors makes learning a good reward model extremely difficult. Often the reward models we learnt still have a large margin of errors.</p><p>Here is an ablation study we did on the reward model performance with O(Millions) users and O(Billions) of tokens. The reward model uses an open-sourced HSTU architecture in the convenience of reproducing this study. We adopt the standard RLHF approach of training a reward model using offline, human-collected feedback. We start by creating a proxy reward, scored on a scale from 1 to 5 in the convenience of understanding. This reward model is co-trained as a shallow reward head on top of the generative recommender. It predicts the reward for the most recently selected title based on a user’s interaction history. To evaluate its effectiveness, we compare the model’s performance against two simple baselines: (1) predicting the next reward as the average reward the user has given in their past interactions, and (2) predicting it as the average reward that all users have assigned to that particular title in previous interactions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tmYdEXqVjSkne__-k6gSig.png"></figure><p>Table 1: Reward model performance metrics</p><p>We observe that the model’s predictions do not significantly outperform the simple baselines. This result is intuitive, as a user’s historical interactions typically cover only a small subset of titles, making it difficult to accurately predict their responses to the vast number of unexplored titles in the catalogue. We expect this to be a potential issue for any large recommendation systems where the ratio between explored and unexplored titles is very small.</p><p>3. Lack of Logged Policy</p><p>In recommendation systems, the policy that generated the logged data is typically unknown and cannot be directly estimated. Offline reinforcement learning methods often rely on Inverse Propensity Scoring (IPS) to debias such data by reweighting interactions according to the logging policy’s action probabilities. However, estimating the logging policy accurately is challenging and prone to error, which can introduce additional biases, and IPS itself is known to suffer from high variance. Consequently, offline RL approaches that depend on IPS are ill-suited for our setting.</p><h3>Advantage Weighted Supervised Fine Tuning</h3><p>Given the three challenges outlined above, we propose a new algorithm Advantage-Weighted SFT (A-SFT). It leverages a combination of supervised fine-tuning and advantage reweighting from reinforcement learning. The key observation is as follows. Despite the reward estimation for each individual event having a high uncertainty, we find the estimations of rewards contain directional signals between high-reward and low-reward events. These signals could help better align the model during post-training.</p><p>A central factor in this study is the generalization ability of the reward model. Better generalization enables more accurate predictions of user preferences for unseen titles, thereby making exploration more effective. For reward models with moderate to high generalization power, both online RL methods such as PPO and offline RL methods such as CQL can perform effectively. However, in our setting, reward model generalization is worse than the language counterparts’, which makes these algorithms less appropriate. In addition, the use of techniques like inverse propensity scoring (IPS) introduces a heightened risk of high-variance estimates, prompting us to exclude algorithms such as off-policy REINFORCE.</p><p>Our proposed method A-SFT does not rely on IPS. With no need of prior knowledge of the logging policy, it can be generally applied to cases where observation of the environments are limited or biased. This is particularly useful to the recommendation setting due to the user feedback loop and distribution shifts with time. Without knowing the logging policy, A-SFT still provides means to control the policy deviation between the current policy and logging policy by tuning the parameter. This design provides essential means to control the learnt bias from uncertain reward models. We show that A-SFT outperforms baseline behavior cloning by directly optimizing observed rewards.</p><p>The advantage-weighted SFT algorithm is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eP-_FyLRs6vrGnwyIp_j_A.png"></figure><p>For the results presented in this blog post, we treat the recommendation problem as a contextual bandit, i.e. given a history of user interactions as the context, can we recommend a high reward next title recommendation for the user?</p><h3>Benchmarks</h3><p>We compared representative algorithms including PPO, IPO, DPO, CQL and SFT as the baselines:</p><ol><li><strong>Reward weighted Behavior Cloning</strong>: This benchmark algorithm modifies supervised fine-tuning (SFT) by weighting the loss with the raw rewards of the chosen item instead of weighing the loss with advantage as in the proposed algorithm.</li><li><strong>Rejection Sampling Direct Preference Optimization / Identity Preference Optimization (RS DPO/IPO)</strong>: this is a variant of DPO/IPO where, for each user history x, ​we generate contrasting response pairs by training an ensemble of reward models to estimate confidence intervals for the reward of multiple potential responses y. If the lower bound of the reward confidence interval for one response​ is less than the upper bound for another response, then this pair is used to train DPO/IPO.</li><li><strong>Conservative Q-Learning (CQL)</strong>: This is a standard offline algorithm that learns a conservative Q function, penalizing overestimation of Q-values, particularly in regions of the state-action space with little or no reward data.</li><li><strong>Proximal Policy Optimization (PPO)</strong>: This is a standard RLHF (Reinforcement Learning from Human Feedback) algorithm that uses reward models as an online environment. PPO learns an advantage function and optimizes the policy to maximize expected reward while maintaining proximity to the initial policy.</li></ol><p>We sampled a separate test set of O(Millions) users. This test set is collected on a future date after the training.</p><h3>Offline Evaluation Results</h3><p>We evaluate our algorithm on a dataset of high-reward user trajectories. For sake of simplicity, we consider a trajectory to have a high reward if the accumulated reward is higher than the median of the population. We present the following metrics for the held out test dataset:</p><ol><li><strong>NDCG@k</strong>: This measures the ranking quality of the recommended items up to position k. It accounts for the position of relevant items in the recommendation list, assigning higher scores when relevant items appear higher in the ranking. The gain is discounted logarithmically at lower ranks, and the result is normalized by the ideal ranking (i.e., the best possible ordering of items).</li><li><strong>HR@k</strong>: This measures the proportion of test cases in which the ground-truth chosen item y appears in the top k recommendations. It is a binary metric per test case (hit or miss) and is averaged over all test cases.</li><li><strong>MRR</strong>: MRR evaluates the ranking quality by measuring the reciprocal of the rank at which the chosen item appears in the recommendation list. The metric is averaged across all test cases.</li><li><strong>Reward Model as A Judge</strong>: We use the reward model to evaluate the policy for future user events. We propose to use an ensemble of reward models for the evaluation to increase confidence. The result is based on the discounted reward generated for a few steps. The standard deviation is less than 4%.</li></ol><p>We measure the percentage improvement in each metric compared to the baseline, Reward Weighted Behavior Cloning(BC). We notice that advantage weighted SFT shows the largest improvement in metrics, outweighing BC as well as reward model dependent algorithms like CQL, PPO, DPO and IPO.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Z8wcCOETlobx8T_oVuDRiA.png"></figure><p>Our experiments show that advantage weighted SFT is a simple but promising approach for post-training generative recommenders as it deals with the issue of poor reward model generalizations and lack of IPS. More specifically, we find PPO, IPO and DPO achieve a good reward score, but also causes the overfitting from the reward model. Conservative Q-Learning achieves more robust improvements but does not fully capture the potential signals in the reward modeling. A-SFT achieved both better recommendation metrics and reward scores.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=61a538d717a9" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/post-training-generative-recommenders-with-advantage-weighted-supervised-finetuning-61a538d717a9">Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/post-training-generative-recommenders-with-advantage-weighted-supervised-finetuning-61a538d717a9</link>
      <guid>https://netflixtechblog.com/post-training-generative-recommenders-with-advantage-weighted-supervised-finetuning-61a538d717a9</guid>
      <pubDate>Sun, 26 Oct 2025 00:01:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Behind the Streams: Real-Time Recommendations for Live Events Part 3]]></title>
      <description><![CDATA[<p>By: <a href="https://www.linkedin.com/in/krisrange/">Kris Range</a>, <a href="https://www.linkedin.com/in/gulatiankush/">Ankush Gulati</a>, <a href="https://www.linkedin.com/in/jimpisaacs/">Jim Isaacs</a>, <a href="https://www.linkedin.com/in/jennifer-s-0019a516/">Jennifer Shin</a>, <a href="https://www.linkedin.com/in/jeremy-kelly-526a30180/">Jeremy Kelly</a>, <a href="https://www.linkedin.com/in/jason-t-26850b26/">Jason Tu</a></p><p><em>This is part 3 in a series called “Behind the Streams”. Check out </em><a href="https://netflixtechblog.com/behind-the-streams-live-at-netflix-part-1-d23f917c2f40"><em>part 1</em></a><em> and </em><a href="https://netflixtechblog.com/building-a-reliable-cloud-live-streaming-pipeline-for-netflix-8627c608c967"><em>part 2</em></a><em> to learn more.</em></p><p>Picture this: It’s seconds before the biggest fight night in Netflix history. Sixty-five million fans are waiting, devices in hand, hearts pounding. The countdown hits zero. What does it take to get everyone to the action on time, every time? At Netflix, we’re used to on-demand viewing where everyone chooses their own moment. But with live events, millions are eager to join in at once. Our job: make sure our members never miss a beat.</p><p>When Live events break streaming records <a href="https://about.netflix.com/en/news/60-million-households-tuned-in-live-for-jake-paul-vs-mike-tyson">¹</a> <a href="https://about.netflix.com/en/news/netflix-nfl-christmas-gameday-reaches-65-million-us-viewers">²</a> <a href="https://about.netflix.com/en/news/over-41-million-global-viewers-on-netflix-watch-terence-crawford-defeat">³</a>, our infrastructure faces the ultimate stress test. Here’s how we engineered a discovery experience for a global audience excited to see a knockout.</p><h3>Why are Live Events Different?</h3><p>Unlike Video on Demand (VOD), members want to catch live events as they happen. There’s something uniquely exciting about being part of the moment. That means we only have a brief window to recommend a Live event at just the right time. Too early, excitement fades; too late, the moment is missed. Every second counts.</p><p>To capture that excitement, we enhanced our recommendation delivery systems to serve real-time suggestions, providing members richer and more compelling signals to hit play in the moment when it matters most. The challenge? Sending dynamic, timely updates concurrently to over a hundred million devices worldwide without creating a <a href="https://en.wikipedia.org/wiki/Thundering_herd_problem">thundering herd effect</a> that would overwhelm our cloud services. Simply scaling up linearly isn’t efficient and reliable. For popular events, it could also divert resources from other critical services. We needed a smarter and more scalable solution than just adding more resources.</p><h3>Orchestrating the moment: Real-time Recommendations</h3><p>With millions of devices online and live event schedules that can shift in real time, the challenge was to keep everyone perfectly in sync. We set out to solve this by building a system that doesn’t just react, but adapts by dynamically updating recommendations as the event unfolds. We identified the need to balance three constraints:</p><ul><li><strong>Time</strong>: the duration required to coordinate an update.</li><li><strong>Request throughput</strong><em>: </em>the capacity of our cloud services to handle requests.</li><li><strong>Compute cardinality</strong>: the variety of requests necessary to serve a unique update.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-z6A8FriBAbJW5BcwMrZTA.png"><figcaption>Visualizing constraints for real-time updates</figcaption></figure><p>We solved this constraint optimization problem by splitting the real-time recommendations into two phases: <strong>prefetching</strong> and <strong>real-time broadcasting</strong>. First, we prefetch the necessary data ahead of time, distributing the load over a longer period to avoid traffic spikes. When the Live event starts or ends, we broadcast a low cardinality message to all connected devices, prompting them to use the prefetched data locally. The timing of the broadcast also adapts when event times shift to preserve accuracy with the production of the Live event. By combining these two phases, we’re able to keep our members’ devices in sync and solve the thundering herd problem. To maximize device reach, especially for those with unstable networks, we use “at least once” broadcasts to ensure every device gets the latest updates and can catch up on any previously missed broadcasts as soon as they’re back online.</p><p>The first phase optimizes <strong>request throughput </strong>and <strong>compute cardinality</strong> by prefetching materialized recommendations, displayed title metadata, and artwork for a Live event. As members naturally browse their devices before the event, this data is prepopulated and stored locally in device cache, awaiting the notification trigger to serve the recommendations instantaneously. By distributing these requests naturally over time ahead of the event, we can eliminate any related traffic spikes and avoid the need for large-scale, real-time system scaling.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QmB99SosLqs-JEo0gp1wwg.png"><figcaption>A phased approach, smoothing traffic requests over time with a real-time low-cardinality broadcast</figcaption></figure><p>The second phase optimizes<strong> request throughput </strong>and<strong> time </strong>to update<strong> </strong>devices by broadcasting a low-cardinality, real-time message to all connected devices at critical moments in a Live event’s lifecycle. Each broadcast payload includes a <strong>state key</strong> and a <strong>timestamp</strong>. The state key indicates the current stage of the Live event, allowing devices to use their pre-fetched data to update cached responses locally without additional server requests. The timestamp ensures that if a device misses a broadcast due to network issues, it can catch up by replaying missed updates upon reconnecting. This mechanism guarantees devices receive updates at least once, significantly increasing delivery reliability even on unstable networks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h6CIrfYnpR24NS5hgvxEGA.png"><figcaption>A phased approach optimizes each constraint to ensure we can deliver for the big moment!</figcaption></figure><blockquote>Moment in Numbers: During peak load, we have successfully delivered updates at multiple stages of our events to over 100 million devices in under a minute.</blockquote><h3>Under the Hood: How It Works</h3><p>With the big picture in mind, let’s examine how these pieces interact in practice.</p><p>In the diagram below, the Message Producer microservice centralizes all of the business logic. It continuously monitors live events for setup and timing changes. When it detects an update, it schedules broadcasts to be sent at precisely the right moment. The Message Producer also standardizes communication by providing a concise GraphQL schema for both device queries and broadcast payloads.</p><p>Rather than sending broadcasts directly to devices via WebSocket, the Message Producer hands them off to the Message Router. The Message Router is part of a robust two-tier pub/sub architecture built on proven technologies like <a href="https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658">Pushy</a> (our WebSocket proxy), Apache Kafka, and <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Netflix’s KV key-value store</a>. The Message Router tracks subscriptions at the Pushy node granularity, while Pushy nodes map the subscriptions to individual connections, creating a low-latency fanout that minimizes compute and bandwidth requirements.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Kc1l_Wnc4i08xFA8JnRLCA.png"></figure><p>Devices interface with our GraphQL <a href="https://netflix.github.io/dgs/">Domain Graph Service (DGS)</a>. These schemas offer multiple query interfaces for prefetching, allowing devices to tailor their requests to the specific experience being presented. Each response adheres to a consistent API that resolves to a map of stage keys, enabling fast lookups and keeping business logic off the device. Our broadcast schema specifies WebSocket connection parameters, the current event stage, and the timestamp of the last broadcast message. When a device receives a broadcast, it injects the payload directly into its cache, triggering an immediate update and re-render of the interface.</p><h3>Balancing the Moment: Throughput Management</h3><p>In addition to building the new technology to support real-time recommendations, we also evaluated our existing systems for potential traffic hotspots. Using high-watermark traffic projections for live events, we generated synthetic traffic to simulate game-day scenarios and observed how our online services handled these bursts. Through this process, several common patterns emerged:</p><p><strong>Breaking the Cache Synchrony</strong></p><p>Our game-day simulations revealed that while our approach mitigated the immediate thundering herd risks driven by member traffic during the events, live events introduced unexpected mini thundering herds in our systems hours before and after the actual events. The surge of members joining just in time for these events led to concentrated cache expirations and recomputations, which created traffic spikes well outside the event window that we did not anticipate. This was not a problem for VOD content because the member traffic patterns are a lot smoother. We found that fixed TTLs caused cache expirations and refresh-traffic spikes to happen all at once. To address this, we added jitter to server and client cache expirations to spread out refreshes and smooth out traffic spikes.</p><p><strong>Adaptive Traffic Prioritization</strong></p><p>While our services already leverage traffic prioritization and partitioning based on factors such as request type and device type, live events introduced a distinct challenge. These events generated brief traffic bursts that were intensely spiky and placed significant strain on our systems. Through simulations, we recognized the need for an additional event-driven layer of traffic management.</p><p>To tackle this, we improved our traffic sharding strategies by using event-based signals. This enabled us to route live event traffic to dedicated clusters with more aggressive scaling policies. We also added a dynamic traffic prioritization ruleset that activates whenever we see high requests per second (RPS) to ensure our systems can handle the surge smoothly. During these peaks, we aggressively deprioritize non-critical server-driven updates so that our systems can devote resources to the most time-sensitive computations. This approach ensures smooth performance and reliability when demand is at its highest.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/729/1*PYhFbFxK5PbOEAnYtTfD4Q.jpeg"><figcaption>Snapshot of non-critical traffic volume decline (in %) for a member-facing service during a live event — achieved via aggressive de-prioritization</figcaption></figure><h3>Looking Ahead</h3><p>When we set out to build a seamlessly scalable scheduled viewing experience, our goal was to create a dynamic and richer member experience for live content. Popular live events like the Crawford v. Canelo fight and the NFL Christmas games truly put our systems to the test. Along the way, we also uncovered valuable learnings that continue to shape our work. Our attempts to deprioritize traffic to other non-critical services caused unexpected call patterns and spikes in traffic elsewhere. Similarly, in hindsight, we also learned that the high traffic volume from popular events caused excessive non-essential logging and was putting unnecessary pressure on our ingestion pipelines.</p><p>None of this work would have been possible without our stunning colleagues at Netflix who collaborated across multiple functions to architect, build, and test these approaches, ensuring members can easily access events at the right moment: UI Engineering, Cloud Gateway, Data Science &amp; Engineering, Search and Discovery, Evidence Engineering, Member Experience Foundations, Content Promotion and Distribution, Operations and Reliability, Device Playback, Experience and Design and Product Management.</p><p>As Netflix’s content offering expands to include new formats like live titles, free-to-air linear content, and games, we’re excited to build on what we’ve accomplished and look ahead to even more possibilities. Our roadmap includes extending the capabilities we developed for scheduled live viewing to these emerging formats. We’re also focused on enhancing our engineering tooling for greater visibility into operations, message delivery, and error handling to help us continue to deliver the best possible experience for our members.</p><h3>Join Us for What’s Next</h3><p>We’re just scratching the surface of what’s possible as we bring new live experiences to members around the world. If you are looking to solve interesting technical challenges in a <a href="https://jobs.netflix.com/culture">unique culture</a>, then <a href="https://jobs.netflix.com/">apply</a> for a role that captures your curiosity.</p><p><em>Look out for future blog posts in our “Behind the Streams” series, where we’ll explore the systems that ensure viewers can watch live streams once they manage to find and play them.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=e027cb313f8f" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/behind-the-streams-real-time-recommendations-for-live-events-e027cb313f8f">Behind the Streams: Real-Time Recommendations for Live Events Part 3</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/behind-the-streams-real-time-recommendations-for-live-events-e027cb313f8f</link>
      <guid>https://netflixtechblog.com/behind-the-streams-real-time-recommendations-for-live-events-e027cb313f8f</guid>
      <pubDate>Tue, 21 Oct 2025 02:53:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…]]></title>
      <description><![CDATA[<h3>How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale</h3><p>Authors: <a href="https://www.linkedin.com/in/ataruc/">Adrian Taruc</a> and <a href="https://www.linkedin.com/in/jamesdalydalton/">James Dalton</a></p><p><em>This is the first entry of a multi-part blog series describing how we built a Real-Time Distributed Graph (RDG). In Part 1, we will discuss the motivation for creating the RDG and the architecture of the data processing pipeline that populates it.</em></p><h3>Introduction</h3><p>The Netflix product experience historically consisted of a single core offering: streaming video on demand. Our members logged into the app, browsed, and watched titles such as Stranger Things, Squid Game, and Bridgerton. Although this is still the core of our product, our business has changed significantly over the last few years. For example, we introduced ad-supported plans, live programming events (e.g., <a href="https://www.netflix.com/title/81764952">Jake Paul vs. Mike Tyson</a> and <a href="https://www.netflix.com/tudum/articles/nfl-games-on-netflix">NFL Christmas Day Games</a>), and <a href="https://about.netflix.com/en/news/let-the-games-begin-a-new-way-to-experience-entertainment-on-mobile">mobile games</a> as part of a Netflix subscription. This evolution of our business has created a new class of problems where we have to analyze member interactions with the app across different business verticals. Let’s walk through a simple example scenario:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TPFlIvYqGC3L2x1A-KqkyQ.png"></figure><ol><li>Imagine a Netflix member logging into the app on their smartphone and beginning to watch an episode of Stranger Things.</li><li>Eventually, they decide to watch on a bigger screen, so they log into the app on a smart TV in their home and continue watching the same episode.</li><li>Finally, after completing the episode, they log into the app on their tablet and play the game “Stranger Things: 1984”.</li></ol><p>We want to know that these three activities belong to the same member, despite occurring at different times and across various devices. In a traditional data warehouse, these events would land in at least two different tables and may be processed at different cadences. But in a graph system, they become connected almost instantly. Ultimately, analyzing member interactions in the app across domains empowers Netflix to create more personalized and engaging experiences.</p><p>In the early days of our business expansion, discovering these relationships and contextual insights was extremely difficult. Netflix is famous for adopting a microservices architecture — hundreds of microservices developed and maintained by hundreds of individual teams. Some notable benefits of microservices are:</p><ol><li><strong>Service Decomposition</strong>: The overall platform is separated into smaller services, each responsible for a specific business capability. This modularity allows for independent service development, deployment, and scaling.</li><li><strong>Data Isolation</strong>: Each service manages its own data, reducing interdependencies. This allows teams to choose the most suitable data schemas and storage technologies for their services.</li></ol><p><strong>However, these benefits also led to drawbacks for our data science and engineering partners.</strong> In practice, the separation of business concerns and service development ultimately resulted in a separation of data. Manually stitching data together from our data warehouse and siloed databases was an onerous task for our partners. Our data engineering team recognized we needed a solution to process and store our enormous swath of interconnected data while enabling fast querying to discover insights. Although we could have structured the data in various ways, we ultimately settled on a graph representation. We believe a graph offers key advantages, specifically:</p><ul><li><strong>Relationship-Centric Queries:</strong> Graphs enable fast “hops” across multiple nodes and edges without expensive joins or manual denormalization that would be required in table-based data models.</li><li><strong>Flexibility as Relationships Grow:</strong> As new connections and entities emerge, graphs can quickly adapt without significant schema changes or re-architecture.</li><li><strong>Pattern and Anomaly Detection:</strong> Our stakeholders’ use cases often require identifying hidden relationships, cycles, or groupings in the data — capabilities much more naturally expressed and efficiently executed using graph traversals than siloed point lookups.</li></ul><p>This is why we set out to build a Real-Time Distributed Graph, or “RDG” for short.</p><h3>Ingestion and Processing</h3><p>Three main layers in the system power the RDG:</p><ol><li><strong>Ingestion and Processing</strong> — receive events from disparate upstream data sources and use them to generate graph nodes and edges.</li><li><strong>Storage</strong> — write nodes and edges to persistent data stores.</li><li><strong>Serving</strong> — expose ways for internal clients to query graph nodes and edges.</li></ol><p><strong>The rest of this post will focus on the first layer, while subsequent posts in this blog series will cover the other layers.</strong> The diagram below depicts a high-level overview of the ingestion and processing pipeline:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Jy0eVxvB-AzFNijNfRb9ZA.png"></figure><p>Building and updating the RDG in real-time requires continuously processing vast volumes of incoming data. Batch processing systems and traditional data warehouses cannot offer the low latency needed to maintain an up-to-date graph that supports real-time applications. We opted for a stream processing architecture, enabling us to update the graph’s data as events happen, thus minimizing delay and ensuring the system reflects the latest member interactions within the Netflix app.</p><h3>Kafka as the Ingestion Backbone</h3><p>Member actions in the Netflix app are published to our API Gateway, which then writes them as records to <a href="https://kafka.apache.org/">Apache Kafka</a> topics. Kafka is the mechanism through which internal data applications can consume these events. It provides durable, replayable streams that downstream processors, such as <a href="https://flink.apache.org/">Apache Flink</a> jobs, can consume in real-time.</p><p>Our team’s applications consume several different Kafka topics, each generating up to roughly <strong>1 million messages per second</strong>. Topic records are encoded in the Apache Avro format, and Avro schemas are persisted in an internal centralized schema registry. In order to strike a balance between maintaining data availability and managing the financial expenses of storage infrastructure, we tailor retention policies for each topic according to its throughput and record size. We also persist topic records to <a href="https://iceberg.apache.org/">Apache Iceberg</a> data warehouse tables, which allows us to backfill data in scenarios where older data is no longer available in the Kafka topics.</p><h3>Processing Data with Apache Flink</h3><p>The event records in the Kafka streams are ingested by Flink jobs. We chose Flink because of its strong capabilities around near-real-time event processing. There is also robust internal platform support for Flink within Netflix, which allows jobs to integrate with Kafka and various storage backends seamlessly. At a high level, the anatomy of an RDG Flink job looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0G-yrGzB_ZbgaRlPbextTQ.png"></figure><p>For the sake of simplicity, the diagram above depicts a basic flow in which a member logs into their Netflix account and begins watching an episode of Stranger Things. Reading the diagram from left to right:</p><ul><li>The actions of logging into the app and watching the Stranger Things episode are ultimately written as events to Kafka topics.</li><li>The Flink job consumes event records from the upstream Kafka topics.</li><li>Next, we have a series of Flink processor functions that:</li></ul><ol><li>Apply filtering and projections to remove noise based on the individual fields that are present — or in some cases, not present — in the events.</li><li>Enrich events with additional metadata, which are stored and accessed by the processor functions via side inputs.</li><li>Transform events into graph primitives — nodes representing entities (e.g., member accounts and show/movie titles), and edges representing relationships or interactions between them. In this example, the diagram only shows a few nodes and an edge to keep things simple. However, in reality, we create and update up to a few dozen different nodes and edges, depending on the member actions that occurred within the Netflix app.</li><li>Buffer, detect, and deduplicate overlapping updates that occur to the same nodes and edges within a small, configurable time window. This step reduces the data throughput we publish downstream. It is implemented using stateful process functions and timers.</li><li>Publish nodes and edges records to <a href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">Data Mesh</a>, an abstraction layer that connects data applications and storage systems. We write a total (nodes + edges) of <strong>more than 5 million records per second</strong> to Data Mesh, which handles persisting the records to various data stores that other internal services can query.</li></ol><h3>From One Job to Many: Scaling Flink the Hard Way</h3><p>Initially, we tried having just one Flink job that consumed all the Kafka source topics. However, this quickly became a big operational headache since different topics can have different data volumes and throughputs at different times during the day. Consequently, tuning the monolithic Flink job became extremely difficult — we struggled to find CPU, memory, job parallelism, and checkpointing interval configurations that ensured job stability.</p><p>Instead, we pivoted to having a 1:1 mapping from the Kafka source topic to the consuming Flink job. Although this led to additional operational overhead due to more jobs to develop and deploy, each job has been much simpler to maintain, analyze, and tune.</p><p>Similarly, each node and edge type is written to a separate Kafka topic. This means we have significantly more Kafka topics to manage. However, we decided the tradeoff of having bespoke tuning and scaling per topic was worth it. We also designed the graph data model to be as generic and flexible as possible, so adding new types of nodes and edges would be an infrequent operation.</p><h3>Acknowledgements</h3><p>We would be remiss if we didn’t give a special shout-out to our stunning colleagues who work on the internal Netflix data platform. Building the RDG was a multi-year effort that required us to design novel solutions, and the investments and foundations from our platform teams were critical to its successful creation. You make the lives of Netflix data engineers much easier, and the RDG would not exist without your diligent collaboration!</p><p>—</p><p>Thanks for reading the first season of the RDG blog series; stay tuned for Season 2, where we will go over the storage layer containing the graph’s various nodes and edges.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=80113e124acc" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/how-and-why-netflix-built-a-real-time-distributed-graph-part-1-ingesting-and-processing-data-80113e124acc">How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/how-and-why-netflix-built-a-real-time-distributed-graph-part-1-ingesting-and-processing-data-80113e124acc</link>
      <guid>https://netflixtechblog.com/how-and-why-netflix-built-a-real-time-distributed-graph-part-1-ingesting-and-processing-data-80113e124acc</guid>
      <pubDate>Fri, 17 Oct 2025 20:42:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine]]></title>
      <description><![CDATA[<p>By <a href="https://www.linkedin.com/in/jheua/">Jun He</a>, <a href="https://www.linkedin.com/in/yingyi-zhang-a0a164111/">Yingyi Zhang</a>, <a href="https://www.linkedin.com/in/spearsem/">Ely Spears</a></p><h3>TL;DR</h3><p>We recently upgraded the Maestro engine to go beyond scalability and improved its performance by <strong>100X</strong>! The overall overhead is reduced from seconds to milliseconds. We have updated the Maestro open source project with this improvement! Please visit the <a href="https://github.com/Netflix/maestro">Maestro GitHub repository</a> to get started. If you find it useful, please <a href="https://github.com/Netflix/maestro">give us a star</a>.</p><h3>Introduction</h3><p>In our previous <a href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">blog post</a>, we introduced Maestro as a horizontally scalable workflow orchestrator designed to manage large-scale Data/ML workflows at Netflix. Over the past two and a half years, Maestro has achieved its design goal and successfully supported massive workflows with hundreds of thousands of jobs, managing millions of executions daily. As the adoption of Maestro increases at Netflix, new use cases have emerged, driven by Netflix’s evolving business needs, such as Live, Ads, and Games. To meet these needs, some of the workflows are now scheduled on a sub-hourly basis. Additionally, Maestro is increasingly being used for low-latency use cases, such as ad hoc queries, beyond traditional daily or hourly scheduled ETL data pipeline use cases.</p><p>While Maestro excels in orchestrating various heterogeneous workflows and managing user end-to-end development experiences, users have experienced noticeable speedbumps (i.e. ten seconds overhead) from the Maestro engine during workflow executions and development, affecting overall efficiency and productivity. Although being fully scalable to support Netflix-scale use cases, the processing overhead from Maestro internal engine state transitions and lifecycle activities have become a bottleneck, particularly during development cycles. Users have expressed the need for a high performance workflow engine to support iterative development use cases.</p><p>To visualize our end users’ needs for the workflow orchestrator, we create a 5-layer structure graph shown below. Before the change, Maestro reached level 4 but faced challenges to satisfy the user’s needs in level 5. With the new engine design, Maestro is able to power the users to work with their highest capacity and spark joy for end users during their development over the Maestro.</p><figure><img alt="Figure 1. A 5-layer structure showing needs for the workflow orchestrator" src="https://cdn-images-1.medium.com/max/562/1*q871tK1C7Y8VSAXVOmu4Ig.png"><figcaption>Figure 1. A 5-layer structure showing needs for the workflow orchestrator.</figcaption></figure><p>In this blog post, we will share our new engine details, explain our design trade-off decisions, and share learnings from this redesign work.</p><h3>Architectural Evolution of Maestro</h3><h4>Before the change</h4><p>To understand the improvements, we will first revisit the original architecture of Maestro to understand why the overhead is high. The system was divided into three main layers, as illustrated in the diagram below. In the sections that follow we will explain each layer and the role it played in our performance optimization.</p><figure><img alt="Figure 2. The architecture diagram before the evolution." src="https://cdn-images-1.medium.com/max/1024/1*198QCvklU8o6aUrDdaBGRA.png"><figcaption>Figure 2. The architecture diagram before the evolution.</figcaption></figure><p><strong>Maestro API and Step Runtime Layer</strong></p><p>This layer offers seamless integrations with other Netflix services (e.g., compute engines like Spark and Trino). Using Maestro, thousands of practitioners build production workflows using a paved path to access platform services . They can focus primarily on their business logic while relying on Maestro to manage the lifecycle of jobs and workflows plus the integration with data platform services and required integrations such as for authentication, monitoring and alerting. This layer functioned efficiently without introducing significant overhead.</p><p><strong>Maestro Engine Layer</strong></p><p>The Maestro engine serves several crucial functions:</p><ul><li>Managing the lifecycle of workflows, their steps and maintaining their state machines</li><li>Supporting all user actions (e.g., start, restart, stop, pause) on workflow and step entities</li><li>Translating complex Maestro workflow graphs into parallel flows, where each flow is an array of sequentially chained flow tasks, translating every step into a flow task, and then executing transformed flows using the internal flow engine</li><li>Acting as a middle layer to maintain isolation between the Maestro step runtime layer and the underlying flow engine layer</li><li>Implementing required data access patterns and writing Maestro data into the database</li></ul><p>In terms of speed, this layer had acceptable overhead but faced edge cases (e.g. a step might be concurrently executed by two workers at the same time, causing race conditions) due to lacking a strong guarantee from the internal flow engine and the external distributed job queue.</p><p><strong>Maestro Internal Flow Engine Layer</strong></p><p>The Maestro internal flow engine performed <strong>2</strong> primary functions:</p><ul><li>Calling task’s execution functions at a given interval.</li><li>Starting the next tasks in an array of sequential task flows (not a graph), if applicable.</li></ul><p>This foundational layer was based on Netflix OSS Conductor 2.x (<a href="https://github.com/Netflix/conductor/releases/tag/v3.0.0">deprecated since Apr 2021</a>), which requires a dedicated set of separate database tables and distributed job queues.</p><p>The existing implementation of this layer introduces an impactful overhead (e.g. a few seconds to tens of seconds overall delays). The lack of strong guarantees (e.g. exactly once publishing) from this layer leads to race conditions which cause stuck jobs or lost executions.</p><h4>Options to consider</h4><p>We have evaluated three options to address those existing issues:</p><ul><li>Option 1: Implement an internal flow engine optimized for Maestro specific use cases</li><li>Option 2: Upgrade Conductor library to 4.0, which addresses the overheads and offers other improvements and enhancements compared with Conductor 2.X.</li><li>Option 3: Use Temporal as the internal flow engine</li></ul><p>One aspect that influenced our assessment of option two is that Conductor 2 provided a final callback capability in the state machine that was contributed specifically for Maestro’s use case to ensure database synchronization between the Conductor and Maestro engine states. It would require porting this functionality to Conductor 4 though it had been dropped given no other Conductor use cases besides Maestro relied on this. By rewriting the flow engine it would allow removal of several complex internal databases and database synchronization requirements which was attractive for simplifying operational reliability. Given Maestro did not need the full set of state engine features offered by Conductor, this motivated us to consider a flow engine rewrite as a higher priority.</p><p>The decision for Temporal was more straightforward. Temporal is optimized towards facilitating inter-process orchestration and would involve calling an external service to interact with the Temporal flow engine. Given Maestro is operating greater than a million tasks per day, many of which are long running, we felt it was an unnecessary source of risk to couple the DAG engine execution with an external service call. If our requirements went beyond lightweight state transition management we might reconsider because Temporal is a very robust control plane orchestration system, but for our needs it introduced complexity and potential reliability weak spots when there was no direct need for the advanced feature set that it offered.</p><p>After considering Option 2 and Option 3, we developed more conviction that Maestro’s architecture could be greatly simplified by not using a full DAG evaluation engine and having to maintain the state machine for two systems (Maestro and Conductor/Temporal). Therefore, we have decided to go with Option 1.</p><h4>After the change</h4><p>To address these issues, we completely rewrote the Maestro internal flow engine layer to satisfy Maestro’s specific needs and optimize its performance. This new flow engine is lightweight with minimal dependencies, focusing on excelling in the two primary functions mentioned <a href="https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041#4bd7">above</a>. We also replaced existing distributed job queues with internal ones to provide a strong guarantee.</p><p>The new engine is <strong>highly performant, efficient, scalable, and fault-tolerant</strong>. It is the foundation for all upper components of Maestro and provides the following guarantees to avoid race conditions:</p><ul><li>A single step should only be executed by a single worker at any given time</li><li>Step state should never be rolled back</li><li>Steps should always eventually run to a terminal state</li><li>The internal flow state should be eventually consistent with the Maestro workflow state</li><li>External API and user actions should not cause race conditions on the workflow execution</li></ul><p>Here is the new architecture diagram after the change, which is much simpler with less dependencies:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZGutO7STwBr81NQ2qAWjYQ.png"><figcaption>Figure 3. The architecture diagram after the evolution.</figcaption></figure><h3>New Flow Engine Optimization</h3><p>The new flow engine significantly boosts speed by maintaining state in memory. It ensures consistency by using Maestro engine’s database as the source of truth for workflow and step states. During bootstrapping, the flow engine rebuilds its in-memory state from the database, improving performance and simplifying the overall architecture. This is in contrast to the previous design in which multiple databases had to be reconciled against one another (Conductor’s tables and Maestro’s tables) or else suffer race conditions and rare orphaned job status.</p><p>The flow engine operates on in-memory flow states, resembling a <a href="https://docs.aws.amazon.com/whitepapers/latest/database-caching-strategies-using-redis/caching-patterns.html#write-through">write through caching pattern</a>. Updates to workflow or step state in the database also update the in-memory flow state. If in-memory state is lost, the flow engine rebuilds it from the database, ensuring eventual consistency and resolving race conditions.</p><p>This design delivers lower latency and higher throughput, avoids inconsistencies from dual persistence, simplifies the architecture, and keeps the in‑memory view eventually consistent with the database.</p><h4>Maintaining Scalability While Gaining Speed</h4><p>With the new engine, we significantly boost performance by collocating flows and their tasks on the same node throughout their lifecycle. Therefore, states of a flow and its tasks will stay in a single node’s memory without persisting to the database. This stickiness and locality bring great performance benefits but inevitably impact scalability since tasks are no longer reassigned to a new worker of the whole cluster in each polling cycle.</p><p>To maintain horizontal scalability, we introduced a flow group concept to partition running flows into groups. In this way, each Maestro flow engine instance only needs to maintain ownership of groups rather than individual flows, reducing maintenance costs (e.g., heartbeat) and simplifying reconciliation by allowing each Maestro node to load flows for a group in batches. Each Maestro node claims ownership of a group of flows through a flow group actor and manages their entire lifecycle via child flow actors. If ownership is lost due to node failure or long JVM GC, another node can claim the group to resume flow executions by reconciling internal state from Maestro database. The following diagram illustrates the ownership maintenance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nvUSa5zxZOWi4G8EIh9bQA.png"><figcaption>Figure 4. Ownership maintenance sequence diagram.</figcaption></figure><h4>Flow Partitioning</h4><p>To efficiently distribute traffic, Maestro assigns a consistent group ID to flows/workflows by a simple stable ID assignment method, as shown in the diagram’s Partitioning Function box. We chose this simpler partitioning strategy over advanced ones, e.g. consistent hashing, primarily due to execution and reconciliation costs and consistency challenges in a distributed system.</p><p>Since Maestro decomposes workflows into hierarchical internal flows (e.g., foreach), parent flows need to interact with child flows across different groups. To enable this, the maximal group number from the parent, denoted as N’ in the diagram, is passed down to all child flows. This allows child flows, such as subworkflows or foreach iterations, to recompute their own group IDs and also ensures that a parent flow can always determine the group ID of its child flows using only their workflow identifiers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PdP7OEWMFDpbNTZZNuKe7Q.png"><figcaption>Figure 5. Flow group partitioning mechanism diagram.</figcaption></figure><p>After a flow’s group ID is determined, the flow operator routes the flow request to the appropriate node. Each node owns a specific range of group IDs. For example, in the diagram, Node 1 owns groups 0, 1, and 2, while Node 3 owns groups 6, 7, and 8. The groups then contain the individual flows (e.g., Flow A, Flow B).</p><p>In this design, the group size is configurable and nodes can also have different group size configurations. The following diagram shows a flow group partitioning example while the maximal group number is changed during the engine execution without impacting any existing workflows.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cHk1MpAAcjfKC0GvRL57gA.png"><figcaption>Figure 6. A flow group partitioning example.</figcaption></figure><p>In short, Maestro flow engine shares the group info across the parent and child workflows to provide a flexible and stable partitioning mechanism to distribute work across the cluster.</p><h4>Queue Optimization</h4><p>We replaced both external distributed job queues in the existing system with internal ones, preserving the same fault‑tolerance and recovery guarantees while reducing latency and boosting throughput.</p><p>For the internal flow engine, the queue is a simple in‑memory Java blocking queue. It requires no persistence and can be rebuilt from Maestro state during reconciliation.</p><p>For the Maestro engine, we implemented a database‑backed in‑memory queue that provides <strong>exactly‑once publishing and at‑least‑once delivery guarantees</strong>, addressing multiple edge cases that previously required manual state correction.</p><p>This design is similar to the<a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html"> transactional outbox pattern</a>. In the same transaction that updates Maestro tables, a row is inserted into the `maestro_queue` table. Upon transaction commit, the job is immediately pushed to a queue worker on the same node, eliminating polling latency. After successful processing, the worker deletes the row from the database. A periodic sweeper re-enqueues any rows whose timeout has expired, ensuring another worker picks them up if a worker stalls or a node fails.</p><p>This design handles failures cleanly. If the transaction fails, both data and message roll back atomically, no partial publishing. If a worker or node fails after commit, the timeout mechanism ensures the job is retried elsewhere. On restart, a node rebuilds its in‑memory queue from the queue table, providing at-least-once delivery guarantee.</p><p>To enhance scalability and avoid contention across event types, each event type is assigned a `queue_id`. Job messages are then partitioned by `queue_id`, optimizing performance and maintaining system efficiency under high load.</p><h3>From Stateless Worker Model to Stateful Actor Model</h3><p>Maestro previously used a shared-nothing stateless worker model with a polling mechanism. When a task started, its identifier was enqueued to a distributed task queue. A worker from the flow engine would pick the task identifier from the queue, load the complete states of the whole workflow (including the flow itself and every task), execute the task interface method once, write the updated task data back to the database, and put the task back in the queue with a polling delay. The worker would then forget this task and start polling the next one.</p><p>That architecture was simple and horizontally scalable (excluding database scalability considerations), but it had drawbacks. The process introduced considerable overhead due to polling intervals and state loading. The time spent in one polling cycle on distributed queues, loading complete states, and other DB queries was significant.</p><p>As Maestro engine decomposes complex workflow graphs into multiple flows, actions might involve multiple flows spanning multiple polling cycles, adding up to significant overhead (around ten seconds in the worst cases). Also, this design didn’t offer strong execution guarantees mainly because the distributed job queue could only provide at-least-once guarantees. Tasks might be dequeued and dispatched to multiple workers, workers might reset states in certain race conditions, or load stale states of other tasks and make incorrect decisions. For example, after a long garbage-collection pause or network hiccup, two workers can pick up the same task: one sets the task status as completed and then unblocks the downstream steps to move forward. However, the other worker, working off stale state, resets the task status back to running, leaving the whole workflow in a conflicting state.</p><p>In the new design, we developed a stateful actor model, keeping internal states in memory. All tasks of a workflow are collocated in the same Maestro node, providing the best performance as states are in the same JVM.</p><h4>Actor-Based Model</h4><p>The new flow engine fits well into an actor model. We also deliberately designed it to allow sharing certain local states (read-only) between parent, child, and sibling actors. This optimization gains performance benefits without losing thread safety due to Maestro’s use cases. We used Java 21’s virtual thread support to implement it with minimal dependencies.</p><p>The new actor-based flow engine is fully message/event-driven and can take actions immediately when events are received, eliminating polling interval delays. To maintain compatibility with the existing polling-based logic, we developed a wakeup mechanism. This model requires flow actors and their child task actors to be collocated in the same JVM for communication over the in-memory queue. Since the Maestro engine already decomposes large-scale workflow instances into many small flows, each flow has a limited number of tasks that fit well into memory.</p><p>Below is a high-level overview of the Maestro execution flow based on the actor model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*acx3i0wXvowwOm0t0ufOHA.png"><figcaption>Figure 7. The high level overview of the Maestro execution.</figcaption></figure><ul><li>When a workflow starts or during reconciliation, the flow engine inserts (if not existing) or loads the Maestro workflow and step instance from the database, transforming it into the internal flow and task state. This state remains in JVM memory until evicted (e.g., when the workflow instance reaches a terminal state).</li><li>A virtual thread is created for each entity (workflow instance or step attempt) as an actor to handle all updates or actions for this entity, ensuring thread safety and eliminating distributed locks and potential race conditions.</li><li>Each virtual thread actor contains an in-memory state, a thread-safe blocking queue, and a state machine to update states, ensuring thread safety and high efficiency.</li><li>Actors are organized hierarchically, with flow actors managing all their task actors. Flow actors and their task actors are kept in the same JVM for locality benefits, with the ability to relocate flow instances to other nodes if needed.</li><li>An event can wake up a virtual thread by pushing a message to the actor’s job queue, enabling Maestro to move toward an event-driven approach alongside the current polling-based approach.</li><li>A reconciliation process transforms the Maestro data model into the internal flow data.</li></ul><h4>Virtual Thread Based Implementation</h4><p>We chose Java virtual threads to implement various actors (e.g. group actors and flow actors), which simplified the actor model implementation. With a smaller amount of code, we developed a fully functional and highly performant event-driven distributed flow engine. Virtual threads fit very well in use cases like state machine transitions within actors. They are lightweight enough to be created in a large number without Out-Of-Memory risks.</p><p>However, virtual threads can potentially deadlock. They’re not suitable for executing user-provided logic or complex step runtime logic that might depend on external libraries or services outside our control. To address this, we separate flow engine execution from task execution logic by adding a separate worker thread pool (not virtual threads) to run actual step runtime business logic like launching containers or making external API calls. Flow/task actors can <a href="https://github.com/Netflix/maestro/blob/main/maestro-flow/src/main/java/com/netflix/maestro/flow/engine/ExecutionContext.java#L96-L100">wait indefinitely for the future of the thread poll executor to complete</a> but don’t perform actual execution, allowing us to benefit from virtual threads while avoiding deadlock issues.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ulNZD0dkW4Tj50OaQ0e8aw.png"><figcaption>Figure 8. Virtual thread and worker thread separation.</figcaption></figure><h4>Providing Strong Execution Guarantees</h4><p>To provide strong execution guarantees, we implemented a generation ID-based solution to ensure that a single flow or task is executed by only one actor at any time, with states that never roll back and eventually reach a terminal state.</p><p>When a node claims a new group or a group with an expired heartbeat, it updates the database table row and increments the group generation ID. During node bootstrap, the group actor updates all its owned flows’ generation IDs while rebuilding internal flow states. When creating a new flow, the group actor verifies that the database generation ID matches its in-memory generation ID, otherwise rejecting the creation and reporting a retryable error to the caller. Please check <a href="https://github.com/Netflix/maestro/blob/main/maestro-flow/src/main/java/com/netflix/maestro/flow/dao/MaestroFlowDao.java">the source code</a> for the implementation details.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UexwsCGLLE-lrzV3HVzCEg.png"><figcaption>Figure 9. An example sequence diagram showing how generation id provides a strong guarantee.</figcaption></figure><p>Additionally, the new flow engine supports both event-driven execution and polling-based periodic reconciliation. Event-driven support allows us to extend polling intervals for state reconciliation at a very low cost, while polling-based reconciliation relaxes event delivery requirements to at-most-once.</p><h3>Testing, Validation and Rollout</h3><p>Migrating hundreds of thousands of Netflix data processing jobs to a new workflow engine required meticulous planning and execution to avoid data corruption, unexpected traffic patterns, and edge cases that could hinder performance gains. We adopted a principled approach to ensure a smooth transition:</p><ol><li><strong>Realistic Testing:</strong> Our testing mirrored real-world use cases as closely as possible.</li><li><strong>Balanced Approach:</strong> We balanced the need for rapid delivery with comprehensive testing.</li><li><strong>Minimal User Disruption:</strong> The goal was for users to be unaware of the underlying changes.</li><li><strong>Clear Communication:</strong> For cases requiring user involvement, clear communication was provided.</li></ol><h4>Maestro Test Framework</h4><p>To achieve our testing goals, we developed an adaptable testing framework for Maestro. This framework addresses the limitations of static unit and integration tests by providing a more dynamic and comprehensive approach, mimicking organic production traffic. It complements existing tests to instill confidence when rolling out major changes, such as new DAG engines.</p><p>The framework is designed to sample real user workflows, disconnecting business logic from external side effects like data reads or writes. This allows us to run workflow graphs of various shapes and sizes, reflecting the diverse use cases across Netflix. While system integrations are handled through deployment pipeline integration tests, the ability to exercise a wide variety of workflow topologies (e.g., parallel executions, for-each jobs, conditional branching and parameter passing between jobs) was crucial for ensuring the new flow engine’s correctness and performance.</p><p>The prototype workflow for the test framework focuses on auto-testing parameters, involving two main steps:</p><p><strong>1. Caching Production Workflows:</strong></p><ul><li>Successful production instances are queried from a historical Maestro feed table over a specified period.</li><li>Run parameters, initiator, and instance IDs are extracted and organized into an instance data map.</li><li>YAML definitions and subworkflow IDs are pulled from S3 storage.</li><li>Both workflow definitions and instance data are cached on S3 for subsequent steps.</li></ul><p><strong>2. Pushing, Running, and Monitoring Workflows:</strong></p><ul><li>Cached workflow definitions and instance data are loaded.</li><li>Notebook-based jobs are replaced with custom notebooks, and certain job types (e.g., vanilla container runtime jobs, templated data movement jobs) and signal triggers are converted to a special no-op job type or skipped.</li><li>Abstract job types like Write-Audit-Publish are expressed as a single step template but are translated to multiple reified nodes of the DAG when executed. These are auto-translated into several custom notebook job types to replace the generated nodes.</li><li>Workflows and subworkflows are pushed, with only non-subworkflows being run using original production instance information.</li><li>1. In the parent workflow, each sub-workflow is replaced with a special no-op placeholder so that the overall topology is preserved but without executing any side-effects of child workflows and avoid cases using dynamic runtime parameter logic.</li><li>2. Each sub-workflow is then separately treated like a top-level parent workflow not initiated from its parent, to exercise the actual workflow steps of the sub-workflow.</li><li>The custom notebook internally compares all passed parameters for each job.</li><li>Workflow instances are monitored until termination (success or failure).</li><li>An email detailing failed workflow instances is generated.</li></ul><p>Future phases of the test framework aim to expand support for native steps, more templates, Titus and Metaflow workflows, and include more robust signal testing. Further integration with the ecosystem, including dedicated Genie clusters for no-op jobs and DGS for our internal workflow UI feature verification, is also being explored.</p><h4>Rollout Plan</h4><p>Our rollout strategy prioritized minimal user disruption. We determined that an entire workflow, from its root instance, must reside in either the old or new flow engine, preventing mixed operations that could lead to complex failure modes and manual data reconciliation.</p><p>To facilitate this, we established a parallel infrastructure for the new workflow engine and leveraged our orchestrator gateway API to hide any routing or redirection logic from users. This approach provided excellent isolation for managing the migration. Initially, specific workflows could explicitly opt in via a system flag, allowing us to observe their execution and gain confidence. By scaling up traffic to the parallel infrastructure in direct proportion to what was scaled down from the original infrastructure, the dual infrastructure cost increase was negligible.</p><p>Once confident, we transitioned to a percentage-based cutover. In the event of a sustained failure in the new engine, our team could roll back a workflow by removing it from the new engine’s database and restarting it in the original stack. However, one consequence of rollback was that failed workflows had to restart from the beginning, recomputing previously successful steps, to ensure all artifacts were generated from a consistent flow engine.</p><p>Leveraging Maestro’s 10-day workflow timeout, we migrated users without disruption. Existing executions would either complete or time out. Upon restarting (due to failure/timeout) or triggering a new instance (due to success), the workflow would be picked up by the new engine. This effectively allowed us to gradually “drain” traffic from the old engine to the new one with no user involvement.</p><p>While the plan generally proceeded as expected with limited edge cases, we did encounter a few challenges:</p><ul><li><strong>Stuck Workflows:</strong> Around 50 workflows with defunct or incorrect ownership information entered a stuck state. In some cases, a backlog of queued instances behind a stuck instance created a race condition in which a new instance would be started immediately when an old instance was terminated, perpetually keeping the workflow on the old engine. For these, we proactively contacted users to negotiate manual stop-and-restart times, forcing them onto the new engine.</li><li><strong>Configuration Discrepancies:</strong> A significant lesson learned was the importance of meticulous record-keeping and management of parallel infrastructure components. We discovered alerts, system flags, and feature flags configured for one stack but not the other. This led to a failure in a partner team’s system that dynamically rolled out a Python migration by analyzing workflow configurations. The absence of a required feature flag in the new engine stack caused the process to be silently skipped, resulting in incorrect Python version configurations for about 40 workflows. Although quickly remediated, this caused user inconvenience as affected workflows needed to be restarted and verified for no lingering data corruption issues. This issue also highlighted limitations in the testing framework since runtime configuration based on external API calls to the configuration service were not exercised in simulated workflow executions.</li></ul><p>Despite these challenges, the migration was a success. We migrated over 60,000 active workflows generating over a million data processing tasks daily with almost no user involvement. By observing the flow engine’s lifecycle management latency, we validated a reduction in step launch overhead from around 5 seconds to 50 milliseconds. Workflow start overhead (incurred once per each workflow execution) also improved from 200 milliseconds to 50 milliseconds. Aggregating this over a million daily step executions translates to saving approximately 57 days of flow engine overhead per day, leading to a snappier user experience, more timely workflow status for data practitioners and greater overall task throughput for the same infrastructure scale.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/826/1*63TFS2uVWjXxlZGpHhTpoA.jpeg"></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/874/1*yxfbCcs1Y2h4dpxJ5hZkDA.jpeg"></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/948/1*6Sq7ozdrxqeHI56SvejJkw.png"></figure><p>We additionally realized significant benefits internally with reduced maintenance effort due to the new flow engine’s simplified set of database components. We were able to delete nearly 40TB of obsolete tables related to the previous stateless flow engine and saw a 90% reduction in internal database query traffic which had previously been a significant source of system alerts for the team.</p><h3>Conclusion</h3><p>The architectural evolution of Maestro represents a significant leap in performance, reducing overhead from seconds to milliseconds. This redesign with a stateful actor model not only enhances speed by 100X but also maintains scalability and reliability, ensuring Maestro continues to meet the diverse needs of Netflix’s data and ML workflows.</p><p>Key takeaways from this evolution include:</p><ul><li><strong>Performance matters:</strong> Even in a system designed for scale, the speed of individual operations significantly impacts user experience and productivity.</li><li><strong>Simplicity wins:</strong> Reducing dependencies and simplifying architecture not only improved performance but also enhanced reliability and maintainability.</li><li><strong>Strong guarantees are essential:</strong> Providing strong execution guarantees eliminates race conditions and edge cases that previously required manual intervention.</li><li><strong>Locality optimizations pay off:</strong> Collocating related flows and tasks in the same JVM dramatically reduces overhead from the Maestro engine.</li><li><strong>Modern language features help:</strong> Java 21’s virtual threads enabled an elegant actor-based implementation with minimal code complexity and dependencies.</li></ul><p>We’re excited to share these improvements with the open-source community and look forward to seeing how Maestro continues to evolve. The performance gains we’ve achieved open new possibilities for low-latency workflow orchestration use cases while continuing to support the massive scale that Netflix and other organizations require.</p><p>Visit the <a href="https://github.com/Netflix/maestro">Maestro GitHub repository</a> to explore these improvements. If you have any questions, thoughts, or comments about Maestro, please feel free to create a <a href="https://github.com/Netflix/maestro/issues">GitHub issue</a> in the Maestro repository. We are eager to hear from you. If you are passionate about solving large scale orchestration problems, please <a href="https://explore.jobs.netflix.net/careers?query=Data%20Platform&amp;Teams=Engineering&amp;domain=netflix.com&amp;sort_by=relevance">join us</a>.</p><h3>Acknowledgements</h3><p>Special thanks to Big Data Orchestration team members for general contributions to Maestro and diligent review, discussion and incident response required to make this project successful: Davis Shepherd, Natallia Dzenisenka, Praneeth Yenugutala, Brittany Truong, Jonathan Indig, Deepak Ramalingam, Binbing Hou, Zhuoran Dong, Victor Dusa, and Gabriel Ikpaetuk — and and internal partners Yun Li and Romain Cledat.</p><p>Thank you to Anoop Panicker and Aravindan Ramkumar from our partner organization that leads Conductor development in Netflix. They helped us understand issues in Conductor 2.X that initially motivated the rearchitecture and helped provide context on later versions of Conductor that defined some of the core trade-offs for the decision to implement a custom DAG engine in Maestro.</p><p>We’d also like to thank our partners on the Data Security &amp; Infrastructure and Engineering Support teams who helped identify and rapidly fix the configuration discrepancy error encountered during production rollout: Amer Hesson, Ye Ji, Sungmin Lee, Brandon Quan, Anmol Khurana, and Manav Garekar.</p><p>A special thanks also goes out to partners from the Data Experience team including Jeff Bothe, Justin Wei, and Andrew Seier. The flow engine speed improvement was actually so dramatic that it broke some integrations with our internal workflow UI that reported state transition durations. Our partners helped us catch and fix UI regressions before they shipped to avoid impact to users.</p><p>We also thank Prashanth Ramdas, Anjali Norwood, Eva Tse, Charles Zhao, Sumukh Shivaprakash, Joey Lynch, Harikrishna Menon, Marcelo Mayworm, Charles Smith and other leaders for their constructive feedback and guidance on the Maestro project.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=028e9637f041" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041">100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041</link>
      <guid>https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041</guid>
      <pubDate>Mon, 29 Sep 2025 16:10:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building a Resilient Data Platform with Write-Ahead Log at Netflix]]></title>
      <description><![CDATA[<p>By <a href="https://www.linkedin.com/in/prudhviraj9">Prudhviraj Karumanchi</a>, <a href="https://www.linkedin.com/in/samuelfu/">Samuel Fu</a>, <a href="https://www.linkedin.com/in/sriram-rangarajan-35169715/">Sriram Rangarajan</a>, <a href="https://www.linkedin.com/in/vidhya-arvind-11908723">Vidhya Arvind</a>, <a href="https://www.linkedin.com/in/yunwang-io/">Yun Wang</a>, <a href="https://www.linkedin.com/in/john-l-693b7915a/">John Lu</a></p><h3>Introduction</h3><p>Netflix operates at a massive scale, serving hundreds of millions of users with diverse content and features. Behind the scenes, ensuring data consistency, reliability, and efficient operations across various services presents a continuous challenge. At the heart of many critical functions lies the concept of a Write-Ahead Log (WAL) abstraction. At Netflix scale, every challenge gets amplified. Some of the key challenges we encountered include:</p><ul><li>Accidental data loss and data corruption in databases</li><li>System entropy across different datastores (e.g., writing to Cassandra and Elasticsearch)</li><li>Handling updates to multiple partitions (e.g., building secondary indices on top of a NoSQL database)</li><li>Data replication (in-region and across regions)</li><li>Reliable retry mechanisms for<strong> </strong>real time data pipeline at scale</li><li>Bulk deletes to database causing OOM on the Key-Value nodes</li></ul><p>All the above challenges either resulted in production incidents or outages, consumed significant engineering resources, or led to bespoke solutions and technical debt. During one particular incident, a developer issued an ALTER TABLE command that led to data corruption. Fortunately, the data was fronted by a cache, so the ability to extend cache TTL quickly together with the app writing the mutations to Kafka allowed us to recover. Absent the resilience features on the application, there would have been permanent data loss. As the data platform team, we needed to provide resilience and guarantees to protect not just this application, but all the critical applications we have at Netflix.</p><p>Regarding the retry mechanisms for real time data pipelines, Netflix operates at a massive scale where failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput.</p><p>With these problems in mind, we decided to build a system that would solve all the aforementioned issues and continue to serve the future needs of Netflix in the online data platform space. Our Write-Ahead Log (WAL) is a distributed system that captures data changes, provides strong durability guarantees, and reliably delivers these changes to downstream consumers. This blog post dives into how Netflix is building a generic WAL solution to address common data challenges, enhance developer efficiency, and power high-leverage capabilities like secondary indices, enable cross-region replication for non-replicated storage engines, and support widely used patterns like delayed queues.</p><h3>API</h3><p>Our API is intentionally simple, exposing just the essential parameters. WAL has one main API endpoint, <em>WriteToLog</em>, abstracting away the internal implementation and ensuring that users can onboard easily.</p><pre>rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse) {...}<br><br>/**<br>  * WAL request message<br>  * namespace: Identifier for a particular WAL<br>  * lifecycle: How much delay to set and original write time <br>  * payload: Payload of the message<br>  * target: Details of where to send the payload <br>  */<br>message WriteToLogRequest {<br>  string namespace = 1;<br>  Lifecycle lifecycle = 2;<br>  bytes payload = 3;<br>  Target target = 4;<br>}<br><br>/**<br>  * WAL response message<br>  * durable: Whether the request succeeded, failed, or unknown<br>  * message: Reason for failure<br>  */<br>message WriteToLogResponse {<br>  Trilean durable = 1;<br>  string message = 2;<br>}</pre><p>A <em>namespace</em> defines where and how data is stored, providing logical separation while abstracting the underlying storage systems. Each <em>namespace</em> can be configured to use different queues: Kafka, SQS, or combinations of multiple. <em>Namespace</em> also serves as a central configuration of settings, such as backoff multiplier or maximum number of retry attempts, and more. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs.</p><p>WAL can assume different <em>personas</em> depending on the namespace configuration.</p><h4><strong>Persona #1 (<em>Delayed Queues</em>)</strong></h4><p>In the example configuration below, the Product Data Systems (PDS) <em>namespace</em> uses SQS as the underlying message queue, enabling delayed messages. PDS uses Kafka extensively, and failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput. That’s when PDS started leveraging WAL for delayed messages.</p><pre>"persistenceConfigurations": {<br>  "persistenceConfiguration": [<br>  {<br>    "physicalStorage": {<br>      "type": "SQS",<br>    },<br>    "config": {<br>      "wal-queue": [<br>        "dgwwal-dq-pds"<br>      ],<br>      "wal-dlq-queue": [<br>        "dgwwal-dlq-pds"<br>      ],<br>      "queue.poll-interval.secs": 10,<br>      "queue.max-messages-per-poll": 100<br>    }<br>  }<br>  ]<br>}</pre><h4><strong>Persona #2 (<em>Generic Cross-Region Replication</em>)</strong></h4><p>Below is the namespace configuration for cross-region replication of <a href="https://netflixtechblog.com/caching-for-a-global-netflix-7bcc457012f1">EVCache</a> using WAL, which replicates messages from a source region to multiple destinations. It uses Kafka under the hood.</p><pre>"persistence_configurations": {<br>  "persistence_configuration": [<br>  {<br>    "physical_storage": {<br>      "type": "KAFKA"<br>    },<br>    "config": {<br>      "consumer_stack": "consumer",<br>      "context": "This is for cross region replication for evcache_foobar",<br>      "target": {<br>        "euwest1": "dgwwal.foobar.cluster.eu-west-1.netflix.net",<br>        "type": "evc-replication",<br>        "useast1": "dgwwal.foobar.cluster.us-east-1.netflix.net",<br>        "useast2": "dgwwal.foobar.cluster.us-east-2.netflix.net",<br>        "uswest2": "dgwwal.foobar.cluster.us-west-2.netflix.net"<br>      },<br>      "wal-kafka-dlq-topics": [],<br>      "wal-kafka-topics": [<br>        "evcache_foobar"<br>      ],<br>      "wal.kafka.bootstrap.servers.prefix": "kafka-foobar"<br>    }<br>  }<br>  ]<br>}</pre><h4><strong>Persona #3 (Handling <em>multi-partition mutations</em>)</strong></h4><p>Below is the namespace configuration for supporting <em>mutateItems</em> API in <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Key-Value</a>, where multiple write requests can go to different partitions and have to be eventually consistent. A key detail in the below configuration is the presence of Kafka and durable_storage. These data stores are required to facilitate two phase commit semantics, which we will discuss in detail below.</p><pre>"persistence_configurations": {<br>  "persistence_configuration": [<br>  {<br>    "physical_storage": {<br>      "type": "KAFKA"<br>    },<br>    "config": {<br>      "consumer_stack": "consumer",<br>      "contacts": "unknown",<br>      "context": "WAL to support multi-id/namespace mutations for dgwkv.foobar",<br>      "durable_storage": {<br>        "namespace": "foobar_wal_type",<br>        "shard": "walfoobar",<br>        "type": "kv"<br>      },<br>      "target": {},<br>      "wal-kafka-dlq-topics": [<br>        "foobar_kv_multi_id-dlq"<br>      ],<br>      "wal-kafka-topics": [<br>        "foobar_kv_multi_id"<br>      ],<br>      "wal.kafka.bootstrap.servers.prefix": "kaas_kafka-dgwwal_foobar7102"<br>    }<br>  }<br>  ]<br>}</pre><p>An important note is that requests to WAL support at-least once semantics due to the underlying implementation.</p><h3>Under the Hood</h3><p>The core architecture consists of several key components working together.</p><p><strong>Message Producer and Message Consumer separation:</strong> The message producer receives incoming messages from client applications and adds them into the queue, while the message consumer processes messages from the queue and sends them to the targets. Because of this separation, other systems can bring their own pluggable producers or consumers, depending on their use cases. WAL’s control plane allows for a pluggable model, which, depending on the use-case, allows us to switch between different message queues.</p><p><strong>SQS and Kafka with a dead letter queue by default</strong>: Every WAL <em>namespace</em> has its own message queue and gets a dead letter queue (DLQ) by default, because there can be transient errors and hard errors. Application teams using <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Key-Value</a> abstraction simply need to toggle a flag to enable WAL and get all this functionality without needing to understand the underlying complexity.</p><ul><li><strong>Kafka-backed namespaces</strong>: handle standard message processing</li><li><strong>SQS-backed namespaces</strong>: support delayed queue semantics (we added custom logic to go beyond the standard defaults enforced in terms of delay, size limits, etc)</li><li><strong>Complex multi-partition scenarios:</strong> use queues and durable storage</li></ul><p><strong>Target Flexibility</strong>: The messages added to WAL are pushed to the target datastores. Targets can be Cassandra databases, Memcached caches, Kafka queues, or upstream applications. Users can specify the target via namespace configuration and in the API itself.</p><figure><img alt="Architecture of WAL" src="https://cdn-images-1.medium.com/max/1024/1*tfnrbP7oD_r9iEesLhACpA.png"><figcaption>Architecture of WAL</figcaption></figure><h3>Deployment Model</h3><p>WAL is deployed using the <a href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6">Data Gateway infrastructure</a>. This means that WAL deployments automatically come with mTLS, connection management, authentication, runtime and deployment configurations out of the box.</p><p>Each data gateway abstraction (including WAL) is deployed as a <em>shard</em>. A <em>shard</em> is a physical concept describing a group of hardware instances. Each use case of WAL is usually deployed as a separate <em>shard</em>. For example, the Ads Events service will send requests to WAL <em>shard A</em>, while the Gaming Catalog service will send requests to WAL <em>shard </em>B, allowing for separation of concerns and avoiding noisy neighbour problems.</p><p>Each <em>shard</em> of WAL can have multiple <em>namespaces</em>. A <em>namespace</em> is a logical concept describing a configuration. Each request to WAL has to specify its <em>namespace</em> so that WAL can apply the correct configuration to the request. Each <em>namespace</em> has its own configuration of queues to ensure isolation per use case. If the underlying queue of a WAL <em>namespace</em> becomes the bottleneck of throughput, the operators can choose to add more queues on the fly by modifying the <em>namespace</em> configurations. The concept of <em>shards</em> and <em>namespaces</em> is shared across all Data Gateway Abstractions, including <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Key-Value</a>, <a href="https://netflixtechblog.com/netflixs-distributed-counter-abstraction-8d0c45eb66b2">Counter</a>, <a href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8">Timeseries</a>, etc. The <em>namespace</em> configurations are stored in a globally replicated Relational SQL database to ensure availability and consistency.</p><figure><img alt="Deployment model of WAL" src="https://cdn-images-1.medium.com/max/1024/1*Gh2O_tTvxZxlbRmKn9Atag.png"><figcaption>Deployment model of WAL</figcaption></figure><p>Based on certain CPU and network thresholds, the Producer group and the Consumer group of each <em>shard</em> will (separately) automatically scale up the number of instances to ensure the service has low latency, high throughput and high availability. WAL, along with other abstractions, also uses the <a href="https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581">Netflix adaptive load shedding libraries</a> and Envoy to automatically shed requests beyond a certain limit. WAL can be deployed to multiple regions, so each region will deploy its own group of instances.</p><h3>Solving different flavors of problems with no change to the core architecture</h3><p>The WAL addresses multiple data reliability challenges with no changes to the core architecture:</p><p><strong>Data Loss Prevention:</strong> In case of database downtime, WAL can continue to hold the incoming mutations. When the database becomes available again, replay mutations back to the database. The tradeoff is eventual consistency rather than immediate consistency, and no data loss.</p><p><strong>Generic Data Replication:</strong> For systems like EVCache (using Memcached) and RocksDB that do not support replication by default, WAL provides systematic replication (both in-region and across-region). The target can be another application, another WAL, or another queue — it’s completely pluggable through configuration.</p><p><strong>System Entropy and Multi-Partition Solutions: </strong>Whether dealing with writes across two databases (like Cassandra and Elasticsearch) or mutations across multiple partitions in one database, the solution is the same — write to WAL first, then let the WAL consumer handle the mutations. No more asynchronous repairs needed; WAL handles retries and backoff automatically.</p><p><strong>Data Corruption Recovery:</strong> In case of DB corruptions, restore to the last known good backup, then replay mutations from WAL omitting the offending write/mutation.</p><p>There are some major differences between using WAL and directly using Kafka/SQS. WAL is an abstraction on the underlying queues, so the underlying technology can be swapped out depending on use cases with no code changes. WAL emphasizes an easy yet effective API that saves users from complicated setups and configurations. We leverage the control plane to pivot technologies behind WAL when needed without app or client intervention.</p><h3>WAL usage at Netflix</h3><h4>Delay Queue</h4><p>The most common use case for WAL is as a Delay Queue. If an application is interested in sending a request at a certain time in the future, it can offload its requests to WAL, which guarantees that their requests will land after the specified delay.</p><p>Netflix’s Live Origin processes and delivers Netflix live stream video chunks, storing its video data in a Key-Value <em>abstraction</em> backed by Cassandra and EVCache. When Live Origin decides to delete certain video data after an event is completed, it issues delete requests to the Key-Value abstraction. However, the large amount of delete requests in a short burst interfere with the more important real-time read/write requests, causing performance issues in Cassandra and timeouts for the incoming live traffic. To get around this, Key-Value issues the delete requests to WAL first, with a random delay and jitter set for each delete request. WAL, after the delay, sends the delete requests back to Key-Value. Since the deletes are now a flatter curve of requests over time, Key-Value is then able to send the requests to the datastore with no issues.</p><figure><img alt="Requests being spread out over time through delayed requests" src="https://cdn-images-1.medium.com/max/1024/1*7JV6kc5QMyfviAJIjVXqew.png"><figcaption>Requests being spread out over time through delayed requests</figcaption></figure><p>Additionally, WAL is used by many services that utilize Kafka to stream events, including Ads, Gaming, Product Data Systems, etc. Whenever Kafka requests fail for any reason, the client apps will send WAL a request to retry the kafka request with a delay. This abstracts away the backoff and retry layer of Kafka for many teams, increasing developer efficiency.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xnAYzUCsEQ18qyCOnUmuPg.png"><figcaption>Backoff and delayed retries for clients producing to Kafka</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sbEVgICN9qbEecEPYiSHWQ.png"><figcaption>Backoff and delayed retries for clients consuming from Kafka</figcaption></figure><h4>Cross-Region Replication</h4><p>WAL is also used for global cross-region replication. The architecture of WAL is generic and allows any datastore/applications to onboard for cross-region replication. Currently, the largest use case is <a href="https://netflixtechblog.com/caching-for-a-global-netflix-7bcc457012f1">EVCache</a>, and we are working to onboard other storage engines.</p><p>EVCache is deployed by clusters of Memcached instances across multiple regions, where each cluster in each region shares the same data. Each region’s client apps will write, read, or delete data from the EVCache cluster of the same region. To ensure global consistency, the EVCache client of one region will replicate write and delete requests to all other regions. To implement this, the EVCache client that originated the request will send the request to a WAL corresponding to the EVCache cluster and region.</p><p>Since the EVCache client acts as the message producer group in this case, WAL only needs to deploy the message consumer groups. From there, the multiple message consumers are set up to each target region. They will read from the Kafka topic, and send the replicated write or delete requests to a Writer group in their target region. The Writer group will then go ahead and replicate the request to the EVCache server in the same region.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*twUIRRILpBEXb085wCdP7A.png"><figcaption>EVCache Global Cross-Region Replication Implemented through WAL</figcaption></figure><p>The biggest benefits of this approach, compared to our legacy architecture, is being able to migrate from multi-tenant architecture to single tenant architecture for the most latency sensitive applications. For example, Live Origin will have its own dedicated Message Consumer and Writer groups, while a less latency sensitive service can be multi-tenant. This helps us reduce the blast radius of the issues and also prevents noisy neighbor issues.</p><h4>Multi-Table Mutations</h4><p>WAL is used by <a href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Key-Value</a> service to build the MutateItems API. WAL enables the API’s multi-table and multi-id mutations by implementing 2-phase commit semantics under the hood. For this discussion, we can assume that Key-Value service is backed by Cassandra, and each of its <em>namespaces</em> represents a certain table in a Cassandra DB.</p><p>When a Key-Value client issues a MutateItems request to Key-Value server, the request can contain multiple PutItems or DeleteItems requests. Each of those requests can go to different ids and <em>namespaces</em>, or Cassandra tables.</p><pre>message MutateItemsRequest {<br> repeated MutationRequest mutations = 1;<br> message MutationRequest {<br>  oneof mutation {<br>    PutItemsRequest put = 1;<br>    DeleteItemsRequest delete = 2;<br>  }<br> }<br>}</pre><p>The MutateItems request operates on an eventually consistent model. When the Key-Value server returns a success response, it guarantees that every operation within the MutateItemsRequest will eventually complete successfully. Individual put or delete operations may be partitioned into smaller chunks based on request size, meaning a single operation could spawn multiple chunk requests that must be processed in a specific sequence.</p><p>Two approaches exist to ensure Key-Value client requests achieve success. The synchronous approach involves client-side retries until all mutations complete. However, this method introduces significant challenges; datastores might not natively support transactions and provide no guarantees about the entire request succeeding. Additionally, when more than one replica set is involved in a request, latency occurs in unexpected ways, and the entire request chain must be retried. Also, partial failures in synchronous processing can leave the database in an inconsistent state if some mutations succeed while others fail, requiring complex rollback mechanisms or leaving data integrity compromised. The asynchronous approach was ultimately adopted to address these performance and consistency concerns.</p><p>Given Key-Value’s stateless architecture, the service cannot maintain the mutation success state or guarantee order internally. Instead, it leverages a Write-Ahead Log (WAL) to guarantee mutation completion. For each MutateItems request, Key-Value forwards individual put or delete operations to WAL as they arrive, with each operation tagged with a sequence number to preserve ordering. After transmitting all mutations, Key-Value sends a completion marker indicating the full request has been submitted.</p><p>The WAL producer receives these messages and persists the content, state, and ordering information to a durable storage. The message producer then forwards only the completion marker to the message queue. The message consumer retrieves these markers from the queue and reconstructs the complete mutation set by reading the stored state and content data, ordering operations according to their designated sequence. Failed mutations trigger re-queuing of the completion marker for subsequent retry attempts.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*voSQEpItvosZjPqeTpBseg.png"><figcaption>Architecture of Multi-Table Mutations through WAL</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EVFnGkr57z5cSB5kNI91iA.png"><figcaption>Sequence diagram for Multi-Table Mutations through WAL</figcaption></figure><h3>Closing Thoughts</h3><p>Building Netflix’s generic Write-Ahead Log system has taught us several key lessons that guided our design decisions:</p><p><strong>Pluggable Architecture is Core: </strong>The ability to support different targets, whether databases, caches, queues, or upstream applications, through configuration rather than code changes has been fundamental to WAL’s success across diverse use cases.</p><p><strong>Leverage Existing Building Blocks: </strong>We had control plane infrastructure, Key-Value abstractions, and other components already in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve.</p><p><strong>Separation of Concerns Enables Scale:</strong> By separating message processing from consumption and allowing independent scaling of each component, we can handle traffic surges and failures more gracefully.</p><p><strong>Systems Fail — Consider Tradeoffs Carefully: </strong>WAL itself has failure modes, including traffic surges, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to handle these, but the tradeoffs must be understood.</p><h3>Future work</h3><ul><li>We are planning to add secondary indices in Key-Value service leveraging WAL.</li><li>WAL can also be used by a service to guarantee sending requests to multiple datastores. For example, a database and a backup, or a database and a queue at the same time etc.</li></ul><h3>Acknowledgements</h3><p>Launching WAL was a collaborative effort involving multiple teams at Netflix, and we are grateful to everyone who contributed to making this idea a reality. We would like to thank the following teams for their roles in this launch.</p><ul><li>Caching team — Additional thanks to <a href="https://www.linkedin.com/in/shihhaoyeh/">Shih-Hao Yeh</a>, <a href="https://www.linkedin.com/in/akashdeepgoel/">Akashdeep Goel</a> for contributing to cross region replication for KV, EVCache etc. and owning this service.</li><li>Product Data System team — <a href="https://www.linkedin.com/in/carlos-jmh/">Carlos Matias Herrero</a>, <a href="https://www.linkedin.com/in/bbremen/">Brandon Bremen</a> for contributing to the delay queue design and being early adopters of WAL giving valuable feedback.</li><li>KeyValue and Composite abstractions team — <a href="https://www.linkedin.com/in/rummadis/">Raj Ummadisetty</a> for feedback on API design and mutateItems design discussions. <a href="https://www.linkedin.com/in/rajiv-shringi/">Rajiv Shringi</a><strong> </strong>for feedback on API design.</li><li>Kafka and Real Time Data Infrastructure teams — <a href="https://www.linkedin.com/in/nickmahilani/">Nick Mahilani</a> for feedback and inputs on integrating the WAL client into Kafka client. <a href="https://www.linkedin.com/in/sundaram-ananthanarayanan-97b8b545/">Sundaram Ananthanarayan</a> for design discussions around the possibility of leveraging Flink for some of the WAL use cases.</li><li><a href="https://jolynch.github.io/">Joseph Lynch</a> for providing strategic direction and organizational support for this project.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=127b6712359a" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/building-a-resilient-data-platform-with-write-ahead-log-at-netflix-127b6712359a">Building a Resilient Data Platform with Write-Ahead Log at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/building-a-resilient-data-platform-with-write-ahead-log-at-netflix-127b6712359a</link>
      <guid>https://netflixtechblog.com/building-a-resilient-data-platform-with-write-ahead-log-at-netflix-127b6712359a</guid>
      <pubDate>Fri, 26 Sep 2025 20:57:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale]]></title>
      <description><![CDATA[<p>By <a href="https://www.linkedin.com/in/andrew-pierce-34443a7/">Andrew Pierce</a>, <a href="https://www.linkedin.com/in/chris-thrailkill-a268914/">Chris Thrailkill</a>, <a href="https://www.linkedin.com/in/victor-chiapaikeo-974a501b/">Victor Chiapaikeo</a></p><p>At Netflix, we prioritize getting timely data and insights into the hands of the people who can act on them. One of our key internal applications for this purpose is Muse. Muse’s ultimate goal is to help Netflix members discover content they’ll love by ensuring our promotional media is as effective and authentic as possible. It achieves this by equipping creative strategists and launch managers with data-driven insights showing which artwork or video clips resonate best with global or regional audiences and flagging outliers such as potentially misleading (clickbait-y) assets. These kinds of applications fall under Online Analytical Processing (OLAP), a category of systems designed for complex querying and data exploration. However, enabling Muse to support new, more advanced filtering and grouping capabilities while maintaining high performance and data accuracy has been a challenge. Previous posts have touched on <a href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">artwork personalization</a> and our <a href="https://netflixtechblog.com/introducing-impressions-at-netflix-e2b67c88c9fb">impressions architecture</a>. <strong>In this post, we’ll discuss some steps we’ve taken to evolve the Muse data serving layer to enable new capabilities while maintaining high performance and data accuracy.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*22EDwImi4b5D3tZuY8KXjQ.png"><figcaption>Muse application</figcaption></figure><h3>An Evolving Architecture</h3><p>Like many early analytics applications, Muse began as a simple dashboard powered by batch data pipelines (Spark¹) and a modest Druid² cluster. As the application evolved, so did user demands. Users wanted new features like outlier detection and notification delivery, media comparison and playback, and advanced filtering, all while requiring lower latency and supporting ever-growing datasets (in the order of trillions of rows a year). One of the most challenging requirements was enabling dynamic analysis of promotional media performance by “audience” affinities: internally defined, algorithmically inferred labels representing collections of viewers with similar tastes. Answering questions like “Does specific promotional media resonate more with Character Drama fans or Pop Culture enthusiasts?” required augmenting already voluminous impression and playback data. Supporting filtering and grouping by these many-to-many audience relationships led to a combinatorial explosion in data volume, pushing the limits of our original architecture.</p><p>To address these complexities and support the evolving needs of our users, we undertook a significant evolution of Muse’s architecture. Today’s Muse is a React app that queries a GraphQL layer served with a set of Spring Boot GRPC microservices. In the remainder of this post, we’ll focus on steps we took to scale the data microservice, its backing ETL, and our Druid cluster. <strong>Specifically, we’ve changed the data model to rely on HyperLogLog (HLL) sketches, used </strong><a href="https://hollow.how/"><strong>Hollow</strong></a><strong> for access to in-memory, precomputed aggregates, and taken a series of steps to tune Druid. To ensure the accuracy of these changes, we relied heavily on internal debugging tools to validate pre- and post-changes.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-VJUGD9aJ3BNr-q0MgawIg.png"><figcaption><em>Muse’s Current Architecture</em></figcaption></figure><h4>Moving to HyperLogLog (HLL) Sketches for Distinct Counts</h4><p>Some of the most important metrics we track are impressions, the number of times an asset is shown to a user within a time window, and qualified plays, which links a playback event with a minimum duration back to a specific impression. Calculating these metrics requires counting distinct users. However, performing distinct counts in distributed systems is resource-intensive and challenging. For instance, to determine how many unique profiles have ever seen a particular asset, we need to compare each new set of profile ids with those from all days before it, potentially spanning months or even years.</p><p>For performance, we can trade accuracy. The <a href="https://datasketches.apache.org/">Apache Datasketches library</a> allows us to get distinct count estimates that are within a 1–2% error. This is tunable with a precision parameter called logK (0.8% in our case with logK of 17). We build sketches in two places:</p><ol><li>During Druid ingest: we use the <a href="https://druid.apache.org/docs/latest/development/extensions-core/datasketches-hll/#aggregators">HLLSketchBuild aggregator</a> with Druid <a href="https://druid.apache.org/docs/latest/ingestion/rollup/">rollup set to true</a> to reduce our data in preparation for fast distinct counting</li><li>During our Spark ETL: we persist precomputed aggregates like all-time impressions per asset in the form of HLL sketches. Each day, we merge a new HLL sketch into the existing one using a combination of <a href="https://spark.apache.org/docs/3.5.1/api/java/org/apache/spark/sql/functions.html#hll_union(org.apache.spark.sql.Column,org.apache.spark.sql.Column)">hll_union</a> and <a href="https://spark.apache.org/docs/3.5.1/api/java/org/apache/spark/sql/functions.html#hll_union_agg(org.apache.spark.sql.Column)">hll_union_agg</a> (<a href="https://www.databricks.com/blog/apache-spark-3-apache-datasketches-new-sketch-based-approximate-distinct-counting">functions added by our very own Ryan Berti</a>)</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*29CBkjnyIn_1e9V1R0m8hg.png"><figcaption><em>We use Datasketches in our ETL and serving systems</em></figcaption></figure><p>HLL has been a huge performance boost for us both within the serving and ETL layer. Across our most common OLAP query patterns, we’ve seen latencies reduce by approx 50%. Nevertheless, running APPROX_COUNT_DISTINCT over large date ranges on the Druid cluster for very large titles exhausts limited threads, especially in high-concurrency situations. To further offload Druid query volume and preserve cluster threads, we’ve also relied extensively on the <a href="https://github.com/Netflix/hollow">Hollow library</a>.</p><h4>Hollow as a Read-Only Key Value Store for Precomputed Aggregates</h4><p>Our in-house Hollow³ infrastructure allows us to easily create Hollow feeds — essentially highly compressed and performant in-memory key/value stores — from Iceberg⁴ tables. In this setup, dedicated producer servers listen for changes to Iceberg tables, and when updates occur, they push the latest data to downstream consumers. On the consumer side, our Spring Boot applications listen to announcements from these producers and automatically refresh in-memory caches with the latest dataset.</p><p>This architecture has enabled us to migrate several data access patterns from Druid to Hollow, specifically ones with a limited number of parameter combinations per title. One of these was fetching distinct filter dimensions. For example, while most Netflix-branded titles are released globally, licensed titles often have rights restrictions that limit their availability to specific countries and time windows. As a result, a particular licensed title might only be available to members in Germany and Luxembourg.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1014/1*m4Tdj3YixQfPmnEaMGdspA.png"><figcaption><em>Distinct countries queried from a Hollow feed for the assets for Manta Manta</em></figcaption></figure><p>In the past, retrieving these distinct country values per asset required issuing a SELECT DISTINCT query to our Druid cluster. With Hollow, we maintain a feed of distinct dimension values, allowing us to perform stream operations like the one below directly on a cached dataset.</p><pre>/**<br> * Returns the possible filter values for a dimension such as countries<br> */<br>public List&lt;Dimension&gt; getDimensions(long movieId, String dimensionId) {<br>    // Access in-memory Hollow feed with near instant query time<br>    Map&lt;String, List&lt;Dimension&gt;&gt; dimensions = dimensionsHollowConsumer.lookup(movieId);<br>    return dimensions.getOrDefault(dimensionId, List.of()).stream()<br>        .sorted(Comparator.comparing(Dimension::getName))<br>        .toList();<br>}</pre><p>Although it adds complexity to our service by requiring more intricate request routing and a higher memory footprint, pre-computed aggregates have given us greater stability and performance. In the case of fetching distinct dimensions, we’ve observed query times drop from hundreds of milliseconds to just tens of milliseconds. More importantly, this shift has offloaded high concurrency demands from our Druid cluster, resulting in more consistent query performance. In addition to this use case, cached pre-computed aggregates also power features such as retrieving recently launched titles, accessing all-time asset metrics, and serving various pieces of title metadata.</p><h4>Tuning Druid</h4><p>Even with the efficiencies gained from HLL sketches and Hollow feeds, ensuring that our Druid cluster operates performantly has been an ongoing challenge. Fortunately, at Netflix, we are in the company of multiple <a href="https://www.apache.org/foundation/governance/pmcs">Apache Druid PMC members</a> like <a href="https://www.linkedin.com/in/maytasm/">Maytas Monsereenusorn</a> and <a href="https://www.linkedin.com/in/jessetuglu/">Jesse Tuğlu</a> who have helped us wring out every ounce of performance. Some of the key optimizations we’ve implemented include:</p><ul><li><strong>Increasing broker count relative to historical nodes:</strong> We aim for a broker-to-historical ratio close to the <a href="https://druid.apache.org/docs/latest/operations/basic-cluster-tuning/#number-of-brokers">recommended 1:15</a>, which helps improve query throughput.</li><li><strong>Tuning segment sizes:</strong> By targeting the <a href="https://druid.apache.org/docs/latest/operations/segment-optimization/">300–700 MB “sweet spot”</a> for segment sizes, primarily using the tuningConfig.targetRowsPerSegment parameter during ingestion — we ensure that each segment a single historical thread scans is not overly large.</li><li><strong>Leveraging Druid lookups for data enrichment:</strong> Since joins can be prohibitively expensive in Druid, we use <a href="https://druid.apache.org/docs/latest/querying/lookups/">lookups</a> at query time for any key column enrichment.</li><li><strong>Optimizing search predicates:</strong> We ensure that all search predicates operate on physical columns rather than virtual ones, creating necessary columns during ingestion with <a href="https://druid.apache.org/docs/latest/ingestion/ingestion-spec/#transforms">transformSpec.transforms</a>.</li><li><strong>Filtering and slimming data sources at ingest:</strong> By applying filters within <a href="https://druid.apache.org/docs/latest/ingestion/ingestion-spec/#filter">transformSpec.filter</a> and removing all unused columns in <a href="https://druid.apache.org/docs/latest/ingestion/ingestion-spec/#dimensionsspec">dimensionsSpec.dimensions</a>, we keep our data sources lean and improve the possibility of <a href="https://druid.apache.org/docs/latest/ingestion/rollup">higher rollup yield</a>.</li><li><strong>Use of multi-value dimensions:</strong> Exploiting the Druid <a href="https://druid.apache.org/docs/latest/querying/multi-value-dimensions/">multi-value dimension</a> feature was key to overcoming the “many-to-many” combinatorial quandary when integrating audience filtering and grouping functionality mentioned in the “An Evolving Architecture” section above.</li></ul><p>Together, these optimizations, combined with previous ones, have decreased our p99 Druid latencies by roughly 50%.</p><h4>Validation &amp; Rollout</h4><p>Rolling out these changes to our metrics system required a thorough validation and release strategy. Our approach prioritized both data integrity and user trust, leveraging a blend of automation, targeted tooling, and incremental exposure to production traffic. At the core of our strategy was a parallel stack deployment: both the legacy and new metric stacks operated side-by-side within the Muse Data microservice. This setup allowed us to validate data quality, monitor real-world performance, and mitigate risk by enabling seamless fallback at any stage.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/831/1*gmcpkg9ehK1iX4Z6tKsetw.png"></figure><p>We adopted a two-pronged validation process:</p><ul><li><strong>Automated Offline Validation: </strong>Using Jupyter Notebooks, we automated the sampling and comparison of key metrics across both the legacy and new stacks. Our sampling set included a representative mix: recently accessed titles, high-profile launches, and edge-case titles with unique handling requirements. This allowed us to catch subtle discrepancies in metrics early in the process. Iterative testing on this set guided fixes, such as tuning the HLL logK parameter and benchmarking end-to-end latency improvements.</li><li><strong>In-App Data Comparison Tooling: </strong>To facilitate rapid triage, we built a developer-facing comparison feature within our application that displays data from both the legacy and new metric stacks side by side. The tool automatically highlights any significant differences, making it easy to quickly spot and investigate discrepancies identified during offline validation or reported by users.</li></ul><p>We implemented several release best practices to mitigate risk and maintain stability:</p><ul><li><strong>Staggered Implementation by Application Segment: </strong>We developed and deployed the new metric stack in stages, focusing on specific application segments. This meant building out support for asset types like artwork and video separately and then further dividing by CEE phase (Explore, Exploit). By implementing changes segment by segment, we were able to isolate issues early, validate each piece independently, and reduce overall risk during the migration.</li><li><strong>Shadow Testing (“Dark Launch”):</strong> Prior to exposing the new stack to end users, we mirrored production traffic asynchronously to the new implementation. This allowed us to validate real-world latency and catch potential faults in a live environment, without impacting the actual user experience.</li><li><strong>Granular Feature Flagging: </strong>We implemented fine-grained feature flags to control exposure within each segment. This allowed us to target specific user groups or titles and instantly roll back or adjust the rollout scope if any issues were detected, ensuring rapid mitigation with minimal disruption.</li></ul><h3>Learnings and Next Steps</h3><p>Our journey with Muse tested the limits of several parts of the stack: the ETL layer, the Druid layer, and the data serving layer. While some choices, like leveraging Netflix’s in-house Hollow infrastructure, were influenced by available resources, simple principles like offloading query volume, pre-filtering of rows and columns before Druid rollup, and optimizing search predicates (along with a bit of HLL magic) went a long way in allowing us to support new capabilities while maintaining performance. Additionally, engineering best practices like producing side-by-side implementations and backwards-compatible changes enabled us to roll out revisions steadily while maintaining rigorous validation standards. Looking ahead, we’ll continue to build on this foundation by supporting a wider range of content types like Live and Games, incorporating synopsis data, deepening our understanding of how assets work together to influence member choosing, and incorporating new metrics to distinguish between “effective” and “authentic” promotional assets, in service of helping members find content that truly resonates with them.</p><p>¹ Apache Spark is an open-source analytics engine for processing large-scale data, enabling tasks like batch processing, machine learning, and stream processing.</p><p>² Apache Druid is a high-performance, real-time analytics database designed for quickly querying large volumes of data.</p><p>³ Hollow is a Java library for efficient in-memory storage and access to moderately sized, read-only datasets, making it ideal for high-performance data retrieval.</p><p>⁴ Apache Iceberg is an open-source table format designed for large-scale analytical datasets stored in data lakes. It provides a robust and reliable way to manage data in formats like Parquet or ORC within cloud object storage or distributed file systems.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=aa9ad326fd77" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale-aa9ad326fd77">Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale-aa9ad326fd77</link>
      <guid>https://netflixtechblog.com/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale-aa9ad326fd77</guid>
      <pubDate>Mon, 22 Sep 2025 23:24:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Empowering Netflix Engineers with Incident Management]]></title>
      <description><![CDATA[<p><em>By: </em><a href="https://www.linkedin.com/in/mollystruve/"><em>Molly Struve</em></a></p><p>Netflix’s mission to provide seamless entertainment to hundreds of millions of users globally demands exceptional reliability. At the heart of this reliability is how we handle incidents — those inevitable moments when something doesn’t go as expected.</p><p>Teams can respond quickly and more effectively when incidents are managed consistently across a company. A robust process for following up after incidents creates opportunities for learning and improving systems. This continuous improvement cycle is essential for maintaining the highly reliable systems on which our members depend.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xiHZ0qDpJbANPSw1UwqLJA.png"></figure><p>Having a shared, consistent approach to incident management became critical as Netflix grew and expanded its business. This post delves into our journey to transform incident management from a centralized function into a widespread, accessible practice and the hard-won lessons we’ve learned along the way.</p><h3>The Past: Countless Missed Opportunities</h3><p>For most of Netflix’s past, incident management was the domain of our central Site Reliability Engineering team, called <a href="https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb">CORE</a> (Critical Operations and Reliability Engineering). CORE was focused on streaming and was the sole initiator of incidents. They used Jira and a single Slack channel for incident response. This approach worked in the early days, but we knew it wouldn’t scale as Netflix grew and diversified.</p><p>With thousands of microservices supporting critical functions beyond streaming, we knew plenty of things were breaking that we were not capturing. We had an internal post-incident write-up template called “OOPS,” which teams could use to write about operational surprises. The template saw limited adoption as many engineers didn’t know about it or understand its purpose or value. With countless smaller, everyday incidents going unnoticed, we were missing key opportunities to learn and improve.</p><h3>The Aspiration: A Paved Road to Incident Management</h3><p>Recognizing these limits, we embarked on a journey to democratize incident management. Our goal: open more incidents and engage more teams in the process. We envisioned a “paved road” for incident management — a process so intuitive and streamlined that anyone could easily declare and manage an incident, even at 3 AM. Creating a paved road required a shift: our central SRE team would no longer be the only ones declaring incidents. Instead, we’d empower teams across engineering to own their own incidents. Making this significant shift required both technological and cultural changes.</p><h3>Finding the Right Tool</h3><p>Scaling technical processes within an organization as diverse and intricate as Netflix is challenging. To enable every engineering team to manage incidents effectively, we needed a comprehensive incident management tool that was far more sophisticated than Jira and a single Slack channel. We knew any solution, whether built or bought, would need to meet four key requirements:</p><ul><li><strong>Intuitive user experience</strong> — Our number one priority was making sure the tool was so intuitive that anyone could use it with little to no training.</li><li><strong>Internal data integration capabilities</strong> — We needed the ability to hook in Netflix-specific data.</li><li><strong>Balanced customization with consistency</strong> — We wanted teams to have flexibility while maintaining shared standards.</li><li><strong>Approachable</strong> — A friendly and appealing tool that could help drive a cultural shift around incidents.</li></ul><p>The “build vs. buy” question was a significant consideration. While Netflix boasts a world-class engineering team, building an in-house solution meeting these requirements was impractical due to our ambitious timeline, the substantial investment needed, and ongoing ownership costs. Following Netflix’s engineering principle of “build only when necessary,” we evaluated external solutions against these criteria.</p><p>This evaluation process led us to adopt <a href="http://incident.io/">Incident.io</a>. While the platform checked all our boxes during selection, the four above requirements proved even more impactful than anticipated during Netflix’s incident management transformation.</p><h3>Tackling the Transformation</h3><p>Selecting the right tool was just the beginning. The real challenge was rolling it out across Netflix’s diverse engineering organization and achieving the cultural shift we envisioned. Here are four elements that helped make our goal a reality.</p><h4>Intuitive Design Drove Adoption and Cultural Transformation</h4><p>Tool usability was crucial to encourage teams to open incidents. It had to be easily understandable, even for engineers who aren’t incident management experts and only use it a few times a year. When introducing <a href="http://incident.io/">Incident.io</a>, we saw rapid organic adoption because the tool was easy to pick up without much guidance. Its intuitive design allowed users to discover features as they used it. Thanks to prioritizing usability, within four months, 20% of engineering teams were using the tooling, and six months later, we had over 50% adoption.</p><p>Beyond rapid adoption, the tool helped shift how Netflix engineers think about incidents. Incidents went from “big scary outages” to simply “any blip or issue that degrades or disrupts a service that deserves attention and learning.” The tool’s friendly, welcoming interface made incident management less intimidating and more accessible. Some engineers described the platform as “jolly” and mentioned that it actually made them <em>want</em> to open incidents. The approachable design lowered psychological barriers for engineers to declare incidents and made it feel like a natural, even positive, part of their workflow.</p><h4>Organizational Investment Supported Scalable Growth</h4><p>While having an intuitive tool was important, successfully empowering engineers to open incidents required deliberate organizational investment. We invested heavily in standardization, developing an incident management process lightweight enough to avoid overwhelming users yet structured enough to support complex incidents. Finding the right balance took time and active engagement with users to understand what was working and what wasn’t. To this day, we still make adjustments to refine and improve the process.</p><p>On the education front, we created lightweight docs, quick-reference cheatsheets, and short demo videos to accelerate adoption across Netflix’s diverse engineering organization. We took these resources on roadshows across engineering teams and proved that the barrier to entry for managing incidents was practically nonexistent. While most engineers bought in easily, we had our skeptics. Over time, we worked with these folks to understand their needs better and help them fit incident management into their daily routines and processes.</p><h4>Internal Integrations Reduce Cognitive Load</h4><p>Integrating our unique organizational context — like teams, software services, business domains, and even hardware devices — directly into the incident management platform was critical. Netflix-specific contextualization enables powerful automations, such as automatically looping in the right teams or pre-filling incident fields from alerts. These integrations significantly reduce cognitive load during an incident and empower engineers to focus on quick mitigation. Beyond individual incidents, integrations with internal data across multiple incidents enable us to identify and address systemic issues.</p><h4>Balanced Customization with Consistency Improved Response</h4><p>A flexible platform allowed us to create a tailored incident response experience while enforcing a shared language and standard metadata across all engineering teams. This balance proved crucial for response effectiveness: different teams can adapt workflows to their specific needs, but core elements like “impacted areas and domains” stay consistent. Incident responders can quickly understand any incident organization-wide because the structure and language remain familiar, enabling faster, more effective incident response.</p><h3>The Result: A New Era of Incident Management</h3><p>Our journey to democratize incident management has yielded massive wins across Netflix Engineering. We successfully transitioned from a centralized incident response model to empowering engineers to declare and manage incidents. The transformation has fostered a culture of renewed ownership and learning across engineering teams.</p><p>We’ve established new practices and are growing an incident management culture we’re genuinely proud of, but we’re not done yet. Our incident management processes continue to evolve and adapt to fit Netflix’s growing needs. Every day, we work to educate engineers and leaders on the tremendous value incidents provide. We’re excited to continue harnessing these incredible learning opportunities to improve our platform for our hundreds of millions of members.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=ebb967871de4" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/empowering-netflix-engineers-with-incident-management-ebb967871de4">Empowering Netflix Engineers with Incident Management</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/empowering-netflix-engineers-with-incident-management-ebb967871de4</link>
      <guid>https://netflixtechblog.com/empowering-netflix-engineers-with-incident-management-ebb967871de4</guid>
      <pubDate>Fri, 19 Sep 2025 18:48:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix]]></title>
      <description><![CDATA[<p>By<em> </em><a href="https://www.linkedin.com/in/daomi/">Dao Mi</a>, <a href="https://www.linkedin.com/in/pabloadelgado/">Pablo Delgado</a>, <a href="https://www.linkedin.com/in/ryan-berti-4942aa83/">Ryan Berti</a>, <a href="https://www.linkedin.com/in/amanuel-kahsay-81ab29153/">Amanuel Kahsay</a>, <a href="https://www.linkedin.com/in/onwoke/">Obi-Ike Nwoke</a>, <a href="https://www.linkedin.com/in/chris-thrailkill-a268914/">Christopher Thrailkill</a>, and <a href="https://www.linkedin.com/in/patriciogarza/">Patricio Garza</a></p><p>At Netflix, data engineering has always been a critical function to enable the business’s ability to understand content, power recommendations, and drive business decisions. Traditionally, the function centered on building robust tables and pipelines to capture facts, derive metrics, and provide well modeled data products to their partners in analytics &amp; data science functions. But as Netflix’s studio and content production scaled, so too have the challenges — and opportunities — of working with complex media data.</p><p>Today, we’re excited to share how our team is formalizing a new specialization of data engineering at Netflix: <strong>Media ML Data Engineering</strong>. This evolution is embodied in our latest collaboration with our platform teams, the <strong>Media Data Lake</strong>, which is designed to harness the full potential of media assets (video, audio, subtitles, scripts, and more) and enable the latest advances in machine learning, including latest transformer model architecture. As part of this initiative, we’re intentionally applying data engineering best practices — ensuring that our approach is both innovative and grounded in proven methodologies.</p><h3>The Evolution: From Traditional Tables to Media Tables</h3><p><strong>Traditional data engineering</strong> at Netflix focused on building structured tables for metrics, dashboards, and data science models. These tables were primarily structured text or numerical fields, ideal for business intelligence, analytics and statistical modeling.</p><p>However, the nature of media data is fundamentally different:</p><ul><li>It’s <strong>multi-modal</strong> (video, audio, text, images).</li><li>It contains <strong>derived</strong> fields from media (embeddings, captions, transcriptions…etc)</li><li>It’s <strong>unstructured</strong> and massive in scale when parsed out.</li><li>It’s deeply <strong>intertwined</strong> with creative workflows and business asset lineage.</li></ul><p>As our studio operations (see below) expanded, we saw the need for a new approach — one that could provide centralized, standardized, and scalable access to all types of media assets and their metadata for both analytical and machine learning workflows.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*87cD7-YKwcl_Quvu"></figure><h3>The Rise of Media ML Data Engineering</h3><p>Enter <strong>Media ML Data Engineering</strong> — a new specialization at Netflix that bridges the gap between traditional data engineering and the unique demands of media-centric machine learning. This role sits at the intersection of data engineering, ML infrastructure, and media production. Our mission is to provide seamless access to media assets and derived data (including outputs from machine learning models) for researchers, data scientists, and other downstream data consumers.</p><h3>Key Responsibilities</h3><ul><li><strong>Centralized Media Data Access:</strong> Building, cataloging and maintaining the data and pipelines that populates the Media Data Lake, a data platform for storing and serving media assets and their metadata.</li><li><strong>Asset Standardization:</strong> Standardizing media assets across modalities (video, images, audio, text) to ensure consistency and quality for ML applications in partnership with domain engineering teams.</li><li><strong>Metadata Management:</strong> Unifying and enriching asset metadata, making it easier to track asset lineage, quality, and coverage.</li><li><strong>ML-Ready Data:</strong> Exposing large corpora of assets for early-stage algorithm exploration, benchmarking, and productionization.</li><li><strong>Collaboration:</strong> Partnering closely with domain experts, algorithm researchers, upstream content engineering teams and (machine learning &amp; data) platform colleagues to ensure our data meets real-world needs.</li></ul><p>This new role is essential for bridging the gap between creative media workflows and the technical demands of cutting-edge ML.</p><h3>Introducing the Media Data Lake</h3><p>To enable the next generation of media analytics and machine learning, we are building the <strong>Media Data Lake </strong>at Netflix — a data lake designed specifically for media assets at Netflix using <a href="https://lancedb.com/">LanceDB</a>. We have partnered with our data platform team on integrating LanceDB into our <a href="https://netflixtechblog.com/all?topic=big-data">Big Data Platform</a>.</p><h3>Architecture and Key Components</h3><ul><li><strong>Media Table:</strong> The core of the Media Data Lake, this structured dataset captures essential metadata and references to all media assets. It’s designed to be extensible, supporting both traditional metadata and outputs from ML models (including transformer-based embeddings, media understanding research and more).</li><li><strong>Data Model:</strong> We are developing a robust data model to standardize how media assets and their attributes are represented, making it easier to query and join across schemas.</li><li><strong>Data API:</strong> An pythonic interface that will provide programmatic access to the Media Table, supporting both interactive exploration and automated workflows.</li><li><strong>UI Components:</strong> Off-the-shelf UI interfaces enable teams to visually explore assets in the media data lake, accelerating discovery and iteration for ICs.</li><li><strong>Online and Offline System Architecture:</strong> Real-time access for lightweight queries and exploration of raw media assets; scalable large batch processing for ML training, benchmarking, and research.</li><li><strong>Compute</strong>: distributed batch inference layer capable of processing using GPUs and media data processing at scale using CPUs.</li></ul><h3>Starting Small with New Technology</h3><p>Our initial focus this past year has been on delivering a “data pond” — a mini-version of the Media Data Lake targeted at video/audio datasets for early stage model training, evaluation and research. All data for this phase comes from AMP, our internal <a href="https://netflixtechblog.com/elasticsearch-indexing-strategy-in-asset-management-platform-amp-99332231e541">asset management system</a> and <a href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428">annotation store</a>, and the scope is intentionally small to ensure a solid, extensible foundation could be built while introducing a new technology into the company. We are able to perform data exploration of the raw media assets to build up an intuitive understanding of the media via lightweight queries to AMP.</p><h3>Media Tables: The New Foundation for ML and Innovation</h3><p>One of the most exciting developments is the rise of <strong>media tables</strong> — structured datasets that not only capture traditional metadata, but also include the outputs of advanced ML models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4AL_ScuaNy4ZguBj"></figure><p>These media tables power a range of innovative applications, such as:</p><ul><li><strong>Translation &amp; Audio Quality Measures:</strong> Managing audio clips and features via text-to-speech models for engineering localization quality metrics.</li><li><strong>Media Fidelity Restoration:</strong> Research on restoration of videos to HDR for remastering and other image technology use-cases.</li><li><strong>Story Understanding and Content Embedding:</strong> Structuring narrative elements extracted from textual evidence and video of a title to increase operational efficiency in title launch preparation and ratings, e.g. detection of smoking, gore, NSFW scenes in our titles.</li><li><strong>Media Search:</strong> Leverage multi-modal vector search to find similar keyframes, shots, dialogue to facilitate research and experimentation.</li></ul><p>These tables built on top of LanceDB are designed to scale, support complex queries, and serve both research and other data science &amp; analytical needs.</p><h3>The Human Side: New Roles and Collaboration</h3><p>Media ML Data Engineering is a team sport. Our data engineers partner with domain experts, data scientists, ML researchers, upstream business ops and content engineering teams to ensure our data solutions are fit for purpose. We also work closely with our friendly platform teams to ensure technological breakthroughs that are beneficial beyond our small corner of the universe could become horizontal abstractions that benefit the rest of Netflix. This collaborative model enables rapid iteration, high data quality, innovative use cases and technology re-use.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*adTotLaJXhE7KC2-"></figure><h3>Looking Ahead</h3><p>The evolution from traditional data engineering to Media ML data engineering — anchored by our media data lake — is unlocking new frontiers for Netflix:</p><ul><li><strong>Richer, more accurate ML models</strong> trained on high-quality, standardized media data.</li><li><strong>Supercharge ML Model evaluations </strong>via quick iteration cycles on the data.</li><li><strong>Faster experimentation and productization</strong> of new AI-powered features.</li><li><strong>Deeper insights into our content and creative workflows</strong> via metrics constructed from Media ML algorithms inferred features.</li></ul><p>As we continue to grow the media data lake, be on the lookout for subsequent blog posts sharing our learnings and tools with the broader media ml &amp; data engineering community.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=6dcc91058d8d" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/from-facts-metrics-to-media-machine-learning-evolving-the-data-engineering-function-at-netflix-6dcc91058d8d">From Facts &amp; Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/from-facts-metrics-to-media-machine-learning-evolving-the-data-engineering-function-at-netflix-6dcc91058d8d</link>
      <guid>https://netflixtechblog.com/from-facts-metrics-to-media-machine-learning-evolving-the-data-engineering-function-at-netflix-6dcc91058d8d</guid>
      <pubDate>Thu, 21 Aug 2025 19:39:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[ML Observability: Bringing Transparency to Payments and Beyond]]></title>
      <description><![CDATA[<p>By <a href="http://www.linkedin.com/in/tang">Tanya Tang</a>, <a href="https://www.linkedin.com/in/dkmehrmann">Andrew Mehrmann</a></p><p>At Netflix, the importance of ML observability cannot be overstated. ML observability refers to the ability to monitor, understand, and gain insights into the performance and behavior of machine learning models in production. It involves tracking key metrics, detecting anomalies, diagnosing issues, and ensuring models are operating reliably and as intended. ML observability helps teams identify data drift, model degradation, and operational problems, enabling faster troubleshooting and continuous improvement of ML systems.</p><p>One specific area where ML observability plays a crucial role is in payment processing. At Netflix, we strive to ensure that technical or process-related payment issues never become a barrier for someone wanting to sign up or continue using our service. By leveraging ML to optimize payment processing, and using ML observability to monitor and explain these decisions, we can reduce payment friction. This ensures that new members can subscribe seamlessly and existing members can renew without hassle, allowing everyone to enjoy Netflix without interruption.</p><h3>ML Observability: A Primer</h3><p>ML Observability is a set of practices and tools to help ML practitioners and stakeholders alike gain a deeper, end to end understanding of their ML systems across all stages of its lifecycle, from development to deployment to ongoing operations. An effective ML Observability framework not only facilitates automatic detection and surfacing of issues but also provides detailed root cause analysis, acting as a guardrail to ensure ML systems perform reliably over time. This enables teams to iterate and improve their models rapidly, reduce time to detection for incidents, while also increasing the buy-in and trust of their stakeholders by providing rich context about the system’s’ behaviors and impact.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nIX4fj34CFNPh1TCXWADSw.png"></figure><p>Some examples of these tools include long-term production performance monitoring and analysis, feature, target, and prediction drift monitoring, automated data quality checks, and model explainability. For example, a good observability system would detect aberrations in input data, the feature pipeline, predictions, and outcomes as well provide insight into the likely causes of model decisions and/or performance.</p><p>As an ML portfolio grows, ad-hoc monitoring becomes increasingly challenging. Greater complexity also raises the likelihood of interactions between different model components, making it unrealistic to treat each model as an isolated black box. At this stage, investing strategically in observability is essential — not only to support the current portfolio, but also to prepare for future growth.</p><h3>Stakeholder-Facing Observability Modules</h3><p>In order to reliably and responsibly evolve our payments processing system to be increasingly ML-driven, we invested heavily up-front in ML observability solutions. To provide confidence to our business stakeholders through this evolution, we looked beyond technical metrics such as precision and recall and placed greater emphasis on real-world outcomes like “how much traffic did we send down this route” and “where are the regions that ML is underperforming.”</p><p>Using this as a guidepost, we designed a collection of interconnected modules for machine learning observability: <strong>logging, monitoring, and explaining.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/374/0*Q0lbG7mF6SCfH4kM"></figure><h3>Logging</h3><p>In order to support the monitoring and explaining we wanted to do, we first needed to log the appropriate data. This seems obvious and trivial, but as usual the devil is in the details: what fields exactly do we need to log and when? How does this work for simple models vs. more complex ones? What about models that are actually made of multiple models?</p><p>Consider the following, relatively straightforward model. It takes some input data, creates features, passes these to a model which creates some score between 0 and 1, and then that score is translated into a decision (say, whether to process a card as Debit or Credit).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eqnhrCw9fSO4t01l42GuhQ.png"></figure><p>There are several elements you may wish to log: a unique identifier for each record that is trained and scored, the raw data, the final features that fed the model, a unique identifier for the model, the feature importances for that model, the raw model score, the cutoffs used to map a score to a decision, timestamps for the decision as well as the model, etc.</p><p>To address this, we drafted an initial data schema that would enable our various ML observability initiatives. We identified the following logical entities to be necessary for the observability initiatives we were pursuing:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QnB8neT1CZFqkceY0eVT4g.png"></figure><h3>Monitoring</h3><p>The goal of monitoring is twofold: 1) enable self-serve analytics and 2) provide opinionated views into key model insights. It can be helpful to think of this as “business analytics on your models,” as a lot of the key concepts (online analytical processing cubes, visualizations, metric definitions, etc) carry over. Following this analogy, we can craft key metrics that help us understand our models. There are several considerations when defining metrics, including whether your needs are understanding real-world model behavior versus offline model metrics, and whether your audience are ML practitioners or model stakeholders.</p><p>Due to our particular needs, our bias for metrics is toward online, stakeholder-focused metrics. Online metrics tell us what actually happened in the real world, rather than in an idealized counterfactual universe that might have its own biases. Additionally, our stakeholders’ focus is on business outcomes, so our metrics tend to be outcome-focused rather than abstract and technical model metrics.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/916/0*rXEcgYja4RCzbOHm"></figure><p>We focused on simple, easy to explain metrics:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/836/1*gcysTqQSi5Eh4a8UXNGIHQ.png"></figure><p>These metrics begin to suggest reasons for changing trends in the model’s behavior over time, as well as more generally how the model is performing. This gives us an overall view of model health and an intuitive approximation of what we think the model should have done. For example, if payment processor A starts receiving more traffic in a certain market compared to payment processor B, you might ask:</p><ul><li>Has the model seen something to make it prefer processor A?</li><li>Have more transactions become eligible to go to processor A?</li></ul><p>However, to truly explain specific decisions made by the model, especially which features are responsible for current trends, we need to use more advanced explainability tools, which will be discussed in the next section.</p><h3>Explaining</h3><p>Explainability means understanding the “why” behind ML decisions. This can mean the “why” in aggregate (e.g. why are so many of our transactions suddenly going down one particular route) or the “why” for a single instance (e.g. what factors led to this particular transaction being routed a particular way). This gives us the ability to approximate the previous status quo where we could inspect our static rules for insights about route volume.</p><p>One of the most effective tools we can leverage for ML explainability is SHAP (Shapley Additive exPlanations, <em>Lundberg &amp; Lee 2017</em>). At a high level, SHAP values are derived from cooperative game theory, specifically the Shapley values concept. The core idea is to fairly distribute the “payout” (in our case, the model’s prediction) among the “players” (the input features) by considering their contribution to the prediction.</p><h4>Key Benefits of SHAP:</h4><ol><li><strong>Model-Agnostic:</strong> SHAP can be applied to any ML model, making it a versatile tool for explainability.</li><li><strong>Consistent and Local Explanations:</strong> SHAP provides consistent explanations for individual predictions, helping us understand the contribution of each feature to a specific decision.</li><li><strong>Global Interpretability:</strong> By aggregating SHAP values across many predictions, we can gain insights into the overall behavior of the model and the importance of different features.</li><li><strong>Mathematical properties: </strong>SHAP satisfies important mathematical axioms such as efficiency, symmetry, dummy, and additivity. These properties allow us to compute explanations at the individual level and aggregate them for any ad-hoc groups that stakeholders are interested in, such as country, issuing bank, processor, or any combinations thereof.</li></ol><p>Because of the above advantages, we leverage SHAP as one core algorithm to unpack a variety of models and open the black box for stakeholders. Its well-documented <a href="https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html">Python interface</a> makes it easy to integrate into our workflows.</p><h4>Explaining Complex ML Systems</h4><p>For ML systems that score single events and use the output scores directly for business decisions, explainability is relatively straightforward, as the production decision is directly tied to the model’s output. However, in the case of a bandit algorithm, explainability can be more complex because the bandit policy may involve multiple layers, meaning the model’s output may not be the final decision used in production. For example, we may have a classifier model to predict the likelihood of transaction approval for each route, but we might want to penalize certain routes due to higher processing fees.</p><p>Here is an example of a plot we built to visualize these layers. The traffic that the model would have selected on its own is on the left, and different penalty or guardrail layers impact final volume as you move left to right. For example, the model originally allocated 22% traffic to processor W with Configuration A, however for cost and contractual considerations, the traffic was reduced to 19% with 3% being allocated to Processor W with Configuration B, and Processor Nc with Configuration B.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/0*F5saiA3Wzn7Gv39r"></figure><p>While individual event analysis is crucial, such as in fraud detection where false positives need to be scrutinized, in payment processing, stakeholders are more interested in explaining model decisions at a group level (e.g., decisions for one issuing bank from a country). This is essential for business conversations with external parties. SHAP’s mathematical properties allow for flexible aggregation at the group level while maintaining consistency and accuracy.</p><p>Additionally, due to the multi-candidate structure, when stakeholders inquire about why a particular candidate was chosen, they are often interested in the differential perspective — specifically, why another similar candidate was not selected. We leverage SHAP to segment populations into cohorts that share the same candidates and identify the features that make subtle but critical differences. For example, while Feature A might be globally important, if we compare two candidates that both have the same value for Feature A, the local differences become crucial. This facilitates stakeholders discussions and helps understand subtle differences among different routes or payment partners.</p><p>Earlier, we were alerted that our ML model consistently reduced traffic to a particular route every Tuesday. By leveraging our explanation system, we identified that two features — <em>route</em> and <em>day of the week</em> — were contributing negatively to the predictions on Tuesdays. Further analysis revealed that this route had experienced an outage on a previous Tuesday, which the model had learned and encoded into the <em>route</em> and <em>day of the week</em> features. This raises an important question: should outage data be included in model training? This discovery opens up discussions with stakeholders and provides opportunities to further enhance our ML system.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*D3sMvWaKsWvoFmSt"></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pgodI89ibzBtgySJ"></figure><p>The explanation system not only demystifies our machine learning models but also fosters transparency and trust among our stakeholders, enabling more informed and confident decision-making.</p><h3>Wrapping Up</h3><p>At Netflix, we face the challenge of routing thousands of payment transactions per minute in our mission to entertain the world. To help meet this challenge, we introduced an observability framework and set of tools to allow us to open the ML black box and understand the intricacies of how we route billions of dollars of transactions in hundreds of countries every year. This has led to a massive operational complexity reduction in addition to improved transaction approval rate, while also allowing us to focus on innovation rather than operations.</p><p>Looking ahead, we are generalizing our solution with a standardized data schema. This will simplify applying our advanced ML observability tools to other models across various domains. By creating a versatile and scalable framework, we aim to empower ML developers to quickly deploy and improve models, bring transparency to stakeholders, and accelerate innovation.</p><p><em>We also thank </em><a href="https://www.linkedin.com/in/ckarthiksriram/"><em>Karthik Chandrashekar</em></a><em>, </em><a href="https://www.linkedin.com/in/zainabmir/"><em>Zainab Mahar Mir</em></a><em>, </em><a href="https://www.linkedin.com/in/joshuakaroly/"><em>Josh Karoly</em></a><em> and </em><a href="https://www.linkedin.com/in/joshkornpublicpolicy/"><em>Josh Korn</em></a><em> for their helpful suggestions.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=33073e260a38" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/ml-observability-bring-transparency-to-payments-and-beyond-33073e260a38">ML Observability: Bringing Transparency to Payments and Beyond</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/ml-observability-bring-transparency-to-payments-and-beyond-33073e260a38</link>
      <guid>https://netflixtechblog.com/ml-observability-bring-transparency-to-payments-and-beyond-33073e260a38</guid>
      <pubDate>Mon, 18 Aug 2025 20:15:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Accelerating Video Quality Control at Netflix with Pixel Error Detection]]></title>
      <description><![CDATA[<p><em>By </em><a href="https://www.isikdogan.com/"><em>Leo Isikdogan</em></a><em>, Jesse Korosi, Zile Liao, Nagendra Kamath, Ananya Poddar</em></p><p>At Netflix, we support the filmmaking process that merges creativity with technology. This includes reducing manual workloads wherever possible. Automating tedious tasks that take a lot of time while requiring very little creativity allows our creative partners to devote their time and energy to what matters most: creative storytelling.</p><p>With that in mind, we developed a new method for quality control (QC) that automatically detects pixel-level artifacts in videos, reducing the need for manual visual reviews in the early stages of QC.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/798/1*VN8kAKM6Hr1vYUT7itlauA.gif"><figcaption><em>Examples of detected pixel errors.</em></figcaption></figure><h3>Why This Matters</h3><p>Netflix is deeply invested in ensuring our content creators’ stories are accurately carried from production to screen. As such, we invest manual time and energy in reviewing for technical errors that could distract from our members’ immersion in and enjoyment of these stories.</p><p>Teams spend a lot of time manually reviewing every shot to identify any issues that could cause problems down the line. One of the problems they look for is tiny bright spots caused by malfunctioning camera sensors (often called hot or lit pixels). Flagging those issues is a painstaking and error-prone process. They can be hard to catch even when every single frame in a shot is manually inspected. And if left undetected, they can surface unexpectedly later in production, leading to labor-intensive and costly fixes.</p><p>By automating these QC checks, we help production teams spot and address issues sooner, reduce tedious manual searches, and address issues before they accumulate.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wgfs1OZHkGnf91WO"><figcaption><em>Proof of Concept: hours spent on full-frame manual QC vs. minutes with the automated workflow.</em></figcaption></figure><h3>Precision at the Pixel Level: Pixel Error Detection</h3><p>Pixel errors come in two main types:</p><ol><li>Hot (lit) pixels: single frame bright pixels</li><li>Dead (stuck) pixels: pixels that don’t respond to light</li></ol><p>Earlier work at Netflix addressed detecting dead pixels using techniques based on pixel intensity gradients and statistical comparisons [<a href="https://patents.google.com/patent/US11107206B2/en">1</a>, <a href="https://www.researchgate.net/publication/326104534_Shot_Change_and_Stuck_Pixel_Detection_of_Digital_Video_Assets">2</a>]. In this work, we focus on hot pixels, which are a lot harder to flag manually.</p><p>Hot pixels in a frame can occupy only a few pixels and appear for just a single frame. Imagine reviewing thousands of high-resolution video frames looking for hot pixels. To reduce manual effort, we built a highly efficient neural network to pinpoint pixel-level artifacts in real time. While detection of hot pixels is not entirely new in video production workflows, we do it at scale and with near-perfect recall rates.</p><p>Detecting artifacts at the pixel level requires the ability to identify small-scale, fine features in large images. It also requires leveraging temporal information to distinguish between actual pixel artifacts and naturally bright pixels with artifact-like features, such as small lights, catch lights, and other specular reflections.</p><p>Given those requirements, we designed a bespoke model for this task. Many mainstream computer vision models downsample inputs to reduce dimensionality, but pixel errors are sensitive to this. For example, if we downsample a 4K frame by 8x to 480p resolution, pixel-level errors almost entirely disappear. For that reason, our model processes large-scale inputs at full resolution rather than explicitly downsampling them in pre-processing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/865/1*LJSpRqkvB76OcIgP5yE6tw.png"><figcaption><em>Why Downsampling Fails. Top: Original frame crop showing a prominent hot pixel. Bottom: Same area after 8x downscaling, causing the artifact to become nearly invisible before it even reaches the model.</em></figcaption></figure><p>The network analyzes a window of five consecutive frames at a time, giving it the temporal context it needs to tell the difference between a one-off sensor glitch and a naturally bright object that persists across frames.</p><p>For every frame, the model outputs a continuous-valued map of pixel error occurrences at the input resolution. During training, we directly optimize those error maps by minimizing dense, pixel-wise loss functions.</p><p>During inference, our algorithm binarizes the model’s outputs using a confidence threshold, then performs connected component labeling to find clusters of pixel errors. Finally, it calculates the centroids of those clusters to report (x, y) locations of the found pixel errors.</p><p>All of this processing happens in real-time on a single GPU.</p><h3>Building a Synthetic Pixel Error Generator</h3><p>Pixel errors are rare and make up a very small portion of videos, both temporally and spatially, in the context of the total volume of footage captured and the full resolution of a given frame. Therefore, they are hard to annotate manually. Initially, we had virtually no data to train our model. To overcome this, we developed a synthetic pixel error generator that closely mimicked real-world artifacts. We simulated two main types of pixel errors: symmetrical and curvilinear.</p><p><strong><em>Symmetrical:</em></strong> Most pixel errors are symmetrical along at least one axis.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/809/1*Wjwzn4Xa6Wb_xUyEY-2ENw.png"><figcaption><em>Symmetrical Artifacts. Left: Three real examples of hot pixels. Right: Three synthetically generated hot pixel samples.</em></figcaption></figure><p><strong><em>Curvilinear:</em></strong> Some pixel errors follow curvilinear structures.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/814/1*CPgXIysLwFoyj7uwLLjeSw.png"><figcaption><em>Curvilinear Artifacts. Left: Three real examples of hot pixels. Right: Three synthetically generated hot pixel samples.</em></figcaption></figure><p>To create realistic training samples, we superimposed these synthetic errors onto frames from the Netflix catalog. We added those artificial hot pixels to where they would be most visible: dark, still areas in the scenes. Instead of sampling (x, y) coordinates for the synthetic errors uniformly, we sampled them from a heatmap, with selection probabilities determined by the amount of motion and image intensity.</p><p>Synthetic data was essential for training our initial model. However, to close the domain gap and improve precision, we needed to run multiple tuning cycles on fresh, real-world footage.</p><p>After training an initial model solely on this synthetic data, we refined it iteratively with real-world data as follows:</p><ol><li>Inference: Run the model on previously unseen footage without any added synthetic hot pixels.</li><li>False Positive Elimination: Manually review detections and zero out labels for false positives, which is easier than labeling hot pixels from scratch.</li><li>Fine-tuning and Iteration: Fine-tune on the refined dataset and repeat until convergence.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HZ9nl1ReoUPbbZp0qIH__Q.png"><figcaption><em>Synthetic-to-Real Training Pipeline</em></figcaption></figure><p>While false positives represent a small percentage of total input volume, they can still constitute a meaningful number of alerts in absolute terms given the scale of content processing. We continue to refine our model and reduce false positives through ongoing application on real-world datasets. This synthetic-to-real refinement loop steadily reduces false alarms while preserving high sensitivity.</p><h3>Looking Ahead</h3><p>What once required hours of painstaking manual review can now potentially be completed in minutes, freeing creative teams to focus on what matters most: the art of storytelling. As we continue refining these capabilities through ongoing real-world deployment, we’re inspired by the many ways production teams can gain more time to build amazing stories for audiences around the world. We are also working with our partners to better understand how pixel errors affect the viewing experience, which will help us further optimize our models.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=47ef7af7ca2e" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/accelerating-video-quality-control-at-netflix-with-pixel-error-detection-47ef7af7ca2e">Accelerating Video Quality Control at Netflix with Pixel Error Detection</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/accelerating-video-quality-control-at-netflix-with-pixel-error-detection-47ef7af7ca2e</link>
      <guid>https://netflixtechblog.com/accelerating-video-quality-control-at-netflix-with-pixel-error-detection-47ef7af7ca2e</guid>
      <pubDate>Mon, 11 Aug 2025 23:29:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Behind the Streams: Live at Netflix. Part 1]]></title>
      <description><![CDATA[<div class="ac cb"><div class="ci bh hv hw hx hy"><div><div></div><p id="de94" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">By <a class="ag hb" href="https://www.linkedin.com/in/sfedov/" rel="noopener ugc nofollow" target="_blank">Sergey Fedorov</a>, <a class="ag hb" href="https://www.linkedin.com/in/phamchristopher/" rel="noopener ugc nofollow" target="_blank">Chris Pham</a>, <a class="ag hb" href="https://www.linkedin.com/in/flavioribeiro/" rel="noopener ugc nofollow" target="_blank">Flavio Ribeiro</a>, <a class="ag hb" href="https://www.linkedin.com/in/chrisnewton2/" rel="noopener ugc nofollow" target="_blank">Chris Newton</a>, and <a class="ag hb" href="https://www.linkedin.com/in/wei-wei-1571794/" rel="noopener ugc nofollow" target="_blank">Wei Wei</a></p><p id="13a1" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Many great ideas at Netflix begin with a question, and three years ago, we asked one of our boldest yet: if we were to entertain the world through Live — a format almost as old as television itself — how would <em class="oq">we</em> do it?</p><p id="0e1c" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">What began with an engineering plan to pave the path towards our first Live comedy special, <a class="ag hb" href="https://www.netflix.com/title/80167499" rel="noopener ugc nofollow" target="_blank">Chris Rock: Selective Outrage</a>, has since led to hundreds of Live events ranging from the biggest <a class="ag hb" href="https://www.netflix.com/tudum/articles/greatest-roast-of-all-time-tom-brady-live" rel="noopener ugc nofollow" target="_blank">comedy shows</a> and <a class="ag hb" href="https://about.netflix.com/en/news/nfl-christmas-day-games-on-netflix-average-over-30-million-global-viewers" rel="noopener ugc nofollow" target="_blank">NFL Christmas Games</a> to record-breaking <a class="ag hb" href="https://about.netflix.com/en/news/jake-paul-vs-mike-tyson-over-108-million-live-global-viewers" rel="noopener ugc nofollow" target="_blank">boxing fights</a> and becoming the <a class="ag hb" href="https://about.netflix.com/en/news/netflix-to-become-new-home-of-wwe-raw-beginning-2025" rel="noopener ugc nofollow" target="_blank">home of WWE</a>.</p><p id="1d28" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In our series <em class="oq">Behind the Streams</em> — where we take you through the technical journey of our biggest bets — we will do a multiple part deep-dive into the architecture of Live and what we learned while building it. Part one begins with the foundation we set for Live, and the critical decisions we made that influenced our approach.</p><h1 id="9094" class="or os io bf ot ou ov ow gl ox oy oz gn pa pb pc pd pe pf pg ph pi pj pk pl pm bk">| But First: What Makes Live Streaming Different?</h1><p id="5adf" class="pw-post-body-paragraph nv nw io nx b ny pn oa ob oc po oe of go pp oh oi gr pq ok ol gu pr on oo op hp bk">While Live as a television format is not new, the streaming experience we intended to build required capabilities we did not have at the time. Despite 15 years of on-demand streaming under our belt, Live introduced new considerations influencing architecture and technology choices:</p><figure class="pv pw px py pz qa ps pt paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><div class="ps pt pu"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*o1u1pYFC7BDuJSrT%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*o1u1pYFC7BDuJSrT%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*o1u1pYFC7BDuJSrT%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*o1u1pYFC7BDuJSrT%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*o1u1pYFC7BDuJSrT%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*o1u1pYFC7BDuJSrT%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*o1u1pYFC7BDuJSrT%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*o1u1pYFC7BDuJSrT 640w, https://miro.medium.com/v2/resize:fit:720/0*o1u1pYFC7BDuJSrT 720w, https://miro.medium.com/v2/resize:fit:750/0*o1u1pYFC7BDuJSrT 750w, https://miro.medium.com/v2/resize:fit:786/0*o1u1pYFC7BDuJSrT 786w, https://miro.medium.com/v2/resize:fit:828/0*o1u1pYFC7BDuJSrT 828w, https://miro.medium.com/v2/resize:fit:1100/0*o1u1pYFC7BDuJSrT 1100w, https://miro.medium.com/v2/resize:fit:1400/0*o1u1pYFC7BDuJSrT 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw qf c" width="700" height="408" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qg ff qh ps pt qi qj bf b bg ab du">References: 1. <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/distributing-content-to-open-connect-3e3e391d4dc9" target="_blank" data-discover="true">Content Pre-Positioning on Open Connect</a>, 2.<a class="ag hb" href="https://www.infoq.com/presentations/load-balancing-netflix/" rel="noopener ugc nofollow" target="_blank">Load-Balancing Netflix Traffic at Global Scale</a></figcaption></figure><p id="6b2b" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">This means that we had a lot to build in order to make Live work well on Netflix. That starts with making the right choices regarding the fundamentals of our Live Architecture.</p><h1 id="c6a9" class="or os io bf ot ou ov ow gl ox oy oz gn pa pb pc pd pe pf pg ph pi pj pk pl pm bk">| Key Pillars of Netflix Live Architecture</h1><p id="f409" class="pw-post-body-paragraph nv nw io nx b ny pn oa ob oc po oe of go pp oh oi gr pq ok ol gu pr on oo op hp bk">Our Live Technology needed to extend the same promise to members that we’ve made with on-demand streaming: <strong class="nx ip">great quality</strong> on as <strong class="nx ip">many devices</strong> as possible <strong class="nx ip">without interruptions</strong>. Live is one of many entertainment formats on Netflix, so we also needed to seamlessly blend Live events into the user experience, all while scaling to over 300 million global subscribers.</p><p id="ca8d" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">When we started, we had <strong class="nx ip">nine months</strong> until the first launch. While we needed to execute quickly, we also wanted to <strong class="nx ip">architect for future growth</strong> in both <strong class="nx ip">magnitude</strong> and <strong class="nx ip">multitude</strong> of events. As a key principle, we leveraged our unique position of building support for a single product — Netflix — and having control over the full Live lifecycle, from Production to Screen.</p><figure class="pv pw px py pz qa ps pt paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><div class="ps pt qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*eMwIzmDUqURUESqu%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*eMwIzmDUqURUESqu%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*eMwIzmDUqURUESqu%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*eMwIzmDUqURUESqu%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*eMwIzmDUqURUESqu%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*eMwIzmDUqURUESqu%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*eMwIzmDUqURUESqu%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*eMwIzmDUqURUESqu 640w, https://miro.medium.com/v2/resize:fit:720/0*eMwIzmDUqURUESqu 720w, https://miro.medium.com/v2/resize:fit:750/0*eMwIzmDUqURUESqu 750w, https://miro.medium.com/v2/resize:fit:786/0*eMwIzmDUqURUESqu 786w, https://miro.medium.com/v2/resize:fit:828/0*eMwIzmDUqURUESqu 828w, https://miro.medium.com/v2/resize:fit:1100/0*eMwIzmDUqURUESqu 1100w, https://miro.medium.com/v2/resize:fit:1400/0*eMwIzmDUqURUESqu 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw qf c" width="700" height="261" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="acf2" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Dedicated Broadcast Facilities to Ingest Live Content from Production</strong></p><p id="9608" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Live events can happen anywhere in the world, but not every location has Live facilities or great connectivity. To ensure secure and reliable live signal transport, we leverage distributed and highly connected broadcast operations centers, with specialized equipment for signal ingest and inspection, closed-captioning, graphics and advertisement management. We prioritized <strong class="nx ip">repeatability</strong>, conditioning engineering to launch live events consistently, reliably, and cost-effectively, leaning into <strong class="nx ip">automation</strong> wherever possible. As a result, we have been able to reduce the event-specific setup to the transmission between production and the Broadcast Operations Center, reusing the rest across events.</p><p id="f6b4" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Cloud-based Redundant Transcoding and Packaging Pipelines</strong></p><p id="92de" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The feed received at the Broadcast Center contains a fully produced program, but still needs to be encoded and packaged for streaming on devices. We chose a Cloud-based approach to allow for <strong class="nx ip">dynamic scaling</strong>, <strong class="nx ip">flexibility</strong> in configuration, and <strong class="nx ip">ease of integration</strong> with our Digital Rights Management (DRM), content management, and content delivery services already deployed in the cloud. We leverage AWS MediaConnect and AWS MediaLive to acquire feeds in the cloud and transcode them into various video quality levels with bitrates tailored per show. We built a <strong class="nx ip">custom packager</strong> to better integrate with our delivery and playback systems. We also built a <strong class="nx ip">custom Live Origin</strong> to ensure strict read and write SLAs for Live segments.</p><p id="7d57" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Scaling Live Content Delivery to millions of viewers with Open Connect CDN</strong></p><p id="2393" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In order for the produced media assets to be streamed, they need to be transferred from a few AWS locations, where Live Origin is deployed, to hundreds of millions of devices worldwide. We leverage Netflix’s CDN, <a class="ag hb" href="https://www.theverge.com/22787426/netflix-cdn-open-connect" rel="noopener ugc nofollow" target="_blank">Open Connect</a>, to scale Live asset delivery. Open Connect servers are placed close to the viewers at <strong class="nx ip">over 6K locations</strong> and connected to AWS locations via a <strong class="nx ip">dedicated Open Connect Backbone network</strong>.</p><figure class="pv pw px py pz qa ps pt paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><div class="ps pt ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*drjqrhSVFS7jfQzOBoG7dA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*drjqrhSVFS7jfQzOBoG7dA.png" /><img alt="" class="bh fw qf c" width="700" height="433" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qg ff qh ps pt qi qj bf b bg ab du">18K+<em class="qm"> servers in 6K+ locations, in Internet Exchanges, or embedded into ISP networks</em></figcaption></figure><figure class="pv pw px py pz qa ps pt paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><div class="ps pt qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*KWo6BzotdnxqoveWCvY12Q.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*KWo6BzotdnxqoveWCvY12Q.png" /><img alt="" class="bh fw qf c" width="700" height="349" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qg ff qh ps pt qi qj bf b bg ab du"><em class="qm">Open Connect Backbone connects servers in Internet Exchange locations to 5 AWS regions</em></figcaption></figure><p id="11ff" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">By enabling Live delivery on Open Connect, we build on top of $1B+ in Netflix investments over the last 12 years focused on scaling the network and optimizing the performance of delivery servers. By sharing capacity across on-demand and Live viewership we improve utilization, and by caching past Live content on the same servers used for on-demand streaming, we can easily enable catch-up viewing.</p><p id="e6ba" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Optimizing Live Playback for Device Compatibility, Scale, Quality, and Stability</strong></p><p id="e965" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">To make Live accessible to the majority of our customers without upgrading their streaming devices, we settled on using <strong class="nx ip">HTTPS</strong>-based Live Streaming. While UDP-based protocols can provide additional features like ultra-low latency, HTTPS has ubiquitous support among devices and compatibility with delivery and encoding systems. Furthermore, we use <strong class="nx ip">AVC</strong> and <strong class="nx ip">HEVC</strong> video codecs, transcode with multiple quality levels up <strong class="nx ip">from SD to 4K</strong>, and use a <strong class="nx ip">2-second segment</strong> duration to balance compression efficiency, infrastructure load, and latency. While prioritizing streaming quality and playback stability, we have also achieved industry standard latency from camera to device, and continue to improve it.</p><p id="104e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">To configure playback, the device player receives a playback manifest at the play start. The manifest contains items like the encoding bitrates and CDN servers players should use. We deliver the manifest from the cloud instead of the CDN, as it allows us to personalize the configuration for each device. To reference segments of the stream, the manifest includes a segment template that is used by devices to map a wall-clock time to URLs on the CDN. Using a segment template vs periodic polling for manifest updates minimizes network dependencies, CDN server load, and overhead on resource-constrained devices, like smart TVs, thus improving both scalability and stability of our system. While streaming, the player monitors network performance and dynamically chooses the bitrate and CDN server, maximizing streaming quality while minimizing rebuffering.</p><p id="e72a" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Run Discovery and Playback Control Services in the Cloud</strong></p><p id="e33b" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">So far, we have covered the streaming path from Camera to Device. To make the stream fully work, we also need to orchestrate across all systems, and ensure viewers can find and start the Live event. This functionality is performed by <strong class="nx ip">dozens of Cloud services</strong>, with functions like playback configuration, personalization, or metrics collection. These services tend to receive disproportionately <strong class="nx ip">higher loads around Live event start time</strong>, and Cloud deployment provides flexibility in dynamically scaling compute resources. Moreover, as Live demand tends to be localized, we are able to balance load across <strong class="nx ip">multiple AWS regions</strong>, better utilizing our global footprint. Deployment in the cloud also allows us to build a user experience where we embed Live content into a broader selection of entertainment options in the UI, like on-demand titles or Games.</p><p id="8c6e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Centralize Real-time Metrics in the Cloud with Specialized Tools and Facilities</strong></p><p id="bce4" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">With control over ingest, encoding pipelines, the Open Connect CDN, and device players, we have nearly <strong class="nx ip">end-to-end observability</strong> into the Live workflow. During Live, we collect system and user metrics in real-time (e.g., where members see the title on Netflix and their quality of experience), alerting us to poor user experiences or degraded system performance. Our real-time monitoring is built using a mix of internally developed tools, such as <a class="ag hb" href="https://netflix.github.io/atlas-docs/" rel="noopener ugc nofollow" target="_blank">Atlas</a>, <a class="ag hb" href="https://netflix.github.io/mantis/" rel="noopener ugc nofollow" target="_blank">Mantis</a>, and <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c" target="_blank" data-discover="true">Lumen</a>, and open-source technologies, such as Kafka and <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06" target="_blank" data-discover="true">Druid</a>, processing up to <strong class="nx ip">38 million events per second</strong> during some of our largest live events while providing critical metrics and operational insights in a matter of seconds. Furthermore, we set up dedicated <strong class="nx ip">“Control Center” facilities</strong>, which bring key metrics together to the operational team that monitors the event in real-time.</p><h1 id="6604" class="or os io bf ot ou ov ow gl ox oy oz gn pa pb pc pd pe pf pg ph pi pj pk pl pm bk">| Our key learnings so far</h1><p id="0598" class="pw-post-body-paragraph nv nw io nx b ny pn oa ob oc po oe of go pp oh oi gr pq ok ol gu pr on oo op hp bk">Building new functionality always brings fresh challenges and opportunities to learn, especially with a system as complex as Live. Even after three years, we’re still learning every day how to deliver Live events more effectively. Here are a few key highlights:</p><p id="3f57" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Extensive testing: </strong>Prior to Live we heavily relied on the predictable flow of on-demand traffic for pre-release canaries or A/B tests to validate deployments. But Live traffic was not always available, especially not at the scale representative of a big launch. As a result, we spent considerable effort to:</p><ol class=""><li id="3982" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op qo qp qq bk">Generate internal “test streams,” which engineers use to run <strong class="nx ip">integration</strong>, <strong class="nx ip">regression</strong>, or <strong class="nx ip">smoke tests</strong> as part of the development lifecycle.</li><li id="892f" class="nv nw io nx b ny qr oa ob oc qs oe of go qt oh oi gr qu ok ol gu qv on oo op qo qp qq bk">Build synthetic <strong class="nx ip">load testing</strong> capabilities to stress test cloud and CDN systems. We use 2 approaches, allowing us to generate up to <strong class="nx ip">100K starts-per-second</strong>:<br /> — Capture, modify, and replay past Live production traffic, representing a diversity of user devices and request patterns.<br /> — Virtualize Netflix devices and generate traffic against CDN or Cloud endpoints to test the impact of the latest changes across all systems.</li><li id="b6ec" class="nv nw io nx b ny qr oa ob oc qs oe of go qt oh oi gr qu ok ol gu qv on oo op qo qp qq bk">Run automated <strong class="nx ip">failure injection</strong>, forcing missing or corrupted segments from the encoding pipeline, loss of a cloud region, network drop, or server timeouts.</li></ol><p id="82d4" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Regular practice: </strong>Despite rigorous pre-release testing, nothing beats a production environment, especially when operating at scale. We learned that having a regular schedule with diverse Live content is essential to making improvements while balancing the risks of member impact. We run<a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/decision-making-at-netflix-33065fa06481" target="_blank" data-discover="true"> A/B tests</a>, perform <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f" target="_blank" data-discover="true">chaos testing</a>, operational exercises, and train operational teams for upcoming launches.</p><p id="4319" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Viewership predictions:</strong> We use prediction-based techniques to pre-provision Cloud and CDN capacity, and share forecasts with our ISP and Cloud partners ahead of time so they can plan network and compute resources. Then we complement them with reactive scaling of cloud systems powering sign-up, log-in, title discovery, and playback services to account for viewership exceeding our predictions. We have found success with forward-looking real-time viewership predictions <em class="oq">during</em> a live event, allowing us to take steps to mitigate risks earlier, before more members are impacted.</p><p id="3d52" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Graceful degradation: </strong>Despite our best efforts, we can (and did!) find ourselves in a situation where viewership exceeded our predictions and provisioned capacity. In this case, we developed a number of levers to continue streaming, even if it means gradually removing some nice-to-have features. For example, we use <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/enhancing-netflix-reliability-with-service-level-prioritized-load-shedding-e735e6ce8f7d" target="_blank" data-discover="true">service-level prioritized load shedding</a> to prioritize live traffic over non-critical traffic (like pre-fetch). Beyond that, we can lighten the experience, like dialing down personalization, disabling bookmarks, or lowering the maximum streaming quality. Our load tests include scenarios where we under-scale systems to validate desired behavior.</p><p id="255c" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Retry storms: </strong>When systems reach capacity, our key focus is to avoid cascading issues or further overloading systems with retries.Beyond system retries, users may retry manually — we’ve seen a 10x increase in traffic load due to stream restarts after viewing interruptions of as little as 30 seconds. We spent considerable time understanding device retry behavior in the presence of issues like network timeouts or missing segments. As a result, we implemented strategies like server-guided backoff for device retries, absorbing spikes via prioritized traffic shedding at Cloud Edge Gateway, and re-balancing traffic between cloud regions.</p><p id="ace2" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Contingency planning: </strong>“<em class="oq">Everyone has a plan until they get punched in the mouth</em>” is very relevant for Live. When something breaks, there is practically no time for troubleshooting. For large events, we set up <strong class="nx ip">in-person launch rooms</strong> with engineering owners of critical systems. For quick detection and response, we developed a small set of metrics as early indicators of issues, and have extensive runbooks for common operational issues. We don’t learn on launch day; instead, launch teams practice failure response via <strong class="nx ip">Game Day exercises ahead of time</strong>. Finally, our runbooks extend beyond engineering, covering escalation to executive leadership and coordination across functions like Customer Service, Production, Communications, or Social.</p><p id="018a" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Our commitment to enhancing the member experience doesn’t end at the “Thanks for Watching!” screen. Shortly after each live stream, we dive into metrics to identify areas for improvement. Our Data &amp; Insights team conducts comprehensive analyses, A/B tests, and consumer research to ensure the next event is even more delightful for our members. We leverage insights on member behavior, preferences, and expectations to refine the Netflix product experience and optimize our Live technology — like reducing latency by ~10 seconds through A/B tests, without affecting quality or stability.</p><h1 id="1acd" class="or os io bf ot ou ov ow gl ox oy oz gn pa pb pc pd pe pf pg ph pi pj pk pl pm bk">| What’s next on our Live journey?</h1><p id="de44" class="pw-post-body-paragraph nv nw io nx b ny pn oa ob oc po oe of go pp oh oi gr pq ok ol gu pr on oo op hp bk">Despite three years of effort, we are far from done! In fact, we are just getting started, actively building on the learnings shared above to deliver more joy to our members with Live events. To support the growing number of Live titles and new formats, like <a class="ag hb" href="https://www.netflix.com/tudum/articles/womens-world-cup-netflix" rel="noopener ugc nofollow" target="_blank">FIFA WWC in 2027</a>, we keep building our broadcast and delivery infrastructure and are actively working to further improve the Live experience.</p><p id="2732" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In this post, we’ve provided a broad overview and have barely scratched the surface. In the upcoming posts, we will dive deeper into key pillars of our Live systems, covering our encoding, delivery, playback, and user experience investments in more detail.</p><p id="95fd" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Getting this far would not have been possible without the hard work of dozens of teams across Netflix, who collaborate closely to design, build, and operate Live systems: Operations and Reliability, Encoding Technologies, Content Delivery, Device Playback, Streaming Algorithms, UI Engineering, Search and Discovery, Messaging, Content Promotion and Distribution, Data Platform, Cloud Infrastructure, Tooling and Productivity, Program Management, Data Science &amp; Engineering, Product Management, Globalization, Consumer Insights, Ads, Security, Payments, Live Production, Experience and Design, Product Marketing and Customer Service, amongst many others.</p></div></div><div class="qa"><div class="ac cb"><div class="my qw mz qx na qy cf qz cg ra ci bh"><div class="pv pw px py pz ac lu"><figure class="mu qa rb rc rd re rf paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*yokgEf1StGt84gHMxQqtrQ.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*yokgEf1StGt84gHMxQqtrQ.jpeg" /><img alt="" class="bh fw qf c" width="500" height="3024" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure><figure class="mu qa rb rc rd re rf paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*C2TVMPryZ6NwH1-DgDy7Rw.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*C2TVMPryZ6NwH1-DgDy7Rw.jpeg" /><img alt="" class="bh fw qf c" width="500" height="3024" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure></div><div class="ac lu"><figure class="mu qa rb rc rd re rf paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*q4FIjjQDicGkzlz-ktzhKA.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*q4FIjjQDicGkzlz-ktzhKA.jpeg" /><img alt="" class="bh fw qf c" width="500" height="3024" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure><figure class="mu qa rb rc rd re rf paragraph-image"><div role="button" tabindex="0" class="qb qc fl qd bh qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Zj_RUWUUlkBs3JiEul6AjQ.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Zj_RUWUUlkBs3JiEul6AjQ.jpeg" /><img alt="" class="bh fw qf c" width="500" height="3024" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/behind-the-streams-live-at-netflix-part-1-d23f917c2f40</link>
      <guid>https://netflixtechblog.com/behind-the-streams-live-at-netflix-part-1-d23f917c2f40</guid>
      <pubDate>Tue, 15 Jul 2025 18:04:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow]]></title>
      <description><![CDATA[<div class="ac cb"><div class="ci bh hv hw hx hy"><div><div></div><p id="10a0" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">By <a class="ag hb" href="https://www.linkedin.com/in/eugeneemelyanov/" rel="noopener ugc nofollow" target="_blank">Eugene Yemelyanau</a>, <a class="ag hb" href="https://www.linkedin.com/in/jake-grice/" rel="noopener ugc nofollow" target="_blank">Jake Grice</a></p><figure class="ot ou ov ow ox oy oq or paragraph-image"><div role="button" tabindex="0" class="oz pa fl pb bh pc"><div class="oq or os"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*5J0sqUrM77vCIDsM8wq_Dw.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*5J0sqUrM77vCIDsM8wq_Dw.jpeg" /><img alt="" class="bh fw pd c" width="700" height="368" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="ffb3" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk"><em class="qa">Introduction</em></h1><p id="2f88" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk"><a class="ag hb" href="http://tudum.com" rel="noopener ugc nofollow" target="_blank"><em class="qg">Tudum.com</em></a><em class="qg"> is Netflix’s official fan destination, enabling fans to dive deeper into their favorite Netflix shows and movies. Tudum offers exclusive first-looks, behind-the-scenes content, talent interviews, live events, guides, and interactive experiences. “Tudum” is named after the sonic ID you hear when pressing play on a Netflix show or movie. Attracting over 20 million members each month, Tudum is designed to enrich the viewing experience by offering additional context and insights into the content available on Netflix.</em></p><h1 id="655e" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Initial architecture</h1><p id="3da2" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">At the end of 2021, when we envisioned Tudum’s implementation, we considered architectural patterns that would be maintainable, extensible, and well-understood by engineers. With the goal of building a flexible, configuration-driven system, we looked to <strong class="nx ip">server-driven UI</strong> (SDUI) as an appealing solution. SDUI is a design approach where the server dictates the structure and content of the UI, allowing for dynamic updates and customization without requiring changes to the client application. Client applications like web, mobile, and TV devices, act as rendering engines for SDUI data. After our teams weighed and vetted all the details, the dust settled and we landed on an approach similar to Command Query Responsibility Segregation (<a class="ag hb" href="https://www.geeksforgeeks.org/cqrs-command-query-responsibility-segregation/" rel="noopener ugc nofollow" target="_blank">CQRS</a>). At Tudum, we have two main use cases that CQRS is perfectly capable of solving:</p><ul class=""><li id="9a85" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op qh qi qj bk"><strong class="nx ip">Tudum’s editorial team</strong> brings exclusive interviews, first-look photos, behind the scenes videos, and many more forms of fan-forward content, and compiles it all into pages on the <a class="ag hb" href="http://tudum.com" rel="noopener ugc nofollow" target="_blank">Tudum.com</a> website. This content comes onto Tudum in the form of individually published pages, and content elements within the pages. In support of this, Tudum’s architecture includes a write path to store all of this data, including internal comments, revisions, version history, asset metadata, and scheduling settings.</li><li id="a323" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op qh qi qj bk"><strong class="nx ip">Tudum visitors</strong> consume published pages. In this case, Tudum needs to serve personalized experiences for our beloved fans, and accesses only the latest version of our content.</li></ul></div></div><div class="oy"><div class="ac cb"><div class="my qp mz qq na qr cf qs cg qt ci bh"><figure class="ot ou ov ow ox oy qv qw paragraph-image"><div role="button" tabindex="0" class="oz pa fl pb bh pc"><div class="oq or qu"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*i_LBGZ4i7QWeiDLES88HoA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*i_LBGZ4i7QWeiDLES88HoA.png" /><img alt="" class="bh fw pd c" width="1000" height="779" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qx ff qy oq or qz ra bf b bg ab du">Initial Tudum data architecture</figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh hv hw hx hy"><p id="6413" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The high-level diagram above focuses on storage &amp; distribution, illustrating how we leveraged Kafka to separate the write and read databases. The write database would store internal page content and metadata from our CMS. The read database would store read-optimized page content, for example: CDN image URLs rather than internal asset IDs, and movie titles, synopses, and actor names instead of placeholders. This content ingestion pipeline allowed us to regenerate all consumer-facing content on demand, applying new structure and data, such as global navigation or branding changes. The Tudum Ingestion Service converted internal CMS data into a read-optimized format by applying page templates, running validations, performing data transformations, and producing the individual content elements into a Kafka topic. The Data Service Consumer, received the content elements from Kafka, stored them in a high-availability database (Cassandra), and acted as an API layer for the Page Construction service and other internal Tudum services to retrieve content.</p><p id="704d" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">A key advantage of decoupling read and write paths is the ability to scale them independently. It is a well-known architectural approach to connect both write and read databases using an event driven architecture. As a result, content edits would <strong class="nx ip"><em class="qg">eventually</em></strong> appear on <a class="ag hb" href="http://tudum.com" rel="noopener ugc nofollow" target="_blank">tudum.com</a>.</p><h1 id="38cc" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Challenges with eventual consistency</h1><p id="05f4" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">Did you notice the emphasis on “<strong class="nx ip"><em class="qg">eventually</em></strong>?” A major downside of this architecture was the delay between making an edit and observing that edit reflected on the website. For instance, when the team publishes an update, the following steps must occur:</p><ol class=""><li id="d25e" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op rb qi qj bk">Call the REST endpoint on the 3rd party CMS to save the data.</li><li id="3462" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Wait for the CMS to notify the Tudum Ingestion layer via a webhook.</li><li id="3b1e" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Wait for the Tudum Ingestion layer to query all necessary sections via API, validate data and assets, process the page, and produce the modified content to Kafka.</li><li id="140d" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Wait for the Data Service Consumer to consume this message from Kafka and store it in the database.</li><li id="d298" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Finally, after some <strong class="nx ip">cache refresh delay</strong>, this data would <strong class="nx ip"><em class="qg">eventually</em></strong> become available to the Page Construction service. Great!</li></ol><p id="4a60" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">By introducing a highly-scalable eventually-consistent architecture we were missing the ability to quickly render changes after writing them — an important capability for internal previews.</p><p id="6b8e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In our performance profiling, we found the source of delay was our Page Data Service which acted as a facade for an underlying <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30" target="_blank" data-discover="true">Key Value Data Abstraction</a> database. Page Data Service utilized a <strong class="nx ip">near cache</strong> to accelerate page building and reduce read latencies from the database.</p><p id="9814" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">This cache was implemented to optimize the N+1 key lookups necessary for page construction by having a complete data set in memory. When engineers hear “<em class="qg">slow reads</em>,” the immediate answer is often “<em class="qg">cache</em>,” which is exactly what our team adopted. The KVDAL near cache can refresh in the background on every app node. Regardless of which system modifies the data, the cache is updated with each refresh cycle. If you have 60 keys and a refresh interval of 60 seconds, the near cache will update one key per second. This was problematic for previewing recent modifications, as these changes were only reflected with each cache refresh. As Tudum’s content grew, cache refresh times increased, further extending the delay.</p><h1 id="0410" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">RAW Hollow</h1><p id="76e2" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">As this pain point grew, a new technology was being developed that would act as our silver bullet. <a class="ag hb" href="https://hollow.how/raw-hollow-sigmod.pdf" rel="noopener ugc nofollow" target="_blank">RAW Hollow</a> is an innovative in-memory, co-located, compressed object database developed by Netflix, designed to handle small to medium datasets with support for strong read-after-write consistency. It addresses the challenges of achieving consistent performance with low latency and high availability in applications that deal with less frequently changing datasets. Unlike traditional SQL databases or fully in-memory solutions, RAW Hollow offers a unique approach where the entire dataset is distributed across the application cluster and resides in the memory of each application process.</p><p id="3cf2" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">This design leverages compression techniques to scale datasets up to 100 million records per entity, ensuring extremely low latencies and high availability. RAW Hollow provides eventual consistency by default, with the option for strong consistency at the individual request level, allowing users to balance between high availability and data consistency. It simplifies the development of highly available and scalable stateful applications by eliminating the complexities of cache synchronization and external dependencies. This makes RAW Hollow a robust solution for efficiently managing datasets in environments like Netflix’s streaming services, where high performance and reliability are paramount.</p><h1 id="1a6a" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Revised architecture</h1><p id="e493" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">Tudum was a perfect fit to battle-test RAW Hollow while it was pre-GA internally. Hollow’s high-density near cache significantly reduces I/O. Having our primary dataset in memory enables Tudum’s various microservices (page construction, search, personalization) to access data synchronously in O(1) time, simplifying architecture, reducing code complexity, and increasing fault tolerance.</p></div></div><div class="oy"><div class="ac cb"><div class="my qp mz qq na qr cf qs cg qt ci bh"><figure class="ot ou ov ow ox oy qv qw paragraph-image"><div role="button" tabindex="0" class="oz pa fl pb bh pc"><div class="oq or rc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*XpvbAvfxMmfUq4oBC_E_BA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*XpvbAvfxMmfUq4oBC_E_BA.png" /><img alt="" class="bh fw pd c" width="1000" height="900" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qx ff qy oq or qz ra bf b bg ab du">Updated Tudum data architecture</figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh hv hw hx hy"><p id="24fa" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In our simplified architecture, we eliminated the Page Data Service, Key Value store, and Kafka infrastructure, in favor of RAW Hollow. By embedding the in-memory client directly into our read-path services, we avoid per-request I/O and reduce roundtrip time.</p><h1 id="6980" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Migration results</h1><p id="9b6a" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">The updated architecture yielded a monumental reduction in data propagation times, and the reduced I/O led to faster request times as an added bonus. Hollow’s compression alleviated our concerns about our data being “too big” to fit in memory. Storing three years’ of unhydrated data requires only a 130MB memory footprint — 25% of its uncompressed size in an Iceberg table!</p><p id="adc5" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Writers and editors can preview changes in seconds instead of minutes, while still maintaining high-availability and in-memory caching for Tudum visitors — the best of both worlds.</p><p id="b810" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">But what about the faster request times? The diagram below illustrates the before &amp; after timing to fulfil a request for Tudum’s home page. All of Tudum’s read-path services leverage Hollow in-memory state, leading to a significant increase in page construction speed and personalization algorithms. Controlling for factors like TLS, authentication, request logging, and WAF filtering, homepage construction time decreased from ~1.4 seconds to ~0.4 seconds!</p></div></div><div class="oy"><div class="ac cb"><div class="my qp mz qq na qr cf qs cg qt ci bh"><figure class="ot ou ov ow ox oy qv qw paragraph-image"><div role="button" tabindex="0" class="oz pa fl pb bh pc"><div class="oq or rd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*o7LgLonS37PvGPY04iVMCg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*o7LgLonS37PvGPY04iVMCg.png" /><img alt="" class="bh fw pd c" width="1000" height="793" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qx ff qy oq or qz ra bf b bg ab du">Home page construction time</figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh hv hw hx hy"><p id="f4b8" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">An attentive reader might notice that we have now tightly-coupled our Page Construction Service with the Hollow In-Memory State. This tight-coupling is used only in Tudum-specific applications. However, caution is needed if sharing the Hollow In-Memory Client with other engineering teams, as it could limit your ability to make schema changes or deprecations.</p><h1 id="a4d4" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Key Learnings</h1><ol class=""><li id="a421" class="nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op rb qi qj bk">CQRS is a powerful design paradigm for scale, if you can tolerate some eventual consistency.</li><li id="7771" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Minimizing the number of sequential operations can significantly reduce response times. I/O is often the main enemy of performance.</li><li id="c7fc" class="nv nw io nx b ny qk oa ob oc ql oe of go qm oh oi gr qn ok ol gu qo on oo op rb qi qj bk">Caching is complicated. Cache invalidation is a hard problem. By holding an entire dataset in memory, you can eliminate an entire class of problems.</li></ol><p id="2f6b" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">In the next episode, we’ll share how <a class="ag hb" href="http://tudum.com" rel="noopener ugc nofollow" target="_blank">Tudum.com</a> leverages Server Driven UI to rapidly build and deploy new experiences for Netflix fans. Stay tuned!</p><h1 id="5218" class="pe pf io bf pg ph pi pj gl pk pl pm gn pn po pp pq pr ps pt pu pv pw px py pz bk">Credits</h1><p id="5972" class="pw-post-body-paragraph nv nw io nx b ny qb oa ob oc qc oe of go qd oh oi gr qe ok ol gu qf on oo op hp bk">Thanks to <a class="ag hb" href="https://www.linkedin.com/in/koszewnik" rel="noopener ugc nofollow" target="_blank">Drew Koszewnik</a>, <a class="ag hb" href="https://www.linkedin.com/in/govindvenkatramankrishnan" rel="noopener ugc nofollow" target="_blank">Govind Venkatraman Krishnan</a>, <a class="ag hb" href="https://www.linkedin.com/in/nick-mooney-193849/" rel="noopener ugc nofollow" target="_blank">Nick Mooney</a></p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/netflix-tudum-architecture-from-cqrs-with-kafka-to-cqrs-with-raw-hollow-86d141b72e52</link>
      <guid>https://netflixtechblog.com/netflix-tudum-architecture-from-cqrs-with-kafka-to-cqrs-with-raw-hollow-86d141b72e52</guid>
      <pubDate>Thu, 10 Jul 2025 21:31:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Driving Content Delivery Efficiency Through Classifying Cache Misses]]></title>
      <description><![CDATA[<div><div></div><p id="7f7c" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">By <a class="ag hb" href="https://www.linkedin.com/in/mvipulbharat/" rel="noopener ugc nofollow" target="_blank">Vipul Marlecha</a>, <a class="ag hb" href="https://www.linkedin.com/in/lara-deek-79773966" rel="noopener ugc nofollow" target="_blank">Lara Deek</a>, <a class="ag hb" href="https://www.linkedin.com/in/thiaraortiz" rel="noopener ugc nofollow" target="_blank">Thiara Ortiz</a></p><p id="b29b" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><em class="oq">The mission of </em><a class="ag hb" href="https://openconnect.netflix.com/en/#what-is-open-connect" rel="noopener ugc nofollow" target="_blank"><em class="oq">Open Connect</em></a><em class="oq">, our dedicated content delivery network (CDN), is to deliver the best quality of experience (QoE) to our members. By localizing our Open Connect Appliances (OCAs), we bring Netflix content closer to the end user. This is achieved through close partnerships with internet service providers (ISPs) worldwide. Our ability to efficiently localize traffic, known as Content Delivery Efficiency, is a critical component of Open Connect’s service.</em></p><p id="8d23" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><em class="oq">In this post, we discuss one of the frameworks we use to evaluate our efficiency and identify sources of inefficiencies. Specifically, we classify the causes of traffic not being served from local servers, a phenomenon that we refer to as cache misses.</em></p><h2 id="fb9e" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">Why does Netflix have the Open Connect Program?</h2><p id="f869" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">The Open Connect Program is a cornerstone of Netflix’s commitment to delivering unparalleled QoE for our customers. By localizing traffic delivery from Open Connect servers at IX or ISP sites, we significantly enhance the speed and reliability of content delivery. The inherent latencies of data traveling across physical links, compounded by Internet infrastructure components like routers and network stacks, can disrupt a seamless viewing experience. Delays in video start times, reduced initial video quality, and the frustrating occurrence of buffering lead to an overall reduction in customer QoE. Open Connect empowers Netflix to maintain hyper-efficiency, ensuring a flawless client experience for new, latency-sensitive, on-demand content such as live streams and ads.</p><p id="827e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Our custom-built servers, known as Open Connect Appliances (OCAs), are designed for both efficiency and cost-effectiveness. By logging detailed historical streaming behavior and using it to model and forecast future trends, we hyper-optimize our OCAs for long-term caching efficiency. We build methods to efficiently and reliably store, stream, and move our content.</p><p id="dae4" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The mission of Open Connect hinges on our ability to effectively localize content on our OCAs globally, despite limited storage space, and also by design with specific storage sizes. This ensures that our cost and power efficiency metrics continue to improve, enhancing client QoE and reducing costs for our ISP partners. A critical question we continuously ask is: How do we evaluate and monitor which bytes should have been served from local OCAs but resulted in a cache miss?</p><p id="361a" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">The Anatomy of a Playback Request</strong></p><p id="03de" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Let us start by introducing the logic that directs or “steers” a specific Netflix client device to its dedicated OCA. The lifecycle from when a client device presses play until the video starts being streamed to that device is referred to as “playback.” Figure 1 illustrates the logical components involved in playback.</p><figure class="pi pj pk pl pm pn pf pg paragraph-image"><div role="button" tabindex="0" class="po pp fl pq bh pr"><div class="pf pg ph"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*VvQOBlOQLLAkBFOw%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*VvQOBlOQLLAkBFOw%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*VvQOBlOQLLAkBFOw%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*VvQOBlOQLLAkBFOw%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*VvQOBlOQLLAkBFOw%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*VvQOBlOQLLAkBFOw%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*VvQOBlOQLLAkBFOw%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*VvQOBlOQLLAkBFOw 640w, https://miro.medium.com/v2/resize:fit:720/0*VvQOBlOQLLAkBFOw 720w, https://miro.medium.com/v2/resize:fit:750/0*VvQOBlOQLLAkBFOw 750w, https://miro.medium.com/v2/resize:fit:786/0*VvQOBlOQLLAkBFOw 786w, https://miro.medium.com/v2/resize:fit:828/0*VvQOBlOQLLAkBFOw 828w, https://miro.medium.com/v2/resize:fit:1100/0*VvQOBlOQLLAkBFOw 1100w, https://miro.medium.com/v2/resize:fit:1400/0*VvQOBlOQLLAkBFOw 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw ps c" width="700" height="405" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="ec11" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Figure 1:</strong> Components for Playback</p><p id="eb5d" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The components involved in playback are important to understand as we elaborate on the concept of how we determine a cache miss versus hit. Independent of client requests, every OCA in our CDN periodically reports its capacity and health, learned BGP routes, and current list of stored files. All of this data is reported to the Cache Control Service (CCS). When a member hits the play button, this request is sent to our AWS services, specifically the Playback Apps service. After Playback Apps determines which files correspond to a specific movie request, it issues a request to “steer” the client’s playback request to OCAs via the Steering Service. The Steering Service in turn, using the data reported from OCAs to CCS as well as other client information such as geo location, identifies the set of OCAs that can satisfy that client’s request. This set of OCAs is then returned in the form of rank-ordered URLs to the client device, the client connects to the top-ranked OCA and requests the files it needs to begin the video stream.</p><h2 id="11f7" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">What is a Cache Miss?</h2><p id="0d83" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">A cache miss occurs when bytes are not served from the best available OCA for a given Netflix client, independent of OCA state. For each playback request, the Steering Service computes a ranked list of local sites for the client, ordered by network proximity alone. This ranked list of sites is known as the “proximity rank.” Network proximity is determined based on the IP ranges (BGP routes) that are advertised by our ISP partners. Any OCA from the first “most proximal” site on this list is the most preferred and closest, having advertised the longest, most specific matching prefix to the client’s IP address. A cache miss is logged when bytes are not streamed from any OCA at this first local site, and we log when and why that happens.</p><p id="9760" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">It is important to note that our concept of cache misses is viewed from the client’s perspective, focusing on the optimal delivery source for the end user and prepositioning content accordingly, rather than relying on traditional CDN proxy caching mechanisms. Our “prepositioning” differentiator allows us to prioritize client QoE by ensuring content is served from the most optimal OCA.</p><p id="4340" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">We attribute cache misses to three logical categories. The intuition behind the delineated categories is that each category informs parallel strategies to achieve content delivery efficiency.</p><ul class=""><li id="22ba" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op pt pu pv bk"><strong class="nx ip">Content Miss:</strong> This happens when the files were not found on OCAs in the local site. In previous articles like “<a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/content-popularity-for-open-connect-b86d56f613b" target="_blank" data-discover="true">Content Popularity for Open Connect</a>” and “<a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/distributing-content-to-open-connect-3e3e391d4dc9" target="_blank" data-discover="true">Distributing Content to Open Connect</a>,” we discuss how we decide what content to prioritize populating first onto our OCAs. A sample of efforts this insights informs include: (1) how accurately we predict the popularity of content, (2) how rapidly we pre-position that content, (3) how well we design our OCA hardware, and (4) how well we provision storage capacity at our locations of presence.</li><li id="6c6c" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op pt pu pv bk"><strong class="nx ip">Health Miss:</strong> This happens when the local site’s OCA hardware resources are becoming saturated, and one or more OCA can not handle more traffic. As a result, we direct clients to other OCAs with capacity to serve that content. Each OCA has a control loop that monitors its bottleneck metrics (such as CPU, disk usage, etc.) and assesses its ability to serve additional traffic. This is referred to as “OCA health.” Insight into health misses informs efforts such as: (1) how well we load balance traffic across OCAs with heterogeneous hardware resources, (2) how well we provision enough copies of highly popular content to distribute massive traffic, which is also tied to how accurately we predict the popularity of content, and (3) how well we preposition content to specific hardware components with varying traffic serve capabilities and bottlenecks.</li></ul><p id="88df" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Next we will dig into the framework we built to log and compute these metrics in real-time, with some extra attention to technical detail.</p><h2 id="354c" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">Cache Miss Computation Framework</h2><h2 id="7414" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">Logging Components</h2><p id="06fa" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">There are two critical data components that we log, gather, and analyze to compute cache misses:</p><ul class=""><li id="fef8" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op pt pu pv bk"><strong class="nx ip">Steering Playback Manifest Logs:</strong> Within the Steering Service, we compute and log the ranked list of sites for each client request, i.e. the “proximity rank” introduced earlier. We also enrich that list with information that reflects the logical decisions and filters our algorithms applied across all proximity ranks given that point-in-time state of our systems. This information allows us to replay/simulate any hypothetical scenario easily, such as to evaluate whether an outage across all sites in the first proximity rank would overwhelm sites in the second proximity rank, and many more such scenarios!</li><li id="d1c3" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op pt pu pv bk"><strong class="nx ip">OCA Server Logs:</strong> Once a Netflix client connects with an OCA to begin video streaming, the OCAs log any data regarding that streaming session, such as the files streamed and total bytes. All OCA logs are consolidated to identify which OCA(s) each client actually watched its video stream from, and the amount of content streamed.</li></ul><p id="0d1e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The above logs are joined for every Netflix client’s playback request to compute detailed cache miss metrics (in bytes and hours streamed) at different aggregation levels (such as per OCA, movie, file, encode type, country, and so on).</p><h2 id="508a" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">System Architecture</h2><p id="63f2" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">Figure 2 outlines how the logging components fit into the general engineering architecture that allows us to compute content miss metrics at low-latency and almost real-time.</p><figure class="pi pj pk pl pm pn pf pg paragraph-image"><div role="button" tabindex="0" class="po pp fl pq bh pr"><div class="pf pg qb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*PlQ4xv4Si8iWGnW1%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*PlQ4xv4Si8iWGnW1%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*PlQ4xv4Si8iWGnW1%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*PlQ4xv4Si8iWGnW1%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*PlQ4xv4Si8iWGnW1%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*PlQ4xv4Si8iWGnW1%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*PlQ4xv4Si8iWGnW1%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*PlQ4xv4Si8iWGnW1 640w, https://miro.medium.com/v2/resize:fit:720/0*PlQ4xv4Si8iWGnW1 720w, https://miro.medium.com/v2/resize:fit:750/0*PlQ4xv4Si8iWGnW1 750w, https://miro.medium.com/v2/resize:fit:786/0*PlQ4xv4Si8iWGnW1 786w, https://miro.medium.com/v2/resize:fit:828/0*PlQ4xv4Si8iWGnW1 828w, https://miro.medium.com/v2/resize:fit:1100/0*PlQ4xv4Si8iWGnW1 1100w, https://miro.medium.com/v2/resize:fit:1400/0*PlQ4xv4Si8iWGnW1 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw ps c" width="700" height="354" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="8a58" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Figure 2:</strong> Components of the cache miss computation framework.</p><p id="b723" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">We will now describe the system requirements of each component.</p><ol class=""><li id="7a9e" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op qc pu pv bk"><strong class="nx ip">Log Emission</strong>: The logs for computing cache miss are emitted to Kafka clusters in each of our evaluated AWS regions, enabling us to send logs with the lowest possible latency. After a client device makes a playback request, the Steering Service generates a <em class="oq">steering playback manifest</em>, logs it, and sends the data to a Kafka cluster. Kafka is used for event streaming at Netflix because of its high-throughput event processing, low latency, and reliability. After the client device starts the video stream from an OCA, the OCA stores information about the bytes served for each file requested by each unique client playback stream. This data is what we refer to as <em class="oq">OCA server logs</em>.</li><li id="4c2c" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op qc pu pv bk"><strong class="nx ip">Log Consolidation</strong>: The logs emitted by the Steering Service and the OCAs can result in data for a single playback request being distributed across different AWS regions, because logs are recorded in geographically distributed Kafka clusters. <em class="oq">OCA server logs</em> might be stored in one region’s Kafka cluster while <em class="oq">steering playback manifest logs</em> are stored in another. One approach to consolidate data for a single playback is to build complex many-to-many joins. In streaming pipelines, performing these joins requires replicating logs across all regions, which leads to data duplication and increased complexity. This setup complicates downstream data processing and inflates operational costs due to multiple redundant cross-region data transfers. To overcome these challenges, we perform a cross-region transfer only once, consolidating all logs into a single region.</li><li id="89ce" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op qc pu pv bk"><strong class="nx ip">Log Enrichment</strong>: We enrich the logs during streaming joins with metadata using various slow-changing dimension tables and services so that we have the necessary information about the OCA and the played content.</li><li id="4bd0" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op qc pu pv bk"><strong class="nx ip">Streaming Window-Based Join</strong>: We perform a streaming window-based join to merge the <em class="oq">steering playback manifest logs</em> with the <em class="oq">OCA server logs</em>. Performing enrichment and log consolidation upstream allows for more seamless and un-interrupted joining of our log data sources.</li><li id="b8ac" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op qc pu pv bk"><strong class="nx ip">Cache Miss Calculations</strong>: After joining the logs, we compute the cache miss metrics. The computation checks whether the client played content from an OCA in the first site listed in the <em class="oq">steering playback manifest</em>’s proximity rank or from another site. When a video stream occurs at a higher proximity rank, this indicates that a cache miss occurred.</li></ol><h1 id="f012" class="qd os io bf ot qe qf qg gl qh qi qj gn qk ql qm qn qo qp qq qr qs qt qu qv qw bk">Data Model to Evaluate Cache Misses</h1><p id="c8ef" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">One of the most exciting opportunities we have enabled through these logs (in these authors’ opinions) is the ability to replay our logic offline and in simulations with variable parameters, to reproduce impact in production under different conditions. This allows us to test new conditions, features, and hypothetical scenarios without impacting production Netflix traffic.</p><p id="c132" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">To achieve the above, our data should satisfy two main conditions. First, the data should be comprehensive in representing the state of each distinct logical step involved in steering, including the decisions and their reasons. In order to achieve this, the underlying logic, here the Steering Service, needs to be built in a modularized fashion, where each logical component overlays data from the prior component, resulting in a rich blurb representing the system’s full state, which is finally logged. This all needs to be achieved without adding perceivable latency to client playback requests! Second, the data should be in a format that allows near-real-time aggregate metrics for monitoring purposes.</p><p id="dae9" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Some components of our final, joined data model that enables us to collect rich insights in a scalable and timely manner are listed in Table 1.</p><p id="4cfa" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Table 1: Unified Data Model after joining <em class="oq">steering playback manifest</em> and <em class="oq">OCA server logs</em>.</strong></p><figure class="pi pj pk pl pm pn pf pg paragraph-image"><div role="button" tabindex="0" class="po pp fl pq bh pr"><div class="pf pg qx"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*7kbXh8GcB8P75TPWfsjTig.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*7kbXh8GcB8P75TPWfsjTig.png" /><img alt="" class="bh fw ps c" width="700" height="619" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h2 id="65ca" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">Cache Miss Computation Sample</h2><p id="ba38" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">Let us share an example of how we compute cache miss metrics. For a given unique client play request, we know we had a cache miss when the client streams from an OCA that is not in the client’s first proximity rank. As you can see from Table 1, each file needed for a client’s video streaming session is linked to routable OCAs and their corresponding sites with a proximity rank. These are 0 based indexes with proximity rank zero indicating the most optimal OCA for the client. “Proximity Rank Zero” indicates that the client connected to an OCA in the most preferred site(s), thus no misses occurred. Higher proximity ranks indicate a miss has occurred. The aggregation of all bytes and hours streamed from non-preferred sites constitutes a missed opportunity for Netflix and are reported in our cache miss metrics.</p><p id="ef7e" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Decision Labels and Bytes Sent</strong></p><p id="bd09" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Sourced from the <em class="oq">steering playback manifest logs</em>, we record why we did not select an OCA for playback. These are denoted by:</p><ul class=""><li id="7f1f" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op pt pu pv bk">“H”: Health miss.</li><li id="ae1f" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op pt pu pv bk">“C”: Content miss.</li></ul><p id="5c56" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk"><strong class="nx ip">Metrics Calculation and Categorization</strong></p><p id="891c" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">For each file needed for a client’s video streaming session, we can categorize the bytes streamed by the client into different types of misses:</p><ul class=""><li id="f864" class="nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op pt pu pv bk">No Miss: If proximity rank is zero, bytes were streamed from the optimal OCA.</li><li id="e378" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op pt pu pv bk">Health Miss (“H”): Miss due to the OCA reporting high utilization.</li><li id="213c" class="nv nw io nx b ny pw oa ob oc px oe of go py oh oi gr pz ok ol gu qa on oo op pt pu pv bk">Content Miss (“C”): Miss due to the OCA not having the content available locally.</li></ul><h2 id="c8cd" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">How are miss metrics used to monitor our efficiency?</h2><p id="5903" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">Open Connect uses cache miss metrics to manage our Open Connect infrastructure. One of the team’s goals is to reduce the frequency of these cache misses, as they indicate that our members are being served by less proximal OCAs. By maintaining a detailed set of metrics that reveal the reasons behind cache misses, we can set up alerts to quickly identify when members are streaming from suboptimal locations. This is crucial because we operate a global CDN with millions of members worldwide and tens of thousands of servers.</p><p id="fca9" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">The figure below illustrates how we track the volume of total streaming traffic alongside the proportion of traffic streamed from less preferred locations due to content shedding. By calculating the ratio of content shed traffic to total streamed traffic, we derive a content shed ratio:</p><p id="1ea9" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">content shed ratio = content shed traffic total streamed traffic</p><figure class="pi pj pk pl pm pn pf pg paragraph-image"><div role="button" tabindex="0" class="po pp fl pq bh pr"><div class="pf pg qy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*qoDQqy_9y6mffW9C%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*qoDQqy_9y6mffW9C%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*qoDQqy_9y6mffW9C%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*qoDQqy_9y6mffW9C%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*qoDQqy_9y6mffW9C%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*qoDQqy_9y6mffW9C%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*qoDQqy_9y6mffW9C%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*qoDQqy_9y6mffW9C 640w, https://miro.medium.com/v2/resize:fit:720/0*qoDQqy_9y6mffW9C 720w, https://miro.medium.com/v2/resize:fit:750/0*qoDQqy_9y6mffW9C 750w, https://miro.medium.com/v2/resize:fit:786/0*qoDQqy_9y6mffW9C 786w, https://miro.medium.com/v2/resize:fit:828/0*qoDQqy_9y6mffW9C 828w, https://miro.medium.com/v2/resize:fit:1100/0*qoDQqy_9y6mffW9C 1100w, https://miro.medium.com/v2/resize:fit:1400/0*qoDQqy_9y6mffW9C 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw ps c" width="700" height="376" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="687d" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">This active monitoring of content shedding allows us to maintain a tight feedback loop to ensure the efficacy of our deployment and prediction algorithms, streaming traffic, and the QoE of our members. Given that content shedding can occur for multiple reasons, it is essential to have clear signals indicating when it happens, along with known and automated remediation strategies, such as mechanisms to quickly deploy mispredicted content onto OCAs. When special intervention is necessary to minimize shedding, we use it as an opportunity to enhance our systems as well as to ensure they are comprehensive in considering all known failure cases.</p><h2 id="20b3" class="or os io bf ot gk ou dy gl gm ov ea gn go ow gp gq gr ox gs gt gu oy gv gw oz bk">Conclusion</h2><p id="fde1" class="pw-post-body-paragraph nv nw io nx b ny pa oa ob oc pb oe of go pc oh oi gr pd ok ol gu pe on oo op hp bk">Open Connect’s unique strategy requires us to be incredibly efficient in delivering content from our OCAs. We closely track miss metrics to ensure we are maximizing the traffic our members stream from most proximal locations. This ensures we are delivering the best quality of experience to our members globally.</p><p id="e265" class="pw-post-body-paragraph nv nw io nx b ny nz oa ob oc od oe of go og oh oi gr oj ok ol gu om on oo op hp bk">Our methods for managing cache misses are evolving, especially with the introduction of new streaming types like Live and Ads, which have different streaming behaviors and access patterns compared to traditional video. We remain committed to identifying and seizing opportunities for improvement as we face new challenges.</p></div>]]></description>
      <link>https://netflixtechblog.com/driving-content-delivery-efficiency-through-classifying-cache-misses-ffcf08026b6c</link>
      <guid>https://netflixtechblog.com/driving-content-delivery-efficiency-through-classifying-cache-misses-ffcf08026b6c</guid>
      <pubDate>Wed, 02 Jul 2025 17:20:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[AV1 @ Scale: Film Grain Synthesis, The Awakening]]></title>
      <description><![CDATA[<div class="ac cb"><div class="ci bh hv hw hx hy"><div><div><h2 id="57a9" class="pw-subtitle-paragraph jm in io bf b jn jo jp jq jr js jt ju jv jw jx jy jz ka kb cq du"><em class="jl">Unleashing Film Grain Synthesis on Netflix and Enhancing Visuals for Millions</em></h2><div></div><p id="aa6f" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk"><a class="ag hb" href="https://www.linkedin.com/in/li-heng-chen-a75458a2/" rel="noopener ugc nofollow" target="_blank">Li-Heng Chen</a>, <a class="ag hb" href="https://www.linkedin.com/in/andreynorkin/" rel="noopener ugc nofollow" target="_blank">Andrey Norkin</a>, <a class="ag hb" href="https://www.linkedin.com/in/liwei-guo/" rel="noopener ugc nofollow" target="_blank">Liwei Guo</a>, <a class="ag hb" href="https://www.linkedin.com/in/henryzhili/" rel="noopener ugc nofollow" target="_blank">Zhi Li</a>, <a class="ag hb" href="https://www.linkedin.com/in/agataopalach/" rel="noopener ugc nofollow" target="_blank">Agata Opalach</a> and <a class="ag hb" href="https://www.linkedin.com/in/anush-moorthy-b8451142/" rel="noopener ugc nofollow" target="_blank">Anush Moorthy</a></p><figure class="pe pf pg ph pi pj pb pc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc pd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*7WHYl75ij_-W7YDLj2rz2w.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*7WHYl75ij_-W7YDLj2rz2w.png" /><img alt="" class="bh fw po c" width="700" height="371" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="358d" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk"><strong class="ok ip">Picture this: you’re watching a classic film, and the subtle dance of film grain adds a layer of authenticity and nostalgia to every scene.</strong> This grain, formed from tiny particles during the film’s development, is more than just a visual effect. It plays a key role in storytelling by enhancing the film’s depth and contributing to its realism. However, film grain is as elusive as it is beautiful. Its random nature makes it notoriously difficult to compress. Traditional compression algorithms struggle to manage it, often forcing a choice between preserving the grain and reducing file size.</p><p id="b381" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">In the digital age, noise remains a ubiquitous element in video content. Camera sensor noise introduces its own characteristics, while filmmakers often add intentional grain during post-production to evoke mood or a vintage feel. These elements create a visually rich experience that tests conventional compression methods.</p><p id="177f" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">We’re giving members globally a transformed streaming experience with the recent rollout of AV1 Film Grain Synthesis (FGS) streams. While FGS has been part of the AV1 standard since its inception, we only enabled it for a limited number of titles during <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/bringing-av1-streaming-to-netflix-members-tvs-b7fc88e42320" target="_blank" data-discover="true">our initial launch of the AV1 codec in 2021</a>. Now, we’re enabling this innovative technology at scale, leveraging it to preserve the artistic integrity of film grain while optimizing data efficiency. In this blog post, we’ll explore how FGS revolutionizes video streaming and enhances your viewing experience.</p><h1 id="34c5" class="pp pq io bf pr ps pt jp gl pu pv js gn pw px py pz qa qb qc qd qe qf qg qh qi bk">Understanding Film Grain Synthesis in AV1</h1><p id="3ec2" class="pw-post-body-paragraph oi oj io ok b jn qj om on jq qk op oq go ql os ot gr qm ov ow gu qn oy oz pa hp bk">The AV1 Film Grain Synthesis tool models film grain through two key components, with model parameters estimated before the encoding of the denoised video:</p><p id="fefa" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk"><strong class="ok ip">Film Grain Pattern</strong>: an <em class="qo">auto-regressive (AR) model</em> is used to replicate the pattern of film grain. The key parameters are the AR coefficients, which can be estimated from the residual between the source video and the denoised video, essentially capturing the noise. This model captures the spatial correlation between the grain samples, ensuring that the noise characteristics of the original content are accurately preserved. By adjusting the auto-regressive coefficients {ai}, the model can control the grain’s shape, making it appear coarser or finer. With these coefficients, a 64x64 noise template is generated, as illustrated in the animation below. To construct the noise layer during playback, random 32x32 patches are extracted from the 64x64 noise template and added to the decoded video.</p><figure class="pe pf pg ph pi pj pb pc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*uNAjI0TwiFlfclpC%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*uNAjI0TwiFlfclpC%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*uNAjI0TwiFlfclpC%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*uNAjI0TwiFlfclpC%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*uNAjI0TwiFlfclpC%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*uNAjI0TwiFlfclpC%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*uNAjI0TwiFlfclpC%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*uNAjI0TwiFlfclpC 640w, https://miro.medium.com/v2/resize:fit:720/0*uNAjI0TwiFlfclpC 720w, https://miro.medium.com/v2/resize:fit:750/0*uNAjI0TwiFlfclpC 750w, https://miro.medium.com/v2/resize:fit:786/0*uNAjI0TwiFlfclpC 786w, https://miro.medium.com/v2/resize:fit:828/0*uNAjI0TwiFlfclpC 828w, https://miro.medium.com/v2/resize:fit:1100/0*uNAjI0TwiFlfclpC 1100w, https://miro.medium.com/v2/resize:fit:1400/0*uNAjI0TwiFlfclpC 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw po c" width="700" height="291" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">Fig. 1 The synthesis process of the 64x64 noise template using the simplest AR kernel with a lag parameter L=1. Each noise value is calculated as a linear combination of previously synthesized noise sample values, with AR coefficients a0, a1, a2, a3 and a white Gaussian noise (wgn) component.</figcaption></figure><p id="0ab0" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk"><strong class="ok ip">Film Grain Intensity</strong>: a <em class="qo">scaling function</em> is employed to control the grain’s appearance under varying lighting conditions. This function, estimated during the encoding process, models the relationship between pixel value and noise intensity using a piecewise linear function. This allows for precise adjustments to the grain strength based on video brightness and color. Consequently, the film grain strength is adapted to the areas of the picture, closely recreating the look of the original video. The animation below demonstrates how the grain intensity is adjusted by the scaling function:</p><figure class="pe pf pg ph pi pj pb pc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc qu"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*juBpnCJo31CO0imqoAyTpA.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*juBpnCJo31CO0imqoAyTpA.gif" /><img alt="" class="bh fw po c" width="700" height="289" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">Fig. 2 Illustration of the scaling function’s impact on film grain intensity. Left: The scaling function graph showing the relationship between pixel value and scaling intensity. Right: A grayscale SMPTE bars frame with film grain applied according to the scaling function.</figcaption></figure><p id="c03f" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">With these models specified by AV1 standard, the encoding process first removes the film grain from the video. The standard does not mandate a specific method for this step, allowing users to choose their preferred denoiser. Following the denoising, the video is compressed, and the grain’s pattern and intensity are estimated and transmitted alongside the compressed video data. During playback, the film grain is recreated and reintegrated into the video using a block-based method. This approach is optimized for consumer devices, ensuring smooth playback and high-quality visuals. For a more detailed explanation, please refer to the <a class="ag hb" href="https://norkin.org/pdf/DCC_2018_AV1_film_grain.pdf" rel="noopener ugc nofollow" target="_blank">original paper</a>.</p><p id="e365" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">By combining these components, the AV1 Film Grain Synthesis tool preserves the artistic integrity of film grain while making the content “easier to compress” by denoising the source video prior to encoding. This process enables high-quality video streaming, even in content with heavy grain, resulting in significant bitrate savings and improved visual quality.</p><h1 id="3b4b" class="pp pq io bf pr ps pt jp gl pu pv js gn pw px py pz qa qb qc qd qe qf qg qh qi bk">Visual Quality Improvement, Bitrate Reduction, and Member Benefits</h1><p id="d1b0" class="pw-post-body-paragraph oi oj io ok b jn qj om on jq qk op oq go ql os ot gr qm ov ow gu qn oy oz pa hp bk">In our pursuit of premium streaming quality, enabling AV1 Film Grain Synthesis has led to significant bitrate reduction, allowing us to deliver high-quality video with less data while preserving the artistic integrity of film grain. Below, we showcase visual examples highlighting the improved quality and reduced bitrate, using a frame from the Netflix title <a class="ag hb" href="https://www.netflix.com/title/80996324" rel="noopener ugc nofollow" target="_blank"><em class="qo">They Cloned Tyrone</em></a>:</p></div></div><div class="pj"><div class="ac cb"><div class="nl qv nm qw nn qx cf qy cg qz ci bh"><figure class="pe pf pg ph pi pj rb rc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc ra"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*6mrNIZrV9yJhsw3yn-ZfJQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*6mrNIZrV9yJhsw3yn-ZfJQ.png" /><img alt="" class="bh fw po c" width="1000" height="563" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">A source video frame from <em class="jl">They Cloned Tyrone</em></figcaption></figure><figure class="nh pj rb rc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc ra"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*1GnNYWlnl343CHFdQAHEHw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*1GnNYWlnl343CHFdQAHEHw.png" /><img alt="" class="bh fw po c" width="1000" height="563" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">Regular AV1 (without FGS) @ 8274 kbps</figcaption></figure><figure class="nh pj rb rc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc ra"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*STC86FuqtMqO_hbz-H3CBQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*STC86FuqtMqO_hbz-H3CBQ.png" /><img alt="" class="bh fw po c" width="1000" height="563" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">AV1 with FGS @ 2804 kbps</figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh hv hw hx hy"><p id="2e2e" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">The visual comparison highlights a significant bitrate reduction of approximately 66%, with regular AV1 encoding at 8274 kbps compared to AV1 with FGS at 2804 kbps. In this example, which features strong film grain, it may be observed that the regular version exhibits distorted noise with a discrete cosine transform (DCT)-like pattern. In contrast, the FGS version preserves the integrity of the film grain at a lower bitrate.</p><p id="0d0a" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">Additionally, synthesized noise effectively masks compression artifacts, resulting in a more visually appealing experience. In this comparison below, both the regular AV1 stream and the AV1 FGS stream without synthesized noise (equivalent to compressing the denoised video) show compression artifacts. In contrast, the AV1 FGS stream with grain synthesis (rightmost figure) improves visual quality through contrast masking in human visual systems. The added film grain, a form of mask, effectively conceals some compression artifacts.</p></div></div><div class="pj"><div class="ac cb"><div class="nl qv nm qw nn qx cf qy cg qz ci bh"><div class="pe pf pg ph pi ac mh"><figure class="nh pj rd re rb rc rf paragraph-image"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*ms9UEY7w_LyQFv14kZJHWQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*ms9UEY7w_LyQFv14kZJHWQ.png" /><img alt="" class="bh fw po c" width="220" height="320" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></figure><figure class="nh pj rd re rb rc rf paragraph-image"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*um6VAX8lSeGHPTT--94rZA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*um6VAX8lSeGHPTT--94rZA.png" /><img alt="" class="bh fw po c" width="220" height="320" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></figure><figure class="nh pj rd re rb rc rf paragraph-image"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*mB9eWSgP9Fete-g-dqYSig.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*mB9eWSgP9Fete-g-dqYSig.png" /><img alt="" class="bh fw po c" width="220" height="320" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture><figcaption class="qq ff qr pb pc qs qt bf b bg ab du rg fl rh ri">Cropped frame comparison: Regular AV1 stream (Left), AV1 FGS stream <strong class="bf pr">without</strong> grain synthesis during decoding (Middle), and AV1 FGS stream with grain synthesis (Right).</figcaption></figure></div></div></div></div><div class="ac cb"><div class="ci bh hv hw hx hy"><p id="5092" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">Currently, we lack a dedicated quality model for film grain synthesis. The noise appearing at different pixel locations between the source and decoded video poses challenges for pixelwise comparison methods like <a class="ag hb" href="https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio" rel="noopener ugc nofollow" target="_blank">PSNR</a> or <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652" target="_blank" data-discover="true">VMAF</a>, leading to penalized quality scores. Despite this, our internal assessment highlights the improvements in visual quality and the value of these advancements.</p><p id="2281" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">To evaluate the impact of AV1 Film Grain Synthesis, we selected approximately 300 titles from the Netflix catalog, each with varying levels of graininess. The bar chart below illustrates a 36% reduction in average bitrate for resolutions of 1080p and above when AV1 film grain synthesis is enabled, highlighting its efficacy in optimizing data usage. For resolutions below 1080p, the reduction in bitrate is relatively small, reaching only a 10% decrease, likely because noise is filtered out during the downscaling process. Furthermore, enabling the film grain synthesis coding tool consistently introduces syntax overhead to the bitstream.</p><figure class="pe pf pg ph pi pj pb pc paragraph-image"><div role="button" tabindex="0" class="pk pl fl pm bh pn"><div class="pb pc qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*fVfNc9DIh878py8G%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*fVfNc9DIh878py8G%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*fVfNc9DIh878py8G%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*fVfNc9DIh878py8G%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*fVfNc9DIh878py8G%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*fVfNc9DIh878py8G%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*fVfNc9DIh878py8G%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*fVfNc9DIh878py8G 640w, https://miro.medium.com/v2/resize:fit:720/0*fVfNc9DIh878py8G 720w, https://miro.medium.com/v2/resize:fit:750/0*fVfNc9DIh878py8G 750w, https://miro.medium.com/v2/resize:fit:786/0*fVfNc9DIh878py8G 786w, https://miro.medium.com/v2/resize:fit:828/0*fVfNc9DIh878py8G 828w, https://miro.medium.com/v2/resize:fit:1100/0*fVfNc9DIh878py8G 1100w, https://miro.medium.com/v2/resize:fit:1400/0*fVfNc9DIh878py8G 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw po c" width="700" height="296" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qq ff qr pb pc qs qt bf b bg ab du">Fig. 3: Comparison of average values across resolution categories between regular AV1 streams (without film grain synthesis) and AV1 streams with film grain synthesis enabled.</figcaption></figure><p id="1e09" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">Finally, we conducted A/B testing prior to rollout to understand the overall streaming impact of enabling AV1 Film Grain Synthesis. This testing showcased a smoother and more reliable Quality of Experience (QoE) for our members. The improvements include:</p><ul class=""><li id="242b" class="oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa rj rk rl bk"><strong class="ok ip">Lower Initial and Average Bitrate</strong>: Bitrate at the start of the playback reduced by 24% and average bitrate by 31.6%, lower network bandwidth requirements and reduced storage needs for downloaded streams.</li><li id="9d6e" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk"><strong class="ok ip">Decreased Playback Errors</strong>: Playback error rate reduced by approximately 3%.</li><li id="dbc9" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk"><strong class="ok ip">Reduced Rebuffering</strong>: 10% fewer rebuffers and a 5% reduction in rebuffer duration resulting from the lower bitrate.</li><li id="9907" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk"><strong class="ok ip">Faster Start Play</strong>: Start play delay reduced by 10%, potentially due to the lower bitrate, which may help devices reach the target buffer level more quickly.</li><li id="66c9" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk"><strong class="ok ip">Improved Playback Stability</strong>: Observed 10% fewer noticeable bitrate drops and a 10% reduction in the time users spend adjusting their playback position during video playback, likely influenced by reduced bitrate and rebuffering.</li><li id="81d7" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk"><strong class="ok ip">Higher Resolution Streaming</strong>: About 0.7% of viewing hours shifted from lower resolutions (≤ 1080p) to 2160p on 4K-capable devices. This shift is attributed to reduced bitrates at switching points, which make it easier to achieve the highest resolution during a session.</li></ul><h1 id="cc76" class="pp pq io bf pr ps pt jp gl pu pv js gn pw px py pz qa qb qc qd qe qf qg qh qi bk">Behind the Scenes: Our Film Grain Adventure Continues</h1><p id="08cc" class="pw-post-body-paragraph oi oj io ok b jn qj om on jq qk op oq go ql os ot gr qm ov ow gu qn oy oz pa hp bk">We’re always excited to share our progress with the community. This blog provides an overview of our journey: from the initial launch of the AV1 codec to the recent addition of Film Grain Synthesis (FGS) streams, highlighting the impact these innovations have on Netflix’s streaming quality. Since March, we’ve been rolling out FGS across scale, and many users can now enjoy the FGS-enabled streams, provided their device supports this feature. We encourage you to watch some of the author’s favorite titles <a class="ag hb" href="https://www.netflix.com/title/81985186" rel="noopener ugc nofollow" target="_blank">The Hot Spot</a>, <a class="ag hb" href="https://www.netflix.com/title/70018511" rel="noopener ugc nofollow" target="_blank">Kung Fu Cult Master</a>, <a class="ag hb" href="https://www.netflix.com/title/70043379" rel="noopener ugc nofollow" target="_blank">Initial D</a>, <a class="ag hb" href="https://www.netflix.com/title/70005331" rel="noopener ugc nofollow" target="_blank">God of Gamblers II</a>, <a class="ag hb" href="https://www.netflix.com/title/80203996" rel="noopener ugc nofollow" target="_blank">Baahubali 2: The Conclusion</a>, or <a class="ag hb" href="https://www.netflix.com/title/81487660" rel="noopener ugc nofollow" target="_blank">Dept. Q</a> (you may need to toggle off HDR from the settings menu) on Netflix to experience the new FGS streams firsthand.</p><p id="2872" class="pw-post-body-paragraph oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa hp bk">In the next post, we will share how we did this in <a class="ag hb" rel="noopener ugc nofollow" href="https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359" target="_blank" data-discover="true">our video encoding pipeline</a>, detailing the process and insights we’ve gained. Stay tuned to the <a class="ag hb" href="https://netflixtechblog.com/" rel="noopener ugc nofollow" target="_blank">Netflix Tech Blog</a> for the latest updates.</p><h1 id="4c65" class="pp pq io bf pr ps pt jp gl pu pv js gn pw px py pz qa qb qc qd qe qf qg qh qi bk">Acknowledgments</h1><p id="950b" class="pw-post-body-paragraph oi oj io ok b jn qj om on jq qk op oq go ql os ot gr qm ov ow gu qn oy oz pa hp bk">This achievement is the result of a collaborative effort among several Open Connect teams at Netflix, including Video Algorithms, Media Encoding Pipeline, Media Foundations, Infrastructure Capacity Planning, and Open Connect Control Plane. We also received invaluable support from Client &amp; Partner Technologies, Streaming &amp; Discovery Experiences, Media Compute &amp; Storage Infrastructure, Data Science &amp; Engineering, and the Global Production Technology team. We would like to express our sincere gratitude to the following individuals for their contributions to the project’s success:</p><ul class=""><li id="5f2e" class="oi oj io ok b jn ol om on jq oo op oq go or os ot gr ou ov ow gu ox oy oz pa rj rk rl bk">Prudhvi Kumar Chaganti and Ken Thomas for the discussion and assistance on rollout strategy</li><li id="cb28" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">Poojarani Chennai Natarajan, Lara Deek , Ivan Ivanov, and Ishaan Shastri for their essential support in planning and operations for Open Connect.</li><li id="0a02" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">Alex Chang for his support in everything related to data analysis, and Jessica Tweneboah and Amelia Taylor for their assistance with AB testing.</li><li id="2acf" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">David Zheng, Janet Xue, Scott Bolter, Brian Li, Allan Zhou, Vivian Li, Sarah Kurdoghlian, Artem Danylenko, Greg Freedman, and many other dedicated team members played a crucial role in device certification and collaboration with device partners. Their efforts significantly improved compatibility across platforms. (Spoiler alert: this was one of the biggest challenges we faced for productizing AV1 FGS!)</li><li id="9eec" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">Javier Fernandez-Ivern and Ritesh Makharia expertly managed the playback logic</li><li id="834e" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">Joseph McCormick and JD Vandenberg for providing valuable insights from a content production point of view, and Alex ‘Ally’ Michaelson for assisting in monitoring customer service.</li><li id="06c6" class="oi oj io ok b jn rm om on jq rn op oq go ro os ot gr rp ov ow gu rq oy oz pa rj rk rl bk">A special thanks to Roger Quero, who played a key role in supporting various aspects of the project and contributed significantly to its overall success while he was at Netflix.</li></ul></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/av1-scale-film-grain-synthesis-the-awakening-ee09cfdff40b</link>
      <guid>https://netflixtechblog.com/av1-scale-film-grain-synthesis-the-awakening-ee09cfdff40b</guid>
      <pubDate>Wed, 02 Jul 2025 16:21:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix]]></title>
      <description><![CDATA[<div class="ac cb"><div class="ci bh ht hu hv hw"><div><div></div><p id="0456" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">By <a class="ag gz" href="https://www.linkedin.com/in/ahutter/" rel="noopener ugc nofollow" target="_blank">Alex Hutter</a>, <a class="ag gz" href="https://www.linkedin.com/in/bertails/" rel="noopener ugc nofollow" target="_blank">Alexandre Bertails</a>, <a class="ag gz" href="https://www.linkedin.com/in/clairezwang0612/" rel="noopener ugc nofollow" target="_blank">Claire Wang</a>, <a class="ag gz" href="https://www.linkedin.com/in/haoyuan-h-98b587134/" rel="noopener ugc nofollow" target="_blank">Haoyuan He</a>, <a class="ag gz" href="https://www.linkedin.com/in/kishore-banala/" rel="noopener ugc nofollow" target="_blank">Kishore Banala</a>, <a class="ag gz" href="https://www.linkedin.com/in/peterroyal/" rel="noopener ugc nofollow" target="_blank">Peter Royal</a>, <a class="ag gz" href="https://www.linkedin.com/in/shervinafshar/" rel="noopener ugc nofollow" target="_blank">Shervin Afshar</a></p><p id="e595" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">As Netflix’s offerings grow — across films, series, games, live events, and ads — so does the complexity of the systems that support it. Core business concepts like ‘actor’ or ‘movie’ are modeled in many places: in our Enterprise GraphQL Gateway powering internal apps, in our asset management platform storing media assets, in our media computing platform that powers encoding pipelines, to name a few. Each system models these concepts differently and in isolation, with little coordination or shared understanding. While they often operate on the same concepts, these systems remain largely unaware of that fact, and of each other.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*wNYAhebbErEdYROL%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*wNYAhebbErEdYROL%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*wNYAhebbErEdYROL%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*wNYAhebbErEdYROL%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*wNYAhebbErEdYROL%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*wNYAhebbErEdYROL%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*wNYAhebbErEdYROL%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*wNYAhebbErEdYROL 640w, https://miro.medium.com/v2/resize:fit:720/0*wNYAhebbErEdYROL 720w, https://miro.medium.com/v2/resize:fit:750/0*wNYAhebbErEdYROL 750w, https://miro.medium.com/v2/resize:fit:786/0*wNYAhebbErEdYROL 786w, https://miro.medium.com/v2/resize:fit:828/0*wNYAhebbErEdYROL 828w, https://miro.medium.com/v2/resize:fit:1100/0*wNYAhebbErEdYROL 1100w, https://miro.medium.com/v2/resize:fit:1400/0*wNYAhebbErEdYROL 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Spider-Man Pointing meme with each Spider-Man labelled as: “it’s a movie”, “it’s a tv show”, “it’s a game”." class="bh fu pi c" width="700" height="467" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="9f61" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">As a result, several challenges emerge:</p><ul class=""><li id="c101" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Duplicated and Inconsistent Models</strong> — Teams re-model the same business entities in different systems, leading to conflicting definitions that are hard to reconcile.</li><li id="e1ed" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">Inconsistent Terminology</strong> — Even within a single system, teams may use different terms for the same concept, or the same term for different concepts, making collaboration harder.</li><li id="b67d" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">Data Quality Issues</strong> — Discrepancies and broken references are hard to detect across our many microservices. While identifiers and foreign keys exist, they are inconsistently modeled and poorly documented, requiring manual work from domain experts to find and fix any data issues.</li><li id="86d8" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">Limited Connectivity</strong> — Within systems, relationships between data are constrained by what each system supports. Across systems, they are effectively non-existent.</li></ul><p id="c853" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">To address these challenges, we need new foundations that allow us to define a model once, at the conceptual level, and reuse those definitions everywhere. But it isn’t enough to just document concepts; we need to connect them to real systems and data. And more than just connect, we have to project those definitions outward, generating schemas and enforcing consistency across systems. The conceptual model must become part of the control plane.</p><p id="e618" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">These were the core ideas that led us to build UDA.</p><h1 id="03da" class="pr ps im bf pt pu pv pw gj px py pz gl qa qb qc qd qe qf qg qh qi qj qk ql qm bk">Introducing UDA</h1><p id="0faf" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">UDA (Unified Data Architecture)</strong> is the foundation for connected data in <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/netflix-studio-engineering-overview-ed60afcfa0ce" target="_blank" data-discover="true">Content Engineering</a>. It enables teams to model domains once and represent them consistently across systems — powering automation, discoverability, and <a class="ag gz" href="https://en.wikipedia.org/wiki/Semantic_interoperability" rel="noopener ugc nofollow" target="_blank">semantic interoperability</a>.</p><p id="5baf" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Using UDA, users and systems can:</strong></p><p id="105b" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Register and connect domain models </strong>— formal conceptualizations of federated business domains expressed as data.</p><ul class=""><li id="27d4" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why? </strong>So everyone uses the same official definitions for business concepts, which avoids confusion and stops different teams from rebuilding similar models in conflicting ways.</li></ul><p id="6a56" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Catalog and map domain models to data containers</strong>, such as GraphQL type resolvers served by a <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/open-sourcing-the-netflix-domain-graph-service-framework-graphql-for-spring-boot-92b9dcecda18" target="_blank" data-discover="true">Domain Graph Service</a>, <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873" target="_blank" data-discover="true">Data Mesh sources</a>, or Iceberg tables, through their representation as a graph.</p><ul class=""><li id="36c0" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why?</strong> To make it easy to find where the actual data for these business concepts lives (e.g., in which specific database, table, or service) and understand how it’s structured there.</li></ul><p id="b0a9" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Transpile domain models into schema definition languages</strong> like GraphQL, Avro, SQL, RDF, and Java, while preserving semantics.</p><ul class=""><li id="5c58" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why? </strong>To automatically create consistent technical data structures (schemas) for various systems directly from the domain models, saving developers manual effort and reducing errors caused by out-of-sync definitions.</li></ul><p id="d376" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Move data faithfully between data containers</strong>, such as from federated GraphQL entities to <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873" target="_blank" data-discover="true">Data Mesh</a> (a general purpose data movement and processing platform for moving data between Netflix systems at scale), Change Data Capture (CDC) sources to joinable Iceberg Data Products.</p><ul class=""><li id="f165" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why? </strong>To save developer time by automatically handling how data is moved and correctly transformed between different systems. This means less manual work to configure data movement, ensuring data shows up consistently and accurately wherever it’s needed.</li></ul><p id="1067" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Discover and explore domain concepts </strong>via search and graph traversal.</p><ul class=""><li id="2c8b" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why? </strong>So anyone can more easily find the specific business information they’re looking for, understand how different concepts and data are related, and be confident they are accessing the correct information.</li></ul><p id="c87b" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Programmatically introspect the knowledge graph</strong> using Java, GraphQL, or SPARQL.</p><ul class=""><li id="a8f7" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">Why?</strong> So developers can build smarter applications that leverage this connected business information, automate more complex data-dependent workflows, and help uncover new insights from the relationships in the data.</li></ul><p id="cb20" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">This post introduces the foundations of UDA</strong> as a knowledge graph, connecting domain models to data containers through mappings, and grounded in an in-house <a class="ag gz" href="https://en.wikipedia.org/wiki/Metamodeling#:~:text=A%20metamodel%2F%20surrogate%20model%20is,representing%20input%20and%20output%20relations" rel="noopener ugc nofollow" target="_blank">metamodel</a>, or model of models, called Upper. Upper defines the language for domain modeling in UDA and enables projections that automatically generate schemas and pipelines across systems.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow qs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*j1I2cLD0vtfE9IQfNiUwVQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*j1I2cLD0vtfE9IQfNiUwVQ.png" /><img alt="Image of the UDA knowledge graph. A central node representing a domain model is connected to other nodes representing Data Mesh, GraphQL, and Iceberg data containers." class="bh fu pi c" width="700" height="723" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">The same domain model can be connected to semantically equivalent data containers in the UDA knowledge graph.</figcaption></figure><p id="b7d0" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">This post also highlights two systems</strong> that leverage UDA in production:</p><p id="995a" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Primary Data Management (PDM)</strong> is our platform for managing authoritative reference data and taxonomies. PDM turns domain models into flat or hierarchical taxonomies that drive a generated UI for business users. These taxonomy models are projected into Avro and GraphQL schemas, automatically provisioning data products in the Warehouse and GraphQL APIs in the <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2" target="_blank" data-discover="true">Enterprise Gateway</a>.</p><p id="8584" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Sphere</strong> is our self-service operational reporting tool for business users. Sphere uses UDA to catalog and relate business concepts across systems, enabling discovery through familiar terms like ‘actor’ or ‘movie.’ Once concepts are selected, Sphere walks the knowledge graph and generates SQL queries to retrieve data from the warehouse, no manual joins or technical mediation required.</p><h2 id="ba47" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">UDA is a Knowledge Graph</h2><p id="b725" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">UDA needs to solve the </strong><a class="ag gz" href="https://en.wikipedia.org/wiki/Data_integration" rel="noopener ugc nofollow" target="_blank"><strong class="oc in">data integration</strong></a><strong class="oc in"> problem. </strong>We needed a data catalog unified with a schema registry, but with a hard requirement for <a class="ag gz" href="https://en.wikipedia.org/wiki/Semantic_integration#:~:text=Semantic%20integration%20is%20the%20process,from%20diverse%20sources" rel="noopener ugc nofollow" target="_blank">semantic integration</a>. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a <a class="ag gz" href="https://en.wikipedia.org/wiki/Knowledge_graph" rel="noopener ugc nofollow" target="_blank">knowledge graph</a> approach.</p><p id="0e88" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">We chose RDF and SHACL as the foundation for UDA’s knowledge graph</strong>. But operationalizing them at enterprise scale surfaced several challenges:</p><ul class=""><li id="2065" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk"><strong class="oc in">RDF lacked a usable information model.</strong> While RDF offers a flexible graph structure, it provides little guidance on how to organize data into <a class="ag gz" href="https://www.w3.org/TR/rdf12-concepts/#dfn-named-graph" rel="noopener ugc nofollow" target="_blank">named graphs</a>, manage ontology ownership, or define governance boundaries. Standard <a class="ag gz" href="https://www.w3.org/2001/sw/wiki/Linking_patterns" rel="noopener ugc nofollow" target="_blank">follow-your-nose mechanisms</a> like owl:imports apply only to ontologies and don’t extend to named graphs; we needed a generalized mechanism to express and resolve dependencies between them.</li><li id="fd63" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">SHACL is not a modeling language for enterprise data.</strong> Designed to validate native RDF, SHACL assumes globally unique URIs and a single data graph. But enterprise data is structured around local schemas and typed keys, as in GraphQL, Avro, or SQL. SHACL could not express these patterns, making it difficult to model and validate real-world data across heterogeneous systems.</li><li id="bcdb" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">Teams lacked shared authoring practices.</strong> Without strong guidelines, teams modeled their ontologies inconsistently breaking semantic interoperability. Even subtle differences in style, structure, or naming led to divergent interpretations and made transpilation harder to define consistently across schemas.</li><li id="f055" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk"><strong class="oc in">Ontology tooling lacked support for collaborative modeling.</strong> Unlike GraphQL Federation, ontology frameworks had no built-in support for modular contributions, team ownership, or safe federation. Most engineers found the tools and concepts unfamiliar, and available authoring environments lacked the structure needed for coordinated contributions.</li></ul><p id="4a57" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">To address these challenges, UDA adopts a named-graph-first information model.</strong> Each named graph conforms to a governing model, itself a named graph in the knowledge graph. This systematic approach ensures resolution, modularity, and enables governance across the entire graph. While a full description of UDA’s information infrastructure is beyond the scope of this post, the next sections explain how UDA bootstraps the knowledge graph with its metamodel and uses it to model data container representations and mappings.</p><h2 id="5d0e" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Upper is Domain Modeling</h2><p id="440f" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">Upper is a language for formally describing domains — business or system — and their concepts</strong>. <a class="ag gz" href="https://en.wikipedia.org/wiki/Conceptualization_(information_science)" rel="noopener ugc nofollow" target="_blank">These concepts are organized into domain models</a>: controlled vocabularies that define classes of keyed entities, their attributes, and their relationships to other entities, which may be keyed or nested, within the same domain or across domains. Keyed concepts within a domain model can be organized in taxonomies of types, which can be as complex as the business or the data system needs them to be. Keyed concepts can also be extended from other domain models — that is, new attributes and relationships can be <a class="ag gz" href="https://tomgruber.org/writing/onto-design.pdf#page=4" rel="noopener ugc nofollow" target="_blank">contributed monotonically</a>. Finally, Upper ships with a rich set of datatypes for attribute values, which can also be customized per domain.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*A_-GpZLvqbxuVdkH%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*A_-GpZLvqbxuVdkH%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*A_-GpZLvqbxuVdkH%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*A_-GpZLvqbxuVdkH%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*A_-GpZLvqbxuVdkH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*A_-GpZLvqbxuVdkH%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*A_-GpZLvqbxuVdkH%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*A_-GpZLvqbxuVdkH 640w, https://miro.medium.com/v2/resize:fit:720/0*A_-GpZLvqbxuVdkH 720w, https://miro.medium.com/v2/resize:fit:750/0*A_-GpZLvqbxuVdkH 750w, https://miro.medium.com/v2/resize:fit:786/0*A_-GpZLvqbxuVdkH 786w, https://miro.medium.com/v2/resize:fit:828/0*A_-GpZLvqbxuVdkH 828w, https://miro.medium.com/v2/resize:fit:1100/0*A_-GpZLvqbxuVdkH 1100w, https://miro.medium.com/v2/resize:fit:1400/0*A_-GpZLvqbxuVdkH 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Visualization of the UDA graph representation of a One Piece character. The Character node in the graph is connected to a Devil Fruit node. The Devil Fruit node is connected to a Devil Fruit Type node." class="bh fu pi c" width="700" height="397" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed"><em class="re">The graph representation of the onepiece: domain model from our UI. Depicted here you can see how Characters are related to Devil Fruit, and that each Devil Fruit has a type.</em></figcaption></figure><p id="82dc" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Upper domain models are data</strong>. They are expressed as <a class="ag gz" href="https://www.w3.org/TR/rdf12-concepts/" rel="noopener ugc nofollow" target="_blank">conceptual RDF</a> and organized into named graphs, making them introspectable, queryable, and versionable within the UDA knowledge graph. This graph unifies not just the domain models themselves, but also the schemas they transpile to — GraphQL, Avro, Iceberg, Java — and the mappings that connect domain concepts to concrete data containers, such as GraphQL type resolvers served by a <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/open-sourcing-the-netflix-domain-graph-service-framework-graphql-for-spring-boot-92b9dcecda18" target="_blank" data-discover="true">Domain Graph Service</a>, <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873" target="_blank" data-discover="true">Data Mesh sources</a>, or Iceberg tables, through their representations. Upper raises the level of abstraction above traditional ontology languages: it defines a strict subset of <a class="ag gz" href="https://www.w3.org/2001/sw/wiki/Main_Page" rel="noopener ugc nofollow" target="_blank">semantic technologies</a> from the W3C tailored and generalized for domain modeling. It builds on ontology frameworks like RDFS, OWL, and SHACL so domain authors can model effectively without even needing to learn what an ontology is.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow rf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*SGMUpJucEWhdlZsd4blz3A.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*SGMUpJucEWhdlZsd4blz3A.png" /><img alt="Screenshot of UDA UI showing domain model for One Piece serialized as Turtle." class="bh fu pi c" width="700" height="767" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">UDA domain model for One Piece. <a class="ag gz" href="https://github.com/Netflix-Skunkworks/uda/blob/9627a97fcd972a41ec910be3f928ea7692d38714/uda-intro-blog/onepiece.ttl" rel="noopener ugc nofollow" target="_blank">Link to full definition</a>.</figcaption></figure><p id="eed1" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Upper is the metamodel for Connected Data in UDA — the model for all models</strong>. It is designed as a bootstrapping <a class="ag gz" href="https://en.wikipedia.org/wiki/Upper_ontology" rel="noopener ugc nofollow" target="_blank">upper ontology</a>, which means that Upper is <em class="rg">self-referencing</em>, because it models itself as a domain model; <em class="rg">self-describing</em>, because it defines the very concept of a domain model; and <em class="rg">self-validating</em>, because it conforms to its own model. This approach enables UDA to bootstrap its own infrastructure: Upper itself is projected into a generated Jena-based Java API and GraphQL schema used in GraphQL service federated into Netflix’s Enterprise GraphQL gateway. These same generated APIs are then used by the projections and the UI. Because all domain models are <a class="ag gz" href="https://en.wikipedia.org/wiki/Conservative_extension" rel="noopener ugc nofollow" target="_blank">conservative extensions</a> of Upper, other system domain models — including those for GraphQL, Avro, Data Mesh, and Mappings — integrate seamlessly into the same runtime, enabling consistent data semantics and interoperability across schemas.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow rh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*5tJcW2A6lLrNi257%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*5tJcW2A6lLrNi257%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*5tJcW2A6lLrNi257%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*5tJcW2A6lLrNi257%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*5tJcW2A6lLrNi257%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*5tJcW2A6lLrNi257%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*5tJcW2A6lLrNi257%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*5tJcW2A6lLrNi257 640w, https://miro.medium.com/v2/resize:fit:720/0*5tJcW2A6lLrNi257 720w, https://miro.medium.com/v2/resize:fit:750/0*5tJcW2A6lLrNi257 750w, https://miro.medium.com/v2/resize:fit:786/0*5tJcW2A6lLrNi257 786w, https://miro.medium.com/v2/resize:fit:828/0*5tJcW2A6lLrNi257 828w, https://miro.medium.com/v2/resize:fit:1100/0*5tJcW2A6lLrNi257 1100w, https://miro.medium.com/v2/resize:fit:1400/0*5tJcW2A6lLrNi257 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Screenshot of an IDE. It shows Java code using the generated API from the Upper metamodel to traverse and print terms from a domain domain in the top while the bottom contains the output of an execution." class="bh fu pi c" width="700" height="801" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">Traversing a domain model programmatically using the Java API generated from the Upper metamodel.</figcaption></figure><h2 id="67b9" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Data Container Representations</h2><p id="168d" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">Data containers are repositories of information. </strong>They contain instance data that conform to their own schema languages or type systems: federated entities from GraphQL services, Avro records from Data Mesh sources, rows from Iceberg tables, or objects from Java APIs. Each container operates within the context of a system that imposes its own structural and operational constraints.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ri"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*qUzAb6-TC2HL8qAWAW1Xlw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*qUzAb6-TC2HL8qAWAW1Xlw.png" /><img alt="Screenshot of a UI showing details for a Data Mesh Source containing One Piece Characters." class="bh fu pi c" width="700" height="759" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">A Data Mesh source is a data container.</figcaption></figure><p id="707f" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Data container </strong><a class="ag gz" href="https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning" rel="noopener ugc nofollow" target="_blank"><strong class="oc in">representations</strong></a><strong class="oc in"> are data.</strong> They are faithful interpretations of the members of data systems as graph data. UDA captures the definition of these systems as their own domain models, the system domains. These models encode both the information architecture of the systems and the schemas of the data containers within. They provide a blueprint for translating the systems into graph representations.</p></div></div><div class="pd"><div class="ac cb"><div class="nd rj ne rk nf rl cf rm cg rn ci bh"><figure class="oy oz pa pb pc pd ro rp paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6QzelmSRrIj1G881%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6QzelmSRrIj1G881%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6QzelmSRrIj1G881%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6QzelmSRrIj1G881%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6QzelmSRrIj1G881%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6QzelmSRrIj1G881%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*6QzelmSRrIj1G881%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6QzelmSRrIj1G881 640w, https://miro.medium.com/v2/resize:fit:720/0*6QzelmSRrIj1G881 720w, https://miro.medium.com/v2/resize:fit:750/0*6QzelmSRrIj1G881 750w, https://miro.medium.com/v2/resize:fit:786/0*6QzelmSRrIj1G881 786w, https://miro.medium.com/v2/resize:fit:828/0*6QzelmSRrIj1G881 828w, https://miro.medium.com/v2/resize:fit:1100/0*6QzelmSRrIj1G881 1100w, https://miro.medium.com/v2/resize:fit:2000/0*6QzelmSRrIj1G881 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /><img alt="Screenshot of an IDE showing two files open side by side. On the left is a system domain model for Data Mesh. On the right is a representation of a Data Mesh source containing One Piece Character data." class="bh fu pi c" width="1000" height="604" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed"><em class="re">Side by side/super imposed image of data container schema and representation. </em><a class="ag gz" href="https://github.com/Netflix-Skunkworks/uda/blob/9627a97fcd972a41ec910be3f928ea7692d38714/uda-intro-blog/onepiece_character_data_container.ttl" rel="noopener ugc nofollow" target="_blank"><em class="re">Link to full data container representation</em></a><em class="re">.</em></figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh ht hu hv hw"><p id="c76d" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">UDA catalogs the data container representations into the knowledge graph.</strong> It records the coordinates and metadata of the underlying data assets, but unlike a traditional catalog, it only tracks assets that are semantically connected to domain models. This enables users and systems to connect concepts from domain models to the concrete locations where corresponding instance data can be accessed. Those connections are called <em class="rg">Mappings</em>.</p><h2 id="31d6" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Mappings</h2><p id="886e" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">Mappings are data that connect domain models to data containers.</strong> Every element in a domain model is addressable, from the domain model itself down to specific attributes and relationships. Likewise, data container representations make all components addressable, from an Iceberg table to an individual column, or from a GraphQL type to a specific field. A Mapping connects nodes in a subgraph of the domain model to nodes in a subgraph of a container representation. Visually, the Mapping is the set of arcs that link those two graphs together.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow rq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*it3X5Vu8plWX5QvN_AJkgw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*it3X5Vu8plWX5QvN_AJkgw.png" /><img alt="Screenshot of UDA UI showing a mapping between a concept in UDA and a Data Mesh Source." class="bh fu pi c" width="700" height="695" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed"><em class="re">A mapping between a domain model and a Data Mesh Source from the UDA UI. </em><a class="ag gz" href="https://github.com/Netflix-Skunkworks/uda/blob/9627a97fcd972a41ec910be3f928ea7692d38714/uda-intro-blog/onepiece_character_mappings.ttl" rel="noopener ugc nofollow" target="_blank"><em class="re">Link to full mapping</em></a><em class="re">.</em></figcaption></figure><p id="900c" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Mappings enable discovery.</strong> Starting from a domain concept, users and systems can walk the knowledge graph to find where that concept is materialized — in which data system, in which container, and even how a specific attribute or relationship is physically accessed. The inverse is also supported: given a data container, one can trace back to the domain concepts it participates in.</p><p id="d391" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Mappings shape UDA’s approach to semantic data integration.</strong> Most existing schema languages are not expressive enough in capturing richer semantics of a domain to address requirements for data integration (<a class="ag gz" href="https://doi.org/10.1007/978-3-319-49340-4_8" rel="noopener ugc nofollow" target="_blank">for example</a>, “accessibility of data, providing semantic context to support its interpretation, and establishing meaningful links between data”). A trivial example of this could be seen in the lack of built-in facilities in Avro to represent foreign keys, making it very hard to express how entities relate across Data Mesh sources. Mappings, together with the corresponding system domain models, allow for such relationships, and many other constraints, to be defined in the domain models and used programmatically in actual data systems.</p><p id="936f" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Mappings enable intent-based automation.</strong> Data is not always available in the systems where consumers need it. Because Mappings encode both meaning and location, UDA can reason about how data should move, preserving semantics, without requiring the consumer to specify how it should be done. Beyond the cataloging use case, connecting to existing containers, UDA automatically derives <em class="rg">canonical Mappings</em> from registered domain models as part of the projection process.</p><h2 id="4cec" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Projections</h2><p id="b219" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">A projection produces a concrete data container.</strong> These containers, such as a GraphQL schema or a Data Mesh source, implement the characteristics derived from a registered domain model. Each projection is a concrete realization of Upper’s denotational semantics, ensuring <a class="ag gz" href="https://en.wikipedia.org/wiki/Semantic_interoperability" rel="noopener ugc nofollow" target="_blank">semantic interoperability</a> across all containers projected from the same domain model.</p><p id="c0d9" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Projections produce consistent public contracts across systems.</strong> The data containers generated by projections encode data contracts in the form of schemas, derived by transpiling a domain model into the target container’s schema language. UDA currently supports transpilation to GraphQL and Avro schemas.</p><p id="10a4" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">The GraphQL transpilation produces a schema that adheres to the <a class="ag gz" href="https://spec.graphql.org/October2021/#sec-Overview" rel="noopener ugc nofollow" target="_blank">official GraphQL spec</a> with the ability to generate all GraphQL types defined in the spec. Given that the UDA domain model can be federated, it also supports generating federated graphQL schemas. Below is an example of a transpiled GraphQL schema.</p></div></div><div class="pd"><div class="ac cb"><div class="nd rj ne rk nf rl cf rm cg rn ci bh"><figure class="oy oz pa pb pc pd ro rp paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*NPXB3ujnUGSIklei%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*NPXB3ujnUGSIklei%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*NPXB3ujnUGSIklei%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*NPXB3ujnUGSIklei%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*NPXB3ujnUGSIklei%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*NPXB3ujnUGSIklei%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*NPXB3ujnUGSIklei%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*NPXB3ujnUGSIklei 640w, https://miro.medium.com/v2/resize:fit:720/0*NPXB3ujnUGSIklei 720w, https://miro.medium.com/v2/resize:fit:750/0*NPXB3ujnUGSIklei 750w, https://miro.medium.com/v2/resize:fit:786/0*NPXB3ujnUGSIklei 786w, https://miro.medium.com/v2/resize:fit:828/0*NPXB3ujnUGSIklei 828w, https://miro.medium.com/v2/resize:fit:1100/0*NPXB3ujnUGSIklei 1100w, https://miro.medium.com/v2/resize:fit:2000/0*NPXB3ujnUGSIklei 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /><img alt="Screenshot of an IDE showing two files open side by side. On the left is the definition of a Character in UDA. On the right is transpiled GraphQL schema." class="bh fu pi c" width="1000" height="424" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed"><em class="re">Domain model on the left, with transpiled GraphQL schema on the right. </em><a class="ag gz" href="https://github.com/Netflix-Skunkworks/uda/blob/9627a97fcd972a41ec910be3f928ea7692d38714/uda-intro-blog/onepiece.graphqls" rel="noopener ugc nofollow" target="_blank"><em class="re">Link to full transpiled GraphQL schema</em></a><em class="re">.</em></figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh ht hu hv hw"><p id="252b" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">The Avro transpilation produces a schema that is a Data Mesh flavor of Avro, which includes some customization on top of the <a class="ag gz" href="https://avro.apache.org/docs/1.12.0/specification/" rel="noopener ugc nofollow" target="_blank">official Avro spec</a>. This schema is used to automatically create a Data Mesh source container. Below is an example of a transpiled Avro schema.</p></div></div><div class="pd"><div class="ac cb"><div class="nd rj ne rk nf rl cf rm cg rn ci bh"><figure class="oy oz pa pb pc pd ro rp paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*uVInkj5S3PYTqNA-%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*uVInkj5S3PYTqNA-%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*uVInkj5S3PYTqNA-%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*uVInkj5S3PYTqNA-%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*uVInkj5S3PYTqNA-%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*uVInkj5S3PYTqNA-%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*uVInkj5S3PYTqNA-%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*uVInkj5S3PYTqNA- 640w, https://miro.medium.com/v2/resize:fit:720/0*uVInkj5S3PYTqNA- 720w, https://miro.medium.com/v2/resize:fit:750/0*uVInkj5S3PYTqNA- 750w, https://miro.medium.com/v2/resize:fit:786/0*uVInkj5S3PYTqNA- 786w, https://miro.medium.com/v2/resize:fit:828/0*uVInkj5S3PYTqNA- 828w, https://miro.medium.com/v2/resize:fit:1100/0*uVInkj5S3PYTqNA- 1100w, https://miro.medium.com/v2/resize:fit:2000/0*uVInkj5S3PYTqNA- 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /><img alt="Screenshot of an IDE showing two files open side by side. On the left is the definition of a Devil Fruit in UDA. On the right is transpiled Avro schema." class="bh fu pi c" width="1000" height="476" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed"><em class="re">Domain model on the left, with transpiled Avro schema on the right. </em><a class="ag gz" href="https://github.com/Netflix-Skunkworks/uda/blob/9627a97fcd972a41ec910be3f928ea7692d38714/uda-intro-blog/onepiece.avro" rel="noopener ugc nofollow" target="_blank"><em class="re">Link to full transpiled Avro schema</em></a><em class="re">.</em></figcaption></figure></div></div></div><div class="ac cb"><div class="ci bh ht hu hv hw"><p id="6ead" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Projections can automatically populate data containers. </strong>Some projections, such as those to GraphQL schemas or Data Mesh sources produce empty containers that require developers to populate the data. This might be creating GraphQL APIs or pushing events onto Data Mesh sources. Conversely, other containers, like Iceberg Tables, are automatically created and populated by UDA. For Iceberg Tables, UDA leverages the Data Mesh platform to automatically create data streams to move data into tables. This process utilizes much of the same infrastructure detailed in this blog post <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/data-movement-in-netflix-studio-via-data-mesh-3fddcceb1059" target="_blank" data-discover="true">here</a>.</p><p id="bf20" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Projections have mappings. </strong>UDA automatically generates and manages mappings between the newly created data containers and the projected domain model.</p><h1 id="240f" class="pr ps im bf pt pu pv pw gj px py pz gl qa qb qc qd qe qf qg qh qi qj qk ql qm bk">Early Adopters</h1><h2 id="76f7" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Controlled Vocabularies (PDM)</h2><p id="da3d" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk">The full range of Netflix’s business activities relies on a sprawling data model that captures the details of our many business processes. Teams need to be able to coordinate operational activities to ensure that content production is complete, advertising campaigns are in place, and promotional assets are ready to deploy. We implicitly depend upon a singular definition of shared concepts, such as content production is complete. Multiple definitions create coordination challenges. Software (and humans) don’t know that the definitions mean the same thing.</p><p id="800e" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">We started the Primary Data Management (PDM) initiative to create unified and consistent definitions for the core concepts in our data model. These definitions form <strong class="oc in">controlled vocabularies</strong>, standardized and governed lists for what values are permitted within certain fields in our data model.</p><p id="be83" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Primary Data Management (PDM) is a single place where business users can manage controlled vocabularies. </strong>Our data model governance has been scattered across different tools and teams creating coordination challenges. This is an information management problem relating to the definition, maintenance and consistent use of reference data and taxonomies. This problem is not unique to Netflix, so we looked outward for existing solutions to this problem.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow rr"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*GJMad4GU29YxPONf%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*GJMad4GU29YxPONf%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*GJMad4GU29YxPONf%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*GJMad4GU29YxPONf%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*GJMad4GU29YxPONf%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*GJMad4GU29YxPONf%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*GJMad4GU29YxPONf%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*GJMad4GU29YxPONf 640w, https://miro.medium.com/v2/resize:fit:720/0*GJMad4GU29YxPONf 720w, https://miro.medium.com/v2/resize:fit:750/0*GJMad4GU29YxPONf 750w, https://miro.medium.com/v2/resize:fit:786/0*GJMad4GU29YxPONf 786w, https://miro.medium.com/v2/resize:fit:828/0*GJMad4GU29YxPONf 828w, https://miro.medium.com/v2/resize:fit:1100/0*GJMad4GU29YxPONf 1100w, https://miro.medium.com/v2/resize:fit:1400/0*GJMad4GU29YxPONf 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Screenshot of PDM UI" class="bh fu pi c" width="700" height="420" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">Managing the taxonomy of One Piece characters in PDM.</figcaption></figure><p id="3a7b" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">PDM uses the Simple Knowledge Organization System (</strong><a class="ag gz" href="https://www.w3.org/TR/skos-primer" rel="noopener ugc nofollow" target="_blank"><strong class="oc in">SKOS</strong></a><strong class="oc in">)</strong> <strong class="oc in">model</strong>. It is a W3C data standard designed for modeling knowledge. Its terminology is abstract, with Concepts that can be organized into ConceptSchemes and properties to describe various types of relationships. Every system is hardcoded against <em class="rg">something</em>, that’s how software knows how to manipulate data. We want a system that can work with a data model as its input, so we still need <em class="rg">something</em> concrete to build the software against. This is what SKOS provides, a generic basis for modeling knowledge that our system can understand.</p><p id="cd41" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">PDM uses Domain Models to integrate SKOS into the rest of Content Engineering’s ecosystem. </strong>A core premise of the system is that it takes a domain model as input, and everything that <em class="rg">can</em> be derived <em class="rg">is</em> derived from that model. PDM builds a user interface based upon the model definition and leverages UDA to project this model into type-safe interfaces for other systems to use. The system will provision a Domain Graph Service (DGS) within our federated GraphQL API environment using a GraphQL schema that UDA projects from the domain model. UDA is also used to provision data movement pipelines which are able to feed our <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf" target="_blank" data-discover="true">GraphSearch</a> infrastructure as well as move data into the warehouse. The data movement systems use Avro schemas, and UDA creates a projection from the domain model to Avro.</p><p id="2309" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Consumers of controlled vocabularies never know they’re using SKOS. </strong>Domain models use terms that fit in with the domain. SKOS’s generic notion of <em class="rg">broader</em> and <em class="rg">narrower</em> to define a hierarchy are hidden from consumers as super-properties within the model. This allows consumers to work with language that is familiar to them while enabling PDM to work with any model. The best of both worlds.</p><h2 id="47a4" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Operational Reporting (Sphere)</h2><p id="cb2d" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk"><strong class="oc in">Operational reporting serves the detailed day-to-day activities and processes of a business domain.</strong> It is a reporting paradigm specialized in covering high-resolution, low-latency data sets.</p><p id="1d94" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Operational reporting systems should generate reports without relying on technical intermediaries. </strong>Operational reporting systems need to address the persistent challenge of empowering business users to explore and obtain the data they need, when they need it. Without such self-service systems, requests for new reports or data extracts often result in back-and-forth exchanges, where the initial query may not exactly meet business users’ expectations, requiring further clarification and refinement.</p><p id="905e" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Data discovery and query generation are two relevant aspects of data integration. </strong>Supplying end-users with an accurate, contextual, and user-friendly data discovery experience provides a basis for query generation mechanism which produces syntactically correct and semantically reliable queries.</p><p id="0968" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Operational reports are predominantly run on data hydrated from GraphQL services into the Data Warehouse. </strong>You can read about our journey from conventional data movement to streaming data pipelines based on CDC and GraphQL hydration in <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/data-movement-in-netflix-studio-via-data-mesh-3fddcceb1059" target="_blank" data-discover="true">this blog post</a>. Among the challenging byproducts of this approach was that a single, distinct data concept is now present in two places (GraphQL and data warehouse), with some disparity in semantic context to guide and support the interpretations and connectivity of that data. To address this, we formulate a mechanism to use the syntax and semantics captured in the federated schema from <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2" target="_blank" data-discover="true">Netflix’s Enterprise GraphQL</a> and populate <em class="rg">representational domain models</em> in UDA to preserve those details and add more.</p><p id="28bb" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Domain models enable the data discovery experience. </strong>Metadata aggregated from various data-producing systems is captured in UDA domain models using a unified vocabulary. This metadata is surfaced for the users’ search and discovery needs; instead of specifying exact tables and join keys, users simply can search for familiar business concepts such as ‘actors’ or ‘movies’. We use UDA models to disambiguate and resolve the intended concepts and their related data entities.</p><p id="7eee" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">UDA knowledge graph is the data landscape for query generation. </strong>Once concepts are discovered and their mappings to corresponding data containers are identified and located in the knowledge graph, we use them to establish join strategies. Through graph traversal, we identify <em class="rg">boundaries</em> and <em class="rg">islands</em> within the data landscape. This ensures only feasible, joinable combinations are selected while weeding out semantically incorrect and non-executable query candidates.</p><figure class="oy oz pa pb pc pd ov ow paragraph-image"><div role="button" tabindex="0" class="pe pf fi pg bh ph"><div class="ov ow ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*EFEfzwY-3Tb6521O%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*EFEfzwY-3Tb6521O%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*EFEfzwY-3Tb6521O%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*EFEfzwY-3Tb6521O%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*EFEfzwY-3Tb6521O%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*EFEfzwY-3Tb6521O%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*EFEfzwY-3Tb6521O%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*EFEfzwY-3Tb6521O 640w, https://miro.medium.com/v2/resize:fit:720/0*EFEfzwY-3Tb6521O 720w, https://miro.medium.com/v2/resize:fit:750/0*EFEfzwY-3Tb6521O 750w, https://miro.medium.com/v2/resize:fit:786/0*EFEfzwY-3Tb6521O 786w, https://miro.medium.com/v2/resize:fit:828/0*EFEfzwY-3Tb6521O 828w, https://miro.medium.com/v2/resize:fit:1100/0*EFEfzwY-3Tb6521O 1100w, https://miro.medium.com/v2/resize:fit:1400/0*EFEfzwY-3Tb6521O 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Screenshot of Sphere’s UI" class="bh fu pi c" width="700" height="353" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qt fc qu ov ow qv qw bf b bg ab ed">Generating a report in Sphere.</figcaption></figure><p id="107a" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk"><strong class="oc in">Sphere is a UDA-powered self-service operational reporting system. </strong>The solution based on knowledge graphs described above is called Sphere. Seeing self-service operational reporting through this lens, we can improve business users’ agency in access to operational data. They are empowered to explore, assemble, and refine reports at the conceptual level, while technical complexities are managed by the system.</p><h1 id="76ef" class="pr ps im bf pt pu pv pw gj px py pz gl qa qb qc qd qe qf qg qh qi qj qk ql qm bk">Stay Tuned</h1><p id="844c" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk">UDA marks a fundamental shift in how we approach data modeling within Content Engineering. By providing a unified knowledge graph composed of what we know about our various data systems and the business concepts within them, we’ve made information more consistent, connected, and discoverable across our organization. We’re excited about future applications of these ideas such as:</p><ul class=""><li id="8106" class="oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou pj pk pl bk">Supporting additional projections like Protobuf/gRPC</li><li id="9af0" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk">Materializing the knowledge graph of instance data for querying, profiling, and management</li><li id="ec0a" class="oa ob im oc b od pm of og oh pn oj ok gm po om on gp pp op oq gs pq os ot ou pj pk pl bk">Finally solving some of the initial <a class="ag gz" rel="noopener ugc nofollow" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf" target="_blank" data-discover="true">challenges</a> posed by Graph Search (that actually inspired some of this work)</li></ul><p id="2012" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">If you’re interested in this space, we’d love to connect — whether you’re exploring new roles down the road or just want to swap ideas.</p><p id="844f" class="pw-post-body-paragraph oa ob im oc b od oe of og oh oi oj ok gm ol om on gp oo op oq gs or os ot ou hn bk">Expect to see future blog posts exploring PDM and Sphere in more detail soon!</p><h2 id="e536" class="qx ps im bf pt gi qy du gj gk qz dw gl gm ra gn go gp rb gq gr gs rc gt gu rd bk">Credits</h2><p id="b15e" class="pw-post-body-paragraph oa ob im oc b od qn of og oh qo oj ok gm qp om on gp qq op oq gs qr os ot ou hn bk">Thanks to <a class="ag gz" href="https://www.linkedin.com/in/andreaslegenbauer/" rel="noopener ugc nofollow" target="_blank">Andreas Legenbauer</a>, <a class="ag gz" href="https://www.linkedin.com/in/bernardo-g-4414b41/" rel="noopener ugc nofollow" target="_blank">Bernardo Gomez Palacio Valdes</a>, <a class="ag gz" href="https://www.linkedin.com/in/czhao/" rel="noopener ugc nofollow" target="_blank">Charles Zhao</a>, <a class="ag gz" href="https://www.linkedin.com/in/christopherchonguw/" rel="noopener ugc nofollow" target="_blank">Christopher Chong</a>, <a class="ag gz" href="https://www.linkedin.com/in/deepa-krishnan-593b60/" rel="noopener ugc nofollow" target="_blank">Deepa Krishnan</a>, <a class="ag gz" href="https://www.linkedin.com/in/gpesma/" rel="noopener ugc nofollow" target="_blank">George Pesmazoglou</a>, <a class="ag gz" href="https://www.linkedin.com/in/jsilvax/" rel="noopener ugc nofollow" target="_blank">Jessica Silva</a>, <a class="ag gz" href="https://www.linkedin.com/in/katherine-anderson-77074159/" rel="noopener ugc nofollow" target="_blank">Katherine Anderson</a>, <a class="ag gz" href="https://www.linkedin.com/in/malikday/" rel="noopener ugc nofollow" target="_blank">Malik Day</a>, <a class="ag gz" href="https://www.linkedin.com/in/ritabogdanovashapkina/" rel="noopener ugc nofollow" target="_blank">Rita Bogdanova</a>, <a class="ag gz" href="https://www.linkedin.com/in/ruoyunzheng/" rel="noopener ugc nofollow" target="_blank">Ruoyun Zheng</a>, <a class="ag gz" href="https://www.linkedin.com/in/shawn-s-b80821b0/" rel="noopener ugc nofollow" target="_blank">Shawn Stedman</a>, <a class="ag gz" href="https://www.linkedin.com/in/suchitagoyal/" rel="noopener ugc nofollow" target="_blank">Suchita Goyal</a>, <a class="ag gz" href="http://www.linkedin.com/in/utkarshshrivastava/" rel="noopener ugc nofollow" target="_blank">Utkarsh Shrivastava</a>, <a class="ag gz" href="https://www.linkedin.com/in/yoomikoh/" rel="noopener ugc nofollow" target="_blank">Yoomi Koh</a>, <a class="ag gz" href="https://www.linkedin.com/in/yuliashmeleva/" rel="noopener ugc nofollow" target="_blank">Yulia Shmeleva</a></p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/uda-unified-data-architecture-6a6aee261d8d</link>
      <guid>https://netflixtechblog.com/uda-unified-data-architecture-6a6aee261d8d</guid>
      <pubDate>Thu, 12 Jun 2025 16:56:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning]]></title>
      <description><![CDATA[<div><div></div><p id="e0a5" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">Authors: <a class="ag hb" href="https://www.linkedin.com/in/sejoon-oh/" rel="noopener ugc nofollow" target="_blank">Sejoon Oh</a>, <a class="ag hb" href="https://www.linkedin.com/in/moumitab/" rel="noopener ugc nofollow" target="_blank">Moumita Bhattacharya</a>, <a class="ag hb" href="https://www.linkedin.com/in/yesufeng/" rel="noopener ugc nofollow" target="_blank">Yesu Feng</a>, <a class="ag hb" href="https://www.linkedin.com/in/sudarshanlamkhede/" rel="noopener ugc nofollow" target="_blank">Sudarshan Lamkhede</a>, <a class="ag hb" href="https://www.linkedin.com/in/markhsiao/" rel="noopener ugc nofollow" target="_blank">Ko-Jen Hsiao</a>, and <a class="ag hb" href="https://www.linkedin.com/in/jbasilico/" rel="noopener ugc nofollow" target="_blank">Justin Basilico</a></p><h1 id="115a" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Motivation</h1><p id="cb4a" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">Recommender systems have become essential components of digital services across e-commerce, streaming media, and social networks [1, 2]. At Netflix, these systems drive significant product and business impact by connecting members with relevant content at the right time [3, 4]. While our recommendation <strong class="od ip">foundation model (FM)</strong> has made substantial progress in understanding user preferences through large-scale learning from interaction histories (please refer to this <a class="ag hb" href="https://netflixtechblog.medium.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39" rel="noopener"><strong class="od ip"><em class="px">article</em></strong></a> about FM @ Netflix), there is an opportunity to further enhance its capabilities. By extending FM to incorporate the prediction of underlying user intents, we aim to enrich its understanding of user sessions beyond next-item prediction, thereby offering a more comprehensive and nuanced recommendation experience.</p><p id="1371" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">Recent research has highlighted the importance of understanding user intent in online platforms [5, 6, 7, 8]. As Xia et al. [8] demonstrated at Pinterest, predicting a user’s future intent can lead to more accurate and personalized recommendations. However, existing intent prediction approaches typically employ simple multi-task learning that adds intent prediction heads to next-item prediction models without establishing a hierarchical relationship between these tasks.</p><p id="62f1" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">To address these limitations, we introduce <strong class="od ip"><em class="px">FM-Intent</em></strong>, a novel recommendation model that enhances our foundation model through hierarchical multi-task learning. FM-Intent captures a user’s latent session intent using both short-term and long-term implicit signals as proxies, then leverages this intent prediction to improve next-item recommendations. Unlike conventional approaches, FM-Intent establishes a clear hierarchy where intent predictions directly inform item recommendations, creating a more coherent and effective recommendation pipeline.</p><p id="16e8" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">FM-Intent makes three key contributions:</p><ol class=""><li id="9b16" class="ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov py pz qa bk">A novel recommendation model that captures user intent on the Netflix platform and enhances next-item prediction using this intent information.</li><li id="538d" class="ob oc io od b oe qb og oh oi qc ok ol go qd on oo gr qe oq or gu qf ot ou ov py pz qa bk">A hierarchical multi-task learning approach that effectively models both short-term and long-term user interests.</li><li id="c154" class="ob oc io od b oe qb og oh oi qc ok ol go qd on oo gr qe oq or gu qf ot ou ov py pz qa bk">Comprehensive experimental validation showing significant performance improvements over state-of-the-art models, including our foundation model.</li></ol><h1 id="04a3" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Understanding User Intent in Netflix</h1><p id="d624" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">In the Netflix ecosystem, user intent manifests through various interaction metadata, as illustrated in Figure 1. FM-Intent leverages these implicit signals to predict both user intent and next-item recommendations.</p><figure class="qj qk ql qm qn qo qg qh paragraph-image"><div role="button" tabindex="0" class="qp qq fl qr bh qs"><div class="qg qh qi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*3pMgS5u3TepefPLB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*3pMgS5u3TepefPLB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*3pMgS5u3TepefPLB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*3pMgS5u3TepefPLB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*3pMgS5u3TepefPLB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*3pMgS5u3TepefPLB%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*3pMgS5u3TepefPLB%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*3pMgS5u3TepefPLB 640w, https://miro.medium.com/v2/resize:fit:720/0*3pMgS5u3TepefPLB 720w, https://miro.medium.com/v2/resize:fit:750/0*3pMgS5u3TepefPLB 750w, https://miro.medium.com/v2/resize:fit:786/0*3pMgS5u3TepefPLB 786w, https://miro.medium.com/v2/resize:fit:828/0*3pMgS5u3TepefPLB 828w, https://miro.medium.com/v2/resize:fit:1100/0*3pMgS5u3TepefPLB 1100w, https://miro.medium.com/v2/resize:fit:1400/0*3pMgS5u3TepefPLB 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw qt c" width="700" height="329" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="cd73" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><em class="px">Figure 1: Overview of user engagement data in Netflix. User intent can be associated with several interaction metadata. We leverage various implicit signals to predict user intent and next-item.</em></p><p id="798c" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">In Netflix, there can be multiple types of user intents. For instance,</p><blockquote class="qu qv qw"><p id="78cd" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip"><em class="io">Action Type</em></strong>: Categories reflecting what users intend to do on Netflix, such as discovering new content versus continuing previously started content. For example, when a member plays a follow-up episode of something they were already watching, this can be categorized as “continue watching” intent.</p><p id="5e27" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip"><em class="io">Genre Preference</em></strong>: The pre-defined genre labels (e.g., Action, Thriller, Comedy) that indicate a user’s content preferences during a session. These preferences can shift significantly between sessions, even for the same user.</p><p id="39f6" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip"><em class="io">Movie/Show Type</em></strong>: Whether a user is looking for a movie (typically a single, longer viewing experience) or a TV show (potentially multiple episodes of shorter duration).</p><p id="ebf9" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip"><em class="io">Time-since-release</em></strong>: Whether the user prefers newly released content, recent content (e.g., between a week and a month), or evergreen catalog titles.</p></blockquote><p id="cff1" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">These dimensions serve as proxies for the latent user intent, which is often not directly observable but crucial for providing relevant recommendations.</p><h1 id="9ea0" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">FM-Intent Model Architecture</h1><p id="30e5" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">FM-Intent employs a hierarchical multi-task learning approach with three major components, as illustrated in Figure 2.</p><figure class="qj qk ql qm qn qo qg qh paragraph-image"><div class="qg qh qx"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*TO3z9jPiu2QZR-xnL7erMQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*TO3z9jPiu2QZR-xnL7erMQ.png" /><img alt="" class="bh fw qt c" width="624" height="830" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure><p id="0f7c" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><em class="px">Figure 2: An architectural illustration of our hierarchical multi-task learning model FM-Intent for user intent and item predictions. We use ground-truth intent and item-ID labels to optimize predictions.</em></p><h2 id="2f73" class="qy ox io bf oy gk qz dy gl gm ra ea gn go rb gp gq gr rc gs gt gu rd gv gw re bk">1. Input Feature Sequence Formation</h2><p id="852b" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">The first component constructs rich input features by combining interaction metadata. The input feature for each interaction combines categorical embeddings and numerical features, creating a comprehensive representation of user behavior.</p><h2 id="be50" class="qy ox io bf oy gk qz dy gl gm ra ea gn go rb gp gq gr rc gs gt gu rd gv gw re bk">2. User Intent Prediction</h2><p id="754b" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">The intent prediction component processes the input feature sequence through a Transformer encoder and generates predictions for multiple intent signals.</p><p id="7364" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">The Transformer encoder effectively models the long-term interest of users through multi-head attention mechanisms. For each prediction task, the intent encoding is transformed into prediction scores via fully-connected layers.</p><p id="eefb" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">A key innovation in FM-Intent is the attention-based aggregation of individual intent predictions. This approach generates a comprehensive intent embedding that captures the relative importance of different intent signals for each user, providing valuable insights for personalization and explanation.</p><h2 id="19c9" class="qy ox io bf oy gk qz dy gl gm ra ea gn go rb gp gq gr rc gs gt gu rd gv gw re bk">3. Next-Item Prediction with Hierarchical Multi-Task Learning</h2><p id="6faa" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">The final component combines the input features with the user intent embedding to make more accurate next-item recommendations.</p><p id="32a2" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">FM-Intent employs hierarchical multi-task learning where intent predictions are conducted first, and their results are used as input features for the next-item prediction task. This hierarchical relationship ensures that the next-item recommendations are informed by the predicted user intent, creating a more coherent and effective recommendation model.</p><h1 id="6df1" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Offline Results</h1><p id="f9d0" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">We conducted comprehensive offline experiments on sampled Netflix user engagement data to evaluate FM-Intent’s performance. Note that FM-Intent uses a much smaller dataset for training compared to the FM production model due to its complex hierarchical prediction architecture.</p><h2 id="eb63" class="qy ox io bf oy gk qz dy gl gm ra ea gn go rb gp gq gr rc gs gt gu rd gv gw re bk">Next-Item and Next-Intent Prediction Accuracy</h2><p id="47fb" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">Table 1 compares FM-Intent with several state-of-the-art sequential recommendation models, including our production model (FM-Intent-V0).</p><figure class="qj qk ql qm qn qo qg qh paragraph-image"><div role="button" tabindex="0" class="qp qq fl qr bh qs"><div class="qg qh qi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*7h7aSQhq7U_heAUu%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*7h7aSQhq7U_heAUu%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*7h7aSQhq7U_heAUu%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*7h7aSQhq7U_heAUu%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*7h7aSQhq7U_heAUu%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*7h7aSQhq7U_heAUu%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7h7aSQhq7U_heAUu%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*7h7aSQhq7U_heAUu 640w, https://miro.medium.com/v2/resize:fit:720/0*7h7aSQhq7U_heAUu 720w, https://miro.medium.com/v2/resize:fit:750/0*7h7aSQhq7U_heAUu 750w, https://miro.medium.com/v2/resize:fit:786/0*7h7aSQhq7U_heAUu 786w, https://miro.medium.com/v2/resize:fit:828/0*7h7aSQhq7U_heAUu 828w, https://miro.medium.com/v2/resize:fit:1100/0*7h7aSQhq7U_heAUu 1100w, https://miro.medium.com/v2/resize:fit:1400/0*7h7aSQhq7U_heAUu 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw qt c" width="700" height="203" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="e95c" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><em class="px">Table 1: Next-item and next-intent prediction results of baselines and our proposed method FM-Intent on the Netflix user engagement dataset.</em></p><p id="c67b" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">All metrics are represented as relative % improvements compared to the SOTA baseline: TransAct. N/A indicates that a model is not capable of predicting a certain intent. Note that we added additional fully-connected layers to LSTM, GRU, and Transformer baselines in order to predict user intent, while we used original implementations for other baselines. FM-Intent demonstrates statistically significant improvement of 7.4% in next-item prediction accuracy compared to the best baseline (TransAct).</p><p id="01c4" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">Most baseline models show limited performance as they either cannot predict user intent or cannot incorporate intent predictions into next-item recommendations. Our production model (FM-Intent-V0) performs well but lacks the ability to predict and leverage user intent. Note that FM-Intent-V0 is trained with a smaller dataset for a fair comparison with other models; the actual production model is trained with a much larger dataset.</p><h1 id="c46f" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Qualitative Analysis: User Clustering</h1><figure class="qj qk ql qm qn qo qg qh paragraph-image"><div role="button" tabindex="0" class="qp qq fl qr bh qs"><div class="qg qh rf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*nBqH-ZHRXRfR3-e0%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*nBqH-ZHRXRfR3-e0%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*nBqH-ZHRXRfR3-e0%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*nBqH-ZHRXRfR3-e0%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*nBqH-ZHRXRfR3-e0%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*nBqH-ZHRXRfR3-e0%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*nBqH-ZHRXRfR3-e0%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*nBqH-ZHRXRfR3-e0 640w, https://miro.medium.com/v2/resize:fit:720/0*nBqH-ZHRXRfR3-e0 720w, https://miro.medium.com/v2/resize:fit:750/0*nBqH-ZHRXRfR3-e0 750w, https://miro.medium.com/v2/resize:fit:786/0*nBqH-ZHRXRfR3-e0 786w, https://miro.medium.com/v2/resize:fit:828/0*nBqH-ZHRXRfR3-e0 828w, https://miro.medium.com/v2/resize:fit:1100/0*nBqH-ZHRXRfR3-e0 1100w, https://miro.medium.com/v2/resize:fit:1400/0*nBqH-ZHRXRfR3-e0 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fw qt c" width="700" height="753" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="c3dc" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><em class="px">Figure 3: K-means++ (K=10) clustering of user intent embeddings found by FM-Intent; FM-Intent finds unique clusters of users that share the similar intent.</em></p><p id="b54d" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">FM-Intent generates meaningful user intent embeddings that can be used for clustering users with similar intents. Figure 3 visualizes 10 distinct clusters identified through K-means++ clustering.These clusters reveal meaningful user segments with distinct viewing patterns:</p><ul class=""><li id="de73" class="ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov rg pz qa bk">Users who primarily discover new content versus those who continue watching recent/favorite content.</li><li id="67ed" class="ob oc io od b oe qb og oh oi qc ok ol go qd on oo gr qe oq or gu qf ot ou ov rg pz qa bk">Genre enthusiasts (e.g., <em class="px">anime/kids content viewers</em>).</li><li id="d511" class="ob oc io od b oe qb og oh oi qc ok ol go qd on oo gr qe oq or gu qf ot ou ov rg pz qa bk">Users with specific viewing patterns (e.g., <em class="px">Rewatchers</em> versus <em class="px">casual viewers</em>).</li></ul><h1 id="2e06" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Potential Applications of FM-Intent</h1><p id="5515" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">FM-Intent has been successfully integrated into Netflix’s recommendation ecosystem, can be leveraged for several downstream applications:</p><blockquote class="qu qv qw"><p id="39d4" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip">Personalized UI Optimization</strong>: The predicted user intent could inform the layout and content selection on the Netflix homepage, emphasizing different rows based on whether users are in discovery mode, continue-watching mode, or exploring specific genres.</p><p id="710a" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip">Analytics and User Understanding</strong>: Intent embeddings and clusters provide valuable insights into viewing patterns and preferences, informing content acquisition and production decisions.</p><p id="c3c1" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip">Enhanced Recommendation Signals</strong>: Intent predictions serve as features for other recommendation models, improving their accuracy and relevance.</p><p id="3b9a" class="ob oc px od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk"><strong class="od ip">Search Optimization</strong>: Real-time intent predictions help prioritize search results based on the user’s current session intent.</p></blockquote><h1 id="cc1a" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Conclusion</h1><p id="6e9a" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">FM-Intent represents an advancement in Netflix’s recommendation capabilities by enhancing them with hierarchical multi-task learning for user intent prediction. Our comprehensive experiments demonstrate that FM-Intent significantly outperforms state-of-the-art models, including our prior foundation model that focused solely on next-item prediction. By understanding not just what users might watch next but what underlying intents users have, we can provide more personalized, relevant, and satisfying recommendations.</p><h1 id="c957" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">Acknowledgements</h1><p id="e32c" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">We thank our stunning colleagues in the Foundation Model team &amp; AIMS org. for their valuable feedback and discussions. We also thank our partner teams for getting this up and running in production.</p><h1 id="2e36" class="ow ox io bf oy oz pa pb gl pc pd pe gn pf pg ph pi pj pk pl pm pn po pp pq pr bk">References</h1><p id="8218" class="pw-post-body-paragraph ob oc io od b oe ps og oh oi pt ok ol go pu on oo gr pv oq or gu pw ot ou ov hp bk">[1] Amatriain, X., &amp; Basilico, J. (2015). Recommender systems in industry: A netflix case study. In Recommender systems handbook (pp. 385–419). Springer.</p><p id="c60d" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[2] Gomez-Uribe, C. A., &amp; Hunt, N. (2015). The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4), 1–19.</p><p id="84d0" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[3] Jannach, D., &amp; Jugovac, M. (2019). Measuring the business value of recommender systems. ACM Transactions on Management Information Systems (TMIS), 10(4), 1–23.</p><p id="b1fc" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[4] Bhattacharya, M., &amp; Lamkhede, S. (2022). Augmenting Netflix Search with In-Session Adapted Recommendations. In Proceedings of the 16th ACM Conference on Recommender Systems (pp. 542–545).</p><p id="3d01" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[5] Chen, Y., Liu, Z., Li, J., McAuley, J., &amp; Xiong, C. (2022). Intent contrastive learning for sequential recommendation. In Proceedings of the ACM Web Conference 2022 (pp. 2172–2182).</p><p id="cabd" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[6] Ding, Y., Ma, Y., Wong, W. K., &amp; Chua, T. S. (2021). Modeling instant user intent and content-level transition for sequential fashion recommendation. IEEE Transactions on Multimedia, 24, 2687–2700.</p><p id="f88f" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[7] Liu, Z., Chen, H., Sun, F., Xie, X., Gao, J., Ding, B., &amp; Shen, Y. (2021). Intent preference decoupling for user representation on online recommender system. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (pp. 2575–2582).</p><p id="1c91" class="pw-post-body-paragraph ob oc io od b oe of og oh oi oj ok ol go om on oo gr op oq or gu os ot ou ov hp bk">[8] Xia, X., Eksombatchai, P., Pancha, N., Badani, D. D., Wang, P. W., Gu, N., Joshi, S. V., Farahpour, N., Zhang, Z., &amp; Zhai, A. (2023). TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 5249–5259).</p></div>]]></description>
      <link>https://netflixtechblog.com/fm-intent-predicting-user-session-intent-with-hierarchical-multi-task-learning-94c75e18f4b8</link>
      <guid>https://netflixtechblog.com/fm-intent-predicting-user-session-intent-with-hierarchical-multi-task-learning-94c75e18f4b8</guid>
      <pubDate>Wed, 21 May 2025 18:28:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Behind the Scenes: Building a Robust Ads Event Processing Pipeline]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg hu hv hw hx"><div><div></div><p id="3140" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><a class="af ha" href="https://www.linkedin.com/in/kineshsatiya/" rel="noopener ugc nofollow" target="_blank">Kinesh Satiya</a></p><h2 id="c0fb" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Introduction</h2><p id="0bc0" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj">In a digital advertising platform, a robust feedback system is essential for the lifecycle and success of an ad campaign. This system comprises of diverse sub-systems designed to monitor, measure, and optimize ad campaigns. At Netflix, we embarked on a journey to build a robust event processing platform that not only meets the current demands but also scales for future needs. This blog post delves into the architectural evolution and technical decisions that underpin our Ads event processing pipeline.</p><p id="5676" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Ad serving acts like the “brain” — making decisions, optimizing delivery and ensuring right Ad is shown to the right member at the right time. Meanwhile, ad events, after an Ad is rendered, function like “heartbeats”, continuously providing real-time feedback (oxygen/nutrients) that fuels better decision-making, optimizations, reporting, measurement, and billing. Expanding on this analogy:</p><ul class=""><li id="ea42" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pl pm pn bj">Just as the brain relies on continuous blood flow, ad serving depends on a steady stream of ad events to adjust next ad serving decision, frequency capping, pacing, and personalization.</li><li id="6f87" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">If the nervous system stops sending signals (ad events stop flowing), the brain (ad serving) lacks critical insights and starts making poor decisions or even fails.</li><li id="e937" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">The healthier and more accurate the event stream (just like strong heart function), the better the ad serving system can adapt, optimize, and drive business outcomes.</li></ul><p id="2ec6" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Let’s dive into the journey of building this pipeline.</p><h2 id="35f9" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">The Pilot</h2><p id="1001" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj">In November 2022, we launched a brand <a class="af ha" href="https://about.netflix.com/en/news/announcing-basic-with-ads-us" rel="noopener ugc nofollow" target="_blank">new basic ads plan</a>, in partnership with Microsoft. The software systems extended the existing Netflix playback systems to play ads. Initially, the system was designed to be simple, secure, and efficient, with an underlying ethos of device-originated and server-proxied operations. The system consisted of three main components: the Microsoft Ad Server, Netflix Ads Manager, and Ad Event Handler. Each ad served required tracking to ensure the feedback loop functioned effectively, providing the external ad server with insights on impressions, frequency capping (advertiser policy that limits the number of times a user sees a specific ad), and monetization processes.</p><p id="4b23" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Key features of this system include:</p><ol class=""><li id="568d" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pt pm pn bj"><strong class="oe io">Client Request: </strong>Client devices request for ads during an ad break from Netflix playback systems, which is then decorated with information by ads manager to request ads from the ad server.</li><li id="405c" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Server-Side Ad Insertion:</strong> The Ad Server sends ad responses using the VAST (Video Ad Serving Template) format.</li><li id="8552" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Netflix Ads Manager:</strong> This service parses VAST documents, extracts tracking event information, and creates a simplified response structure for Netflix playback systems and client devices. <br /> — The tracking information is packed into a structured protobuf data model.<br /> — This structure is encrypted to create an opaque token.<br /> — The final response, informs the client devices, when to send an event and the corresponding token.</li><li id="d205" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Client Device:</strong> During ad playback, client devices send events accompanied by a token. The Netflix telemetry system then enqueues all these events in Kafka for asynchronous processing.</li><li id="e73a" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Ads Event Handler:</strong> This component is a Kafka consumer, that reads/decrypts the event payload and forwards the tracking information encoded back to the ad server and other vendors.</li></ol></div></div><div class="pu"><div class="ab ca"><div class="nf pv ng pw nh px ce py cf pz ch bg"><figure class="qd qe qf qg qh pu qi qj paragraph-image"><div role="button" tabindex="0" class="qk ql fk qm bg qn"><div class="qa qb qc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*S_6xv6LxoRyL8KtUXqWueQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*S_6xv6LxoRyL8KtUXqWueQ.png" /></picture></div></div><figcaption class="qp fe qq qa qb qr qs be b bf z dt">Fig 1: Basic Ad Event Handling System</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg hu hv hw hx"><p id="ee5a" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">There is an <a class="af ha" rel="noopener ugc nofollow" href="https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba" target="_blank">excellent prior blog</a> post that explains how this systems was tested end-to-end at scale. This system design allowed us to quickly add new integrations for verification with vendors like DV, IAS and Nielsen for measurement.</p><h2 id="1fc6" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">The Expansion</h2><p id="06ee" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj">As we continued to expand our third-party (3P) advertising vendors for measurement, tracking and verification, we identified a critical trend: growth in the volume of data encapsulated within opaque tokens. These tokens, which are cached on client devices, present a risk of elevated memory usage, potentially impacting device performance. We also anticipated increase in third-party tracking URLs, metadata needs, and more event types as our business added new capabilities.</p><p id="b65f" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">To strategically address these challenges, we introduced a new persistence layer using <a class="af ha" rel="noopener ugc nofollow" href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30" target="_blank">Key-Value abstraction</a>, between ad serving and event handling system: Ads Metadata Registry. This transient storage service stores metadata for each Ad served, and upon callback, event handler would read the tracking information to relay information to the vendors. The contract between the client device and Ads systems continues to use the opaque token per event, but now, instead of tracking information, it contains reference identifiers — Ad ID, the corresponding metadata record ID in the registry and the event name. This approach future proofed our systems to handle any growth in data that needs to pass from ad serving to event handling systems.</p></div></div><div class="pu"><div class="ab ca"><div class="nf pv ng pw nh px ce py cf pz ch bg"><figure class="qd qe qf qg qh pu qi qj paragraph-image"><div role="button" tabindex="0" class="qk ql fk qm bg qn"><div class="qa qb qc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*0wP5cyAj84Vju7ryabJE1A.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*0wP5cyAj84Vju7ryabJE1A.png" /></picture></div></div><figcaption class="qp fe qq qa qb qr qs be b bf z dt">Fig 2: Storage service between Ad Serving &amp; Reporting</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg hu hv hw hx"><h2 id="f553" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">The Evolution</h2><p id="b9ba" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj">In January of 2024, we decided to invest in in-house advertising technology platform. This implied that the event processing pipeline had to evolve significantly — attain parity with existing offerings and continue to support new product launches with rapid iterations using in-house Netflix Ad Server. This required re-evaluation of the entire architecture across all of Ads engineering teams.</p><p id="83b3" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">First, we made an inventory of the use-cases that would need to be supported through ad events.</p><ol class=""><li id="5134" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pt pm pn bj">We’d need to start supporting frequency capping in-house for all ads through Netflix Ad server.</li><li id="8187" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj">Incorporate pricing information for impressions to set the stage for billing events, which are used to charge advertisers.</li><li id="2966" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj">A robust reporting system to share campaign reports with advertisers, combined with metrics data collection, helps assess the delivery and effectiveness of the campaign.</li><li id="da68" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj">Scale event handler to perform tracking information look-ups across different vendors.</li></ol><p id="c00a" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Next, we examined upcoming launches, such as Pause/Display ads, to gain deeper insights into our strategic initiatives. We recognized that Display Ads would utilize a distinct logging framework, suggesting that different upstream pipelines might deliver ad telemetry. However, the downstream use-cases were expected to remain largely consistent. Additionally, by reviewing the goals of our telemetry teams, we saw large initiatives aimed at upgrading the platform, indicating potential future migrations.</p><p id="ac48" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Keeping the above insights &amp; challenges in mind,</p><ul class=""><li id="3983" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pl pm pn bj">We planned a centralized ad event collection system. This centralized service would consolidate common operations like decryption of tokens, enrichment, hashing identifiers into a single step execution and provide a single unified data contract to consumers that is highly extensible (like being agnostic to ad server &amp; ad media).</li><li id="4f2e" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">We proposed moving all consumers of ad telemetry downstream of the centralized service. This creates a clean separation between upstream systems and consumers in Ads Engineering.</li><li id="ee4f" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">In the initial development phase of our advertising system, a crucial component was the creation of ad sessions based on individual ad events. This system was constructed using ad playback telemetry, which allowed us to gather essential metrics from these ad sessions. A significant decision in this plan was to position the ad sessionization process downstream of the raw ad events.</li><li id="f367" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">The proposal also recommended moving all our Ads data processing pipelines for reporting/analytics/metrics for Ads using the data published by the centralized system.</li></ul><p id="ea56" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Putting together all the components in our vision -</p></div></div><div class="pu"><div class="ab ca"><div class="nf pv ng pw nh px ce py cf pz ch bg"><figure class="qd qe qf qg qh pu qi qj paragraph-image"><div role="button" tabindex="0" class="qk ql fk qm bg qn"><div class="qa qb qt"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*aPL3RHeFEzlw_psaLddWKw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*aPL3RHeFEzlw_psaLddWKw.png" /></picture></div></div><figcaption class="qp fe qq qa qb qr qs be b bf z dt">Fig 3: Ad Event processing pipeline</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg hu hv hw hx"><p id="da0e" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">Key components on event processing pipeline -</p><p id="66c8" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><strong class="oe io">Ads Event Publisher:</strong> This centralized system is responsible for collecting ads telemetry and providing unified ad events to the ads engineering teams. It supports various functions such as measurement, finance/billing, reporting, frequency capping, and maintaining an essential feedback loop back to the ad server.</p><p id="6e2e" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><strong class="oe io">Realtime Consumers</strong></p><ol class=""><li id="8bf9" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pt pm pn bj"><strong class="oe io">Frequency Capping: </strong>This system tracks impressions for each campaign, profile, and any other frequency capping parameters set up for the campaign. It is utilized by the Ad Server during each ad decision to ensure ads are served with frequency limits.</li><li id="0475" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Ads Metrics: </strong>This component is a Flink job that transforms raw data to a set of dimensions and metrics, subsequently writing to Apache Druid OLAP database. The streaming data is further backed by an offline process that corrects any inaccuracy during streaming ingestion and providing accurate metrics. It provides real-time metrics to assess the delivery health of campaigns and applies budget capping functionality.</li><li id="efd4" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Ads Sessionizer: </strong>An Apache Flink job that consolidates all events related to a single ad into an Ad Session. This session provides real-time information about ad playback, offering essential business insights and reporting. It is a crucial job that supports all downstream analytical and reporting processes.</li><li id="f42c" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pt pm pn bj"><strong class="oe io">Ads Event Handler: </strong>This service continuously sends information to ad vendors by reading tracking information from ad events, ensuring accurate and timely data exchange.</li></ol><p id="ff42" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><strong class="oe io">Billing/Revenue: </strong>These are offline workflows designed to curate impressions, supporting billing and revenue recognition processes.</p><p id="56fe" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><strong class="oe io">Ads Reporting &amp; Metrics: </strong>This service powers reporting module for our account managers and provides a centralized metrics API that help assess the delivery of a campaign.</p><p id="1868" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">This was a massive multi-quarter effort across different engineering teams. With extensive planning (kudos to our TPM team!) and coordination, we were able to iterate fast, build several services and execute the vision above, to power our in-house ads technology platform.</p><h2 id="67f3" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Conclusion</h2><p id="7e2d" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj">These systems have significantly accelerated our ability to launch new capabilities for the business.</p><ul class=""><li id="6878" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pl pm pn bj">Through our partnership with Microsoft, Display Ad events were integrated into the new pipeline for reusability and ensuring when launching through Netflix ads systems, all use-cases were covered.</li><li id="abd9" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">Programmatic buying capabilities now support the exchange of numerous trackers and dynamic bid prices on impression events.</li><li id="8aac" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">Sharing opt-out signals helps ensure privacy and compliance with GDPR regulations for Ads business in Europe, supporting accurate reporting and measurement.</li><li id="5a25" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj">New event types like Ad clicks and scanning of QR codes events also flow through the pipeline, ensuring all metrics and reporting are tracked consistently.</li></ul><p id="d5ce" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj"><strong class="oe io">Key Takeways</strong></p><ul class=""><li id="1e7a" class="oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow pl pm pn bj"><strong class="oe io">Strategic, incremental evolution:</strong> The development of our ads event processing systems has been a carefully orchestrated journey. Each iteration was meticulously planned by addressing existing challenges, anticipating future needs, and showcasing teamwork, planning, and coordination across various teams. These pillars have been fundamental to the success of this journey.</li><li id="8b0f" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj"><strong class="oe io">Data contract:</strong> A clear data contract has been pivotal in ensuring consistency in interpretation and interoperability across our systems. By standardizing the data models, and establishing a clear data exchange between ad serving, and centralized event collection, our teams have been able to iterate at exceptional speed and continue to deliver many launches on time.</li><li id="bb6b" class="oc od in oe b of po oh oi oj pp ol om gn pq oo op gq pr or os gt ps ou ov ow pl pm pn bj"><strong class="oe io">Separation of concerns: </strong>Consumers are relieved from the need to understand each source of ad telemetry or manage updates and migrations. Instead, a centralized system handles these tasks, allowing consumers to focus on their core business logic.</li></ul><p id="27da" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">We have an exciting list of projects on the horizon. These include managing ad events from ads on Netflix live streams, de-duplication processes, and enriching data signals to deliver enhanced reporting and insights. Additionally, we are advancing our Native Ads strategy, integrating Conversion API for improved conversion tracking, among many others.</p><p id="1bbb" class="pw-post-body-paragraph oc od in oe b of og oh oi oj ok ol om gn on oo op gq oq or os gt ot ou ov ow ho bj">This is definitely not a season finale; it’s just the beginning of our journey to create a best-in-class ads technology platform. We warmly invite you to share your thoughts and comments with us. If you’re interested in learning more or becoming a part of this innovative journey, <a class="af ha" href="https://jobs.netflix.com/" rel="noopener ugc nofollow" target="_blank">Ads Engineering is hiring</a>!</p><h2 id="f86f" class="ox oy in be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj"><em class="qu">Acknowledgements</em></h2><p id="bf29" class="pw-post-body-paragraph oc od in oe b of pg oh oi oj ph ol om gn pi oo op gq pj or os gt pk ou ov ow ho bj"><em class="qv">A special thanks to our amazing colleagues and teams who helped build our foundational post-impression system: </em><a class="af ha" href="https://www.linkedin.com/in/simonspencer1/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Simon Spencer</em></a><em class="qv">, </em><a class="af ha" href="https://www.linkedin.com/in/priyankaavj/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Priyankaa Vijayakumar,</em></a><a class="af ha" href="https://www.linkedin.com/in/indrajit-roy-choudhury-5b011754/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Indrajit Roy Choudhury</em></a><em class="qv">; Ads TPM team — </em><a class="af ha" href="https://www.linkedin.com/in/sonyabellamy/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Sonya Bellamy</em></a><em class="qv">; the Ad Serving Team —</em><a class="af ha" href="https://www.linkedin.com/in/andrewjsweeney/" rel="noopener ugc nofollow" target="_blank"><em class="qv"> Andrew Sweeney</em></a><em class="qv">, </em><a class="af ha" href="https://www.linkedin.com/in/tim-z-b9112034/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Tim Zheng</em></a>, <a class="af ha" href="https://www.linkedin.com/in/haidongt/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Haidong Tang</em></a><em class="qv"> and </em><a class="af ha" href="https://www.linkedin.com/in/edhbarker/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Ed Barker</em></a><em class="qv">; the Ads Data Engineering Team — </em><a class="af ha" href="https://www.linkedin.com/in/sonalisharma/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Sonali Sharma</em></a><em class="qv">, Harsha Arepalli, and </em><a class="af ha" href="https://www.linkedin.com/in/winifredtran/" rel="noopener ugc nofollow" target="_blank"><em class="qv">Wini Tran</em></a><em class="qv">; Product Data Systems — </em><a class="af ha" href="https://www.linkedin.com/in/d3cay/" rel="noopener ugc nofollow" target="_blank"><em class="qv">David Klosowski;</em></a><em class="qv"> and the entire Ads Reporting and Measurement team!</em></p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/behind-the-scenes-building-a-robust-ads-event-processing-pipeline-e4e86caf9249</link>
      <guid>https://netflixtechblog.com/behind-the-scenes-building-a-robust-ads-event-processing-pipeline-e4e86caf9249</guid>
      <pubDate>Fri, 09 May 2025 21:44:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Measuring Dialogue Intelligibility for Netflix Content]]></title>
      <description><![CDATA[<div><div class="ab ca"><div class="ch bg hu hv hw hx"><div class="ih l"><div class="cp ab ii"><div><div class="bl" aria-hidden="false"><div class="ij q ik hz il ii ao"><p class="be b bf z dt">Featured</p></div></div></div></div></div></div></div></div><div class="ho im in io ip"><div class="ab ca"><div class="ch bg hu hv hw hx"><div><div></div><p id="b9e5" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj"><em class="ow">Enhancing Member Experience Through Strategic Collaboration</em></p><p id="1886" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj"><a class="af ha" href="https://www.linkedin.com/in/ozziesutherland/" rel="noopener ugc nofollow" target="_blank">Ozzie Sutherland</a>, <a class="af ha" href="https://www.linkedin.com/in/iroroorife/" rel="noopener ugc nofollow" target="_blank">Iroro Orife</a>, <a class="af ha" href="https://www.linkedin.com/in/chih-wei-wu-73081689/" rel="noopener ugc nofollow" target="_blank">Chih-Wei Wu</a>, <a class="af ha" href="https://www.linkedin.com/in/bhanusrikanth/" rel="noopener ugc nofollow" target="_blank">Bhanu Srikanth</a></p><p id="7f09" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">At Netflix, delivering the best possible experience for our members is at the heart of everything we do, and we know we can’t do it alone. That’s why we work closely with a diverse ecosystem of technology partners, combining their deep expertise with our creative and operational insights. Together, we explore new ideas, develop practical tools, and push technical boundaries in service of storytelling. This collaboration not only empowers the talented creatives working on our shows with better tools to bring their vision to life, but also helps us innovate in service of our members. By building these partnerships on trust, transparency, and shared purpose, we’re able to move faster and more meaningfully, always with the goal of making our stories more immersive, accessible, and enjoyable for audiences everywhere. One area where this collaboration is making a meaningful impact is in improving dialogue intelligibility, from set to screen. We call this the Dialogue Integrity Pipeline.</p><h2 id="4d08" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Dialogue Integrity Pipeline</h2><p id="693a" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">We’ve all been there, settling in for a night of entertainment, only to find ourselves straining to catch what was just said on screen. You’re wrapped up in the story, totally invested, when suddenly a key line of dialogue vanishes into thin air. “Wait, what did they say? I can’t understand the dialogue! What just happened?”</p><p id="1d72" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">You may pick up the remote and rewind, turn up the volume, or try to stay with it and hope this doesn’t happen again. Creating sophisticated, modern series and films requires an incredible artistic &amp; technical effort. At Netflix, we strive to ensure those great stories are easy for the audience to enjoy. Dialogue intelligibility can break down at multiple points in what we call the <strong class="od it">Dialogue Integrity Pipeline</strong>, the journey from on-set capture to final playback at home. Many facets of the process can contribute to dialogue that’s difficult to understand:</p><ul class=""><li id="6e0e" class="ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov pl pm pn bj">Naturalistic acting styles, diverse speech patterns, and accents</li><li id="ac5c" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">Noisy locations, microphone placement problems on set</li><li id="7a02" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">Cinematic (high dynamic range) mixing styles, excessive dialogue processing, substandard equipment</li><li id="c369" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">Audio compromises through the distribution pipeline</li><li id="71bd" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">TVs with inadequate speakers, noisy home environments</li></ul><p id="746a" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">Addressing these issues is critical to maintaining the standard of excellence our content deserves.</p><h2 id="d3ae" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Measurement at Scale</h2><p id="0617" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">Netflix utilizes industry-standard loudness meters to measure content and its adherence to our core loudness specifications. This tool also provides feedback on audio dynamic range (loud to soft) which impacts dialogue intelligibility. The Audio Algorithms team at Netflix wanted to take these measurements further and develop a holistic understanding of dialogue intelligibility throughout the runtime of a given title.</p><p id="d830" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">The team developed a Speech Intelligibility measurement system based on the Short-time Objective Intelligibility (STOI) metric [<a class="af ha" href="https://www.researchgate.net/profile/Cees-Taal/publication/224219052_An_Algorithm_for_Intelligibility_Prediction_of_Time-Frequency_Weighted_Noisy_Speech/links/0deec51da9fbbc5eea000000/An-Algorithm-for-Intelligibility-Prediction-of-Time-Frequency-Weighted-Noisy-Speech.pdf" rel="noopener ugc nofollow" target="_blank">Taal et al.</a> (IEEE <em class="ow">Transactions on Audio, Speech, and Language Processing</em>)]. Firstly, a speech activity detector analyses the dialogue stem to render speech utterances, which are then compared to non-speech sounds in the mix, typically Music and Effects. Then the system calculates the Signal-to-Noise ratio, in each speech frequency band, the results of which are summarized succinctly, per-utterance on the range [0, 1.0], to quantify the degree to which competing Music and Effects can distract the listener.</p><figure class="pw px py pz qa qb pt pu paragraph-image"><div role="button" tabindex="0" class="qc qd fk qe bg qf"><div class="pt pu pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*WSViFfuvT8pcZshi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*WSViFfuvT8pcZshi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*WSViFfuvT8pcZshi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*WSViFfuvT8pcZshi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*WSViFfuvT8pcZshi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*WSViFfuvT8pcZshi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*WSViFfuvT8pcZshi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*WSViFfuvT8pcZshi 640w, https://miro.medium.com/v2/resize:fit:720/0*WSViFfuvT8pcZshi 720w, https://miro.medium.com/v2/resize:fit:750/0*WSViFfuvT8pcZshi 750w, https://miro.medium.com/v2/resize:fit:786/0*WSViFfuvT8pcZshi 786w, https://miro.medium.com/v2/resize:fit:828/0*WSViFfuvT8pcZshi 828w, https://miro.medium.com/v2/resize:fit:1100/0*WSViFfuvT8pcZshi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*WSViFfuvT8pcZshi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qh fe qi pt pu qj qk be b bf z dt">This chart shows how eSTOI (extended Short-Time Objective Intelligibility) method measures dialogue (fg [foreground] stem in the graphic) against non-speech (bg [background] stem in the graphic) to judge intelligibility based on competing non-speech sound.</figcaption></figure><h2 id="a25d" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Optimizing Dialogue Prior to Delivery</h2><p id="b344" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">Understanding dialogue intelligibility across Netflix titles is invaluable, but our mission goes beyond analysis — we strive to empower creators with the tools to craft mixes that resonate seamlessly with audiences at home.</p><p id="1547" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">Seeing the lack of dedicated Dialogue Intelligibility Meter plugins for Digital Audio Workstations, we teamed up with industry leaders, Fraunhofer Institute for Digital Media Technology IDMT (Fraunhofer IDMT) and Nugen Audio to pioneer a solution that enhances creative control and ensures crystal-clear dialogue from mix to final delivery.</p><p id="19ac" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">We collaborated with Fraunhofer IDMT to adapt their machine-learning-based speech intelligibility solution for cross-platform plugin standards and brought in Nugen Audio to develop DAW-compatible plugins.</p><h2 id="c373" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Fraunhofer IDMT</h2><p id="b95b" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">The Fraunhofer Department of Hearing, Speech, and Audio Technology HSA has done significant research and development on media processing tools that measure speech intelligibility. In 2020, the machine learning-based method was integrated into Steinberg’s Nuendo Digital Audio Workstation. We approached the Fraunhofer engineering team with a collaboration proposal to make their technology accessible to other audio workstations through the cross-platform VST (Virtual Studio Technology) and AAX (Avid Audio Extension) plugin standards. The scientists were keen on the project and provided their dialogue intelligibility library.</p><figure class="pw px py pz qa qb pt pu paragraph-image"><div role="button" tabindex="0" class="qc qd fk qe bg qf"><div class="pt pu ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*wuapXe2lajcx3tTj%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*wuapXe2lajcx3tTj%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*wuapXe2lajcx3tTj%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*wuapXe2lajcx3tTj%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*wuapXe2lajcx3tTj%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*wuapXe2lajcx3tTj%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*wuapXe2lajcx3tTj%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*wuapXe2lajcx3tTj 640w, https://miro.medium.com/v2/resize:fit:720/0*wuapXe2lajcx3tTj 720w, https://miro.medium.com/v2/resize:fit:750/0*wuapXe2lajcx3tTj 750w, https://miro.medium.com/v2/resize:fit:786/0*wuapXe2lajcx3tTj 786w, https://miro.medium.com/v2/resize:fit:828/0*wuapXe2lajcx3tTj 828w, https://miro.medium.com/v2/resize:fit:1100/0*wuapXe2lajcx3tTj 1100w, https://miro.medium.com/v2/resize:fit:1400/0*wuapXe2lajcx3tTj 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qh fe qi pt pu qj qk be b bf z dt">The Fraunhofer IDMT Dialogue Intelligibility Meter integrated into the Steinberg Nuendo Digital Audio Workstation.</figcaption></figure><h2 id="698d" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Nugen Audio</h2><p id="77fb" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">Nugen Audio created the VisLM plugin to provide sound teams with an efficient and accurate way to measure mixes for conformance to traditional broadcast &amp; streaming specifications — Full Mix Loudness, Dialogue Loudness, and True Peak. Since then, VisLM has become a widely used tool throughout the global post-production industry. Nugen Audio partnered with Fraunhofer, integrating the Fraunhofer IDMT Dialogue Intelligibility libraries into a new industry-first tool — Nugen DialogCheck. This tool gives <strong class="od it">re-recording mixers</strong> real-time insights, helping them adjust dialogue clarity at the most crucial points in the mixing process, ensuring every word is clear and understood.</p><figure class="pw px py pz qa qb pt pu paragraph-image"><div role="button" tabindex="0" class="qc qd fk qe bg qf"><div class="pt pu qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*gGt-DpKR806J2jqT%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*gGt-DpKR806J2jqT%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*gGt-DpKR806J2jqT%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*gGt-DpKR806J2jqT%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*gGt-DpKR806J2jqT%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*gGt-DpKR806J2jqT%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*gGt-DpKR806J2jqT%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*gGt-DpKR806J2jqT 640w, https://miro.medium.com/v2/resize:fit:720/0*gGt-DpKR806J2jqT 720w, https://miro.medium.com/v2/resize:fit:750/0*gGt-DpKR806J2jqT 750w, https://miro.medium.com/v2/resize:fit:786/0*gGt-DpKR806J2jqT 786w, https://miro.medium.com/v2/resize:fit:828/0*gGt-DpKR806J2jqT 828w, https://miro.medium.com/v2/resize:fit:1100/0*gGt-DpKR806J2jqT 1100w, https://miro.medium.com/v2/resize:fit:1400/0*gGt-DpKR806J2jqT 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h2 id="5c1f" class="ox oy is be oz gj pa dx gk gl pb dz gm gn pc go gp gq pd gr gs gt pe gu gv pf bj">Clearer Dialogue Through Collaboration</h2><p id="3af9" class="pw-post-body-paragraph ob oc is od b oe pg og oh oi ph ok ol gn pi on oo gq pj oq or gt pk ot ou ov ho bj">Crafting crystal-clear dialogue isn’t just a technical challenge — it’s an art that requires continuous innovation and strong industry collaboration. To empower creators, Netflix and its partners are embedding advanced intelligibility measurement tools directly into DAWs, giving sound teams the ability to:</p><ul class=""><li id="9e1e" class="ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov pl pm pn bj">Detect and resolve dialogue clarity issues early in the mix.</li><li id="f1d8" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">Fine-tune speech intelligibility without compromising artistic intent.</li><li id="69a9" class="ob oc is od b oe po og oh oi pp ok ol gn pq on oo gq pr oq or gt ps ot ou ov pl pm pn bj">Deliver immersive, accessible storytelling to every viewer, in any listening environment.</li></ul><p id="ed32" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">At Netflix, we’re committed to pushing the boundaries of audio excellence. From pioneering the eSTOI (extended short-term objective intelligibility) method to collaborating with Fraunhofer and Nugen Audio on cutting-edge tools like the DialogCheck Plugin, we’re setting a new standard for dialogue clarity — ensuring every word is heard exactly as creators intended. But innovation doesn’t happen in isolation. By working together with our partners, we can continue to push the limits of what’s possible, fueling creativity and driving the future of storytelling.</p><p id="f0f9" class="pw-post-body-paragraph ob oc is od b oe of og oh oi oj ok ol gn om on oo gq op oq or gt os ot ou ov ho bj">Finally, we’d like to extend a heartfelt thanks to Scott Kramer for his contributions to this initiative.</p></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/measuring-dialogue-intelligibility-for-netflix-content-58c13d2a6f6e</link>
      <guid>https://netflixtechblog.com/measuring-dialogue-intelligibility-for-netflix-content-58c13d2a6f6e</guid>
      <pubDate>Thu, 08 May 2025 02:40:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How Netflix Accurately Attributes eBPF Flow Logs]]></title>
      <description><![CDATA[<div><div></div><p id="84ec" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">By <a class="ag hc" href="https://www.linkedin.com/in/chengxie90/" rel="noopener ugc nofollow" target="_blank">Cheng Xie</a>, <a class="ag hc" href="https://www.linkedin.com/in/bryan-shultz-85983829/" rel="noopener ugc nofollow" target="_blank">Bryan Shultz</a>, and <a class="ag hc" href="https://www.linkedin.com/in/christine-xu-1b77191b/" rel="noopener ugc nofollow" target="_blank">Christine Xu</a></p><p id="be87" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">In a previous <a class="ag hc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-uses-ebpf-flow-logs-at-scale-for-network-insight-e3ea997dca96">blog post</a>, we described how Netflix uses eBPF to capture TCP flow logs at scale for enhanced network insights. In this post, we delve deeper into how Netflix solved a core problem: accurately attributing flow IP addresses to workload identities.</p><h1 id="3fde" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">A Brief Recap</h1><p id="9147" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk"><strong class="oe ip">FlowExporter</strong> is a sidecar that runs alongside all Netflix workloads. It uses eBPF and <a class="ag hc" href="https://www.brendangregg.com/blog/2018-03-22/tcp-tracepoints.html" rel="noopener ugc nofollow" target="_blank">TCP tracepoints</a> to monitor TCP socket state changes. When a TCP socket closes, FlowExporter generates a flow log record that includes the IP addresses, ports, timestamps, and additional socket statistics. On average, 5 million records are produced per second.</p><p id="4fce" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">In cloud environments, IP addresses are reassigned to different workloads as workload instances are created and terminated, so IP addresses alone cannot provide insights on which workloads are communicating. To make the flow logs useful, each IP address must be attributed to its corresponding workload identity. <strong class="oe ip">FlowCollector</strong>, a backend service, collects flow logs from FlowExporter instances across the fleet, attributes the IP addresses, and sends these attributed flows to Netflix’s <a class="ag hc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">Data Mesh</a> for subsequent stream and batch processing.</p><p id="6fbd" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">The eBPF flow logs provide a comprehensive view of service topology and network health across Netflix’s extensive microservices fleet, regardless of the programming language, RPC mechanism, or application-layer protocol used by individual workloads.</p><h1 id="a716" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">The Problem with Misattribution</h1><p id="21b9" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">Accurately attributing flow IP addresses to workload identities has been a significant challenge since our eBPF flow logs were introduced.</p><p id="413b" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">As noted in our previous blog post, our initial attribution approach relied on <a class="ag hc" href="https://youtu.be/8C9xNVYbCVk?si=Mqic7typcyB-v3JR&amp;t=1687" rel="noopener ugc nofollow" target="_blank">Sonar</a>, an internal IP address tracking service that emits an event whenever an IP address in Netflix’s AWS VPCs is assigned or unassigned to a workload. FlowCollector consumes a stream of IP address change events from Sonar and uses this information to attribute flow IP addresses in real-time.</p><figure class="qb qc qd qe qf qg py pz paragraph-image"><div role="button" tabindex="0" class="qh qi fm qj bh qk"><div class="py pz qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*QIn-JibEFM2CLans%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*QIn-JibEFM2CLans%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*QIn-JibEFM2CLans%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*QIn-JibEFM2CLans%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*QIn-JibEFM2CLans%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*QIn-JibEFM2CLans%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*QIn-JibEFM2CLans%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*QIn-JibEFM2CLans 640w, https://miro.medium.com/v2/resize:fit:720/0*QIn-JibEFM2CLans 720w, https://miro.medium.com/v2/resize:fit:750/0*QIn-JibEFM2CLans 750w, https://miro.medium.com/v2/resize:fit:786/0*QIn-JibEFM2CLans 786w, https://miro.medium.com/v2/resize:fit:828/0*QIn-JibEFM2CLans 828w, https://miro.medium.com/v2/resize:fit:1100/0*QIn-JibEFM2CLans 1100w, https://miro.medium.com/v2/resize:fit:1400/0*QIn-JibEFM2CLans 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fx ql c" width="700" height="217" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="fd88" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">The fundamental drawback of this method is that it can lead to misattribution. Delays and failures are inevitable in distributed systems, which may delay IP address change events from reaching FlowCollector. For instance, an IP address may initially be assigned to workload X but later reassigned to workload Y. However, if the change event for this reassignment is delayed, FlowCollector will continue to assume that the IP address belongs to workload X, resulting in misattributed flows. Additionally, event timestamps may be inaccurate depending on how they are captured.</p><p id="7a27" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Misattribution rendered the flow data unreliable for decision-making. Users often depend on flow logs to validate workload dependencies, but misattribution creates confusion. Without expert knowledge of expected dependencies, users would struggle to identify or confirm misattribution. Moreover, misattribution occurred frequently for critical services with a large footprint due to frequent IP address changes. Overall, misattribution makes fleet-wide dependency analysis impractical.</p><p id="29df" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">As a workaround, we made FlowCollector hold received flows for 15 minutes before attribution, allowing time for delayed IP address change events. While this approach reduced misattribution, it did not eliminate it. Moreover, the waiting period made the data less fresh, reducing its utility for real-time analysis.</p><p id="200e" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Fully eliminating misattribution is crucial because it only takes a single misattributed flow to produce an incorrect workload dependency. Solving this problem required a complete rethinking of our approach. Over the past year, Netflix developed a new attribution method that has finally eliminated misattribution, as detailed in the rest of this post.</p><h1 id="276f" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Attributing Local IP Addresses</h1><p id="994e" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">Each socket has two IP addresses: a local IP address and a remote IP address. Previously, we used the same method to attribute both. However, attributing the local IP address should be a simpler task since the local IP address belongs to the instance where FlowExporter captures the socket. Therefore, FlowExporter should determine the local workload identity from its environment and attribute the local IP address before sending the flow to FlowCollector.</p><p id="da55" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">This is straightforward for workloads running directly on EC2 instances, as Netflix’s <a class="ag hc" href="https://www.youtube.com/watch?v=-mmOT9I6JlY" rel="noopener ugc nofollow" target="_blank">Metatron</a> provisions workload identity certificates to each EC2 instance at boot time. FlowExporter can simply read these certificates from the local disk to determine the local workload identity.</p><p id="81e0" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Attributing local IP addresses for container workloads running on Netflix’s container platform, <a class="ag hc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436">Titus</a>, is more challenging. FlowExporter runs at the container host level, where each host manages multiple container workloads with different identities. When FlowExporter’s eBPF programs receive a socket event from TCP tracepoints in the kernel, the socket may have been created by one of the container workloads or by the host itself. Therefore, FlowExporter must determine which workload to attribute the socket’s local IP address to. To solve this problem, we leveraged <a class="ag hc" href="https://www.youtube.com/watch?v=fmUM9bMoCNE" rel="noopener ugc nofollow" target="_blank">IPMan</a>, Netflix’s container IP address assignment service. IPManAgent, a daemon running on every container host, is responsible for assigning and unassigning IP addresses. As container workloads are launched, IPManAgent writes an IP-address-to-workload-ID mapping to an eBPF map, which FlowExporter’s eBPF programs can then use to look up the workload ID associated with a socket local IP address.</p><figure class="qb qc qd qe qf qg py pz paragraph-image"><div role="button" tabindex="0" class="qh qi fm qj bh qk"><div class="py pz qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*fyfZ6m2NrMq1NgRQ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*fyfZ6m2NrMq1NgRQ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*fyfZ6m2NrMq1NgRQ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*fyfZ6m2NrMq1NgRQ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*fyfZ6m2NrMq1NgRQ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*fyfZ6m2NrMq1NgRQ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*fyfZ6m2NrMq1NgRQ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*fyfZ6m2NrMq1NgRQ 640w, https://miro.medium.com/v2/resize:fit:720/0*fyfZ6m2NrMq1NgRQ 720w, https://miro.medium.com/v2/resize:fit:750/0*fyfZ6m2NrMq1NgRQ 750w, https://miro.medium.com/v2/resize:fit:786/0*fyfZ6m2NrMq1NgRQ 786w, https://miro.medium.com/v2/resize:fit:828/0*fyfZ6m2NrMq1NgRQ 828w, https://miro.medium.com/v2/resize:fit:1100/0*fyfZ6m2NrMq1NgRQ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*fyfZ6m2NrMq1NgRQ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fx ql c" width="700" height="438" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="9514" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Another challenge was to accommodate Netflix’s <a class="ag hc" href="https://lpc.events/event/11/contributions/932/attachments/908/1764/LPC%202021_%20Talking%20IPv6%20to%20IPv4%20Without%20NAT_2.pdf" rel="noopener ugc nofollow" target="_blank">IPv6 to IPv4 translation mechanism</a> on Titus. To facilitate IPv6 migration, Netflix developed a mechanism that enables IPv6-only containers to communicate with IPv4 destinations without incurring NAT64 overhead. This mechanism intercepts connect syscalls and replaces the underlying socket with one that uses a shared IPv4 address assigned to the container host. This confuses FlowExporter because the kernel reports the same local IPv4 address for sockets created by different container workloads. To disambiguate, local port information is additionally required. We modified Titus to write a mapping of (local IPv4 address, local port) to the workload ID into an eBPF map whenever a connect syscall is intercepted. FlowExporter’s eBPF programs then use this map to correctly attribute sockets created by the translation mechanism.</p><p id="9cd6" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">With these problems solved, we can now accurately attribute the local IP address of every flow.</p><h1 id="d60f" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Attributing Remote IP Addresses</h1><p id="0edb" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">Once the local IP address attribution problem is solved, accurately attributing remote IP addresses becomes feasible. Now, each flow reported by FlowExporter includes the local IP address, the local workload identity, and connection start/end timestamps. As FlowCollector receives these flows, it can learn the time ranges during which each workload owns a given IP address. For instance, if FlowCollector sees a flow with local IP address 10.0.0.1 associated with workload X that starts at t1 and ends at t2, it can deduce that 10.0.0.1 belonged to workload X from t1 to t2. Since Netflix uses <a class="ag hc" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html" rel="noopener ugc nofollow" target="_blank">Amazon Time Sync</a> across its fleet, the timestamps (captured by FlowExporter) are reliable.</p><p id="7cd7" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">The FlowCollector service cluster consists of many nodes. Every node must be capable of attributing arbitrary remote IP addresses and, therefore, requires knowledge of all workload IP addresses and their recent ownership records. To represent this knowledge, each node maintains an in-memory hashmap that maps an IP address to a list of time ranges, as illustrated by the following Go structs:</p><pre class="qb qc qd qe qf qm qn qo bp qp bb bk">type IPAddressTracker struct {<br />    ipToTimeRanges map[netip.Addr]timeRanges<br />}type timeRanges []timeRangetype timeRange struct {<br />    workloadID   string<br />    start        time.Time<br />    end          time.Time<br />}</pre><p id="5dda" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">To populate the hashmap, FlowCollector extracts the local IP address, local workload identity, start time, and end time from each received flow and creates/extends the corresponding time ranges in the map. The time ranges for each IP address are sorted in ascending order, and they are non-overlapping since an IP address cannot belong to two different workloads simultaneously.</p><p id="c274" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Since each flow is only sent to one FlowCollector node, each node must share the time ranges it learned from received flows with other nodes. We implemented a broadcasting mechanism using Kafka, where each node publishes learned time ranges to all other nodes. Although more efficient broadcasting implementations exist, the Kafka-based approach is simple and has worked well for us.</p><p id="68ab" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Now, FlowCollector can attribute remote IP addresses by looking them up in the populated map, which returns a list of time ranges. It then uses the flow’s start timestamp to determine the corresponding time range and associated workload identity. If the start time does not fall within any time range, FlowCollector will retry after a delay, eventually giving up if the retry fails. Such failures may occur when flows are lost or broadcast messages are delayed. For our use cases, it is acceptable to leave a small percentage of flows unattributed, but any misattribution is unacceptable.</p><p id="1a45" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">This new method achieves accurate attribution thanks to the continuous heartbeats, each associated with a reliable time range of IP address ownership. It handles transient issues gracefully — a few delayed or lost heartbeats do not lead to misattribution. In contrast, the previous method relied solely on discrete IP address assignment and unassignment events. Lacking heartbeats, it had to presume an IP address remained assigned until notified otherwise (which can be hours or days later), making it vulnerable to misattribution when the notifications were delayed.</p><figure class="qb qc qd qe qf qg py pz paragraph-image"><div role="button" tabindex="0" class="qh qi fm qj bh qk"><div class="py pz qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*o8tJzaxRlWDBIYBS%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*o8tJzaxRlWDBIYBS%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*o8tJzaxRlWDBIYBS%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*o8tJzaxRlWDBIYBS%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*o8tJzaxRlWDBIYBS%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*o8tJzaxRlWDBIYBS%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*o8tJzaxRlWDBIYBS%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*o8tJzaxRlWDBIYBS 640w, https://miro.medium.com/v2/resize:fit:720/0*o8tJzaxRlWDBIYBS 720w, https://miro.medium.com/v2/resize:fit:750/0*o8tJzaxRlWDBIYBS 750w, https://miro.medium.com/v2/resize:fit:786/0*o8tJzaxRlWDBIYBS 786w, https://miro.medium.com/v2/resize:fit:828/0*o8tJzaxRlWDBIYBS 828w, https://miro.medium.com/v2/resize:fit:1100/0*o8tJzaxRlWDBIYBS 1100w, https://miro.medium.com/v2/resize:fit:1400/0*o8tJzaxRlWDBIYBS 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh fx ql c" width="700" height="520" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="704c" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">One detail is that when FlowCollector receives a flow, it cannot attribute its remote IP address right away because it requires the latest observed time ranges for the remote IP address. Since FlowExporter reports flows in batches every minute, FlowCollector must wait until it receives the flow batch from the remote workload FlowExporter for the last minute, which may not have arrived yet. To address this, FlowCollector temporarily stores received flows on disk for one minute before attributing their remote IP addresses. This introduces a 1-minute delay, but it is much shorter than the 15-minute delay with the previous approach.</p><p id="827b" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">In addition to producing accurate attribution, the new method is also cost-effective thanks to its simplicity and in-memory lookups. Because the in-memory state can be quickly rebuilt when a FlowCollector node starts up, no persistent storage is required. With 30 c7i.2xlarge instances, we can process 5 million flows per second across the entire Netflix fleet.</p><h1 id="7687" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Attributing Cross-Regional IP Addresses</h1><p id="26ad" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">For simplicity, we have so far glossed over one topic: regionalization. Netflix’s cloud microservices operate across multiple AWS regions. To optimize flow reporting and minimize cross-regional traffic, a FlowCollector cluster runs in each major region, and FlowExporter agents send flows to their corresponding regional FlowCollector. When FlowCollector receives a flow, its local IP address is guaranteed to be within the region.</p><p id="cdfc" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">To minimize cross-region traffic, the broadcasting mechanism is limited to FlowCollector nodes within the same region. Consequently, the IP address time ranges map contains only IP addresses from that region. However, cross-regional flows have a remote IP address in a different region. To attribute these flows, the receiving FlowCollector node forwards them to nodes in the corresponding region. FlowCollector determines the region for a remote IP address by looking up a trie built from all Netflix VPC CIDRs. This approach is more efficient than broadcasting IP address time range updates across all regions, as only 1% of Netflix flows are cross-regional.</p><h1 id="f8f3" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Attributing Non-Workload IP Addresses</h1><p id="5a96" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">So far, FlowCollector can accurately attribute IP addresses belonging to Netflix’s cloud workloads. However, not all flow IP addresses fall into this category. For instance, a significant portion of flows goes through AWS ELBs. For these flows, their remote IP addresses are associated with the ELBs, where we cannot run FlowExporter. Consequently, FlowCollector cannot determine their identities by simply observing the received flows. To attribute these remote IP addresses, we continue to use IP address change events from Sonar, which crawls AWS resources to detect changes in IP address assignments. Although this data stream may contain inaccurate timestamps and be delayed, misattribution is not a main concern since ELB IP address reassignment occurs very infrequently.</p><h1 id="4953" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Verifying Correctness</h1><p id="6cc9" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">Verifying that the new method has eliminated misattribution is challenging due to the lack of a definitive source of truth for workload dependencies to validate flow logs against; the flow logs themselves are intended to serve as this source of truth, after all. To build confidence, we analyzed the flow logs of a large service with well-understood dependencies. A large footprint is necessary, as misattribution is more prevalent in services with numerous instances, and there must be a reliable method to determine the dependencies for this service without relying on flow logs.</p><p id="c413" class="pw-post-body-paragraph oc od io oe b of og oh oi oj ok ol om gp on oo op gs oq or os gv ot ou ov ow hp bk">Netflix’s cloud gateway, <a class="ag hc" href="https://github.com/Netflix/zuul" rel="noopener ugc nofollow" target="_blank">Zuul</a>, served this purpose perfectly due to its extensive footprint (handling all cloud ingress traffic), its large number of downstream dependencies, and our ability to derive its dependencies from its routing configurations as the source of truth for comparison with flow logs. We found no misattribution for flows through Zuul over a two-week window. This provided strong confidence that the new attribution method has eliminated misattribution. In the previous approach, approximately 40% of Zuul’s dependencies reported by the flow logs were misattributed.</p><h1 id="57d4" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Conclusion</h1><p id="bb9c" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">With misattribution solved, eBPF flow logs now deliver dependable, fleet-wide insights into Netflix’s service topology and network health. This advancement unlocks numerous exciting opportunities in areas such as service dependency auditing, security analysis, and incident triage, while helping Netflix engineers develop a better understanding of our ever-evolving distributed systems.</p><h1 id="4224" class="ox oy io bf oz pa pb pc gm pd pe pf go pg ph pi pj pk pl pm pn po pp pq pr ps bk">Acknowledgments</h1><p id="28be" class="pw-post-body-paragraph oc od io oe b of pt oh oi oj pu ol om gp pv oo op gs pw or os gv px ou ov ow hp bk">We would like to thank <a class="ag hc" href="https://www.linkedin.com/in/mdubcovsky/" rel="noopener ugc nofollow" target="_blank">Martin Dubcovsky</a>, <a class="ag hc" href="https://www.linkedin.com/in/joannekoong/" rel="noopener ugc nofollow" target="_blank">Joanne Koong</a>, <a class="ag hc" href="https://www.linkedin.com/in/troshko/" rel="noopener ugc nofollow" target="_blank">Taras Roshko</a>, <a class="ag hc" href="https://www.linkedin.com/in/nabilschear/" rel="noopener ugc nofollow" target="_blank">Nabil Schear</a>, <a class="ag hc" href="https://www.linkedin.com/in/jacobmeyers35/" rel="noopener ugc nofollow" target="_blank">Jacob Meyers</a>, <a class="ag hc" href="https://www.linkedin.com/in/parshap/" rel="noopener ugc nofollow" target="_blank">Parsha Pourkhomami</a>, <a class="ag hc" href="https://www.linkedin.com/in/hechaoli/" rel="noopener ugc nofollow" target="_blank">Hechao Li</a>, <a class="ag hc" href="https://www.linkedin.com/in/donavanfritz/" rel="noopener ugc nofollow" target="_blank">Donavan Fritz</a>, <a class="ag hc" href="https://www.linkedin.com/in/rob-gulewich-0335b52/" rel="noopener ugc nofollow" target="_blank">Rob Gulewich</a>, <a class="ag hc" href="https://www.linkedin.com/in/amanda-li-410286166/" rel="noopener ugc nofollow" target="_blank">Amanda Li</a>, <a class="ag hc" href="https://www.linkedin.com/in/jdsalem/" rel="noopener ugc nofollow" target="_blank">John Salem</a>, <a class="ag hc" href="https://www.linkedin.com/in/haananth/" rel="noopener ugc nofollow" target="_blank">Hariharan Ananthakrishnan</a>, <a class="ag hc" href="https://www.linkedin.com/in/joshmachine/" rel="noopener ugc nofollow" target="_blank">Keerti Lakshminarayan</a>, and other stunning colleagues for their feedback, inspiration, and contributions to the success of this effort.</p></div>]]></description>
      <link>https://netflixtechblog.com/how-netflix-accurately-attributes-ebpf-flow-logs-afe6d644a3bc</link>
      <guid>https://netflixtechblog.com/how-netflix-accurately-attributes-ebpf-flow-logs-afe6d644a3bc</guid>
      <pubDate>Tue, 08 Apr 2025 19:50:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Globalizing Productions with Netflix’s Media Production Suite]]></title>
      <description><![CDATA[<div><div></div><p id="5f94" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><a class="ag ih" href="https://www.linkedin.com/in/jesse-korosi-44790985/" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Jesse Korosi</strong></a>,<a class="ag ih" href="https://www.linkedin.com/in/thijsvdkamp/" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Thijs van de Kamp</strong></a><strong class="pd ju">, </strong><a class="ag ih" href="https://www.linkedin.com/in/mayralvega/" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Mayra Vega</strong></a>,<a class="ag ih" href="https://www.linkedin.com/in/laurafuturo/" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Laura Futuro</strong></a>,<a class="ag ih" href="https://www.linkedin.com/in/margoline/" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Anton Margoline</strong></a></p><p id="4329" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">The journey from script to screen is full of challenges in the ever-evolving world of film and television. The industry has always innovated, and over the last decade, it started moving towards cloud-based workflows. However, unlocking cloud innovation and all its benefits on a global scale has proven to be difficult. The opportunity is clear: streamline complex media management logistics, eliminate tedious, non-creative task-based work and enable productions to focus on what matters most–creative storytelling. With these challenges in mind, Netflix has developed a suite of tools by filmmakers for filmmakers: the Media Production Suite (MPS).</p><figure class="pz qa qb qc qd qe pw px paragraph-image"><div role="button" tabindex="0" class="qf qg gs qh bh qi"><div class="pw px py"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*CUGxiNprXnLcOmhI%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*CUGxiNprXnLcOmhI%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*CUGxiNprXnLcOmhI%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*CUGxiNprXnLcOmhI%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*CUGxiNprXnLcOmhI%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*CUGxiNprXnLcOmhI%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*CUGxiNprXnLcOmhI%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*CUGxiNprXnLcOmhI 640w, https://miro.medium.com/v2/resize:fit:720/0*CUGxiNprXnLcOmhI 720w, https://miro.medium.com/v2/resize:fit:750/0*CUGxiNprXnLcOmhI 750w, https://miro.medium.com/v2/resize:fit:786/0*CUGxiNprXnLcOmhI 786w, https://miro.medium.com/v2/resize:fit:828/0*CUGxiNprXnLcOmhI 828w, https://miro.medium.com/v2/resize:fit:1100/0*CUGxiNprXnLcOmhI 1100w, https://miro.medium.com/v2/resize:fit:1400/0*CUGxiNprXnLcOmhI 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc qj c" width="700" height="210" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="ca35" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk"><strong class="am">What are we solving for?</strong></h1><p id="0953" class="pw-post-body-paragraph pb pc jt pd b pe rg pg ph pi rh pk pl hs ri pn po hv rj pq pr hy rk pt pu pv iu bk">Significant time and resources are devoted to managing media logistics throughout the production lifecycle. An average Netflix title produces around ~200 Terabytes of Original Camera Files (OCF), with outliers up to 700 Terabytes, not including any work-in-progress files, VFX assets, 3D assets, etc. The data produced on set is traditionally copied to physical tape stock like LTO. This workflow has been considered the industry norm for a long time and may be cost-effective, but comes with trade-offs. Aside from needing to physically ship and track all movement of the tape stock, storing media on a physical tape makes it harder to search, play and share media assets; slowing down accessibility to production media when needed, especially when titles need to collaborate with talent and vendors all over the world.</p><p id="8c3a" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Even when workflows are fully digital, the distribution of media between multiple departments and vendors can still be challenging. A lack of automation and standardization often results in a labour-intensive process across post-production and VFX with a lot of dependencies that introduce potential human errors and security risks. Many productions utilize a large variety of vendors, making this collaboration a large technical puzzle. As file sizes grow and workflows become more complex, these issues are magnified, leading to inefficiencies that slow down post-production and reduce the available time spent on creative work.</p><p id="beaa" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Moving media into the cloud introduces new challenges for production and post ramping up to meet the operational and technological hurdles this poses. For some post-production facilities, it’s not uncommon to see a wall of portable hard drives at their facility, with media being hand-carried between vendors because alternatives are not available. The need for a centralized, cloud-based solution that transcends these barriers is more pressing than ever. This results in a willingness to embrace new and innovative ideas, even if exploratory, and introduce drastic workflow changes to productions in pursuit of creative evolution.</p><p id="a501" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">At Netflix, we believe that great stories can come from anywhere, but we have seen that technical limitations in traditional workflows reduce access to media and restrict filmmakers’ access to talent. Besides the need for robust cloud storage for their media, artists need access to powerful workstations and real-time playback. Depending on the market, or production budget, cutting-edge technology might not be available or affordable.</p><p id="1726" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">What if we started charting a course to break free from many of these technical limitations and found ways to enhance creativity? Industry trade shows like the International Broadcast Convention (IBC) and the National Association of Broadcasters Show (NAB) highlight a strong global trend: instead of bringing media to the artist/applications (traditional workflow) we see the shift towards bringing people and applications to the media (cloud workflows and remote workstations). The concept of cloud-based workflows is not new, as many technology leaders in our industry have been experimenting in this space for more than a decade. However, executing this vision at a Netflix scale with hundreds of titles a year has not been done before…</p><h1 id="4f2c" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk"><strong class="am">The challenge of building a global technology to solve this</strong></h1><p id="8b30" class="pw-post-body-paragraph pb pc jt pd b pe rg pg ph pi rh pk pl hs ri pn po hv rj pq pr hy rk pt pu pv iu bk">Building solutions at a global scale poses significant challenges. The art of making movies and series lacks equal access to technology, best practices, and global standardization. Different countries worldwide are at different phases of innovation based on local needs and nuances. While some regions boast over a century of cinematic history and have a strong industry, others are just beginning to carve their niche. This vast gap presents a unique challenge: developing global technology that caters to both established and emerging markets, each with distinct languages and workflows.</p><p id="a919" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">The large diversity of needs by talent and vendors globally creates a standardization challenge and can be seen when productions use a global talent pool. Many mature post-production and VFX facilities have built scripts and automation that flow between various artists and personnel within their facility; allowing a more streamlined workflow, even though the customization is time-consuming. E.g., Transcoding, or transcriptions that automatically run when files are dropped in a hot folder, with the expectation that certain sidecar metadata files will accompany them with a specific organizational structure. Embracing and integrating new workflows introduces the fear of disrupting a well-established process, increasing additional pressure on the profit margins of vendors. Small workflow changes that may seem arbitrary may actually have a large impact on vendors. Therefore, innovation should provide meaningful benefits to a title in order to get adopted at scale. Reliability, a proven track record, strong support, and an incredibly low tolerance for bugs, or issues are top of mind in well-established markets.</p><p id="c8c2" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">In developing this suite, we recognized the necessity of addressing the vast array of titles that flow through Netflix without the luxury of expanding into a massive operational entity. Consequently, automation became imperative. The intricacies of color and framing management, along with deliverables, must be seamlessly controlled and effortlessly managed by the user, without the need for manual intervention. Therefore, we cannot lean into humans configuring JSON files behind the scenes to map camera formats into deliverables. By embracing open standards, we not only streamline these processes but also facilitate smoother collaboration across diverse markets and countries, ensuring that our global productions can operate with unparalleled efficiency and cohesion. To ensure this, we’ve decided to lean heavily into standards like <a class="ag ih" href="https://www.oscars.org/science-technology/sci-tech-projects/aces" rel="noopener ugc nofollow" target="_blank">ACES</a>, <a class="ag ih" href="https://acescentral.com/knowledge-base-2/when-is-amf-used/" rel="noopener ugc nofollow" target="_blank">AMF</a>, <a class="ag ih" href="https://theasc.com/society/ascmitc/asc-media-hash-list" rel="noopener ugc nofollow" target="_blank">ASC MHL</a>, <a class="ag ih" href="https://theasc.com/society/ascmitc/asc-framing-decision-list" rel="noopener ugc nofollow" target="_blank">ASC FDL</a>, and <a class="ag ih" href="https://github.com/OpenTimelineIO" rel="noopener ugc nofollow" target="_blank">OTIO</a>. ACES and AMF for color pipeline management. ASC MHL for any file management/verifications. ASC FDL will serve as our framing interoperability and OTIO for any timeline interchange. Leaning into standards like this means that many things can be automated at scale and more importantly, high-complexity workflows can be offered to markets or shows that don’t normally have access to them. As an example, if a show is shot on various camera formats all framed and recorded at different resolutions, with different lenses and different safeties on each frame. The task of normalizing all of these for a VFX vendor into one common container with a normalized center extracted frame is often only offered to very high-end titles, considering it takes a human behind the curtain to create all of these mappings. But by leaning into a standard like the FDL, it means this can now easily be automated, and the control for these mappings, put directly in the hands of users.</p><h1 id="56f4" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk"><strong class="am">Our Answer — Content Hub’s Media Production Suite (MPS)</strong></h1><figure class="pz qa qb qc qd qe"><div class="rl cr l gs"><figcaption class="ro go rp pw px rq rr bf b bg z cm">Introducing Content Hub Media Production Suite video</figcaption></div></figure><p id="e1a8" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Building a global scalable solution that could be utilized in a diversity of markets has been an exciting challenge. We set out to provide customizable and feature-rich tooling for advanced users while remaining intuitive and streamlined enough for less experienced filmmakers. With collaboration from Netflix teams, vendors, and talent across the globe, we’ve taken a bold step forward in enabling a suite of tools inside Netflix Content Hub that democratizes technology: the Media Production Suite. While leveraging our scale economies and access to resources, we can now unlock global talent pools for our productions, drastically reduce non-creative task-based work, streamline workflows, and level the playing field between our markets, ultimately maximizing the time available for what matters most; creative work!</p><h1 id="657d" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk">So what is it?</h1><p id="2737" class="pw-post-body-paragraph pb pc jt pd b pe rg pg ph pi rh pk pl hs ri pn po hv rj pq pr hy rk pt pu pv iu bk">1. <strong class="pd ju">Netflix Hybrid Infrastructure</strong>: Netflix has invested in a hybrid infrastructure, a mix of cloud-based and physically distributed capabilities operating in multiple locations across the world and close to our productions to optimize user performance. This infrastructure is available for Netflix shows and is foundational under Content Hub’s Media Production Suite tooling. Local storage and compute services are connected through the Netflix Open Connect network (Netflix Content Delivery Network) to the infrastructure of Amazon Web Services (AWS). The system facilitates large volumes of camera and sound media and is built for speed. In order to ensure that productions have sufficient upload speeds to get their media into the cloud, Netflix has started to roll out Content Hub Ingest Centers globally to provide high-speed internet connectivity where required. With all media centralized, MPS eliminates the need for physical media transport and reduces the risk of human error. This approach not only streamlines operations but also enhances security and accessibility.</p><p id="507e" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">2. <strong class="pd ju">Automation and Tooling</strong>: In addition to the Netflix Hybrid infrastructure layer, MPS consists of a suite of tools that tap into the media in the Netflix ecosystem.</p><p id="8ae6" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Footage Ingest</strong> — An application that allows users to upload media/files into Content Hub.</p><p id="e3f8" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Media Library</strong> — A central library that allows users to search, preview, share and download media.</p><p id="7864" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Dailies</strong> — A workflow, backed by an operational team, offering automated Quality Control of your footage, sound sync, application of color, rendering, and delivering dailies directly to editorial.</p><p id="4061" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Remote Workstations</strong> — Offering access to remote editorial workstations and storage for post-production needs.</p><p id="0465" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">VFX Pulls</strong> — An automated method for converting and delivering visual effects plates, associated color, and framing files to VFX vendors.</p><p id="dc0b" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Conform Pulls</strong> — An automated method for consolidating, trimming, and delivering all OCF to picture-finishing vendors.</p><p id="0506" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Media Downloader</strong> — An automated download tool that initiates a download once media has been made available in the Netflix cloud.</p><p id="8a17" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">While each of the individual tools within MPS is at different states of maturity, over 350 titles have made use of at least one of the tools noted above. Input has been taken from all over the world while developing, with users ranging from UCAN (United States/Canada), EMEA (Europe, Middle East, and Africa), SEA (South East Asia), LATAM (Latin America), and APAC (Asia Pacific).</p><h1 id="c95b" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk"><strong class="am">Senna: Early Adoption and Insightful Feedback Driving MPS Evolution</strong></h1><figure class="pz qa qb qc qd qe pw px paragraph-image"><div role="button" tabindex="0" class="qf qg gs qh bh qi"><div class="pw px rs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*NLeFw4FDGx2jsZg7%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*NLeFw4FDGx2jsZg7%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*NLeFw4FDGx2jsZg7%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*NLeFw4FDGx2jsZg7%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*NLeFw4FDGx2jsZg7%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*NLeFw4FDGx2jsZg7%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*NLeFw4FDGx2jsZg7%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*NLeFw4FDGx2jsZg7 640w, https://miro.medium.com/v2/resize:fit:720/0*NLeFw4FDGx2jsZg7 720w, https://miro.medium.com/v2/resize:fit:750/0*NLeFw4FDGx2jsZg7 750w, https://miro.medium.com/v2/resize:fit:786/0*NLeFw4FDGx2jsZg7 786w, https://miro.medium.com/v2/resize:fit:828/0*NLeFw4FDGx2jsZg7 828w, https://miro.medium.com/v2/resize:fit:1100/0*NLeFw4FDGx2jsZg7 1100w, https://miro.medium.com/v2/resize:fit:1400/0*NLeFw4FDGx2jsZg7 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc qj c" width="700" height="402" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ro go rp pw px rq rr bf b bg z cm"><em class="ku">Media from the Brazilian-produced series ‘Senna’ being reviewed in MPS</em></figcaption></figure><p id="c6d3" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">The Brazilian-produced series <em class="rt">Senna</em>, which follows the life of legendary Formula 1 driver Ayrton Senna, utilized MPS to reshape their content creation workflow, overcome geographical barriers, and unlock innovation to support world-class storytelling for a global audience. <em class="rt">Senna</em> is a groundbreaking series, not just for its storytelling but for its production journey across Argentina, Uruguay, Brazil, and the United Kingdom. With editorial teams spread across Porto Alegre and Spain, and VFX studios collaborating across locations in Brazil, Canada, the United States, and India, all orchestrated by our subsidiary Scanline VFX. The series exemplifies the global nature of modern filmmaking and was the perfect fit for Netflix’s new Content Hub Media Production Suite (MPS).</p><p id="2fcb" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">At the heart of <em class="rt">Senna’s</em> workflow orchestration is MPS. While each of the tools within MPS is based on an opt-in model, in order to use many of the downstream services, the first step is ensuring that the original camera files (OCF) and original sound files (OSF) are uploaded. “<em class="rt">We knew we were going to shoot in different places,</em>” said Post Supervisor Gabriel Queiroz,<em class="rt">“to have all this material cloud-based, it’s definitely one of the most important things for us. It would be hard to bring all this media physically from Argentina or wherever to Brazil. It will take us a lot of time.”</em> With <em class="rt">Senna</em> shooting across locations, allowing production the capability of uploading their OCF and OSF resulted in no longer requiring shuttling hard drives on airplanes, creating LTO tapes, &amp; managing physical shipments for their negative. And yes, you read that correctly; when utilizing MPS, we don’t require LTO tapes to be written unless there are title-specific needs.</p><p id="838c" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">With <em class="rt">Senna</em> beginning production back in June of 2023, our investment in MPS was still very early stages, and the tooling was considered beta. However, with the help, feedback, and partnership from this production, it was quickly realized that the investment was worth doubling down on. Since the early version used on <em class="rt">Senna</em>, Netflix has been spinning up ingest centers around the world, where drives can be dropped off, and within a matter of hours, all original camera files are uploaded into the Netflix ecosystem. While creating the ability to upload is not a novel concept, behind the scenes, it’s far from simple. Once a drive has been plugged in and our Netflix Footage Ingest application is opened, a validation is run, ensuring all expected media from set is on the drive. After media has been uploaded and checksums are run validating media integrity, all media is inspected, metadata is extracted, and assets are created for viewing/sharing/downloading with playable proxies. All media is then automatically backed up to a second tier of cloud-based storage for the final archive.</p><p id="e51e" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Traditionally, if you wanted to check in with your post vendor on how things are going for each of these media management steps noted above, or whether or not you can clear on set camera cards if you haven’t gotten a completion notification, you would have to pick up the phone and call your vendor. For <em class="rt">Senna</em>, anyone who wanted visibility on progress, simply logged in to Content Hub and could see any activity in the Footage Ingest dashboard, as well as look up any information needed on past uploads.</p><figure class="pz qa qb qc qd qe pw px paragraph-image"><div role="button" tabindex="0" class="qf qg gs qh bh qi"><div class="pw px rs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*7gaSH8-YpnOTKnpu%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*7gaSH8-YpnOTKnpu%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*7gaSH8-YpnOTKnpu%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*7gaSH8-YpnOTKnpu%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*7gaSH8-YpnOTKnpu%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*7gaSH8-YpnOTKnpu%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7gaSH8-YpnOTKnpu%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*7gaSH8-YpnOTKnpu 640w, https://miro.medium.com/v2/resize:fit:720/0*7gaSH8-YpnOTKnpu 720w, https://miro.medium.com/v2/resize:fit:750/0*7gaSH8-YpnOTKnpu 750w, https://miro.medium.com/v2/resize:fit:786/0*7gaSH8-YpnOTKnpu 786w, https://miro.medium.com/v2/resize:fit:828/0*7gaSH8-YpnOTKnpu 828w, https://miro.medium.com/v2/resize:fit:1100/0*7gaSH8-YpnOTKnpu 1100w, https://miro.medium.com/v2/resize:fit:1400/0*7gaSH8-YpnOTKnpu 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc qj c" width="700" height="402" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ro go rp pw px rq rr bf b bg z cm"><em class="ku">Remote monitoring media being uploaded and archived using the MPS Footage Ingest workflow</em></figcaption></figure><p id="f4ad" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">While many services in MPS are available once media has been uploaded, <em class="rt">Senna’s</em> use of MPS focused on VFX. With <em class="rt">Senna</em> shooting a high volume of footage and the show having a high volume of VFX shots, according to Post Supervisor Gabriel Queiroz <em class="rt">“Using MPS was basically a no-brainer, </em>[having]<em class="rt"> used the tool before, I knew what it could bring to the project. And to be honest, with the amount of footage that we have, it was just so much material and with the amount of vendors we have, knowing that we would have to deliver all this footage to all these kinds of vendors, including outside of Brazil and to different parts of the world.”</em></p><p id="770f" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">With a traditional workflow, utilizing available resources in Latin America, VFX Pulls would have been done manually. This process is prone to human error and more importantly, for a show like <em class="rt">Senna</em>, too slow and would have resulted in different I/O methods for every vendor.</p><figure class="pz qa qb qc qd qe pw px paragraph-image"><div role="button" tabindex="0" class="qf qg gs qh bh qi"><div class="pw px ru"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*_CzN6mkamROqqxjo%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*_CzN6mkamROqqxjo%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*_CzN6mkamROqqxjo%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*_CzN6mkamROqqxjo%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*_CzN6mkamROqqxjo%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*_CzN6mkamROqqxjo%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*_CzN6mkamROqqxjo%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*_CzN6mkamROqqxjo 640w, https://miro.medium.com/v2/resize:fit:720/0*_CzN6mkamROqqxjo 720w, https://miro.medium.com/v2/resize:fit:750/0*_CzN6mkamROqqxjo 750w, https://miro.medium.com/v2/resize:fit:786/0*_CzN6mkamROqqxjo 786w, https://miro.medium.com/v2/resize:fit:828/0*_CzN6mkamROqqxjo 828w, https://miro.medium.com/v2/resize:fit:1100/0*_CzN6mkamROqqxjo 1100w, https://miro.medium.com/v2/resize:fit:1400/0*_CzN6mkamROqqxjo 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc qj c" width="700" height="399" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ro go rp pw px rq rr bf b bg z cm"><em class="ku">Illustrating a traditional VFX Editor having to manage various I/O methods</em></figcaption></figure><p id="3595" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">By utilizing MPS, the Assistant Editor was able to log into Content Hub, upload an EDL, and have their VFX Pulls automatically transcoded, color files consolidated and all media placed into a Google Drive style folder built directly in Content Hub (called Workspaces). The VFX Editor was able to make any additional tweaks they wanted to the directory before farming out each of the shots to whichever vendor they were meant for. When it came time for the VFX vendors to then send shots back to editorial or DI, this was also done through MPS. Having one standard method for I/O for all VFX file sharing meant that Editorial and DI did not have to manage a different file transfer/workflow for every single vendor that was onboarded.</p><figure class="pz qa qb qc qd qe pw px paragraph-image"><div role="button" tabindex="0" class="qf qg gs qh bh qi"><div class="pw px ru"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*V01AyZdu0si2N0z5%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*V01AyZdu0si2N0z5%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*V01AyZdu0si2N0z5%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*V01AyZdu0si2N0z5%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*V01AyZdu0si2N0z5%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*V01AyZdu0si2N0z5%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*V01AyZdu0si2N0z5%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*V01AyZdu0si2N0z5 640w, https://miro.medium.com/v2/resize:fit:720/0*V01AyZdu0si2N0z5 720w, https://miro.medium.com/v2/resize:fit:750/0*V01AyZdu0si2N0z5 750w, https://miro.medium.com/v2/resize:fit:786/0*V01AyZdu0si2N0z5 786w, https://miro.medium.com/v2/resize:fit:828/0*V01AyZdu0si2N0z5 828w, https://miro.medium.com/v2/resize:fit:1100/0*V01AyZdu0si2N0z5 1100w, https://miro.medium.com/v2/resize:fit:1400/0*V01AyZdu0si2N0z5 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc qj c" width="700" height="399" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ro go rp pw px rq rr bf b bg z cm"><em class="ku">Illustrating a more streamlined workflow for VFX vendors when using MPS</em></figcaption></figure><p id="c2be" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">After picture was locked and it was time for <em class="rt">Senna</em> to do their Online, the DI facility Quanta was able to utilize the Conform Pull service within MPS. The Conform Pull service allowed their team to upload an EDL, which ran a QC on all of the media from within their edit to ensure a smooth conform and then consolidated, trimmed, and packaged up all of the media they needed for the online. Since this early beta and thanks to learnings from many shows like Senna, advancements have been made in the system’s ability to match back to source media for both Conform and VFX Pulls. Rather than requiring an exact match between EDL and source OCF, there are several variations of fuzzy matching that can take place, as well as a current investigation in using one of our perceptual matching algorithms, allowing for a perceptual conform using computer vision, instead of solely relying on metadata.</p><figure class="pz qa qb qc qd qe"><div class="rl cr l gs"><figcaption class="ro go rp pw px rq rr bf b bg z cm">Inside Senna with Content Hub Media Production Suite video</figcaption></div></figure><h1 id="60c3" class="qk ql jt bf qm qn qo qp hp qq qr qs hr qt qu qv qw qx qy qz ra rb rc rd re rf bk">Conclusion</h1><p id="a3cb" class="pw-post-body-paragraph pb pc jt pd b pe rg pg ph pi rh pk pl hs ri pn po hv rj pq pr hy rk pt pu pv iu bk">The Media Production Suite (MPS) represents a transformative leap in how we approach media production at Netflix. By embracing open standards, we have crafted a scalable solution that not only makes economic sense but also democratizes access to advanced production tools across the globe. This approach allows us to eliminate tedious tasks, enabling our teams to focus on what truly matters: creative storytelling. By fostering global collaboration and leveraging the power of cloud-based workflows, we’re not just enhancing efficiency but also elevating the quality of our productions. As we continue to innovate and refine our processes, we remain committed to breaking down barriers and unlocking the full potential of creative talent worldwide. The future of filmmaking is here, and with MPS, we are leading the charge toward a more connected and creatively empowered industry.</p></div>]]></description>
      <link>https://netflixtechblog.com/globalizing-productions-with-netflixs-media-production-suite-fc3c108c0a22</link>
      <guid>https://netflixtechblog.com/globalizing-productions-with-netflixs-media-production-suite-fc3c108c0a22</guid>
      <pubDate>Mon, 31 Mar 2025 18:21:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Foundation Model for Personalized Recommendation]]></title>
      <description><![CDATA[<div><div></div><p id="af55" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">By <a class="ag ih" href="https://www.linkedin.com/in/markhsiao/" rel="noopener ugc nofollow" target="_blank">Ko-Jen Hsiao</a>, <a class="ag ih" href="https://www.linkedin.com/in/yesufeng/" rel="noopener ugc nofollow" target="_blank">Yesu Feng</a> and <a class="ag ih" href="https://www.linkedin.com/in/sudarshanlamkhede/" rel="noopener ugc nofollow" target="_blank">Sudarshan Lamkhede</a></p><h1 id="400b" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Motivation</h1><p id="4d49" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">Netflix’s personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including “Continue Watching” and “Today’s Top Picks for You.” (Refer to our recent <a class="ag ih" href="https://videorecsys.com/slides/mark_talk3.pdf" rel="noopener ugc nofollow" target="_blank">overview</a> for more details). However, as we expanded our set of personalization algorithms to meet increasing business needs, maintenance of the recommender system became quite costly. Furthermore, it was difficult to transfer innovations from one model to another, given that most are independently trained despite using common data sources. This scenario underscored the need for a new recommender system architecture where member preference learning is centralized, enhancing accessibility and utility across different models.</p><p id="b0d8" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Particularly, these models predominantly extract features from members’ recent interaction histories on the platform. Yet, many are confined to a brief temporal window due to constraints in serving latency or training costs. This limitation has inspired us to develop a foundation model for recommendation. This model aims to assimilate information both from members’ comprehensive interaction histories and our content at a very large scale. It facilitates the distribution of these learnings to other models, either through shared model weights for fine tuning or directly through embeddings.</p><p id="34e6" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">The impetus for constructing a foundational recommendation model is based on the paradigm shift in natural language processing (NLP) to large language models (LLMs). In NLP, the trend is moving away from numerous small, specialized models towards a single, large language model that can perform a variety of tasks either directly or with minimal fine-tuning. Key insights from this shift include:</p><ol class=""><li id="c797" class="pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv qx qy qz bk"><strong class="pd ju">A Data-Centric Approach</strong>: Shifting focus from model-centric strategies, which heavily rely on feature engineering, to a data-centric one. This approach prioritizes the accumulation of large-scale, high-quality data and, where feasible, aims for end-to-end learning.</li><li id="00ed" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><strong class="pd ju">Leveraging Semi-Supervised Learning</strong>: The next-token prediction objective in LLMs has proven remarkably effective. It enables large-scale semi-supervised learning using unlabeled data while also equipping the model with a surprisingly deep understanding of world knowledge.</li></ol><p id="215e" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">These insights have shaped the design of our foundation model, enabling a transition from maintaining numerous small, specialized models to building a scalable, efficient system. By scaling up semi-supervised training data and model parameters, we aim to develop a model that not only meets current needs but also adapts dynamically to evolving demands, ensuring sustainable innovation and resource efficiency.</p><h1 id="cbeb" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Data</h1><p id="9678" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">At Netflix, user engagement spans a wide spectrum, from casual browsing to committed movie watching. With over 300 million users at the end of 2024, this translates into hundreds of billions of interactions — an immense dataset comparable in scale to the token volume of large language models (LLMs). However, as in LLMs, the quality of data often outweighs its sheer volume. To harness this data effectively, we employ a process of interaction tokenization, ensuring meaningful events are identified and redundancies are minimized.</p><p id="6600" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Tokenizing User Interactions</strong>: Not all raw user actions contribute equally to understanding preferences. Tokenization helps define what constitutes a meaningful “token” in a sequence. Drawing an analogy to Byte Pair Encoding (BPE) in NLP, we can think of tokenization as merging adjacent actions to form new, higher-level tokens. However, unlike language tokenization, creating these new tokens requires careful consideration of what information to retain. For instance, the total watch duration might need to be summed or engagement types aggregated to preserve critical details.</p><figure class="ri rj rk rl rm rn rf rg paragraph-image"><div role="button" tabindex="0" class="ro rp gs rq bh rr"><div class="rf rg rh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*1dhdoLxKnf_fcZOq%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*1dhdoLxKnf_fcZOq%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*1dhdoLxKnf_fcZOq%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*1dhdoLxKnf_fcZOq%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*1dhdoLxKnf_fcZOq%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*1dhdoLxKnf_fcZOq%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*1dhdoLxKnf_fcZOq%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*1dhdoLxKnf_fcZOq 640w, https://miro.medium.com/v2/resize:fit:720/0*1dhdoLxKnf_fcZOq 720w, https://miro.medium.com/v2/resize:fit:750/0*1dhdoLxKnf_fcZOq 750w, https://miro.medium.com/v2/resize:fit:786/0*1dhdoLxKnf_fcZOq 786w, https://miro.medium.com/v2/resize:fit:828/0*1dhdoLxKnf_fcZOq 828w, https://miro.medium.com/v2/resize:fit:1100/0*1dhdoLxKnf_fcZOq 1100w, https://miro.medium.com/v2/resize:fit:1400/0*1dhdoLxKnf_fcZOq 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc rs c" width="700" height="281" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="rt go ru rf rg rv rw bf b bg z cm"><strong class="bf py">Figure 1.</strong>Tokenization of user interaction history by merging actions on the same title, preserving important information.</figcaption></figure><p id="eb1f" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">This tradeoff between granular data and sequence compression is akin to the balance in LLMs between vocabulary size and context window. In our case, the goal is to balance the length of interaction history against the level of detail retained in individual tokens. Overly lossy tokenization risks losing valuable signals, while too granular a sequence can exceed practical limits on processing time and memory.</p><p id="55a5" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Even with such strategies, interaction histories from active users can span thousands of events, exceeding the capacity of transformer models with standard self attention layers. In recommendation systems, context windows during inference are often limited to hundreds of events — not due to model capability but because these services typically require millisecond-level latency. This constraint is more stringent than what is typical in LLM applications, where longer inference times (seconds) are more tolerable.</p><p id="c640" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">To address this during training, we implement two key solutions:</p><ol class=""><li id="0868" class="pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv qx qy qz bk"><strong class="pd ju">Sparse Attention Mechanisms</strong>: By leveraging sparse attention techniques such as low-rank compression, the model can extend its context window to several hundred events while maintaining computational efficiency. This enables it to process more extensive interaction histories and derive richer insights into long-term preferences.</li><li id="75fa" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><a class="ag ih" href="https://arxiv.org/abs/2409.14517" rel="noopener ugc nofollow" target="_blank"><strong class="pd ju">Sliding Window Sampling</strong></a>: During training, we sample overlapping windows of interactions from the full sequence. This ensures the model is exposed to different segments of the user’s history over multiple epochs, allowing it to learn from the entire sequence without requiring an impractically large context window.</li></ol><p id="00ba" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">At inference time, when multi-step decoding is needed, we can deploy KV caching to efficiently reuse past computations and maintain low latency.</p><p id="e2ee" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">These approaches collectively allow us to balance the need for detailed, long-term interaction modeling with the practical constraints of model training and inference, enhancing both the precision and scalability of our recommendation system.</p><p id="c410" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk"><strong class="pd ju">Information in Each ‘Token’</strong>: While the first part of our tokenization process focuses on structuring sequences of interactions, the next critical step is defining the rich information contained within each token. Unlike LLMs, which typically rely on a single embedding space to represent input tokens, our interaction events are packed with heterogeneous details. These include attributes of the action itself (such as locale, time, duration, and device type) as well as information about the content (such as item ID and metadata like genre and release country). Most of these features, especially categorical ones, are directly embedded within the model, embracing an end-to-end learning approach. However, certain features require special attention. For example, timestamps need additional processing to capture both absolute and relative notions of time, with absolute time being particularly important for understanding time-sensitive behaviors.</p><p id="5ece" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">To enhance prediction accuracy in sequential recommendation systems, we organize token features into two categories:</p><ol class=""><li id="a88b" class="pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv qx qy qz bk"><strong class="pd ju">Request-Time Features</strong>: These are features available at the moment of prediction, such as log-in time, device, or location.</li><li id="5441" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><strong class="pd ju">Post-Action Features</strong>: These are details available after an interaction has occurred, such as the specific show interacted with or the duration of the interaction.</li></ol><p id="5152" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">To predict the next interaction, we combine request-time features from the current step with post-action features from the <a class="ag ih" href="https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/18140" rel="noopener ugc nofollow" target="_blank">previous step</a>. This blending of contextual and historical information ensures each token in the sequence carries a comprehensive representation, capturing both the immediate context and user behavior patterns over time.</p><h1 id="0db0" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Considerations for Model Objective and Architecture</h1><p id="6f3d" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">As previously mentioned, our default approach employs the autoregressive next-token prediction objective, similar to GPT. This strategy effectively leverages the vast scale of unlabeled user interaction data. The adoption of this objective in recommendation systems has shown multiple successes [1–3]. However, given the distinct differences between language tasks and recommendation tasks, we have made several critical modifications to the objective.</p><p id="91bb" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Firstly, during the pretraining phase of typical LLMs, such as GPT, every target token is generally treated with equal weight. In contrast, in our model, not all user interactions are of equal importance. For instance, a 5-minute trailer play should not carry the same weight as a 2-hour full movie watch. A greater challenge arises when trying to align long-term user satisfaction with specific interactions and recommendations. To address this, we can adopt a multi-token prediction objective during training, where the model predicts the next <em class="rx">n</em> tokens at each step instead of a single token[4]. This approach encourages the model to capture longer-term dependencies and avoid myopic predictions focused solely on immediate next events.</p><p id="a9d5" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">Secondly, we can use multiple fields in our input data as auxiliary prediction objectives in addition to predicting the next item ID, which remains the primary target. For example, we can derive genres from the items in the original sequence and use this genre sequence as an auxiliary target. This approach serves several purposes: it acts as a regularizer to reduce overfitting on noisy item ID predictions, provides additional insights into user intentions or long-term genre preferences, and, when structured hierarchically, can improve the accuracy of predicting the target item ID. By first predicting auxiliary targets, such as genre or original language, the model effectively narrows down the candidate list, simplifying subsequent item ID prediction.</p><h1 id="ab80" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Unique Challenges for Recommendation FM</h1><p id="0a8b" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">In addition to the infrastructure challenges posed by training bigger models with substantial amounts of user interaction data that are common when trying to build foundation models, there are several unique hurdles specific to recommendations to make them viable. One of unique challenges is entity cold-starting.</p><p id="a38b" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">At Netflix, our mission is to entertain the world. New titles are added to the catalog frequently. Therefore the recommendation foundation models require a cold start capability, which means the models need to estimate members’ preferences for newly launched titles before anyone has engaged with them. To enable this, our foundation model training framework is built with the following two capabilities: Incremental training and being able to do inference with unseen entities.</p><ol class=""><li id="d462" class="pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv qx qy qz bk"><strong class="pd ju">Incremental training </strong>: Foundation models are trained on extensive datasets, including every member’s history of plays and actions, making frequent retraining impractical. However, our catalog and member preferences continually evolve. Unlike large language models, which can be incrementally trained with stable token vocabularies, our recommendation models require new embeddings for new titles, necessitating expanded embedding layers and output components. To address this, we warm-start new models by reusing parameters from previous models and initializing new parameters for new titles. For example, new title embeddings can be initialized by adding slight random noise to existing average embeddings or by using a weighted combination of similar titles’ embeddings based on metadata. This approach allows new titles to start with relevant embeddings, facilitating faster fine-tuning. In practice, the initialization method becomes less critical when more member interaction data is used for fine-tuning.</li><li id="09a9" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><strong class="pd ju">Dealing with unseen entities </strong>: Even with incremental training, it’s not always guaranteed to learn efficiently on new entities (ex: newly launched titles). It’s also possible that there will be some new entities that are not included/seen in the training data even if we fine-tune foundation models on a frequent basis. Therefore, it’s also important to let foundation models use metadata information of entities and inputs, not just member interaction data. Thus, our foundation model combines both learnable item id embeddings and learnable embeddings from metadata. The following diagram demonstrates this idea.</li></ol><figure class="ri rj rk rl rm rn rf rg paragraph-image"><div role="button" tabindex="0" class="ro rp gs rq bh rr"><div class="rf rg ry"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*7qnfUGWgXtVUjhP9%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*7qnfUGWgXtVUjhP9%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*7qnfUGWgXtVUjhP9%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*7qnfUGWgXtVUjhP9%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*7qnfUGWgXtVUjhP9%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*7qnfUGWgXtVUjhP9%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7qnfUGWgXtVUjhP9%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*7qnfUGWgXtVUjhP9 640w, https://miro.medium.com/v2/resize:fit:720/0*7qnfUGWgXtVUjhP9 720w, https://miro.medium.com/v2/resize:fit:750/0*7qnfUGWgXtVUjhP9 750w, https://miro.medium.com/v2/resize:fit:786/0*7qnfUGWgXtVUjhP9 786w, https://miro.medium.com/v2/resize:fit:828/0*7qnfUGWgXtVUjhP9 828w, https://miro.medium.com/v2/resize:fit:1100/0*7qnfUGWgXtVUjhP9 1100w, https://miro.medium.com/v2/resize:fit:1400/0*7qnfUGWgXtVUjhP9 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh hc rs c" width="700" height="389" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="rt go ru rf rg rv rw bf b bg z cm"><strong class="bf py">Figure 2. </strong>Titles are associated with various metadata, such as genres, storylines, and tones. Each type of metadata could be represented by averaging its respective embeddings, which are then concatenated to form the overall metadata-based embedding for the title.</figcaption></figure><p id="c1b6" class="pw-post-body-paragraph pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv iu bk">To create the final title embedding, we combine this metadata-based embedding with a fully-learnable ID-based embedding using a mixing layer. Instead of simply summing these embeddings, we use an attention mechanism based on the “age” of the entity. This approach allows new titles with limited interaction data to rely more on metadata, while established titles can depend more on ID-based embeddings. Since titles with similar metadata can have different user engagement, their embeddings should reflect these differences. Introducing some randomness during training encourages the model to learn from metadata rather than relying solely on ID embeddings. This method ensures that newly-launched or pre-launch titles have reasonable embeddings even with no user interaction data.</p><h1 id="5fdf" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Downstream Applications and Challenges</h1><p id="06ff" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">Our recommendation foundation model is designed to understand long-term member preferences and can be utilized in various ways by downstream applications:</p><ol class=""><li id="eb13" class="pb pc jt pd b pe pf pg ph pi pj pk pl hs pm pn po hv pp pq pr hy ps pt pu pv qx qy qz bk"><strong class="pd ju">Direct Use as a Predictive Model </strong>The model is primarily trained to predict the next entity a user will interact with. It includes multiple predictor heads for different tasks, such as forecasting member preferences for various genres. These can be directly applied to meet diverse business needs..</li><li id="b2fa" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><strong class="pd ju">Utilizing embeddings </strong>The model generates valuable embeddings for members and entities like videos, games, and genres. These embeddings are calculated in batch jobs and stored for use in both offline and online applications. They can serve as features in other models or be used for candidate generation, such as retrieving appealing titles for a user. High-quality title embeddings also support title-to-title recommendations. However, one important consideration is that the embedding space has arbitrary, uninterpretable dimensions and is incompatible across different model training runs. This poses challenges for downstream consumers, who must adapt to each retraining and redeployment, risking bugs due to invalidated assumptions about the embedding structure. To address this, we apply an orthogonal low-rank transformation to stabilize the user/item embedding space, ensuring consistent meaning of embedding dimensions, even as the base foundation model is retrained and redeployed.</li><li id="fb28" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk"><strong class="pd ju">Fine-Tuning with Specific Data </strong>The model’s adaptability allows for fine-tuning with application-specific data. Users can integrate the full model or subgraphs into their own models, fine-tuning them with less data and computational power. This approach achieves performance comparable to previous models, despite the initial foundation model requiring significant resources.</li></ol><h1 id="0565" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Scaling Foundation Models for Netflix Recommendations</h1><p id="e598" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">In scaling up our foundation model for Netflix recommendations, we draw inspiration from the success of large language models (LLMs). Just as LLMs have demonstrated the power of scaling in improving performance, we find that scaling is crucial for enhancing generative recommendation tasks. Successful scaling demands robust evaluation, efficient training algorithms, and substantial computing resources. Evaluation must effectively differentiate model performance and identify areas for improvement. Scaling involves data, model, and context scaling, incorporating user engagement, external reviews, multimedia assets, and high-quality embeddings. Our experiments confirm that the scaling law also applies to our foundation model, with consistent improvements observed as we increase data and model size.</p><figure class="ri rj rk rl rm rn rf rg paragraph-image"><div role="button" tabindex="0" class="ro rp gs rq bh rr"><div class="rf rg rz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*dEypYqp643q6GcVzn3IIww.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*dEypYqp643q6GcVzn3IIww.png" /><img alt="" class="bh hc rs c" width="700" height="456" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="rt go ru rf rg rv rw bf b bg z cm"><strong class="bf py">Figure 3. </strong>The relationship between model parameter size and relative performance improvement. The plot demonstrates the scaling law in recommendation modeling, showing a trend of increased performance with larger model sizes. The x-axis is logarithmically scaled to highlight growth across different magnitudes.</figcaption></figure><h1 id="0f22" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Conclusion</h1><p id="baf8" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">In conclusion, our Foundation Model for Personalized Recommendation represents a significant step towards creating a unified, data-centric system that leverages large-scale data to increase the quality of recommendations for our members. This approach borrows insights from Large Language Models (LLMs), particularly the principles of semi-supervised learning and end-to-end training, aiming to harness the vast scale of unlabeled user interaction data. Addressing unique challenges, like cold start and presentation bias, the model also acknowledges the distinct differences between language tasks and recommendation. The Foundation Model allows various downstream applications, from direct use as a predictive model to generate user and entity embeddings for other applications, and can be fine-tuned for specific canvases. We see promising results from downstream integrations. This move from multiple specialized models to a more comprehensive system marks an exciting development in the field of personalized recommendation systems.</p><h1 id="51d7" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Acknowledgements</h1><p id="8a46" class="pw-post-body-paragraph pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv iu bk">Contributors to this work (name in alphabetical order): <a class="ag ih" href="https://www.linkedin.com/in/aileisun/" rel="noopener ugc nofollow" target="_blank">Ai-Lei Sun</a> <a class="ag ih" href="https://www.linkedin.com/in/aishafenton/" rel="noopener ugc nofollow" target="_blank">Aish Fenton</a> <a class="ag ih" href="https://www.linkedin.com/in/annecocos/" rel="noopener ugc nofollow" target="_blank">Anne Cocos</a> <a class="ag ih" href="https://www.linkedin.com/in/foranuj/" rel="noopener ugc nofollow" target="_blank">Anuj Shah</a> <a class="ag ih" href="https://www.linkedin.com/in/arashaghevli/" rel="noopener ugc nofollow" target="_blank">Arash Aghevli</a> <a class="ag ih" href="https://www.linkedin.com/in/baolin-li-659426115/" rel="noopener ugc nofollow" target="_blank">Baolin Li</a> <a class="ag ih" href="https://www.linkedin.com/in/bowei-yan-0080a326/" rel="noopener ugc nofollow" target="_blank">Bowei Yan</a> <a class="ag ih" href="https://www.linkedin.com/in/danielzheng256/" rel="noopener ugc nofollow" target="_blank">Dan Zheng</a> <a class="ag ih" href="https://www.linkedin.com/in/dwliang/" rel="noopener ugc nofollow" target="_blank">Dawen Liang</a> <a class="ag ih" href="https://www.linkedin.com/in/ding-tong-2812785a/" rel="noopener ugc nofollow" target="_blank">Ding Tong</a> <a class="ag ih" href="https://www.linkedin.com/in/divya-gadde-3ba01551/" rel="noopener ugc nofollow" target="_blank">Divya Gadde</a> <a class="ag ih" href="https://www.linkedin.com/in/emma-yanyang-kong-6904b457/" rel="noopener ugc nofollow" target="_blank">Emma Kong</a> <a class="ag ih" href="https://www.linkedin.com/in/gary-y-62175170/" rel="noopener ugc nofollow" target="_blank">Gary Yeh</a> <a class="ag ih" href="https://www.linkedin.com/in/inbar-naor-6b973a50/" rel="noopener ugc nofollow" target="_blank">Inbar Naor</a> <a class="ag ih" href="https://www.linkedin.com/in/jinwangw/" rel="noopener ugc nofollow" target="_blank">Jin Wang</a> <a class="ag ih" href="https://www.linkedin.com/in/jbasilico/" rel="noopener ugc nofollow" target="_blank">Justin Basilico</a> <a class="ag ih" href="https://www.linkedin.com/in/kabir-nagrecha/overlay/about-this-profile/" rel="noopener ugc nofollow" target="_blank">Kabir Nagrecha</a> <a class="ag ih" href="https://www.linkedin.com/in/kzielnicki/" rel="noopener ugc nofollow" target="_blank">Kevin Zielnicki</a> <a class="ag ih" href="https://www.linkedin.com/in/linasbaltrunas/" rel="noopener ugc nofollow" target="_blank">Linas Baltrunas</a> <a class="ag ih" href="https://www.linkedin.com/in/lingyi-liu-4b866016/" rel="noopener ugc nofollow" target="_blank">Lingyi Liu</a> <a class="ag ih" href="https://www.linkedin.com/in/lequn-luke-wang-9226b2129/" rel="noopener ugc nofollow" target="_blank">Luke Wang</a> <a class="ag ih" href="https://www.linkedin.com/in/matan-appelbaum-39472b96/" rel="noopener ugc nofollow" target="_blank">Matan Appelbaum</a> <a class="ag ih" href="https://www.linkedin.com/in/tuzhucheng/" rel="noopener ugc nofollow" target="_blank">Michael Tu</a> <a class="ag ih" href="https://www.linkedin.com/in/moumitab/" rel="noopener ugc nofollow" target="_blank">Moumita Bhattacharya</a> <a class="ag ih" href="https://www.linkedin.com/in/pabloadelgado/" rel="noopener ugc nofollow" target="_blank">Pablo Delgado</a> <a class="ag ih" href="https://www.linkedin.com/in/qiuling-xu-a445b815a/" rel="noopener ugc nofollow" target="_blank">Qiuling Xu</a> <a class="ag ih" href="https://www.linkedin.com/in/rakeshkomuravelli/" rel="noopener ugc nofollow" target="_blank">Rakesh Komuravelli</a> <a class="ag ih" href="https://www.linkedin.com/in/raveeshbhalla/" rel="noopener ugc nofollow" target="_blank">Raveesh Bhalla</a> <a class="ag ih" href="https://www.linkedin.com/in/rob-story-b21a4912/" rel="noopener ugc nofollow" target="_blank">Rob Story</a> <a class="ag ih" href="https://www.linkedin.com/in/rogermenezes/" rel="noopener ugc nofollow" target="_blank">Roger Menezes</a> <a class="ag ih" href="https://www.linkedin.com/in/sejoon-oh/" rel="noopener ugc nofollow" target="_blank">Sejoon Oh</a> <a class="ag ih" href="https://www.linkedin.com/in/shahrzad-naseri-1b988760/" rel="noopener ugc nofollow" target="_blank">Shahrzad Naseri</a> <a class="ag ih" href="https://www.linkedin.com/in/swanandjoshi7/" rel="noopener ugc nofollow" target="_blank">Swanand Joshi</a> <a class="ag ih" href="https://www.linkedin.com/in/trungnguyen324/" rel="noopener ugc nofollow" target="_blank">Trung Nguyen</a> <a class="ag ih" href="https://www.linkedin.com/in/vito-ostuni-0b576027/" rel="noopener ugc nofollow" target="_blank">Vito Ostuni </a><a class="ag ih" href="https://www.linkedin.com/in/thomasweiwang/" rel="noopener ugc nofollow" target="_blank">Wei Wang</a> <a class="ag ih" href="https://www.linkedin.com/in/zhezhangncsu/" rel="noopener ugc nofollow" target="_blank">Zhe Zhang</a></p><h1 id="9c30" class="pw px jt bf py pz qa qb hp qc qd qe hr qf qg qh qi qj qk ql qm qn qo qp qq qr bk">Reference</h1><ol class=""><li id="d518" class="pb pc jt pd b pe qs pg ph pi qt pk pl hs qu pn po hv qv pq pr hy qw pt pu pv qx qy qz bk">C. K. Kang and J. McAuley, “Self-Attentive Sequential Recommendation,” <em class="rx">2018 IEEE International Conference on Data Mining (ICDM)</em>, Singapore, 2018, pp. 197–206, doi: 10.1109/ICDM.2018.00035.</li><li id="3a76" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk">F. Sun et al., “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer,” <em class="rx">Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ‘19)</em>, Beijing, China, 2019, pp. 1441–1450, doi: 10.1145/3357384.3357895.</li><li id="6b8c" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk">J. Zhai et al., “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations,” <em class="rx">arXiv preprint arXiv:2402.17152</em>, 2024.</li><li id="9071" class="pb pc jt pd b pe ra pg ph pi rb pk pl hs rc pn po hv rd pq pr hy re pt pu pv qx qy qz bk">F. Gloeckle, B. Youbi Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better &amp; Faster Large Language Models via Multi-token Prediction,” arXiv preprint arXiv:2404.19737, Apr. 2024.</li></ol></div>]]></description>
      <link>https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39</link>
      <guid>https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39</guid>
      <pubDate>Sat, 29 Mar 2025 01:51:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[HDR10+ Now Streaming on Netflix]]></title>
      <description><![CDATA[<div><div></div><p id="bf52" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk"><a class="ag hy" href="https://www.linkedin.com/in/rquero" rel="noopener ugc nofollow" target="_blank">Roger Quero</a>, <a class="ag hy" href="https://www.linkedin.com/in/liwei-guo" rel="noopener ugc nofollow" target="_blank">Liwei Guo</a>, <a class="ag hy" href="https://www.linkedin.com/in/jeffrwatts/" rel="noopener ugc nofollow" target="_blank">Jeff Watts</a>, <a class="ag hy" href="https://www.linkedin.com/in/joseph-mccormick-7b386026" rel="noopener ugc nofollow" target="_blank">Joseph McCormick</a>, <a class="ag hy" href="https://www.linkedin.com/in/agataopalach/" rel="noopener ugc nofollow" target="_blank">Agata Opalach</a>, <a class="ag hy" href="https://www.linkedin.com/in/anush-moorthy-b8451142/" rel="noopener ugc nofollow" target="_blank">Anush Moorthy</a></p><p id="f88c" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">We are excited to announce that we are now streaming HDR10+ content on our service for AV1-enabled devices, enhancing the viewing experience for certified HDR10+ devices, which previously only received HDR10 content. The dynamic metadata included in our HDR10+ content improves the quality and accuracy of the picture when viewed on these devices.</p><h1 id="4a1e" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk">Delighting Members with Even Better Picture Quality</h1><p id="ba30" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">Nearly a decade ago, we made a bold move to be a pioneering adopter of High Dynamic Range (HDR) technology. HDR enables images to have more details, vivid colors, and improved realism. We began producing our shows and movies in HDR, encoding them in HDR, and streaming them in HDR for our members. We were confident that it would greatly enhance our members’ viewing experience, and unlock new creative visions — and we were right! In the last five years, HDR streaming has increased by more than 300%, while the number of HDR-configured devices watching Netflix has more than doubled. Since launching HDR with season one of <em class="qn">Marco Polo</em>, Netflix now has over 11,000 hours of HDR titles for members to immerse themselves in.</p><p id="c033" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">We continue to enhance member joy while maintaining creative vision by adding support for HDR10+. This will further augment Netflix’s growing HDR ecosystem, preserve creative intent on even more devices, and provide a more immersive viewing experience.</p><p id="1789" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">We enabled HDR10+ on Netflix using the <a class="ag hy" href="https://aomedia.org/specifications/av1/" rel="noopener ugc nofollow" target="_blank">AV1 video codec</a> that was standardized by the Alliance for Open Media (AOM) in 2018. AV1 is one of the most efficient codecs available today. We <a class="ag hy" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/bringing-av1-streaming-to-netflix-members-tvs-b7fc88e42320">previously enabled</a> AV1 encoding for SDR content, and saw tremendous value for our members, including higher and more consistent visual quality, lower play delay and increased streaming at the highest resolution. AV1-SDR is already the second most streamed codec at Netflix, behind H.264/AVC, which has been around for over 20 years! With the addition of HDR10+ streams to AV1, we expect the day is not far when AV1 will be the most streamed codec at Netflix.</p><p id="849f" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">To enhance our offering, we have been adding HDR10+ streams to both new releases and existing popular HDR titles. AV1-HDR10+ now accounts for 50% of all eligible viewing hours. We will continue expanding our HDR10+ offerings with the goal of providing an HDR10+ experience for all HDR titles by the end of this year¹.</p><h1 id="d8ad" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk"><strong class="am">Industry Adopted Formats</strong></h1><p id="947b" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">Today, the industry recognizes three prevalent HDR formats: Dolby Vision, HDR10, and HDR10+. For all three HDR Formats, metadata is embedded in the content, serving as instructions to guide the playback device — whether it’s a TV, mobile device, or computer — on how to display the image.</p><p id="c1f5" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">HDR10 is the most widely adopted HDR format, supported by all HDR devices. HDR10 uses static metadata that is defined once for the entire content detailing aspects such as the maximum content light level (MaxCLL), maximum frame average light level (MaxFALL), as well as characteristics of the mastering display used for color grading. This metadata only allows for a one-size-fits-all tone mapping of the content for display devices. It cannot account for dynamic contrast across scenes, which most content contains.</p><p id="2353" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">HDR10+ and Dolby Vision improve on this with dynamic metadata that provides content image statistics on a per-frame basis, enabling optimized tone mapping adjustments for each scene. This achieves greater perceptual fidelity to the original, preserving creative intent.</p><h1 id="5572" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk"><strong class="am">HDR10 vs. HDR10+</strong></h1><p id="4df0" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">The figure below shows screen grabs of two AV1-encoded frames of the same content displayed using HDR10 (top) and HDR10+ (bottom).</p><figure class="qr qs qt qu qv qw qo qp paragraph-image"><div role="button" tabindex="0" class="qx qy gj qz bh ra"><div class="qo qp qq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*AjnQRaY7VFZoonX5SI36IA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*AjnQRaY7VFZoonX5SI36IA.png" /><img alt="" class="bh gt rb c" width="700" height="418" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><figure class="qr qs qt qu qv qw qo qp paragraph-image"><div role="button" tabindex="0" class="qx qy gj qz bh ra"><div class="qo qp rc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*gsW42hweG6RMbWwQjy1etg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*gsW42hweG6RMbWwQjy1etg.png" /><img alt="" class="bh gt rb c" width="700" height="420" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="0ce5" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk"><em class="qn">Photographs of devices displaying the same frame with HDR10 metadata (top) and HDR10+ metadata (bottom). Notice the preservation of the flashlight detail in the HDR10+ capture, and the over-exposure of the region under the flashlight in the HDR10 one².</em></p><p id="5357" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">As seen in the flashlight on the table, the highlight details are clipped in the HDR10 content, but are recovered in HDR10+. Further, the region under the flashlight is overexposed in the HDR10 content, while HDR10+ renders that region with greater fidelity to the source. The reason HDR10+, with its dynamic metadata, shines in this example is that the scenes preceding and following the scene with this frame have markedly different luminance statistics. The static HDR10 metadata is unable to account for the change in the content. While this is a simple example, the dynamic metadata in HDR10+ demonstrates such value across any set of scenes. This consistency allows our members to stay immersed in the content, and better preserves creative intent.</p><h1 id="925d" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk"><strong class="am">Receiving HDR10+</strong></h1><p id="0211" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">At the time of launch, these requirements must be satisfied to receive HDR10+:</p><p id="8822" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">1.Member must have a Netflix Premium plan subscription</p><p id="e1e4" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">2. Title must be available in HDR10+ format</p><p id="4103" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">3. Member device must support AV1 &amp; HDR10+. Here are some examples of compatible devices:</p><ul class=""><li id="f4b3" class="or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl rd re rf bk">SmartTVs, mobile phones, and tablets that meet Netflix certification for HDR10+</li><li id="dff8" class="or os jk ot b ou rg ow ox oy rh pa pb hj ri pd pe hm rj pg ph hp rk pj pk pl rd re rf bk">Source device (such as set-top boxes, streaming devices, MVPDs, etc.) that meets Netflix certification for HDR10+, connected to an HDR10+ compliant display via HDMI</li></ul><p id="3a60" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">4. For TV or streaming devices, ensure that the HDR toggle is enabled in our Netflix application settings: <a class="ag hy" href="https://help.netflix.com/en/node/100220" rel="noopener ugc nofollow" target="_blank">https://help.netflix.com/en/node/100220</a></p><p id="2e59" class="pw-post-body-paragraph or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl il bk">Additional guidance: <a class="ag hy" href="https://help.netflix.com/en/node/13444" rel="noopener ugc nofollow" target="_blank">https://help.netflix.com/en/node/13444</a></p><h1 id="62e5" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk">Summary</h1><p id="ffa0" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">More HDR content is watched every day on Netflix. Expanding the Netflix HDR ecosystem to include HDR10+ increases the accessibility of HDR content with dynamic metadata to more members, improves the viewing experience, and preserves the creative intent of our content creators. The commitment to innovation and quality underscores our dedication to delivering an immersive and authentic viewing experience for all our members.</p><h1 id="0734" class="pm pn jk bf po pp pq pr hg ps pt pu hi pv pw px py pz qa qb qc qd qe qf qg qh bk">Acknowledgements</h1><p id="5054" class="pw-post-body-paragraph or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl il bk">Launching HDR10+ was a collaborative effort involving multiple teams at Netflix, and we are grateful to everyone who contributed to making this idea a reality. We would like to extend our thanks to the following teams for their crucial roles in this launch:</p><ul class=""><li id="7449" class="or os jk ot b ou ov ow ox oy oz pa pb hj pc pd pe hm pf pg ph hp pi pj pk pl rd re rf bk">The various Client and Partner Engineering teams at Netflix that manage the Netflix experience across different device platforms.<br />Special acknowledgments: <a class="ag hy" href="https://www.linkedin.com/in/akshaygarg05/" rel="noopener ugc nofollow" target="_blank">Akshay Garg</a>, <a class="ag hy" href="https://www.linkedin.com/in/dashap/" rel="noopener ugc nofollow" target="_blank">Dasha Polyakova</a>, <a class="ag hy" href="https://www.linkedin.com/in/wei-vivian-li/" rel="noopener ugc nofollow" target="_blank">Vivian Li</a>, <a class="ag hy" href="https://www.linkedin.com/in/benjamintoofer/" rel="noopener ugc nofollow" target="_blank">Ben Toofer</a>, <a class="ag hy" href="https://www.linkedin.com/in/allanzp/" rel="noopener ugc nofollow" target="_blank">Allan Zhou</a>, <a class="ag hy" href="https://www.linkedin.com/in/artemdanylenko/" rel="noopener ugc nofollow" target="_blank">Artem Danylenko</a></li><li id="23d0" class="or os jk ot b ou rg ow ox oy rh pa pb hj ri pd pe hm rj pg ph hp rk pj pk pl rd re rf bk">The Encoding Technologies team that is responsible for producing optimized encodings to enable high-quality experiences for our members. Special acknowledgments: <a class="ag hy" href="https://www.linkedin.com/in/adithyaprakash/" rel="noopener ugc nofollow" target="_blank">Adithya Prakash</a>, <a class="ag hy" href="https://www.linkedin.com/in/carvalhovinicius/" rel="noopener ugc nofollow" target="_blank">Vinicius Carvalho</a></li><li id="fd90" class="or os jk ot b ou rg ow ox oy rh pa pb hj ri pd pe hm rj pg ph hp rk pj pk pl rd re rf bk">The Content Operations &amp; Innovation teams responsible for producing and delivering HDR content to Netflix, maintaining the intent of creative vision from production to streaming. Special acknowledgements: <a class="ag hy" href="https://www.linkedin.com/in/michael-keegan-072a4950/" rel="noopener ugc nofollow" target="_blank">Michael Keegan</a></li></ul><h2 id="5696" class="rl pn jk bf po hf rm ez hg hh rn fb hi hj ro hk hl hm rp hn ho hp rq hq hr rr bk">Footnotes</h2><ol class=""><li id="ddfa" class="or os jk ot b ou qi ow ox oy qj pa pb hj qk pd pe hm ql pg ph hp qm pj pk pl rs re rf bk">While we have enabled HDR10+ for distribution i.e., for what our members consume on their devices, we continue to accept only Dolby Vision masters on the ingest side, i.e., for all content delivery to Netflix as per our <a class="ag hy" href="https://partnerhelp.netflixstudios.com/hc/en-us/sections/360012197873-Branded-Delivery-Specifications" rel="noopener ugc nofollow" target="_blank">delivery specification</a>. In addition to HDR10+, we continue to serve HDR10 and DolbyVision. Our encoding pipeline is designed with flexibility and extensibility where all these HDR formats could be derived from a single DolbyVision deliverable efficiently at scale.</li><li id="d2d9" class="or os jk ot b ou rg ow ox oy rh pa pb hj ri pd pe hm rj pg ph hp rk pj pk pl rs re rf bk">We recognize that it is hard to convey visual improvements in HDR video using still photographs converted to SDR. We encourage the reader to stream Netflix content in HDR10+ and check for yourself!</li></ol></div>]]></description>
      <link>https://netflixtechblog.com/hdr10-now-streaming-on-netflix-c9ab1f4bd72b</link>
      <guid>https://netflixtechblog.com/hdr10-now-streaming-on-netflix-c9ab1f4bd72b</guid>
      <pubDate>Mon, 24 Mar 2025 19:39:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Title Launch Observability at Netflix Scale]]></title>
      <description><![CDATA[<div><div><h2 id="645e" class="pw-subtitle-paragraph hr gt gu bf b hs ht hu hv hw hx hy hz ia ib ic id ie if ig cq du">Part 3: System Strategies and Architecture</h2><div></div><p id="a132" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">By:</strong> <a class="af oc" href="https://www.linkedin.com/in/varun-khaitan/" rel="noopener ugc nofollow" target="_blank">Varun Khaitan</a></p><p id="6090" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">With special thanks to my stunning colleagues: <a class="af oc" href="https://www.linkedin.com/in/mallikarao/" rel="noopener ugc nofollow" target="_blank">Mallika Rao</a>, <a class="af oc" href="https://www.linkedin.com/in/esmir-mesic/" rel="noopener ugc nofollow" target="_blank">Esmir Mesic</a>, <a class="af oc" href="https://www.linkedin.com/in/hugodesmarques/" rel="noopener ugc nofollow" target="_blank">Hugo Marques</a></p><p id="6db9" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">This blog post is a continuation of <a class="af oc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/title-launch-observability-at-netflix-scale-19ea916be1ed">Part 2</a>, where we cleared the ambiguity around title launch observability at Netflix. In this installment, we will explore the strategies, tools, and methodologies that were employed to achieve comprehensive title observability at scale.</p><h1 id="69a2" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Defining the observability endpoint</h1><p id="bc9b" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">To create a comprehensive solution, we decided to introduce observability endpoints first. Each microservice involved in our <strong class="ni gv">Personalization stack</strong> that integrated with our observability solution had to introduce a new “Title Health” endpoint. Our goal was for each new endpoint to adhere to a few principles:</p><ol class=""><li id="1298" class="ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob pe pf pg bk">Accurate reflection of production behavior</li><li id="cade" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Standardization across all endpoints</li><li id="7204" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Answering the Insight Triad: “Healthy” or not, why not and how to fix it.</li></ol><p id="0cac" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">Accurately Reflecting Production Behavior</strong></p><p id="4dbf" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">A key part of our solution is insights into production behavior, which necessitates our requests to the endpoint result in traffic to the real service functions that mimics the same pathways the traffic would take if it came from the usual callers.</p><p id="377d" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">In order to allow for this mimicking, many systems implement an “event” handling, where they convert our request into a call to the real service with properties enabled to log when titles are filtered out of their response and why. Building services that adhere to software best practices, such as Object-Oriented Programming (OOP), the SOLID principles, and modularization, is crucial to have success at this stage. Without these practices, service endpoints may become tightly coupled to business logic, making it challenging and costly to add a new endpoint that seamlessly integrates with the observability solution while following the same production logic.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*8s2gCb2Pqw2Q0Frq%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*8s2gCb2Pqw2Q0Frq%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*8s2gCb2Pqw2Q0Frq%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*8s2gCb2Pqw2Q0Frq%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*8s2gCb2Pqw2Q0Frq%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*8s2gCb2Pqw2Q0Frq%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*8s2gCb2Pqw2Q0Frq%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*8s2gCb2Pqw2Q0Frq 640w, https://miro.medium.com/v2/resize:fit:720/0*8s2gCb2Pqw2Q0Frq 720w, https://miro.medium.com/v2/resize:fit:750/0*8s2gCb2Pqw2Q0Frq 750w, https://miro.medium.com/v2/resize:fit:786/0*8s2gCb2Pqw2Q0Frq 786w, https://miro.medium.com/v2/resize:fit:828/0*8s2gCb2Pqw2Q0Frq 828w, https://miro.medium.com/v2/resize:fit:1100/0*8s2gCb2Pqw2Q0Frq 1100w, https://miro.medium.com/v2/resize:fit:1400/0*8s2gCb2Pqw2Q0Frq 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="406" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du"><em class="qe">A service with modular business logic facilitates the seamless addition of an observability endpoint.</em></figcaption></figure><p id="2da9" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">Standardization</strong></p><p id="e9b6" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">To standardize communication between our observability service and the personalization stack’s observability endpoints, we’ve developed a stable proto request/response format. This centralized format, defined and maintained by our team, ensures all endpoints adhere to a consistent protocol. As a result, requests are uniformly handled, and responses are processed cohesively. This standardization enhances adoption within the personalization stack, simplifies the system, and improves understanding and debuggability for engineers.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*P-0nxUAHve77yBtv%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*P-0nxUAHve77yBtv%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*P-0nxUAHve77yBtv%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*P-0nxUAHve77yBtv%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*P-0nxUAHve77yBtv%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*P-0nxUAHve77yBtv%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*P-0nxUAHve77yBtv%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*P-0nxUAHve77yBtv 640w, https://miro.medium.com/v2/resize:fit:720/0*P-0nxUAHve77yBtv 720w, https://miro.medium.com/v2/resize:fit:750/0*P-0nxUAHve77yBtv 750w, https://miro.medium.com/v2/resize:fit:786/0*P-0nxUAHve77yBtv 786w, https://miro.medium.com/v2/resize:fit:828/0*P-0nxUAHve77yBtv 828w, https://miro.medium.com/v2/resize:fit:1100/0*P-0nxUAHve77yBtv 1100w, https://miro.medium.com/v2/resize:fit:1400/0*P-0nxUAHve77yBtv 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="152" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du"><em class="qe">The request schema for the observability endpoint.</em></figcaption></figure><p id="97c5" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">The Insight Triad API</strong></p><p id="95d2" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">To efficiently understand the health of a title and triage issues quickly, all implementations of the observability endpoint must answer: is the title eligible for this phase of promotion, if not — why is it not eligible, and what can be done to fix any problems.</p><p id="a1a7" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">The end-users of this observability system are Launch Managers, whose job it is to ensure smooth title launches. As such, they must be able to quickly see whether there is a problem, what the problem is, and how to solve it. Teams implementing the endpoint must provide as much information as possible so that a non-engineer (Launch Manager) can understand the root cause of the issue and fix any title setup issues as they arise. They must also provide enough information for partner engineers to identify the problem with the underlying service in cases of system-level issues.</p><p id="29f8" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">These requirements are captured in the following protobuf object that defines the endpoint response.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*aeo7vs3h2Z5JKH5t%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*aeo7vs3h2Z5JKH5t%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*aeo7vs3h2Z5JKH5t%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*aeo7vs3h2Z5JKH5t%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*aeo7vs3h2Z5JKH5t%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*aeo7vs3h2Z5JKH5t%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*aeo7vs3h2Z5JKH5t%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*aeo7vs3h2Z5JKH5t 640w, https://miro.medium.com/v2/resize:fit:720/0*aeo7vs3h2Z5JKH5t 720w, https://miro.medium.com/v2/resize:fit:750/0*aeo7vs3h2Z5JKH5t 750w, https://miro.medium.com/v2/resize:fit:786/0*aeo7vs3h2Z5JKH5t 786w, https://miro.medium.com/v2/resize:fit:828/0*aeo7vs3h2Z5JKH5t 828w, https://miro.medium.com/v2/resize:fit:1100/0*aeo7vs3h2Z5JKH5t 1100w, https://miro.medium.com/v2/resize:fit:1400/0*aeo7vs3h2Z5JKH5t 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="150" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du"><em class="qe">The response schema for the observability endpoint.</em></figcaption></figure><h1 id="b501" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">High level architecture</h1><p id="34a0" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">We’ve distilled our comprehensive solution into the following key steps, capturing the essence of our approach:</p><ol class=""><li id="7946" class="ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob pe pf pg bk">Establish observability endpoints across all services within our Personalization and Discovery Stack.</li><li id="d917" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Implement proactive monitoring for each of these endpoints.</li><li id="feab" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Track real-time title impressions from the Netflix UI.</li><li id="08ec" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Store the data in an optimized, highly distributed datastore.</li><li id="109e" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">Offer easy-to-integrate APIs for our dashboard, enabling stakeholders to track specific titles effectively.</li><li id="51d0" class="ng nh gu ni b hs ph nk nl hv pi nn no np pj nr ns nt pk nv nw nx pl nz oa ob pe pf pg bk">“Time Travel” to validate ahead of time.</li></ol><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*1h2cwZDfmz8nis_h%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*1h2cwZDfmz8nis_h%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*1h2cwZDfmz8nis_h%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*1h2cwZDfmz8nis_h%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*1h2cwZDfmz8nis_h%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*1h2cwZDfmz8nis_h%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*1h2cwZDfmz8nis_h%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*1h2cwZDfmz8nis_h 640w, https://miro.medium.com/v2/resize:fit:720/0*1h2cwZDfmz8nis_h 720w, https://miro.medium.com/v2/resize:fit:750/0*1h2cwZDfmz8nis_h 750w, https://miro.medium.com/v2/resize:fit:786/0*1h2cwZDfmz8nis_h 786w, https://miro.medium.com/v2/resize:fit:828/0*1h2cwZDfmz8nis_h 828w, https://miro.medium.com/v2/resize:fit:1100/0*1h2cwZDfmz8nis_h 1100w, https://miro.medium.com/v2/resize:fit:1400/0*1h2cwZDfmz8nis_h 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="445" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du"><em class="qe">Observability stack high level architecture diagram</em></figcaption></figure><p id="9f94" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">In the following sections, we will explore each of these concepts and components as illustrated in the diagram above.</p><h1 id="ea0e" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Key Features</h1><h2 id="5bfa" class="qi oe gu bf of qj qk dy oi ql qm ea ol np qn qo qp nt qq qr qs nx qt qu qv qw bk">Proactive monitoring through scheduled collectors jobs</h2><p id="2070" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Our Title Health microservice runs a scheduled collector job every 30 minutes for most of our personalization stack.</p><p id="9419" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">For each Netflix row we support (such as Trending Now, Coming Soon, etc.), there is a dedicated collector. These collectors retrieve the relevant list of titles from our catalog that qualify for a specific row by interfacing with our catalog services. These services are informed about the expected subset of titles for each row, for which we are assessing title health.</p><p id="0bb9" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Once a collector retrieves its list of candidate titles, it orchestrates batched calls to assigned row services using the above standardized schema to retrieve all the relevant health information of the titles. Additionally, some collectors will instead poll our kafka queue for impressions data.</p><h2 id="07cf" class="qi oe gu bf of qj qk dy oi ql qm ea ol np qn qo qp nt qq qr qs nx qt qu qv qw bk">Real-time Title Impressions and Kafka Queue</h2><p id="c0e6" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">In addition to evaluating title health via our personalization stack services, we also keep an eye on how our recommendation algorithms treat titles by reviewing impressions data. It’s essential that our algorithms treat all titles equitably, for each one has limitless potential.</p><p id="1754" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">This data is processed from a real-time impressions stream into a Kafka queue, which our title health system regularly polls. Specialized collectors access the Kafka queue every two minutes to retrieve impressions data. This data is then aggregated in minute(s) intervals, calculating the number of impressions titles receive in near-real-time, and presented as an additional health status indicator for stakeholders.</p><h2 id="958b" class="qi oe gu bf of qj qk dy oi ql qm ea ol np qn qo qp nt qq qr qs nx qt qu qv qw bk">Data storage and distribution through Hollow Feeds</h2><p id="0fe5" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk"><a class="af oc" href="https://hollow.how/" rel="noopener ugc nofollow" target="_blank">Netflix Hollow</a> is an Open Source java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access. Given the shape of our data, hollow feeds are an excellent strategy to distribute the data across our service boxes.</p><p id="7314" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Once collectors gather health data from partner services in the personalization stack or from our impressions stream, this data is stored in a dedicated Hollow feed for each collector. Hollow offers numerous features that help us monitor the overall health of a Netflix row, including ensuring there are no large-scale issues across a feed publish. It also allows us to track the history of each title by maintaining a per-title data history, calculate differences between previous and current data versions, and roll back to earlier versions if a problematic data change is detected.</p><h2 id="321f" class="qi oe gu bf of qj qk dy oi ql qm ea ol np qn qo qp nt qq qr qs nx qt qu qv qw bk">Observability Dashboard using Health Check Engine</h2><p id="d7ac" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">We maintain several dashboards that utilize our title health service to present the status of titles to stakeholders. These user interfaces access an endpoint in our service, enabling them to request the current status of a title across all supported rows. This endpoint efficiently reads from all available Hollow Feeds to obtain the current status, thanks to Hollow’s in-memory capabilities. The results are returned in a standardized format, ensuring easy support for future UIs.</p><p id="5c46" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Additionally, we have other endpoints that can summarize the health of a title across subsets of sections to highlight specific member experiences.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*dBFS1pBlqNoCUHwV%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*dBFS1pBlqNoCUHwV%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*dBFS1pBlqNoCUHwV%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*dBFS1pBlqNoCUHwV%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*dBFS1pBlqNoCUHwV%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*dBFS1pBlqNoCUHwV%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*dBFS1pBlqNoCUHwV%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*dBFS1pBlqNoCUHwV 640w, https://miro.medium.com/v2/resize:fit:720/0*dBFS1pBlqNoCUHwV 720w, https://miro.medium.com/v2/resize:fit:750/0*dBFS1pBlqNoCUHwV 750w, https://miro.medium.com/v2/resize:fit:786/0*dBFS1pBlqNoCUHwV 786w, https://miro.medium.com/v2/resize:fit:828/0*dBFS1pBlqNoCUHwV 828w, https://miro.medium.com/v2/resize:fit:1100/0*dBFS1pBlqNoCUHwV 1100w, https://miro.medium.com/v2/resize:fit:1400/0*dBFS1pBlqNoCUHwV 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="116" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du">Message depicting a dashboard request.</figcaption></figure><h2 id="e3b2" class="qi oe gu bf of qj qk dy oi ql qm ea ol np qn qo qp nt qq qr qs nx qt qu qv qw bk">Time Traveling: Catching before launch</h2><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*Zz2Y8yjPAsbG5WVR%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*Zz2Y8yjPAsbG5WVR%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*Zz2Y8yjPAsbG5WVR%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*Zz2Y8yjPAsbG5WVR%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*Zz2Y8yjPAsbG5WVR%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Zz2Y8yjPAsbG5WVR%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Zz2Y8yjPAsbG5WVR%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Zz2Y8yjPAsbG5WVR 640w, https://miro.medium.com/v2/resize:fit:720/0*Zz2Y8yjPAsbG5WVR 720w, https://miro.medium.com/v2/resize:fit:750/0*Zz2Y8yjPAsbG5WVR 750w, https://miro.medium.com/v2/resize:fit:786/0*Zz2Y8yjPAsbG5WVR 786w, https://miro.medium.com/v2/resize:fit:828/0*Zz2Y8yjPAsbG5WVR 828w, https://miro.medium.com/v2/resize:fit:1100/0*Zz2Y8yjPAsbG5WVR 1100w, https://miro.medium.com/v2/resize:fit:1400/0*Zz2Y8yjPAsbG5WVR 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="541" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="3f42" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Titles launching at Netflix go through several phases of pre-promotion before ultimately launching on our platform. For each of these phases, the first several hours of promotion are critical for the reach and effective personalization of a title, especially once the title has launched. Thus, to prevent issues as titles go through the launch lifecycle, our observability system needs to be capable of simulating traffic ahead of time so that relevant teams can catch and fix issues before they impact members. We call this capability <strong class="ni gv">“Time Travel”</strong>.</p><p id="6a4a" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Many of the metadata and assets involved in title setup have specific timelines for when they become available to members. To determine if a title will be viewable at the start of an experience, we must simulate a request to a partner service as if it were from a future time when those specific metadata or assets are available. This is achieved by including a future timestamp in our request to the observability endpoint, corresponding to when the title is expected to appear for a given experience. The endpoint then communicates with any further downstream services using the context of that future timestamp.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fj px bh py"><div class="pm pn qx"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*jrdqpJmp0lzna6Zc%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*jrdqpJmp0lzna6Zc%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*jrdqpJmp0lzna6Zc%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*jrdqpJmp0lzna6Zc%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*jrdqpJmp0lzna6Zc%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*jrdqpJmp0lzna6Zc%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*jrdqpJmp0lzna6Zc%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*jrdqpJmp0lzna6Zc 640w, https://miro.medium.com/v2/resize:fit:720/0*jrdqpJmp0lzna6Zc 720w, https://miro.medium.com/v2/resize:fit:750/0*jrdqpJmp0lzna6Zc 750w, https://miro.medium.com/v2/resize:fit:786/0*jrdqpJmp0lzna6Zc 786w, https://miro.medium.com/v2/resize:fit:828/0*jrdqpJmp0lzna6Zc 828w, https://miro.medium.com/v2/resize:fit:1100/0*jrdqpJmp0lzna6Zc 1100w, https://miro.medium.com/v2/resize:fit:1400/0*jrdqpJmp0lzna6Zc 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn pz c" width="700" height="118" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qa ff qb pm pn qc qd bf b bg z du">An example request with a future timestamp.</figcaption></figure><h1 id="a91c" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Conclusion</h1><p id="0f5b" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Throughout this series, we’ve explored the journey of enhancing title launch observability at Netflix. In <a class="af oc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/title-launch-observability-at-netflix-scale-c88c586629eb">Part 1</a>, we identified the challenges of managing vast content launches and the need for scalable solutions to ensure each title’s success. <a class="af oc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/title-launch-observability-at-netflix-scale-19ea916be1ed">Part 2</a> highlighted the strategic approach to navigating ambiguity, introducing “Title Health” as a framework to align teams and prioritize core issues. In this final part, we detailed the sophisticated system strategies and architecture, including observability endpoints, proactive monitoring, and “Time Travel” capabilities; all designed to ensure a thrilling viewing experience.</p><p id="ad08" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">By investing in these innovative solutions, we enhance the discoverability and success of each title, fostering trust with content creators and partners. This journey not only bolsters our operational capabilities but also lays the groundwork for future innovations, ensuring that every story reaches its intended audience and that every member enjoys their favorite titles on Netflix.</p><p id="4e68" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Thank you for joining us on this exploration, and stay tuned for more insights and innovations as we continue to entertain the world.</p></div></div>]]></description>
      <link>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-8efe69ebd653</link>
      <guid>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-8efe69ebd653</guid>
      <pubDate>Wed, 05 Mar 2025 02:24:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Impressions at Netflix]]></title>
      <description><![CDATA[<div><div><h2 id="097f" class="pw-subtitle-paragraph hr gt gu bf b hs ht hu hv hw hx hy hz ia ib ic id ie if ig cq du">Part 1: Creating the Source of Truth for Impressions</h2><div></div><p id="ce57" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">By:</strong> <a class="af oc" href="https://www.linkedin.com/in/tulikabhatt/" rel="noopener ugc nofollow" target="_blank">Tulika Bhatt</a></p><p id="e560" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Imagine scrolling through Netflix, where each movie poster or promotional banner competes for your attention. Every image you hover over isn’t just a visual placeholder; it’s a critical data point that fuels our sophisticated personalization engine. At Netflix, we call these images ‘impressions,’ and they play a pivotal role in transforming your interaction from simple browsing into an immersive binge-watching experience, all tailored to your unique tastes.</p><p id="3c19" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Capturing these moments and turning them into a personalized journey is no simple feat. It requires a state-of-the-art system that can track and process these impressions while maintaining a detailed history of each profile’s exposure. This nuanced integration of data and technology empowers us to offer bespoke content recommendations.</p><p id="d15d" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">In this multi-part blog series, we take you behind the scenes of our system that processes billions of impressions daily. We will explore the challenges we encounter and unveil how we are building a resilient solution that transforms these client-side impressions into a personalized content discovery experience for every Netflix viewer.</p><figure class="og oh oi oj ok ol od oe paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="od oe of"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*T6tQiUj-VDtyEhd1%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*T6tQiUj-VDtyEhd1%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*T6tQiUj-VDtyEhd1%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*T6tQiUj-VDtyEhd1%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*T6tQiUj-VDtyEhd1%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*T6tQiUj-VDtyEhd1%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*T6tQiUj-VDtyEhd1%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*T6tQiUj-VDtyEhd1 640w, https://miro.medium.com/v2/resize:fit:720/0*T6tQiUj-VDtyEhd1 720w, https://miro.medium.com/v2/resize:fit:750/0*T6tQiUj-VDtyEhd1 750w, https://miro.medium.com/v2/resize:fit:786/0*T6tQiUj-VDtyEhd1 786w, https://miro.medium.com/v2/resize:fit:828/0*T6tQiUj-VDtyEhd1 828w, https://miro.medium.com/v2/resize:fit:1100/0*T6tQiUj-VDtyEhd1 1100w, https://miro.medium.com/v2/resize:fit:1400/0*T6tQiUj-VDtyEhd1 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn oq c" width="700" height="330" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="or ff os od oe ot ou bf b bg z du">Impressions on homepage</figcaption></figure><h1 id="9ac6" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Why do we need impression history?</h1><h2 id="16ca" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Enhanced Personalization</h2><p id="9340" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">To tailor recommendations more effectively, it’s crucial to track what content a user has already encountered. Having impression history helps us achieve this by allowing us to identify content that has been displayed on the homepage but not engaged with, helping us deliver fresh, engaging recommendations.</p><h2 id="b9ca" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Frequency Capping</h2><p id="549e" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">By maintaining a history of impressions, we can implement frequency capping to prevent over-exposure to the same content. This ensures users aren’t repeatedly shown identical options, keeping the viewing experience vibrant and reducing the risk of frustration or disengagement.</p><h2 id="dc85" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Highlighting New Releases</h2><p id="ba61" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">For new content, impression history helps us monitor initial user interactions and adjust our merchandising efforts accordingly. We can experiment with different content placements or promotional strategies to boost visibility and engagement.</p><h2 id="277c" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Analytical Insights</h2><p id="1d77" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Additionally, impression history offers insightful information for addressing a number of platform-related analytics queries. Analyzing impression history, for example, might help determine how well a specific row on the home page is functioning or assess the effectiveness of a merchandising strategy.</p><h1 id="6595" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Architecture Overview</h1><p id="3dda" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">The first pivotal step in managing impressions begins with the creation of a Source-of-Truth (SOT) dataset. This foundational dataset is essential, as it supports various downstream workflows and enables a multitude of use cases.</p><h2 id="50f2" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Collecting Raw Impression Events</h2><p id="79ce" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">As Netflix members explore our platform, their interactions with the user interface spark a vast array of raw events. These events are promptly relayed from the client side to our servers, entering a centralized event processing queue. This queue ensures we are consistently capturing raw events from our global user base.</p><p id="2110" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">After raw events are collected into a centralized queue, a custom event extractor processes this data to identify and extract all impression events. These extracted events are then routed to an Apache Kafka topic for immediate processing needs and simultaneously stored in an Apache Iceberg table for long-term retention and historical analysis. This dual-path approach leverages Kafka’s capability for low-latency streaming and Iceberg’s efficient management of large-scale, immutable datasets, ensuring both real-time responsiveness and comprehensive historical data availability.</p><figure class="og oh oi oj ok ol od oe paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="od oe ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*4NRQp10pg9KK_GKU%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*4NRQp10pg9KK_GKU%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*4NRQp10pg9KK_GKU%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*4NRQp10pg9KK_GKU%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*4NRQp10pg9KK_GKU%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*4NRQp10pg9KK_GKU%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*4NRQp10pg9KK_GKU%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*4NRQp10pg9KK_GKU 640w, https://miro.medium.com/v2/resize:fit:720/0*4NRQp10pg9KK_GKU 720w, https://miro.medium.com/v2/resize:fit:750/0*4NRQp10pg9KK_GKU 750w, https://miro.medium.com/v2/resize:fit:786/0*4NRQp10pg9KK_GKU 786w, https://miro.medium.com/v2/resize:fit:828/0*4NRQp10pg9KK_GKU 828w, https://miro.medium.com/v2/resize:fit:1100/0*4NRQp10pg9KK_GKU 1100w, https://miro.medium.com/v2/resize:fit:1400/0*4NRQp10pg9KK_GKU 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn oq c" width="700" height="303" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="or ff os od oe ot ou bf b bg z du">Collecting raw impression events</figcaption></figure><h2 id="48cf" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Filtering &amp; Enriching Raw Impressions</h2><p id="829b" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Once the raw impression events are queued, a stateless Apache Flink job takes charge, meticulously processing this data. It filters out any invalid entries and enriches the valid ones with additional metadata, such as show or movie title details, and the specific page and row location where each impression was presented to users. This refined output is then structured using an Avro schema, establishing a definitive source of truth for Netflix’s impression data. The enriched data is seamlessly accessible for both real-time applications via Kafka and historical analysis through storage in an Apache Iceberg table. This dual availability ensures immediate processing capabilities alongside comprehensive long-term data retention.</p><figure class="og oh oi oj ok ol od oe paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="od oe qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*Lhs-gvhMuIyKylHt%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*Lhs-gvhMuIyKylHt%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*Lhs-gvhMuIyKylHt%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*Lhs-gvhMuIyKylHt%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*Lhs-gvhMuIyKylHt%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Lhs-gvhMuIyKylHt%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Lhs-gvhMuIyKylHt%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Lhs-gvhMuIyKylHt 640w, https://miro.medium.com/v2/resize:fit:720/0*Lhs-gvhMuIyKylHt 720w, https://miro.medium.com/v2/resize:fit:750/0*Lhs-gvhMuIyKylHt 750w, https://miro.medium.com/v2/resize:fit:786/0*Lhs-gvhMuIyKylHt 786w, https://miro.medium.com/v2/resize:fit:828/0*Lhs-gvhMuIyKylHt 828w, https://miro.medium.com/v2/resize:fit:1100/0*Lhs-gvhMuIyKylHt 1100w, https://miro.medium.com/v2/resize:fit:1400/0*Lhs-gvhMuIyKylHt 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn oq c" width="700" height="447" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="or ff os od oe ot ou bf b bg z du">Impression Source-of-Truth architecture</figcaption></figure><h2 id="d39b" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Ensuring High Quality Impressions</h2><p id="d682" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Maintaining the highest quality of impressions is a top priority. We accomplish this by gathering detailed column-level metrics that offer insights into the state and quality of each impression. These metrics include everything from validating identifiers to checking that essential columns are properly filled. The data collected feeds into a comprehensive quality dashboard and supports a tiered threshold-based alerting system. These alerts promptly notify us of any potential issues, enabling us to swiftly address regressions. Additionally, while enriching the data, we ensure that all columns are in agreement with each other, offering in-place corrections wherever possible to deliver accurate data.</p><figure class="og oh oi oj ok ol od oe paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="od oe qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*VWssCnOIabEqo02H%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*VWssCnOIabEqo02H%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*VWssCnOIabEqo02H%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*VWssCnOIabEqo02H%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*VWssCnOIabEqo02H%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*VWssCnOIabEqo02H%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*VWssCnOIabEqo02H%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*VWssCnOIabEqo02H 640w, https://miro.medium.com/v2/resize:fit:720/0*VWssCnOIabEqo02H 720w, https://miro.medium.com/v2/resize:fit:750/0*VWssCnOIabEqo02H 750w, https://miro.medium.com/v2/resize:fit:786/0*VWssCnOIabEqo02H 786w, https://miro.medium.com/v2/resize:fit:828/0*VWssCnOIabEqo02H 828w, https://miro.medium.com/v2/resize:fit:1100/0*VWssCnOIabEqo02H 1100w, https://miro.medium.com/v2/resize:fit:1400/0*VWssCnOIabEqo02H 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn oq c" width="700" height="428" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="or ff os od oe ot ou bf b bg z du">Dashboard showing mismatch count between two columns- entityId and videoId</figcaption></figure><h1 id="fed2" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Configuration</h1><p id="9417" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">We handle a staggering volume of 1 to 1.5 million impression events globally every second, with each event approximately 1.2KB in size. To efficiently process this massive influx in real-time, we employ Apache Flink for its low-latency stream processing capabilities, which seamlessly integrates both batch and stream processing to facilitate efficient backfilling of historical data and ensure consistency across real-time and historical analyses. Our Flink configuration includes 8 task managers per region, each equipped with 8 CPU cores and 32GB of memory, operating at a parallelism of 48, allowing us to handle the necessary scale and speed for seamless performance delivery. The Flink job’s sink is equipped with a data mesh connector, as detailed in our <a class="af oc" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">Data Mesh platform</a> which has two outputs: Kafka and Iceberg. This setup allows for efficient streaming of real-time data through Kafka and the preservation of historical data in Iceberg, providing a comprehensive and flexible data processing and storage solution.</p><figure class="og oh oi oj ok ol od oe paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="od oe of"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*B-hm-UJMBV7-WOb6%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*B-hm-UJMBV7-WOb6%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*B-hm-UJMBV7-WOb6%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*B-hm-UJMBV7-WOb6%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*B-hm-UJMBV7-WOb6%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*B-hm-UJMBV7-WOb6%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*B-hm-UJMBV7-WOb6%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*B-hm-UJMBV7-WOb6 640w, https://miro.medium.com/v2/resize:fit:720/0*B-hm-UJMBV7-WOb6 720w, https://miro.medium.com/v2/resize:fit:750/0*B-hm-UJMBV7-WOb6 750w, https://miro.medium.com/v2/resize:fit:786/0*B-hm-UJMBV7-WOb6 786w, https://miro.medium.com/v2/resize:fit:828/0*B-hm-UJMBV7-WOb6 828w, https://miro.medium.com/v2/resize:fit:1100/0*B-hm-UJMBV7-WOb6 1100w, https://miro.medium.com/v2/resize:fit:1400/0*B-hm-UJMBV7-WOb6 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn oq c" width="700" height="235" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="or ff os od oe ot ou bf b bg z du">Raw impressions records per second</figcaption></figure><p id="8d4a" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">We utilize the ‘island model’ for deploying our Flink jobs, where all dependencies for a given application reside within a single region. This approach ensures high availability by isolating regions, so if one becomes degraded, others remain unaffected, allowing traffic to be shifted between regions to maintain service continuity. Thus, all data in one region is processed by the Flink job deployed within that region.</p><h1 id="9ca3" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Future Work</h1><h2 id="0a4f" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Addressing the Challenge of Unschematized Events</h2><p id="ef75" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Allowing raw events to land on our centralized processing queue unschematized offers significant flexibility, but it also introduces challenges. Without a defined schema, it can be difficult to determine whether missing data was intentional or due to a logging error. We are investigating solutions to introduce schema management that maintains flexibility while providing clarity.</p><h2 id="8f38" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Automating Performance Tuning with Autoscalers</h2><p id="c962" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Tuning the performance of our Apache Flink jobs is currently a manual process. The next step is to integrate with autoscalers, which can dynamically adjust resources based on workload demands. This integration will not only optimize performance but also ensure more efficient resource utilization.</p><h2 id="fa29" class="pr ow gu bf ox ps pt dy pa pu pv ea pd np pw px py nt pz qa qb nx qc qd qe qf bk">Improving Data Quality Alerts</h2><p id="64ef" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Right now, there’s a lot of business rules dictating when a data quality alert needs to be fired. This leads to a lot of false positives that require manual judgement. A lot of times it is difficult to track changes leading to regression due to inadequate data lineage information. We are investing in building a comprehensive data quality platform that more intelligently identifies anomalies in our impression stream, keeps track of data lineage and data governance, and also, generates alerts notifying producers of any regressions. This approach will enhance efficiency, reduce manual oversight, and ensure a higher standard of data integrity.</p><h1 id="8555" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Conclusion</h1><p id="1482" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">Creating a reliable source of truth for impressions is a complex but essential task that enhances personalization and discovery experience. Stay tuned for the next part of this series, where we’ll delve into how we use this SOT dataset to create a microservice that provides impression histories. We invite you to share your thoughts in the comments and continue with us on this journey of discovering impressions.</p><h1 id="22e6" class="ov ow gu bf ox oy oz hu pa pb pc hx pd pe pf pg ph pi pj pk pl pm pn po pp pq bk">Acknowledgments</h1><p id="0174" class="pw-post-body-paragraph ng nh gu ni b hs qg nk nl hv qh nn no np qi nr ns nt qj nv nw nx qk nz oa ob gn bk">We are genuinely grateful to our amazing colleagues whose contributions were essential to the success of Impressions: Julian Jaffe, Bryan Keller, Yun Wang, Brandon Bremen, Kyle Alford, Ron Brown and Shriya Arora.</p></div></div>]]></description>
      <link>https://netflixtechblog.com/introducing-impressions-at-netflix-e2b67c88c9fb</link>
      <guid>https://netflixtechblog.com/introducing-impressions-at-netflix-e2b67c88c9fb</guid>
      <pubDate>Sat, 15 Feb 2025 02:13:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Title Launch Observability at Netflix Scale]]></title>
      <description><![CDATA[<div><div><h2 id="af30" class="pw-subtitle-paragraph hr gt gu bf b hs ht hu hv hw hx hy hz ia ib ic id ie if ig cq du">Part 2: Navigating Ambiguity</h2><div></div><p id="77d1" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">By:</strong> <a class="af oc" href="https://www.linkedin.com/in/varun-khaitan/" rel="noopener ugc nofollow" target="_blank">Varun Khaitan</a></p><p id="72bf" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">With special thanks to my stunning colleagues: <a class="af oc" href="https://www.linkedin.com/in/mallikarao/" rel="noopener ugc nofollow" target="_blank">Mallika Rao</a>, <a class="af oc" href="https://www.linkedin.com/in/esmir-mesic/" rel="noopener ugc nofollow" target="_blank">Esmir Mesic</a>, <a class="af oc" href="https://www.linkedin.com/in/hugodesmarques/" rel="noopener ugc nofollow" target="_blank">Hugo Marques</a></p><p id="90f9" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Building on the foundation laid in <a class="af oc" href="https://medium.com/netflix-techblog/title-launch-observability-at-netflix-scale-c88c586629eb" rel="noopener">Part 1</a>, where we explored the “what” behind the challenges of title launch observability at Netflix, this post shifts focus to the “how.” How do we ensure every title launches seamlessly and remains discoverable by the right audience?</p><p id="8c5c" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">In the dynamic world of technology, it’s tempting to leap into problem-solving mode. But the key to lasting success lies in taking a step back — understanding the broader context before diving into solutions. This thoughtful approach doesn’t just address immediate hurdles; it builds the resilience and scalability needed for the future. Let’s explore how this mindset drives results.</p><h1 id="afff" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Understanding the Bigger Picture</h1><p id="1ecd" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Let’s take a comprehensive look at all the elements involved and how they interconnect. We should aim to address questions such as: What is vital to the business? Which aspects of the problem are essential to resolve? And how did we arrive at this point?</p><p id="b0f8" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">This process involves:</p><ol class=""><li id="bdf3" class="ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob pe pf pg bk"><strong class="ni gv">Identifying Stakeholders: </strong>Determine who is impacted by the issue and whose input is crucial for a successful resolution. In this case, the main stakeholders are:<p>-<strong class="ni gv"><em class="ph"> Title Launch Operators<br />Role:</em></strong><em class="ph"> Responsible for setting up the title and its metadata into our systems.<br /></em><strong class="ni gv"><em class="ph">Challenge:</em></strong><em class="ph"> Don’t understand the cascading effects of their setup on these perceived black box personalization systems</em></p><p>-<strong class="ni gv"><em class="ph"> Personalization System Engineers</em></strong><em class="ph"><br /></em><strong class="ni gv"><em class="ph">Role: </em></strong><em class="ph">Develop and operate the personalization systems.<br /></em><strong class="ni gv"><em class="ph">Challenge:</em></strong><em class="ph"> End up spending unplanned cycles on title launch and personalization investigations.</em></p><p>- <strong class="ni gv"><em class="ph">Product Managers </em></strong><em class="ph"><br /></em><strong class="ni gv"><em class="ph">Role: </em></strong><em class="ph">Ensure we put forward the best experience for our members.<br /></em><strong class="ni gv"><em class="ph">Challenge: </em></strong><em class="ph">Members may not connect with the most relevant title.</em></p><p>- <strong class="ni gv"><em class="ph">Creative Representatives</em></strong><em class="ph"> <br /></em><strong class="ni gv"><em class="ph">Role:</em></strong><em class="ph"> Mediator between the content creators and Netflix.<br /></em><strong class="ni gv"><em class="ph">Challenge: </em></strong><em class="ph">Build trust in the Netflix brand with content creators.</em></p></li><li id="fb92" class="ng nh gu ni b hs pi nk nl hv pj nn no np pk nr ns nt pl nv nw nx pm nz oa ob pe pf pg bk"><strong class="ni gv">Mapping the Current Landscape:</strong> By charting the existing landscape, we can pinpoint areas ripe for improvement and steer clear of redundant efforts. Beyond the scattered solutions and makeshift scripts, it became evident that there was no established solution for title launch observability. This suggests that this area has been neglected for quite some time and likely requires significant investment. This situation presents both challenges and opportunities; while it may be more difficult to make initial progress, there are plenty of easy wins to capitalize on.</li><li id="d0ea" class="ng nh gu ni b hs pi nk nl hv pj nn no np pk nr ns nt pl nv nw nx pm nz oa ob pe pf pg bk"><strong class="ni gv">Clarifying the Core Problem:</strong> By clearly defining the problem, we can ensure that our solutions address the root cause rather than just the symptoms. While there were many issues and problems we could address, the core problem here was to make sure every title was treated fairly by our personalization stack. If we can ensure fair treatment with confidence and bring that visibility to all our stakeholders, we can address all their challenges.</li><li id="3881" class="ng nh gu ni b hs pi nk nl hv pj nn no np pk nr ns nt pl nv nw nx pm nz oa ob pe pf pg bk"><strong class="ni gv">Assessing Business Priorities: </strong>Understanding what is most important to the organization helps prioritize actions and resources effectively. In this context, we’re focused on developing systems that ensure successful title launches, build trust between content creators and our brand, and reduce engineering operational overhead. While this is a critical business need and we definitely should solve it, it’s essential to evaluate how it stacks up against other priorities across different areas of the organization.</li></ol><h1 id="4208" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Defining Title Health</h1><p id="2dfd" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Navigating such an ambiguous space required a shared understanding to foster clarity and collaboration. To address this, we introduced the term “Title Health,” a concept designed to help us communicate effectively and capture the nuances of maintaining each title’s visibility and performance. This shared language became a foundation for discussing the complexities of this domain.</p><p id="70a0" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">“Title Health”</strong> encompasses various metrics and indicators that reflect how well a title is performing, in terms of discoverability and member engagement. The three main questions we try to answer are:</p><ol class=""><li id="7e75" class="ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob pe pf pg bk">Is this title visible at all to <strong class="ni gv">any</strong> <strong class="ni gv">member</strong>?</li><li id="777c" class="ng nh gu ni b hs pi nk nl hv pj nn no np pk nr ns nt pl nv nw nx pm nz oa ob pe pf pg bk">Is this title visible to an appropriate <strong class="ni gv">audience size</strong>?</li><li id="0cf3" class="ng nh gu ni b hs pi nk nl hv pj nn no np pk nr ns nt pl nv nw nx pm nz oa ob pe pf pg bk">Is this title reaching <strong class="ni gv">all the appropriate audiences</strong>?</li></ol><p id="b2dd" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Defining Title Health provided a framework to monitor and optimize each title’s lifecycle. It allowed us to align with partners on principles and requirements before building solutions, ensuring every title reaches its intended audience seamlessly. This common language not only introduced the problem space effectively but also accelerated collaboration and decision-making across teams.</p><h1 id="b793" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Categories of issues</h1><p id="0431" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">To build a robust plan for title launch observability, we first needed to categorize the types of issues we encounter. This structured approach allows us to address all aspects of title health comprehensively.</p><p id="86be" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Currently, these issues are grouped into three primary categories:</p><p id="62b5" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">1. Title Setup</strong></p><p id="1422" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">A title’s setup includes essential attributes like metadata (e.g., launch dates, audio and subtitle languages, editorial tags) and assets (e.g., artwork, trailers, supplemental messages). These elements are critical for a title’s eligibility in a row, accurate personalization, and an engaging presentation. Since these attributes feed directly into algorithms, any delays or inaccuracies can ripple through the system.</p><p id="c1fa" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">The observability system must ensure that title setup is complete and validated in a timely manner, identify potential bottlenecks and ensure a smooth launch process.</p><p id="551b" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">2. Personalization Systems</strong></p><p id="4456" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Titles are eligible to be recommended across multiple canvases on product — HomePage, Coming Soon, Messaging, Search and more. Personalization systems handle the recommendation and serving of titles on these canvases, leveraging a vast ecosystem of microservices, caches, databases, code, and configurations to build these product canvases.</p><p id="d32b" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">We aim to validate that titles are eligible in all appropriate product canvases across the end to end personalization stack during all of the title’s launch phases.</p><p id="928f" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk"><strong class="ni gv">3. Algorithms</strong></p><p id="3751" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Complex algorithms drive each personalized product experience, recommending titles tailored to individual members. Observability here means validating the accuracy of algorithmic recommendations for all titles.<br />Algorithmic performance can be affected by various factors, such as model shortcomings, incomplete or inaccurate input signals, feature anomalies, or interactions between titles. Identifying and addressing these issues ensures that recommendations remain precise and effective.</p><p id="a466" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">By categorizing issues into these areas, we can systematically address challenges and deliver a reliable, personalized experience for every title on our platform.</p><h1 id="7b81" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Issue Analysis</h1><p id="a29f" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Let’s also learn more about how often we see each of these types of issues and how much effort it takes to fix them once they come up.</p><figure class="pq pr ps pt pu pv pn po paragraph-image"><div role="button" tabindex="0" class="pw px fj py bh pz"><div class="pn po pp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*YyCLwVKiGE_L6fWb%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*YyCLwVKiGE_L6fWb%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*YyCLwVKiGE_L6fWb%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*YyCLwVKiGE_L6fWb%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*YyCLwVKiGE_L6fWb%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*YyCLwVKiGE_L6fWb%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*YyCLwVKiGE_L6fWb%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*YyCLwVKiGE_L6fWb 640w, https://miro.medium.com/v2/resize:fit:720/0*YyCLwVKiGE_L6fWb 720w, https://miro.medium.com/v2/resize:fit:750/0*YyCLwVKiGE_L6fWb 750w, https://miro.medium.com/v2/resize:fit:786/0*YyCLwVKiGE_L6fWb 786w, https://miro.medium.com/v2/resize:fit:828/0*YyCLwVKiGE_L6fWb 828w, https://miro.medium.com/v2/resize:fit:1100/0*YyCLwVKiGE_L6fWb 1100w, https://miro.medium.com/v2/resize:fit:1400/0*YyCLwVKiGE_L6fWb 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mn qa c" width="700" height="619" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="030a" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">From the above chart, we see that setup issues are the most common but they are also easy to fix since it’s relatively straightforward to go back and rectify a title’s metadata. System issues, which mostly manifest as bugs in our personalization microservices are not uncommon, and they take moderate effort to address. Algorithm issues, while rare, are really difficult to address since these often involve interpreting and retraining complex machine learning models.</p><h1 id="c0e1" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Evaluating Our Options</h1><p id="473f" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">Now that we understand more deeply about the problems we want to address and how we should go about prioritizing our resources. Lets go back to the two options we discussed in Part 1, and make an informed decision.</p><figure class="pq pr ps pt pu pv pn po paragraph-image"><div role="button" tabindex="0" class="pw px fj py bh pz"><div class="pn po qb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*5YRxJT3YI53wgtLs9zO6gg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*5YRxJT3YI53wgtLs9zO6gg.png" /><img alt="" class="bh mn qa c" width="700" height="263" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="64d2" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Ultimately, we realized this space demands the full spectrum of features we’ve discussed. But the question remained: <em class="ph">Where do we start?</em> <br />After careful consideration, we chose to focus on proactive issue detection first. Catching problems before launch offered the greatest potential for business impact, ensuring smoother launches, better member experiences, and stronger system reliability.</p><p id="84b9" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">This decision wasn’t just about solving today’s challenges — it was about laying the foundation for a scalable, robust system that can grow with the complexities of our ever-evolving platform.</p><h1 id="ab95" class="od oe gu bf of og oh hu oi oj ok hx ol om on oo op oq or os ot ou ov ow ox oy bk">Up next</h1><p id="f46c" class="pw-post-body-paragraph ng nh gu ni b hs oz nk nl hv pa nn no np pb nr ns nt pc nv nw nx pd nz oa ob gn bk">In the next iteration we will talk about how to design an observability endpoint that works for all personalization systems. What are the main things to keep in mind while creating a microservice API endpoint? How do we ensure standardization? What is the architecture of the systems involved?</p><p id="a208" class="pw-post-body-paragraph ng nh gu ni b hs nj nk nl hv nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob gn bk">Keep an eye out for our next binge-worthy episode!</p></div></div>]]></description>
      <link>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-19ea916be1ed</link>
      <guid>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-19ea916be1ed</guid>
      <pubDate>Tue, 07 Jan 2025 02:25:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Part 3: A Survey of Analytics Engineering Work at Netflix]]></title>
      <description><![CDATA[<div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><div><div></div><p id="ca1c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">This article is the last in a multi-part series sharing a breadth of Analytics Engineering work at Netflix, recently presented as part of our annual internal Analytics Engineering conference. Need to catch up? Check out </em><a class="af nv" href="https://research.netflix.com/publication/part-1-a-survey-of-analytics-engineering-work-at-netflix" rel="noopener ugc nofollow" target="_blank"><em class="nu">Part 1</em></a><em class="nu">, which detailed how we’re empowering Netflix to efficiently produce and effectively deliver high quality, actionable analytic insights across the company and </em><a class="af nv" href="https://research.netflix.com/publication/part-2-a-survey-of-analytics-engineering-work-at-netflix" rel="noopener ugc nofollow" target="_blank"><em class="nu">Part 2</em></a><em class="nu">, which stepped through a few exciting business applications for Analytics Engineering. This post will go into aspects of technical craft.</em></p><h1 id="a095" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Dashboard Design Tips</h1><p id="1817" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk"><a class="af nv" href="https://www.linkedin.com/in/rinachang" rel="noopener ugc nofollow" target="_blank">Rina Chang</a>, <a class="af nv" href="https://www.linkedin.com/in/shansusielu/" rel="noopener ugc nofollow" target="_blank">Susie Lu</a></p><p id="4102" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">What is design, and why does it matter? Often people think design is about how things look, but design is actually about how things work. Everything is designed, because we’re all making choices about how things work, but not everything is designed well. Good design doesn’t waste time or mental energy; instead, it helps the user achieve their goals.</p><p id="59f0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">When applying this to a dashboard application, the easiest way to use design effectively is to leverage existing patterns. (For example, people have learned that blue underlined text on a website means it’s a clickable link.) So knowing the arsenal of available patterns and what they imply is useful when making the choice of when to use which pattern.</p><p id="692c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">First, to design a dashboard well, you need to understand your user.</p><ul class=""><li id="e325" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">Talk to your users throughout the entire product lifecycle. Talk to them early and often, through whatever means you can.</li><li id="12e2" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Understand their needs, ask why, then ask why again. Separate symptoms from problems from solutions.</li><li id="4374" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Prioritize and clarify — less is more! Distill what you can build that’s differentiated and provides the most value to your user.</li></ul><p id="dd4d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Here is a framework for thinking about what your users are trying to achieve. Where do your users fall on these axes? Don’t solve for multiple positions across these axes in a given view; if that exists, then create different views or potentially different dashboards.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ar0t2-zF5YVuXnUe%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ar0t2-zF5YVuXnUe%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ar0t2-zF5YVuXnUe%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ar0t2-zF5YVuXnUe%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ar0t2-zF5YVuXnUe%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ar0t2-zF5YVuXnUe%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ar0t2-zF5YVuXnUe%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ar0t2-zF5YVuXnUe 640w, https://miro.medium.com/v2/resize:fit:720/0*ar0t2-zF5YVuXnUe 720w, https://miro.medium.com/v2/resize:fit:750/0*ar0t2-zF5YVuXnUe 750w, https://miro.medium.com/v2/resize:fit:786/0*ar0t2-zF5YVuXnUe 786w, https://miro.medium.com/v2/resize:fit:828/0*ar0t2-zF5YVuXnUe 828w, https://miro.medium.com/v2/resize:fit:1100/0*ar0t2-zF5YVuXnUe 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ar0t2-zF5YVuXnUe 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="370" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="106c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Second, understanding your users’ mental models will allow you to choose how to structure your app to match. A few questions to ask yourself when considering the information architecture of your app include:</p><ul class=""><li id="30c3" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">Do you have different user groups trying to accomplish different things? Split them into different apps or different views.</li><li id="feea" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">What should go together on a single page? All the information needed for a single user type to accomplish their “job.” If there are multiple <a class="af nv" href="https://www.christenseninstitute.org/theory/jobs-to-be-done/" rel="noopener ugc nofollow" target="_blank">jobs to be done</a>, split each out onto its own page.</li><li id="c406" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">What should go together within a single section on a page? All the information needed to answer a single question.</li><li id="3bd1" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Does your dashboard feel too difficult to use? You probably have too much information! When in doubt, keep it simple. If needed, hide complexity under an “Advanced” section.</li></ul><p id="0289" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Here are some general guidelines for page layouts:</p><ul class=""><li id="935c" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">Choose infinite scrolling vs. clicking through multiple pages depending on which option suits your users’ expectations better</li><li id="d91d" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Lead with the most-used information first, above the fold</li><li id="a5c0" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Create signposts that cue the user to where they are by labeling pages, sections, and links</li><li id="7543" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Use cards or borders to visually group related items together</li><li id="94e9" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Leverage nesting to create well-understood “scopes of control.” Specifically, users expect a controller object to affect children either: Below it (if horizontal) or To the right of it (if vertical)</li></ul><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*KIqd6dZXD_NZyTKR%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*KIqd6dZXD_NZyTKR%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*KIqd6dZXD_NZyTKR%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*KIqd6dZXD_NZyTKR%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*KIqd6dZXD_NZyTKR%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*KIqd6dZXD_NZyTKR%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*KIqd6dZXD_NZyTKR%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*KIqd6dZXD_NZyTKR 640w, https://miro.medium.com/v2/resize:fit:720/0*KIqd6dZXD_NZyTKR 720w, https://miro.medium.com/v2/resize:fit:750/0*KIqd6dZXD_NZyTKR 750w, https://miro.medium.com/v2/resize:fit:786/0*KIqd6dZXD_NZyTKR 786w, https://miro.medium.com/v2/resize:fit:828/0*KIqd6dZXD_NZyTKR 828w, https://miro.medium.com/v2/resize:fit:1100/0*KIqd6dZXD_NZyTKR 1100w, https://miro.medium.com/v2/resize:fit:1400/0*KIqd6dZXD_NZyTKR 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="977" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*O52xqUnDsJ8kPCVZ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*O52xqUnDsJ8kPCVZ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*O52xqUnDsJ8kPCVZ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*O52xqUnDsJ8kPCVZ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*O52xqUnDsJ8kPCVZ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*O52xqUnDsJ8kPCVZ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*O52xqUnDsJ8kPCVZ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*O52xqUnDsJ8kPCVZ 640w, https://miro.medium.com/v2/resize:fit:720/0*O52xqUnDsJ8kPCVZ 720w, https://miro.medium.com/v2/resize:fit:750/0*O52xqUnDsJ8kPCVZ 750w, https://miro.medium.com/v2/resize:fit:786/0*O52xqUnDsJ8kPCVZ 786w, https://miro.medium.com/v2/resize:fit:828/0*O52xqUnDsJ8kPCVZ 828w, https://miro.medium.com/v2/resize:fit:1100/0*O52xqUnDsJ8kPCVZ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*O52xqUnDsJ8kPCVZ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="977" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6qCQlTNyoabhrkVa%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6qCQlTNyoabhrkVa%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6qCQlTNyoabhrkVa%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6qCQlTNyoabhrkVa%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6qCQlTNyoabhrkVa%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6qCQlTNyoabhrkVa%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*6qCQlTNyoabhrkVa%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6qCQlTNyoabhrkVa 640w, https://miro.medium.com/v2/resize:fit:720/0*6qCQlTNyoabhrkVa 720w, https://miro.medium.com/v2/resize:fit:750/0*6qCQlTNyoabhrkVa 750w, https://miro.medium.com/v2/resize:fit:786/0*6qCQlTNyoabhrkVa 786w, https://miro.medium.com/v2/resize:fit:828/0*6qCQlTNyoabhrkVa 828w, https://miro.medium.com/v2/resize:fit:1100/0*6qCQlTNyoabhrkVa 1100w, https://miro.medium.com/v2/resize:fit:1400/0*6qCQlTNyoabhrkVa 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="977" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="8d9f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Third, some tips and tricks can help you more easily tackle the unique design challenges that come with making interactive charts.</p><ul class=""><li id="9697" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">Titles: Make sure filters are represented in the title or subtitle of the chart for easy scannability and screenshot-ability.</li><li id="c959" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Tooltips: Core details should be on the page, while the context in the tooltip is for deeper information. Annotate multiple points when there are only a handful of lines.</li><li id="c00c" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Annotations: Provide annotations on charts to explain shifts in values so all users can access that context.</li><li id="4445" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Color: Limit the number of colors you use. Be consistent in how you use colors. Otherwise, colors lose meaning.</li><li id="fe99" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk">Onboarding: Separate out onboarding to your dashboard from routine usage.</li></ul><p id="a37f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Finally, it is important to note that these are general guidelines, but there is always room for interpretation and/or the use of good judgment to adapt them to suit your own product and use cases. At the end of the day, the most important thing is that a user can leverage the data insights provided by your dashboard to perform their work, and good design is a means to that end.</p><h1 id="f4f7" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk"><strong class="al">Learnings from Deploying an Analytics API at Netflix</strong></h1><p id="31ad" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk"><a class="af nv" href="https://www.linkedin.com/in/devincarullo/" rel="noopener ugc nofollow" target="_blank">Devin Carullo</a></p><p id="430b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix Studio, we operate at the intersection of art and science. Data is a tool that enhances decision-making, complementing the deep expertise and industry knowledge of our creative professionals.</p><p id="970d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">One example is in production budgeting — namely, determining how much we should spend to produce a given show or movie. Although there was already a process for creating and comparing budgets for new productions against similar past projects, it was highly manual. We developed a tool that automatically selects and compares similar Netflix productions, flagging any anomalies for Production Finance to review.</p><p id="b3ea" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To ensure success, it was essential that results be delivered in real-time and integrated seamlessly into existing tools. This required close collaboration among product teams, DSE, and front-end and back-end developers. We developed a GraphQL endpoint using Metaflow, integrating it into the existing budgeting product. This solution enabled data to be used more effectively for real-time decision-making.</p><p id="d3db" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We recently launched our MVP and continue to iterate on the product. Reflecting on our journey, the path to launch was complex and filled with unexpected challenges. As an analytics engineer accustomed to crafting quick solutions, I underestimated the effort required to deploy a production-grade analytics API.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*KOgCUre0HvjZ82ZH%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*KOgCUre0HvjZ82ZH%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*KOgCUre0HvjZ82ZH%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*KOgCUre0HvjZ82ZH%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*KOgCUre0HvjZ82ZH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*KOgCUre0HvjZ82ZH%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*KOgCUre0HvjZ82ZH%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*KOgCUre0HvjZ82ZH 640w, https://miro.medium.com/v2/resize:fit:720/0*KOgCUre0HvjZ82ZH 720w, https://miro.medium.com/v2/resize:fit:750/0*KOgCUre0HvjZ82ZH 750w, https://miro.medium.com/v2/resize:fit:786/0*KOgCUre0HvjZ82ZH 786w, https://miro.medium.com/v2/resize:fit:828/0*KOgCUre0HvjZ82ZH 828w, https://miro.medium.com/v2/resize:fit:1100/0*KOgCUre0HvjZ82ZH 1100w, https://miro.medium.com/v2/resize:fit:1400/0*KOgCUre0HvjZ82ZH 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="86" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pw ff px ph pi py pz bf b bg z du">Fig 1. My vague idea of how my API would work</figcaption></figure><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*BBEHaQdU_e57_sjD%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*BBEHaQdU_e57_sjD%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*BBEHaQdU_e57_sjD%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*BBEHaQdU_e57_sjD%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*BBEHaQdU_e57_sjD%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*BBEHaQdU_e57_sjD%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*BBEHaQdU_e57_sjD%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*BBEHaQdU_e57_sjD 640w, https://miro.medium.com/v2/resize:fit:720/0*BBEHaQdU_e57_sjD 720w, https://miro.medium.com/v2/resize:fit:750/0*BBEHaQdU_e57_sjD 750w, https://miro.medium.com/v2/resize:fit:786/0*BBEHaQdU_e57_sjD 786w, https://miro.medium.com/v2/resize:fit:828/0*BBEHaQdU_e57_sjD 828w, https://miro.medium.com/v2/resize:fit:1100/0*BBEHaQdU_e57_sjD 1100w, https://miro.medium.com/v2/resize:fit:1400/0*BBEHaQdU_e57_sjD 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="392" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pw ff px ph pi py pz bf b bg z du">Fig 2: Our actual solution</figcaption></figure><p id="3a82" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">With hindsight, below are my key learnings.</p><p id="3ed3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Measure Impact and Necessity of Real-Time Results</strong></p><p id="8dac" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Before implementing real-time analytics, assess whether real-time results are truly necessary for your use case. This can significantly impact the complexity and cost of your solution. Batch processing data may provide a similar impact and take significantly less time. It’s easier to develop and maintain, and tends to be more familiar for analytics engineers, data scientists, and data engineers.</p><p id="70ae" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Additionally, if you are developing a proof of concept, the upfront investment may not be worth it. Scrappy solutions can often be the best choice for analytics work.</p><p id="1905" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Explore All Available Solutions</strong></p><p id="fc32" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, there were multiple established methods for creating an API, but none perfectly suited our specific use case. Metaflow, a tool developed at Netflix for data science projects, already supported REST APIs. However, this approach did not align with the preferred workflow of our engineering partners. Although they could integrate with REST endpoints, this solution presented inherent limitations. Large response sizes rendered the API/front-end integration unreliable, necessitating the addition of filter parameters to reduce the response size.</p><p id="ccc2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Additionally, the product we were integrating into was using GraphQL, and deviating from this established engineering approach was not ideal. Lastly, given our goal to overlay results throughout the product, GraphQL features, such as federation, proved to be particularly advantageous.</p><p id="7ffd" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">After realizing there wasn’t an existing solution at Netflix for deploying python endpoints with GraphQL, we worked with the Metaflow team to build this feature. This allowed us to continue developing via Metaflow and allowed our engineering partners to stay on their paved path.</p><p id="c621" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Align on Performance Expectations</strong></p><p id="d3b5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A major challenge during development was managing API latency. Much of this could have been mitigated by aligning on performance expectations from the outset. Initially, we operated under our assumptions of what constituted an acceptable response time, which differed greatly from the actual needs of our users and our engineering partners.</p><p id="2ae7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Understanding user expectations is key to designing an effective solution. Our methodology resulted in a full budget analysis taking, on average, 7 seconds. Users were willing to wait for an analysis when they modified a budget, but not every time they accessed one. To address this, we implemented caching using Metaflow, reducing the API response time to approximately 1 second for cached results. Additionally, we set up a nightly batch job to pre-cache results.</p><p id="7495" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">While users were generally okay with waiting for analysis during changes, we had to be mindful of GraphQL’s 30-second limit. This highlighted the importance of continuously monitoring the impact of changes on response times, leading us to our next key learning: rigorous testing.</p><p id="1924" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Real-Time Analysis Requires Rigorous Testing</strong></p><p id="ca7a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Load Testing: We leveraged Locust to measure the response time of our endpoint and assess how the endpoint responded to reasonable and elevated loads. We were able to use FullStory, which was already being used in the product, to estimate expected calls per minute.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*xVjhqU2DZV7RYBD0%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*xVjhqU2DZV7RYBD0%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*xVjhqU2DZV7RYBD0%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*xVjhqU2DZV7RYBD0%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*xVjhqU2DZV7RYBD0%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*xVjhqU2DZV7RYBD0%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*xVjhqU2DZV7RYBD0%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*xVjhqU2DZV7RYBD0 640w, https://miro.medium.com/v2/resize:fit:720/0*xVjhqU2DZV7RYBD0 720w, https://miro.medium.com/v2/resize:fit:750/0*xVjhqU2DZV7RYBD0 750w, https://miro.medium.com/v2/resize:fit:786/0*xVjhqU2DZV7RYBD0 786w, https://miro.medium.com/v2/resize:fit:828/0*xVjhqU2DZV7RYBD0 828w, https://miro.medium.com/v2/resize:fit:1100/0*xVjhqU2DZV7RYBD0 1100w, https://miro.medium.com/v2/resize:fit:1400/0*xVjhqU2DZV7RYBD0 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="260" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pw ff px ph pi py pz bf b bg z du">Fig 3. Locust allows us to simulate concurrent calls and measure response time</figcaption></figure><p id="4782" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Unit Tests &amp; Integration Tests: Code testing is always a good idea, but it can often be overlooked in analytics. It is especially important when you are delivering live analysis to circumvent end users from being the first to see an error or incorrect information. We implemented unit testing and full integration tests, ensuring that our analysis would return correct results.</p><p id="91e6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">The Importance of Aligning Workflows and Collaboration</strong></p><p id="3443" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This project marked the first time our team collaborated directly with our engineering partners to integrate a DSE API into their product. Throughout the process, we discovered significant gaps in our understanding of each other’s workflows. Assumptions about each other’s knowledge and processes led to misunderstandings and delays.</p><p id="3d94" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Deployment Paths: Our engineering partners followed a strict deployment path, whereas our approach on the DSE side was more flexible. We typically tested our work on feature branches using Metaflow projects and then pushed results to production. However, this lack of control led to issues, such as inadvertently deploying changes to production before the corresponding product updates were ready and difficulties in managing a test endpoint. Ultimately, we deferred to our engineering partners to establish a deployment path and collaborated with the Metaflow team and data engineers to implement it effectively.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*BaqggE2wQ2C9Svo8%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*BaqggE2wQ2C9Svo8%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*BaqggE2wQ2C9Svo8%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*BaqggE2wQ2C9Svo8%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*BaqggE2wQ2C9Svo8%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*BaqggE2wQ2C9Svo8%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*BaqggE2wQ2C9Svo8%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*BaqggE2wQ2C9Svo8 640w, https://miro.medium.com/v2/resize:fit:720/0*BaqggE2wQ2C9Svo8 720w, https://miro.medium.com/v2/resize:fit:750/0*BaqggE2wQ2C9Svo8 750w, https://miro.medium.com/v2/resize:fit:786/0*BaqggE2wQ2C9Svo8 786w, https://miro.medium.com/v2/resize:fit:828/0*BaqggE2wQ2C9Svo8 828w, https://miro.medium.com/v2/resize:fit:1100/0*BaqggE2wQ2C9Svo8 1100w, https://miro.medium.com/v2/resize:fit:1400/0*BaqggE2wQ2C9Svo8 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="328" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pw ff px ph pi py pz bf b bg z du">Fig 4. Our current deployment path</figcaption></figure><p id="6fa7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Work Planning: While the engineering team operated on sprints, our DSE team planned by quarters. This misalignment in planning cycles is an ongoing challenge that we are actively working to resolve.</p><p id="e86e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Looking ahead, our team is committed to continuing this partnership with our engineering colleagues. Both teams have invested significant time in building this relationship, and we are optimistic that it will yield substantial benefits in future projects.</p><h1 id="f327" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">External Speaker: Benn Stancil</h1><p id="fa9b" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">In addition to the above presentations, we kicked off our Analytics Summit with a keynote talk from <a class="af nv" href="https://www.linkedin.com/in/benn-stancil/" rel="noopener ugc nofollow" target="_blank">Benn Stancil</a>, Founder of Mode Analytics. Benn stepped through a history of the modern data stack, and the group discussed ideas on the future of analytics.</p></div></div></div><div class="ab cb qb qc qd qe" role="separator"><div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="5e52" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Analytics Engineering is a key contributor to building our deep data culture at Netflix, and we are proud to have a large group of stunning colleagues that are not only applying but advancing our analytical capabilities at Netflix. The 2024 Analytics Summit continued to be a wonderful way to give visibility to one another on work across business verticals, celebrate our collective impact, and highlight what’s to come in analytics practice at Netflix.</p><p id="1920" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To learn more, follow the <a class="af nv" href="https://research.netflix.com/research-area/analytics" rel="noopener ugc nofollow" target="_blank">Netflix Research Site</a>, and if you are also interested in entertaining the world, have a look at <a class="af nv" href="https://explore.jobs.netflix.net/careers" rel="noopener ugc nofollow" target="_blank">our open roles</a>!</p></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/part-3-a-survey-of-analytics-engineering-work-at-netflix-e67f0aa82183</link>
      <guid>https://netflixtechblog.com/part-3-a-survey-of-analytics-engineering-work-at-netflix-e67f0aa82183</guid>
      <pubDate>Mon, 06 Jan 2025 20:27:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Part 2: A Survey of Analytics Engineering Work at Netflix]]></title>
      <description><![CDATA[<div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><div><div></div><p id="c218" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">This article is the second in a multi-part series sharing a breadth of Analytics Engineering work at Netflix, recently presented as part of our annual internal Analytics Engineering conference. Need to catch up? Check out </em><a class="af nv" href="https://research.netflix.com/publication/part-1-a-survey-of-analytics-engineering-work-at-netflix" rel="noopener ugc nofollow" target="_blank"><em class="nu">Part 1</em></a><em class="nu">. In this article, we highlight a few exciting analytic business applications, and in our final article we’ll go into aspects of the technical craft.</em></p><h1 id="6e5b" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Game Analytics</h1><p id="cf63" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk"><a class="af nv" href="https://www.linkedin.com/in/yimeng-tang-49566b207/" rel="noopener ugc nofollow" target="_blank">Yimeng Tang</a>, <a class="af nv" href="https://www.linkedin.com/in/clairewilleck/" rel="noopener ugc nofollow" target="_blank">Claire Willeck</a>, <a class="af nv" href="https://www.linkedin.com/in/sagarpalao/" rel="noopener ugc nofollow" target="_blank">Sagar Palao</a></p><h1 id="3399" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">User Acquisition Incrementality for Netflix Games</h1><p id="af46" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Netflix has been launching games for the past three years, during which it has initiated various marketing efforts, including User Acquisition (UA) campaigns, to promote these games across different countries. These UA campaigns typically feature static creatives, launch trailers, and game review videos on platforms like Google, Meta, and TikTok. The primary goals of these campaigns are to encourage more people to install and play the games, making incremental installs and engagement crucial metrics for evaluating their effectiveness.</p><p id="8619" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Most UA campaigns are conducted at the country level, meaning that everyone in the targeted countries can see the ads. However, due to the absence of a control group in these countries, we adopt a synthetic control framework (<a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/round-2-a-survey-of-causal-inference-applications-at-netflix-fd78328ee0bb">blog post</a>) to estimate the counterfactual scenario. This involves creating a weighted combination of countries not exposed to the UA campaign to serve as a counterfactual for the treated countries. To facilitate easier access to incrementality results, we have developed an interactive tool powered by this framework. This tool allows users to directly obtain the lift in game installs and engagement, view plots for both the treated country and the synthetic control unit, and assess the p-value from placebo tests.</p><p id="b9e6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To better guide the design and budgeting of future campaigns, we are developing an Incremental Return on Investment model. This model incorporates factors such as the incremental impact, the value of the incremental engagement and incremental signups, and the cost of running the campaign. In addition to using the causal inference framework mentioned earlier to estimate incrementality, we also leverage other frameworks, such as Incremental Account Lifetime Valuation (<a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f">blog post</a>), to assign value to the incremental engagement and signups resulting from the campaigns.</p><h1 id="bd39" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Measuring and Validating Incremental Signups for Netflix Games</h1><p id="9ea5" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Netflix is a subscription service meaning members buy subscriptions which include games but not the individual games themselves. This makes it difficult to measure the impact of different game launches on acquisition. We only observe signups, not why members signed up.</p><p id="c843" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This means we need to estimate incremental signups. We adopt an approach developed at Netflix to estimate incremental acquisition (<a class="af nv" href="https://arxiv.org/pdf/2106.15346" rel="noopener ugc nofollow" target="_blank">technical paper</a>). This approach uses simple assumptions to estimate a counterfactual for the rate that new members start playing the game.</p><p id="5128" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Because games differ from series/films, it’s crucial to validate this estimation method for games. Ideally, we would have causal estimates from an A/B test to use for validation, but since that is not available, we use another causal inference design as one of our ensemble of validation approaches. This causal inference design involves a systematic framework we designed to measure game events that relies on synthetic control (<a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/round-2-a-survey-of-causal-inference-applications-at-netflix-fd78328ee0bb">blog post</a>).</p><p id="19cb" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As we mentioned above, we have been launching User Acquisition (UA) campaigns in select countries to boost game engagement and new memberships. We can use this cross-country variation to form a synthetic control and measure the incremental signups due to the UA campaign. The incremental signups from UA campaigns differ from those attributed to a game, but they should be similar. When our estimated incremental acquisition numbers over a campaign period are similar to the incremental acquisition numbers calculated using synthetic control, we feel more confident in our approach to measuring incremental signups for games.</p><h1 id="1015" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Netflix Games Players’ Adventure: Modeled using State Machine</h1><p id="828b" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">At Netflix Games, we aim to have a high number of members engaging with games each month, referred to as Monthly Active Accounts (MAA). To evaluate our progress toward this objective and to find areas to boost our MAA, we modeled the Netflix players’ journey as a state machine.</p><p id="a279" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We track a daily state machine showing the probability of account transitions between states.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*j2wKL4S3ywEs9mpf%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*j2wKL4S3ywEs9mpf%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*j2wKL4S3ywEs9mpf%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*j2wKL4S3ywEs9mpf%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*j2wKL4S3ywEs9mpf%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*j2wKL4S3ywEs9mpf%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*j2wKL4S3ywEs9mpf%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*j2wKL4S3ywEs9mpf 640w, https://miro.medium.com/v2/resize:fit:720/0*j2wKL4S3ywEs9mpf 720w, https://miro.medium.com/v2/resize:fit:750/0*j2wKL4S3ywEs9mpf 750w, https://miro.medium.com/v2/resize:fit:786/0*j2wKL4S3ywEs9mpf 786w, https://miro.medium.com/v2/resize:fit:828/0*j2wKL4S3ywEs9mpf 828w, https://miro.medium.com/v2/resize:fit:1100/0*j2wKL4S3ywEs9mpf 1100w, https://miro.medium.com/v2/resize:fit:1400/0*j2wKL4S3ywEs9mpf 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="580" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="5504" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Fig: Netflix Players’ Journey as State machine</p><p id="82b9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Modeling the players’ journey as a state machine allows us to simulate future states and assess progress toward engagement goals. The most basic operation involves multiplying the daily state-transition matrix with the current state values to determine the next day’s state values.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ud5xnQi9QM6ELiVP%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ud5xnQi9QM6ELiVP%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ud5xnQi9QM6ELiVP%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ud5xnQi9QM6ELiVP%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ud5xnQi9QM6ELiVP%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ud5xnQi9QM6ELiVP%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ud5xnQi9QM6ELiVP%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ud5xnQi9QM6ELiVP 640w, https://miro.medium.com/v2/resize:fit:720/0*ud5xnQi9QM6ELiVP 720w, https://miro.medium.com/v2/resize:fit:750/0*ud5xnQi9QM6ELiVP 750w, https://miro.medium.com/v2/resize:fit:786/0*ud5xnQi9QM6ELiVP 786w, https://miro.medium.com/v2/resize:fit:828/0*ud5xnQi9QM6ELiVP 828w, https://miro.medium.com/v2/resize:fit:1100/0*ud5xnQi9QM6ELiVP 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ud5xnQi9QM6ELiVP 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="73" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="0d9c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This basic operation allows us to explore various scenarios:</p><ul class=""><li id="03b5" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt po pp pq bk">Constant Trends: If transition rates stay constant, we can predict future states by repeatedly multiplying the daily state-transition matrix to new state values, helping us assess progress towards annual goals under unchanged conditions.</li><li id="fd4f" class="mw mx gu my b mz pr nb nc nd ps nf ng nh pt nj nk nl pu nn no np pv nr ns nt po pp pq bk">Dynamic Scenarios: By modifying transition rates, we can simulate complex scenarios. For instance, mimicking past changes in transition rates from a game launch allows us to predict the impact of similar future launches by altering the transition rate for a specific period.</li><li id="80bf" class="mw mx gu my b mz pr nb nc nd ps nf ng nh pt nj nk nl pu nn no np pv nr ns nt po pp pq bk">Steady State: We can calculate the steady state of the state-transition matrix (excluding new players) to estimate the MAA once all accounts have tried Netflix games and understand long-term retention and reactivation effects.</li></ul><p id="bf61" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Beyond predicting future states, we use the state machine for sensitivity analysis to find which transition rates most impact MAA. By making small changes to each transition rate we calculate the resulting MAA and measure its impact. This guides us in prioritizing efforts on top-of-funnel improvements, member retention, or reactivation.</p><h1 id="166d" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Content Cash Modeling</h1><p id="3391" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk"><a class="af nv" href="https://www.linkedin.com/in/alexandra-diamond-b04902219/" rel="noopener ugc nofollow" target="_blank">Alex Diamond</a></p><p id="e03d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix we produce a variety of entertainment: movies, series, documentaries, stand-up specials, and more. Each format has a different production process and different patterns of cash spend, called our “Content Forecast”. Looking into the future, Netflix keeps a plan of how many titles we intend to produce, what kinds, and when. Because we don’t yet know what specific titles that content will eventually become, these generic placeholders are called “TBD Slots.” A sizable portion of our Content Forecast is represented by TBD Slots.</p><p id="4188" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Almost all businesses have a cash forecasting process informing how much cash they need in a given time period to continue executing on their plans. As plans change, the cash forecast will change. Netflix has a cash forecast that projects our cash needs to produce the titles we plan to make. This presents the question: how can we optimally forecast cash needs for TBD Slots, given we don’t have details on what real titles they will become?</p><p id="54e5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The large majority of our titles are funded throughout the production process — starting from when we begin developing the title to shooting the actual shows and movies to launch on our Netflix service.</p><p id="b01b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Since cash spend is driven by what is happening on a production, we model it by breaking down into these three steps:</p><ol class=""><li id="8b6a" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pw pp pq bk">Determine estimated production phase durations using historical actuals</li><li id="a35d" class="mw mx gu my b mz pr nb nc nd ps nf ng nh pt nj nk nl pu nn no np pv nr ns nt pw pp pq bk">Determine estimated percent of cash spent in each production phase</li><li id="1606" class="mw mx gu my b mz pr nb nc nd ps nf ng nh pt nj nk nl pu nn no np pv nr ns nt pw pp pq bk">Model the shape of cash spend within each phase</li></ol><p id="565e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Putting these three pieces together allows us to generate a generic estimation of cash spend per day leading up to and beyond a title’s launch date (a proxy for “completion”). We could distribute this spend linearly across each phase, but this approach allows us to capture nuance around patterns of spend that ramp up slowly, or are concentrated at the start and taper off throughout.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa px"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*B6Abl5okW1BRfvrc%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*B6Abl5okW1BRfvrc%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*B6Abl5okW1BRfvrc%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*B6Abl5okW1BRfvrc%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*B6Abl5okW1BRfvrc%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*B6Abl5okW1BRfvrc%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*B6Abl5okW1BRfvrc%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*B6Abl5okW1BRfvrc 640w, https://miro.medium.com/v2/resize:fit:720/0*B6Abl5okW1BRfvrc 720w, https://miro.medium.com/v2/resize:fit:750/0*B6Abl5okW1BRfvrc 750w, https://miro.medium.com/v2/resize:fit:786/0*B6Abl5okW1BRfvrc 786w, https://miro.medium.com/v2/resize:fit:828/0*B6Abl5okW1BRfvrc 828w, https://miro.medium.com/v2/resize:fit:1100/0*B6Abl5okW1BRfvrc 1100w, https://miro.medium.com/v2/resize:fit:1400/0*B6Abl5okW1BRfvrc 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="292" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="5046" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Before starting any math, we need to ensure a high quality historical dataset. Data quality plays a huge role in this work. For example, if we see 80% of our cash spent before production even started, it might be safe to say that either the production dates (which are manually captured) are incorrect or that title had a unique spending pattern that we don’t want to anticipate our future titles will follow.</p><p id="6932" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For the first two steps, finding the estimated phase durations and cash percent per phase, we’ve found that simple math works best, for interpretability and consistency. We use a weighted average across our “clean” historical actuals to produce these estimated assumptions.</p><p id="05a0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For modeling the shape of spend throughout each phase, we perform constrained optimization to fit a 3rd degree polynomial function. The constraints include:</p><ol class=""><li id="01b8" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pw pp pq bk">Must pass through the points (0,0) and (1,1). This ensures that 0% through the phase, 0% of that phase’s cash has been spent. Similarly, 100% through the phase, 100% of that phase’s cash has been spent.</li><li id="3032" class="mw mx gu my b mz pr nb nc nd ps nf ng nh pt nj nk nl pu nn no np pv nr ns nt pw pp pq bk">The derivative must be non-negative. This ensures that the function is monotonically increasing, avoiding counterintuitively forecasting any negative spend.</li></ol><p id="5632" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The optimization’s objective function minimizes the sum of squared residuals and returns the coefficients of the polynomial that will guide the shape of cash spend through each phase.</p><p id="d6c2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Once we have these coefficients, we can evaluate this polynomial at each day of the expected phase duration, and then multiply the result by the expected cash per phase. With some additional data processing, this yields an expected percent of cash spend each day leading up to and beyond the launch date, which we can base our forecasts on.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa px"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ki-M_57G284X4IKo%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ki-M_57G284X4IKo%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ki-M_57G284X4IKo%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ki-M_57G284X4IKo%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ki-M_57G284X4IKo%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ki-M_57G284X4IKo%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ki-M_57G284X4IKo%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ki-M_57G284X4IKo 640w, https://miro.medium.com/v2/resize:fit:720/0*ki-M_57G284X4IKo 720w, https://miro.medium.com/v2/resize:fit:750/0*ki-M_57G284X4IKo 750w, https://miro.medium.com/v2/resize:fit:786/0*ki-M_57G284X4IKo 786w, https://miro.medium.com/v2/resize:fit:828/0*ki-M_57G284X4IKo 828w, https://miro.medium.com/v2/resize:fit:1100/0*ki-M_57G284X4IKo 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ki-M_57G284X4IKo 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="280" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="5ad2" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Assistive Speech Recognition in Dubbing Workflows at Netflix</h1><p id="3a21" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk"><a class="af nv" href="https://www.linkedin.com/in/tanguycornuau/" rel="noopener ugc nofollow" target="_blank">Tanguy Cornau</a></p><p id="40e1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Great stories can come from anywhere and be loved everywhere. At Netflix, we strive to make our titles accessible to a global audience, transcending language barriers to connect with viewers worldwide. One of the key ways we achieve this is through creating dubs in many languages.</p><p id="1dce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">From the transcription of the original titles all the way to the delivery of the dub audio, we blend innovation with human expertise to preserve the original creative intent.</p><p id="65c6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Leveraging technologies like Assistive Speech Recognition (ASR), we seek to make the <em class="nu">transcription</em> part of the process more efficient for our linguists. Transcription, in our context, involves creating a verbatim script of the spoken dialogue, along with precise timing information to perfectly align the text with the original video. With ASR, instead of starting the transcription from scratch, linguists get a pre-generated starting point which they can use and edit for complete accuracy.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa py"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*tdYvT28jMf3Z7QI_%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*tdYvT28jMf3Z7QI_%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*tdYvT28jMf3Z7QI_%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*tdYvT28jMf3Z7QI_%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*tdYvT28jMf3Z7QI_%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*tdYvT28jMf3Z7QI_%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*tdYvT28jMf3Z7QI_%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*tdYvT28jMf3Z7QI_ 640w, https://miro.medium.com/v2/resize:fit:720/0*tdYvT28jMf3Z7QI_ 720w, https://miro.medium.com/v2/resize:fit:750/0*tdYvT28jMf3Z7QI_ 750w, https://miro.medium.com/v2/resize:fit:786/0*tdYvT28jMf3Z7QI_ 786w, https://miro.medium.com/v2/resize:fit:828/0*tdYvT28jMf3Z7QI_ 828w, https://miro.medium.com/v2/resize:fit:1100/0*tdYvT28jMf3Z7QI_ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*tdYvT28jMf3Z7QI_ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="265" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="c129" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This efficiency enables linguists to focus more on other creative tasks, such as adding cultural annotations and references, which are crucial for downstream dubbing.</p><p id="a59a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">With ASR, and other new and enhanced technologies we introduce, rigorous analytics and measurement are essential to their success. To effectively evaluate our ASR system, we’ve established a multi-layered measurement framework that provides comprehensive insights into its performance across many dimensions (for example, the accuracy of the text and timing predictions), offline and online.</p><p id="5e57" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">ASR is expected to perform differently for various languages; therefore, at a high level, we track metrics by original language of the show, allowing us to assess overall ASR effectiveness and identify trends across different linguistic contexts. We further break down performance by various dimensions, e.g. content type, genre, etc… to help us pinpoint specific areas where the ASR system may encounter difficulties. Furthermore, our framework allows us to conduct in-depth analyses of individual titles’ transcription, focusing on critical quality dimensions around text and timing accuracy of ASR suggestions. By zooming in on where the system falls short, we gain valuable insights into specific challenges, enabling us to further refine our understanding of ASR performance.</p><p id="b2bc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">These measurement layers collectively empower us to continuously monitor, identify improvement areas, and implement targeted enhancements, ensuring that our ASR technology gets more and more accurate, effective, and helpful to linguists across diverse content types and languages. By refining our dubbing workflows through these innovations, we aim to keep improving the quality of our dubs to help great stories travel across the globe and bring joy to our members.</p></div></div></div><div class="ab cb pz qa qb qc" role="separator"><div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="f9f3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Analytics Engineering is a key contributor to building our deep data culture at Netflix, and we are proud to have a large group of stunning colleagues that are not only applying but advancing our analytical capabilities at Netflix. The 2024 Analytics Summit continued to be a wonderful way to give visibility to one another on work across business verticals, celebrate our collective impact, and highlight what’s to come in analytics practice at Netflix.</p><p id="d18b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To learn more, follow the <a class="af nv" href="https://research.netflix.com/research-area/analytics" rel="noopener ugc nofollow" target="_blank">Netflix Research Site</a>, and if you are also interested in entertaining the world, have a look at <a class="af nv" href="https://explore.jobs.netflix.net/careers" rel="noopener ugc nofollow" target="_blank">our open roles</a>!</p></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/part-2-a-survey-of-analytics-engineering-work-at-netflix-4f1f53b4ab0f</link>
      <guid>https://netflixtechblog.com/part-2-a-survey-of-analytics-engineering-work-at-netflix-4f1f53b4ab0f</guid>
      <pubDate>Thu, 02 Jan 2025 22:07:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Configurable Metaflow]]></title>
      <description><![CDATA[<div><div></div><p id="156c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><a class="af nu" href="https://www.linkedin.com/in/david-j-berg/" rel="noopener ugc nofollow" target="_blank"><em class="nv">David J. Berg</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/david-casler-05a5278/" rel="noopener ugc nofollow" target="_blank"><em class="nv">David Casler</em></a>^, <a class="af nu" href="https://www.linkedin.com/in/romain-cledat-4a211a5/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Romain Cledat</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/qian-huang-emma/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Qian Huang</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/rui-lin-483a83111/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Rui Lin</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/nissanpow/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Nissan Pow</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/nurcansonmez/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Nurcan Sonmez</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/shashanksrikanth/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Shashank Srikanth</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/chaoying-wang/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Chaoying Wang</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/reginalw/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Regina Wang</em></a>*<em class="nv">, </em><a class="af nu" href="https://www.linkedin.com/in/zitingyu/" rel="noopener ugc nofollow" target="_blank"><em class="nv">Darin Yu</em></a>*<br />*: Model Development Team, Machine Learning Platform<br />^: Content Demand Modeling Team</p><p id="8a5d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A month ago at QConSF, we showcased how <a class="af nu" href="https://qconsf.com/presentation/nov2024/supporting-diverse-ml-systems-netflix" rel="noopener ugc nofollow" target="_blank">Netflix utilizes Metaflow to power a diverse set of ML and AI use cases</a>, managing thousands of unique Metaflow flows. This followed a previous <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d">blog</a> on the same topic. Many of these projects are under constant development by dedicated teams with their own business goals and development best practices, such as the system that <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f">supports our content decision makers</a>, or the system that ranks which language subtitles are most valuable for a specific piece of content.</p><p id="8a83" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As a central ML and AI platform team, our role is to empower our partner teams with tools that maximize their productivity and effectiveness, while adapting to their specific needs (not the other way around). This has been a guiding design principle with <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9">Metaflow since its inception</a>.</p><figure class="nz oa ob oc od oe nw nx paragraph-image"><div role="button" tabindex="0" class="of og fj oh bh oi"><div class="nw nx ny"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*XrOVl25ZLx8_4nHLRxNgDg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*XrOVl25ZLx8_4nHLRxNgDg.png" /><img alt="" class="bh md oj c" width="700" height="413" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ok ff ol nw nx om on bf b bg z du">Metaflow infrastructure stack</figcaption></figure><p id="46ac" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Standing on the shoulders of our extensive cloud infrastructure, Metaflow facilitates easy access to data, compute, and <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78">production-grade workflow orchestration</a>, as well as built-in best practices for common concerns such as <a class="af nu" href="https://docs.metaflow.org/scaling/tagging" rel="noopener ugc nofollow" target="_blank">collaboration</a>, <a class="af nu" href="https://docs.metaflow.org/metaflow/basics#artifacts" rel="noopener ugc nofollow" target="_blank">versioning</a>, <a class="af nu" href="https://docs.metaflow.org/scaling/dependencies" rel="noopener ugc nofollow" target="_blank">dependency management</a>, and <a class="af nu" href="https://outerbounds.com/blog/metaflow-dynamic-cards" rel="noopener ugc nofollow" target="_blank">observability</a>, which teams use to setup ML/AI experiments and systems that work for them. As a result, Metaflow users at Netflix have been able to run millions of experiments over the past few years without wasting time on low-level concerns.</p><h1 id="15f1" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">A long standing FAQ: configurable flows</h1><p id="8abd" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">While Metaflow aims to be un-opinionated about some of the upper levels of the stack, some teams within Netflix have developed their own opinionated tooling. As part of Metaflow’s adaptation to their specific needs, we constantly try to understand what has been developed and, more importantly, what gaps these solutions are filling.</p><p id="bd7e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In some cases, we determine that the gap being addressed is very team specific, or too opinionated at too high a level in the stack, and we therefore decide to not develop it within Metaflow. In other cases, however, we realize that we can develop an underlying construct that aids in filling that gap. Note that even in that case, we do not always aim to completely fill the gap and instead focus on extracting a more general lower level concept that can be leveraged by that particular user but also by others. One such recurring pattern we noticed at Netflix is the need to deploy sets of closely related flows, often as part of a larger pipeline involving table creations, ETLs, and deployment jobs. Frequently, practitioners want to <a class="af nu" href="https://docs.metaflow.org/production/coordinating-larger-metaflow-projects" rel="noopener ugc nofollow" target="_blank">experiment with variants</a> of these flows, testing new data, new parameterizations, or new algorithms, while keeping the overall structure of the flow or flows intact.</p><p id="64c6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A natural solution is to make flows configurable using configuration files, so variants can be defined without changing the code. Thus far, there hasn’t been a built-in solution for configuring flows, so teams have built their bespoke solutions leveraging Metaflow’s <a class="af nu" href="https://docs.metaflow.org/metaflow/basics#advanced-parameters" rel="noopener ugc nofollow" target="_blank">JSON-typed Parameters</a>, <a class="af nu" href="https://docs.metaflow.org/scaling/data#data-in-local-files" rel="noopener ugc nofollow" target="_blank">IncludeFile</a>, and <a class="af nu" href="https://docs.metaflow.org/production/scheduling-metaflow-flows/scheduling-with-aws-step-functions#deploy-time-parameters" rel="noopener ugc nofollow" target="_blank">deploy-time Parameters</a> or deploying their own home-grown solution (often with great pain). However, none of these solutions make it easy to configure all aspects of the flow’s behavior, decorators in particular.</p><figure class="nz oa ob oc od oe nw nx paragraph-image"><div role="button" tabindex="0" class="of og fj oh bh oi"><div class="nw nx pr"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*3f9q7PZgxYX8rRygIOWXyA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*3f9q7PZgxYX8rRygIOWXyA.png" /><img alt="" class="bh md oj c" width="700" height="434" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ok ff ol nw nx om on bf b bg z du">Requests for a feature like Metaflow Config</figcaption></figure><p id="0e00" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Outside Netflix, we have seen similar frequently asked questions on the <a class="af nu" href="http://chat.metaflow.org" rel="noopener ugc nofollow" target="_blank">Metaflow community Slack</a> as shown in the user quotes above:</p><ul class=""><li id="8010" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt ps pt pu bk">how can I adjust <a class="af nu" href="https://docs.metaflow.org/scaling/remote-tasks/requesting-resources" rel="noopener ugc nofollow" target="_blank">the @resource requirements</a>, such as CPU or memory, without having to hardcode the values in my flows?</li><li id="129c" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">how to adjust <a class="af nu" href="https://docs.metaflow.org/production/scheduling-metaflow-flows/scheduling-with-argo-workflows#time-based-triggering" rel="noopener ugc nofollow" target="_blank">the triggering @schedule</a> programmatically, so our production and staging deployments can run at different cadences?</li></ul><h1 id="69ac" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">New in Metaflow: Configs!</h1><p id="36e8" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">Today, to answer the FAQ, we introduce a new — small but mighty — feature in Metaflow: <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/introduction" rel="noopener ugc nofollow" target="_blank">a Config object</a>. Configs complement the existing Metaflow constructs of artifacts and Parameters, by allowing you to configure all aspects of the flow, decorators in particular, prior to any run starting. At the end of the day, artifacts, Parameters and Configs are all stored as artifacts by Metaflow but they differ in when they are persisted as shown in the diagram below:</p><figure class="nz oa ob oc od oe nw nx paragraph-image"><div role="button" tabindex="0" class="of og fj oh bh oi"><div class="nw nx qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*L-klklqt1n9LKXG0jh-fTw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*L-klklqt1n9LKXG0jh-fTw.png" /><img alt="" class="bh md oj c" width="700" height="302" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="ok ff ol nw nx om on bf b bg z du">Different data artifacts in Metaflow</figcaption></figure><p id="448b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Said another way:</p><ul class=""><li id="b920" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt ps pt pu bk">An<strong class="my gv"> artifact</strong> is resolved and persisted to the datastore at the end of each task.</li><li id="73c1" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">A<strong class="my gv"> parameter</strong> is resolved and persisted at the start of a run; it can therefore be modified up to that point. One common use case is to use <a class="af nu" href="https://docs.metaflow.org/production/event-triggering" rel="noopener ugc nofollow" target="_blank">triggers</a> to pass values to a run right before executing. Parameters can only be used within your step code.</li><li id="61b0" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">A<strong class="my gv"> config</strong> is resolved and persisted when the flow is deployed. When using a scheduler such as <a class="af nu" href="https://docs.metaflow.org/production/scheduling-metaflow-flows/scheduling-with-argo-workflows" rel="noopener ugc nofollow" target="_blank">Argo Workflows</a>, deployment happens when create’ing the flow. In the case of a local run, “deployment” happens just prior to the execution of the run — think of “deployment” as gathering all that is needed to run the flow. Unlike parameters, configs can be used more widely in your flow code, particularly, they can be used in step or flow level decorators as well as to set defaults for parameters. Configs can of course also be used within your flow.</li></ul><p id="226d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As an example, you can specify a Config that reads a pleasantly human-readable configuration file, formatted as <a class="af nu" href="https://toml.io/en/" rel="noopener ugc nofollow" target="_blank">TOML</a>. The Config specifies a triggering ‘@schedule’ and ‘@resource’ requirements, as well as application-specific parameters for this specific deployment:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">[schedule]<br />cron = "0 * * * *"[model]<br />optimizer = "adam"<br />learning_rate = 0.5[resources]<br />cpu = 1</pre><p id="be39" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Using the newly released Metaflow 2.13, you can configure a flow with a Config like above, as demonstrated by this flow:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">import pprint<br />from metaflow import FlowSpec, step, Config, resources, config_expr, schedule@schedule(cron=config_expr("config.schedule.cron"))<br />class ConfigurableFlow(FlowSpec):<br />    config = Config("config", default="myconfig.toml", parser="tomllib.loads")@resources(cpu=config.resources.cpu)<br />    @step<br />    def start(self):<br />        print("Config loaded:")<br />        pprint.pp(self.config)<br />        self.next(self.end)@step<br />    def end(self):<br />        passif __name__ == "__main__":<br />    ConfigurableFlow()</pre><p id="251b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">There is a lot going on in the code above, a few highlights:</p><ul class=""><li id="cc56" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt ps pt pu bk">you can refer to configs <em class="nv">before</em> they have been defined using ‘config_expr’.</li><li id="2cf7" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">you can define arbitrary <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/parsing-configs" rel="noopener ugc nofollow" target="_blank">parsers</a> — using a string means the parser doesn’t even have to be present remotely!</li></ul><p id="1590" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">From the developer’s point of view, Configs behave like dictionary-like artifacts. For convenience, they support the dot-syntax (when possible) for accessing keys, making it easy to access values in a nested configuration. You can also unpack the whole Config (or a subtree of it) with Python’s standard dictionary unpacking syntax, ‘**config’. The standard dictionary subscript notation is also available.</p><p id="5e1e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Since Configs turn into dictionary artifacts, they get versioned and persisted automatically as artifacts. You can <a class="af nu" href="https://docs.metaflow.org/metaflow/client" rel="noopener ugc nofollow" target="_blank">access Configs of any past runs easily through the Client API</a>. As a result, your data, models, code, Parameters, Configs, and <a class="af nu" href="https://docs.metaflow.org/scaling/dependencies" rel="noopener ugc nofollow" target="_blank">execution environments</a> are all stored as a consistent bundle — neatly organized in <a class="af nu" href="https://docs.metaflow.org/scaling/tagging" rel="noopener ugc nofollow" target="_blank">Metaflow namespaces</a> — paving the way for easily reproducible, consistent, low-boilerplate, and now easily configurable experiments and robust production deployments.</p><h1 id="9d64" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">More than a humble config file</h1><p id="ee02" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">While you can get far by accompanying your flow with a simple config file (stored in your favorite format, thanks to <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/parsing-configs" rel="noopener ugc nofollow" target="_blank">user-definable parsers</a>), Configs unlock a number of advanced use cases. Consider these examples from the updated documentation:</p><ul class=""><li id="626e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt ps pt pu bk">You can <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/basic-configuration#mixing-configs-and-parameters" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">choose the right level of runtime configurability</strong></a> versus fixed deployments by mixing Parameters and Configs. For instance, you can use a Config to define a default value for a parameter which can be <a class="af nu" href="https://docs.metaflow.org/production/event-triggering/external-events#passing-parameters-in-events" rel="noopener ugc nofollow" target="_blank">overridden by a real-time event</a> as a run is triggered.</li><li id="83e8" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">You can define a custom parser to <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/parsing-configs#validating-configs-with-pydantic" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">validate the configuration</strong></a>, e.g. using the popular <a class="af nu" href="https://docs.pydantic.dev/latest/" rel="noopener ugc nofollow" target="_blank">Pydantic</a> library.</li><li id="19d3" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">You are not limited to using a single file: you can leverage a configuration manager like <a class="af nu" href="https://omegaconf.readthedocs.io/en/2.3_branch/" rel="noopener ugc nofollow" target="_blank">OmegaConf</a> or <a class="af nu" href="https://hydra.cc/" rel="noopener ugc nofollow" target="_blank">Hydra</a> to <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/parsing-configs#advanced-configurations-with-omegaconf" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">manage a hierarchy of cascading configuration files</strong></a>. You can also use a domain-specific tool for generating Configs, such as Netflix’s <em class="nv">Metaboost</em> which we cover below.</li><li id="ca32" class="mw mx gu my b mz pv nb nc nd pw nf ng nh px nj nk nl py nn no np pz nr ns nt ps pt pu bk">You can also <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/custom-parsers#generating-configs-programmatically" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">generate configurations on the fly</strong></a>, e.g. fetch Configs from an external service, or inspect the execution environment, such as the current GIT branch, and include it as an extra piece of context in runs.</li></ul><p id="ac3f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A major benefit of Config over previous more hacky solutions for configuring flows is that they work seamlessly with other features of Metaflow: you can run steps remotely and deploy flows to production, even when relying on custom parsers, without having to worry about packaging Configs or parsers manually or keeping Configs consistent across tasks. Configs also work with the <a class="af nu" href="https://docs.metaflow.org/metaflow/managing-flows/runner" rel="noopener ugc nofollow" target="_blank">Runner</a> and <a class="af nu" href="https://docs.metaflow.org/metaflow/managing-flows/deployer" rel="noopener ugc nofollow" target="_blank">Deployer</a>.</p><h1 id="8ae8" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">The Hollywood principle: don’t call us, we’ll call you</h1><p id="84f1" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">When used in conjunction with a configuration manager like <a class="af nu" href="https://hydra.cc" rel="noopener ugc nofollow" target="_blank">Hydra</a>, Configs enable a pattern that is highly relevant for ML and AI use cases: orchestrating experiments over multiple configurations or sweeping over parameter spaces. While Metaflow has always supported <a class="af nu" href="https://docs.outerbounds.com/grid-search-with-metaflow/" rel="noopener ugc nofollow" target="_blank">sweeping over parameter grids</a> easily using foreaches, it hasn’t been easily possible to alter the flow itself, e.g. to change <a class="af nu" href="https://docs.metaflow.org/api/step-decorators/resources" rel="noopener ugc nofollow" target="_blank">@resources</a> or <a class="af nu" href="https://docs.metaflow.org/api/step-decorators/conda" rel="noopener ugc nofollow" target="_blank">@pypi/@conda</a> dependencies for every experiment.</p><p id="f7ef" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In a typical case, you trigger a Metaflow flow that consumes a configuration file, changing <em class="nv">how</em> a run behaves. With Hydra, you can <a class="af nu" href="https://en.wikipedia.org/wiki/Inversion_of_control" rel="noopener ugc nofollow" target="_blank">invert the control</a>: it is Hydra that decides <em class="nv">what</em> gets run based on a configuration file. Thanks to Metaflow’s new <a class="af nu" href="https://docs.metaflow.org/metaflow/managing-flows/runner" rel="noopener ugc nofollow" target="_blank">Runner</a> and <a class="af nu" href="https://docs.metaflow.org/metaflow/managing-flows/deployer" rel="noopener ugc nofollow" target="_blank">Deployer</a> APIs, you can create a Hydra app that operates Metaflow programmatically — for instance, to deploy and execute hundreds of variants of a flow in a large-scale experiment.</p><p id="1fda" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/config-driven-experimentation" rel="noopener ugc nofollow" target="_blank">Take a look at two interesting examples of this pattern</a> in the documentation. As a teaser, this video shows Hydra orchestrating deployment of tens of Metaflow flows, each of which benchmarks PyTorch using a varying number of CPU cores and tensor sizes, updating a visualization of the results in real-time as the experiment progresses:</p><figure class="nz oa ob oc od oe"><div class="qk jf l fj"><figcaption class="ok ff ol nw nx om on bf b bg z du">Example using Hydra with Metaflow</figcaption></div></figure><h1 id="88a6" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">Metaboosting Metaflow — based on a true story</h1><p id="4282" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">To give a motivating example of what configurations look like at Netflix in practice, let’s consider <em class="nv">Metaboost</em>, an internal Netflix CLI tool that helps ML practitioners manage, develop and execute their cross-platform projects, somewhat similar to the open-source Hydra discussed above but with specific integrations to the Netflix ecosystem. Metaboost is an example of an opinionated framework developed by a team already using Metaflow. In fact, a part of the inspiration for introducing Configs in Metaflow came from this very use case.</p><p id="9322" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Metaboost serves as a single interface to three different internal platforms at Netflix that manage ETL/Workflows (<a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78"><em class="nv">Maestro</em></a>), Machine Learning Pipelines (<a class="af nu" href="https://docs.metaflow.org" rel="noopener ugc nofollow" target="_blank"><em class="nv">Metaflow</em></a>) and Data Warehouse Tables (<em class="nv">Kragle</em>). In this context, having a single configuration system to manage a ML project holistically gives users increased project coherence and decreased project risk.</p><h2 id="fbf6" class="qn op gu bf oq qo qp dy ou qq qr ea oy nh qs qt qu nl qv qw qx np qy qz ra rb bk">Configuration in Metaboost</h2><p id="a6d0" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">Ease of configuration and templatizing are core values of Metaboost. Templatizing in Metaboost is achieved through the concept of <em class="nv">bindings</em>, wherein we can <em class="nv">bind</em> a Metaflow pipeline to an arbitrary label, and then create a corresponding bespoke configuration for that label. The binding-connected configuration is then merged into a global set of configurations containing such information as GIT repository, branch, etc. Binding a Metaflow, will also signal to Metaboost that it should instantiate the Metaflow flow once per binding into our orchestration cluster.</p><p id="747b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Imagine a ML practitioner on the Netflix Content ML team, sourcing features from hundreds of columns in our data warehouse, and creating a multitude of models against a <em class="nv">growing</em> suite of metrics. When a brand new content metric comes along, with Metaboost, the first version of the metric’s predictive model can easily be created by simply swapping the target column against which the model is trained.</p><p id="dd0d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Subsequent versions of the model will result from experimenting with hyper parameters, tweaking feature engineering, or conducting feature diets. Metaboost’s bindings, and their integration with Metaflow Configs, can be leveraged to scale the number of experiments as fast as a scientist can create experiment based configurations.</p><h2 id="edb4" class="qn op gu bf oq qo qp dy ou qq qr ea oy nh qs qt qu nl qv qw qx np qy qz ra rb bk">Scaling experiments with Metaboost bindings — backed by Metaflow Config</h2><p id="9ad4" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">Consider a Metaboost ML project named `demo` that creates and loads data to custom tables (ETL managed by Maestro), and then trains a simple model on this data (ML Pipeline managed by Metaflow). The project structure of this repository might look like the following:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">├── metaflows<br />│   ├── custom                               -&gt; custom python code, used by<br />|   |   |                                       Metaflow<br />│   │   ├── data.py<br />│   │   └── model.py<br />│   └── training.py                          -&gt; defines our Metaflow pipeline<br />├── schemas<br />│   ├── demo_features_f.tbl.yaml             -&gt; table DDL, stores our ETL<br />|   |                                           output, Metaflow input<br />│   └── demo_predictions_f.tbl.yaml          -&gt; table DDL,<br />|                                               stores our Metaflow output<br />├── settings<br />│   ├── settings.configuration.EXP_01.yaml   -&gt; defines the additive<br />|   |                                           config for Experiment 1<br />│   ├── settings.configuration.EXP_02.yaml   -&gt; defines the additive<br />|   |                                           config for Experiment 2<br />│   ├── settings.configuration.yaml          -&gt; defines our global<br />|   |                                           configuration<br />│   └── settings.environment.yaml            -&gt; defines parameters based on<br />|                                               git branch (e.g. READ_DB)<br />├── tests<br />├── workflows<br />│   ├── sql<br />│   ├── demo.demo_features_f.sch.yaml        -&gt; Maestro workflow, defines ETL<br />│   └── demo.main.sch.yaml                   -&gt; Maestro workflow, orchestrates<br />|                                               ETLs and Metaflow<br />└── metaboost.yaml                           -&gt; defines our project for<br />                                                Metaboost</pre><p id="f679" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The configuration files in the settings directory above contain the following YAML files:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk"># settings.configuration.yaml (global configuration)<br />model:<br />  fit_intercept: True<br />conda:<br />  numpy: '1.22.4'<br />  "scikit-learn": '1.4.0'</pre><pre class="rc qb qc qd bp qe bb bk"># settings.configuration.EXP_01.yaml<br />target_column: metricA<br />features:<br />  - runtime<br />  - content_type<br />  - top_billed_talent</pre><pre class="rc qb qc qd bp qe bb bk"># settings.configuration.EXP_02.yaml<br />target_column: metricA<br />features:<br />  - runtime<br />  - director<br />  - box_office</pre><p id="069e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Metaboost will merge each experiment configuration (<em class="nv">*.EXP*.yaml</em>) into the global configuration (settings.configuration.yaml) <em class="nv">individually</em> at Metaboost command initialization. Let’s take a look at how Metaboost combines these configurations with a Metaboost command:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">(venv-demo) ~/projects/metaboost-demo [branch=demoX] <br />$ metaboost metaflow settings show --yaml-path=configurationbinding=EXP_01:<br />model:                     -&gt; defined in setting.configuration.yaml (global)<br />  fit_intercept: true<br />conda:                     -&gt; defined in setting.configuration.yaml (global)<br />  numpy: 1.22.4<br />  "scikit-learn": 1.4.0<br />target_column: metricA     -&gt; defined in setting.configuration.EXP_01.yaml<br />features:                  -&gt; defined in setting.configuration.EXP_01.yaml<br />- runtime<br />- content_type<br />- top_billed_talentbinding=EXP_02:<br />model:                     -&gt; defined in setting.configuration.yaml (global)<br />  fit_intercept: true<br />conda:                     -&gt; defined in setting.configuration.yaml (global)<br />  numpy: 1.22.4<br />  "scikit-learn": 1.4.0<br />target_column: metricA     -&gt; defined in setting.configuration.EXP_02.yaml<br />features:                  -&gt; defined in setting.configuration.EXP_02.yaml<br />- runtime<br />- director<br />- box_office</pre><p id="a78e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Metaboost understands it should deploy/run two independent instances of training.py — one for the EXP_01 binding and one for the EXP_02 binding. You can also see that Metaboost is aware that the tables and ETL workflows are <em class="nv">not bound</em>, and should only be deployed once. These details of which artifacts to bind and which to leave unbound are encoded in the project’s top-level metaboost.yaml file.</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">(venv-demo) ~/projects/metaboost-demo [branch=demoX] <br />$ metaboost project listTables (metaboost table list):<br />schemas/demo_predictions_f.tbl.yaml (binding=default):<br />    table_path=prodhive/demo_db/demo_predictions_f<br />schemas/demo_features_f.tbl.yaml (binding=default):<br />    table_path=prodhive/demo_db/demo_features_fWorkflows (metaboost workflow list):<br />workflows/demo.demo_features_f.sch.yaml (binding=default):<br />    cluster=sandbox, workflow.id=demo.branch_demox.demo_features_f<br />workflows/demo.main.sch.yaml (binding=default):<br />    cluster=sandbox, workflow.id=demo.branch_demox.mainMetaflows (metaboost metaflow list):<br />metaflows/training.py (binding=EXP_01): -&gt; EXP_01 instance of training.py<br />    cluster=sandbox, workflow.id=demo.branch_demox.EXP_01.training   <br />metaflows/training.py (binding=EXP_02): -&gt; EXP_02 instance of training.py<br />    cluster=sandbox, workflow.id=demo.branch_demox.EXP_02.training</pre><p id="c673" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Below is a simple Metaflow pipeline that fetches data, executes feature engineering, and trains a LinearRegression model. The work to integrate Metaboost Settings into a user’s Metaflow pipeline (implemented using Metaflow Configs) is as easy as adding a single mix-in to the FlowSpec definition:</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">from metaflow import FlowSpec, Parameter, conda_base, step<br />from custom.data import feature_engineer, get_data<br />from metaflow.metaboost import MetaboostSettings@conda_base(<br />    libraries=MetaboostSettings.get_deploy_time_settings("configuration.conda")<br />)<br />class DemoTraining(FlowSpec, MetaboostSettings):<br />    prediction_date = Parameter("prediction_date", type=int, default=-1)@step<br />    def start(self):<br />        # get show_settings() for free with the mixin<br />        # and get convenient debugging info<br />        self.show_settings(exclude_patterns=["artifact*", "system*"])self.next(self.get_features)@step<br />    def get_features(self):<br />        # feature engineers on our extracted data<br />        self.fe_df = feature_engineer(<br />            # loads data from our ETL pipeline<br />            data=get_data(prediction_date=self.prediction_date),<br />            features=self.settings.configuration.features +<br />                [self.settings.configuration.target_column]<br />        )self.next(self.train)@step<br />    def train(self):<br />        from sklearn.linear_model import LinearRegression# trains our model<br />        self.model = LinearRegression(<br />            fit_intercept=self.settings.configuration.model.fit_intercept<br />        ).fit(<br />            X=self.fe_df[self.settings.configuration.features],<br />            y=self.fe_df[self.settings.configuration.target_column]<br />        )<br />        print(f"Fit slope: {self.model.coef_[0]}")<br />        print(f"Fit intercept: {self.model.intercept_}")self.next(self.end)@step<br />    def end(self):<br />        passif __name__ == "__main__":<br />    DemoTraining()</pre><p id="1fa8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The Metaflow Config is added to the FlowSpec by mixing in the MetaboostSettings class. Referencing a configuration value is as easy as using the dot syntax to drill into whichever parameter you’d like.</p><p id="5872" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Finally let’s take a look at the output from our sample Metaflow above. We execute experiment EXP_01 with</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">metaboost metaflow run --binding=EXP_01</pre><p id="ea6c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">which upon execution will merge the configurations into a single <em class="nv">settings</em> file (shown previously) and serialize it as a yaml file to the <em class="nv">.metaboost/settings/compiled/</em> directory.</p><p id="0e34" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">You can see the actual command and args that were sub-processed in the <em class="nv">Metaboost Execution</em> section below. Please note the <strong class="my gv">–config</strong> argument pointing to the serialized yaml file, and then subsequently accessible via <strong class="my gv">self.settings</strong>. Also note the convenient printing of configuration values to stdout during the start step using a mixed in function named <strong class="my gv">show_settings()</strong>.</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">(venv-demo) ~/projects/metaboost-demo [branch=demoX] <br />$ metaboost metaflow run --binding=EXP_01Metaboost Execution: <br /> - python3.10 /root/repos/cdm-metaboost-irl/metaflows/training.py<br />   --no-pylint --package-suffixes=.py --environment=conda<br />   --config settings<br />   .metaboost/settings/compiled/settings.branch_demox.EXP_01.training.mP4eIStG.yaml<br />   run --prediction_date20241006Metaflow 2.12.39+nflxfastdata(2.13.5);nflx(2.13.5);metaboost(0.0.27)<br />  executing DemoTraining for user:dcasler<br />Validating your flow...<br />    The graph looks good!<br />Bootstrapping Conda environment... (this could take a few minutes)<br />All packages already cached in s3.<br />All environments already cached in s3.Workflow starting (run-id 50), see it in the UI at<br />https://metaflowui.prod.netflix.net/DemoTraining/50[50/start/251640833] Task is starting.<br />[50/start/251640833] Configuration Values:<br />[50/start/251640833]   settings.configuration.conda.numpy            = 1.22.4<br />[50/start/251640833]   settings.configuration.features.0             = runtime<br />[50/start/251640833]   settings.configuration.features.1             = content_type<br />[50/start/251640833]   settings.configuration.features.2             = top_billed_talent<br />[50/start/251640833]   settings.configuration.model.fit_intercept    = True<br />[50/start/251640833]   settings.configuration.target_column          = metricA<br />[50/start/251640833]   settings.environment.READ_DATABASE            = data_warehouse_prod<br />[50/start/251640833]   settings.environment.TARGET_DATABASE          = demo_dev<br />[50/start/251640833] Task finished successfully.[50/get_features/251640840] Task is starting.<br />[50/get_features/251640840] Task finished successfully.[50/train/251640854] Task is starting.<br />[50/train/251640854] Fit slope: 0.4702672504331096<br />[50/train/251640854] Fit intercept: -6.247919678070083<br />[50/train/251640854] Task finished successfully.[50/end/251640868] Task is starting.<br />[50/end/251640868] Task finished successfully.Done! See the run in the UI at<br />https://metaflowui.prod.netflix.net/DemoTraining/50</pre><h2 id="d6d2" class="qn op gu bf oq qo qp dy ou qq qr ea oy nh qs qt qu nl qv qw qx np qy qz ra rb bk">Takeaways</h2><p id="feb6" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">Metaboost is an integration tool that aims to ease the project development, management and execution burden of ML projects at Netflix. It employs a configuration system that combines git based parameters, global configurations and arbitrarily <em class="nv">bound</em> configuration files for use during execution against internal Netflix platforms.</p><p id="7ec5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Integrating this configuration system with the new Config in Metaflow is incredibly simple (by design), only requiring users to add a mix-in class to their FlowSpec — <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/custom-parsers#including-default-configs-in-flows" rel="noopener ugc nofollow" target="_blank">similar to this example in Metaflow documentation</a> — and then reference the configuration values in steps or decorators. The example above templatizes a training Metaflow for the sake of experimentation, but users could just as easily use bindings/configs to templatize their flows across target metrics, business initiatives or any other arbitrary lines of work.</p><h1 id="d730" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">Try it at home</h1><p id="2f85" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">It couldn’t be easier to get started with Configs! Just</p><pre class="nz oa ob oc od qb qc qd bp qe bb bk">pip install -U metaflow</pre><p id="e0ec" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">to get the latest version and <a class="af nu" href="https://docs.metaflow.org/metaflow/configuring-flows/introduction" rel="noopener ugc nofollow" target="_blank">head to the updated documentation</a> for examples. If you are impatient, you can find and execute <a class="af nu" href="https://github.com/outerbounds/config-examples" rel="noopener ugc nofollow" target="_blank">all config-related examples in this repository</a> as well.</p><p id="53e2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">If you have any questions or feedback about Config (or other Metaflow features), you can reach out to us at the <a class="af nu" href="http://chat.metaflow.org" rel="noopener ugc nofollow" target="_blank">Metaflow community Slack</a>.</p><h1 id="24a8" class="oo op gu bf oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk pl bk">Acknowledgments</h1><p id="97f9" class="pw-post-body-paragraph mw mx gu my b mz pm nb nc nd pn nf ng nh po nj nk nl pp nn no np pq nr ns nt gn bk">We would like to thank <a class="af nu" href="https://outerbounds.co" rel="noopener ugc nofollow" target="_blank">Outerbounds</a> for their collaboration on this feature; for rigorously testing it and developing a repository of examples to showcase some of the possibilities offered by this feature.</p></div>]]></description>
      <link>https://netflixtechblog.com/introducing-configurable-metaflow-d2fb8e9ba1c6</link>
      <guid>https://netflixtechblog.com/introducing-configurable-metaflow-d2fb8e9ba1c6</guid>
      <pubDate>Fri, 20 Dec 2024 08:11:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Part 1: A Survey of Analytics Engineering Work at Netflix]]></title>
      <description><![CDATA[<div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><div><div></div><p id="9766" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">This article is the first in a multi-part series sharing a breadth of Analytics Engineering work at Netflix, recently presented as part of our annual internal Analytics Engineering conference. We kick off with a few topics focused on how we’re empowering Netflix to efficiently produce and effectively deliver high quality, actionable analytic insights across the company. Subsequent posts will detail examples of exciting analytic engineering domain applications and aspects of the technical craft.</em></p><p id="1e6c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, we seek to entertain the world by ensuring our members find the shows and movies that will thrill them. Analytics at Netflix powers everything from understanding what content will excite and bring members back for more to how we should produce and distribute a content slate that maximizes member joy. Analytics Engineers deliver these insights by establishing deep business and product partnerships; translating business challenges into solutions that unblock critical decisions; and designing, building, and maintaining end-to-end analytical systems.</p><p id="9e33" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Each year, we bring the Analytics Engineering community together for an Analytics Summit — a 3-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. We covered a broad array of exciting topics and wanted to spotlight a few to give you a taste of what we’re working on across Analytics Engineering at Netflix!</p><h1 id="a458" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">DataJunction: Unifying Experimentation and Analytics</h1><p id="6f96" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk"><a class="af oy" href="https://www.linkedin.com/in/shyiann/" rel="noopener ugc nofollow" target="_blank">Yian Shang</a>, <a class="af oy" href="https://www.linkedin.com/in/anhqle/" rel="noopener ugc nofollow" target="_blank">Anh Le</a></p><p id="2200" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, like in many organizations, creating and using metrics is often more complex than it should be. Metric definitions are often scattered across various databases, documentation sites, and code repositories, making it difficult for analysts and data scientists to find reliable information quickly. This fragmentation leads to inconsistencies and wastes valuable time as teams end up reinventing metrics or seeking clarification on definitions that should be standardized and readily accessible.</p><p id="53c3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Enter <a class="af oy" href="https://datajunction.io/" rel="noopener ugc nofollow" target="_blank">DataJunction</a> (DJ). DJ acts as a central store where metric definitions can live and evolve. Once a metric owner has registered a metric into DJ, metric consumers throughout the organization can apply that same metric definition to a set of filtered records and aggregate to any dimensional grain.</p><p id="10ac" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As an example, imagine an analyst wanting to create a “Total Streaming Hours” metric. To add this metric to DJ, they need to provide two pieces of information:</p><ul class=""><li id="01ce" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">The fact table that the metric comes from:</li></ul><p id="c29e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">SELECT<br /> account_id, country_iso_code, streaming_hours<br />FROM streaming_fact_table</p><ul class=""><li id="3e6e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">The metric expression:</li></ul><p id="aef3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">`SUM(streaming_hours)`</p><p id="8781" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Then metric consumers throughout the organization can call DJ to request either the SQL or the resulting data. For example,</p><ul class=""><li id="b9e6" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">total_streaming_hours of each account:</li></ul><p id="ca2b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">dj.sql(metrics=[“total_streaming_hours”], dimensions=[“account_id”]))</p><ul class=""><li id="6782" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">total_streaming_hours of each country:</li></ul><p id="d234" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">dj.sql(metrics=[“total_streaming_hours”], dimensions=[“country_iso_code”]))</p><ul class=""><li id="bda2" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">total_streaming_hours of each account in the US:</li></ul><p id="2dfd" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">dj.sql(metrics=[“total_streaming_hours”], dimensions=[“country_iso_code”], filters=[“country_iso_code = ‘US’”]))</p><p id="425c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The key here is that DJ can perform the dimensional join on users’ behalf. If country_iso_code doesn’t already exist in the fact table, the metric owner only needs to tell DJ that account_id is the foreign key to an `users_dimension_table` (we call this process “<a class="af oy" href="https://datajunction.io/docs/0.1.0/data-modeling/dimension-links/" rel="noopener ugc nofollow" target="_blank">dimension linking</a>”). DJ then can perform the joins to bring in any requested dimensions from `users_dimension_table`.</p><p id="123b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The Netflix Experimentation Platform heavily leverages this feature today by treating cell assignment as just another dimension that it asks DJ to bring in. For example, to compare the average streaming hours in cell A vs cell B, the Experimentation Platform relies on DJ to bring in “cell_assignment” as a user’s dimension (no different from country_iso_code). A metric can therefore be defined once in DJ and be made available across analytics dashboards and experimentation analysis.</p><p id="eb33" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">DJ has a strong pedigree–there are several prior <a class="af oy" href="https://benn.substack.com/p/bi-by-another-name" rel="noopener ugc nofollow" target="_blank">semantic layers</a> in the industry (e.g. <a class="af oy" href="https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70" rel="noopener">Minerva</a> at Airbnb; dbt Transform, Looker, and AtScale as paid solutions). DJ stands out as an <a class="af oy" href="https://github.com/DataJunction/dj" rel="noopener ugc nofollow" target="_blank">open source</a> solution that is actively developed and stress-tested at Netflix. We’d love to see DJ easing <em class="nu">your</em> metric creation and consumption pain points!</p><h1 id="443f" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">LORE: How we’re democratizing analytics at Netflix</h1><p id="4cc6" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk"><a class="af oy" href="https://www.linkedin.com/in/apurvakansara/" rel="noopener ugc nofollow" target="_blank">Apurva Kansara</a></p><p id="19be" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, we rely on data and analytics to inform critical business decisions. Over time, this has resulted in large numbers of dashboard products. While such analytics products are tremendously useful, we noticed a few trends:</p><ol class=""><li id="6777" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pc pa pb bk">A large portion of such products have less than 5 MAU (monthly active users)</li><li id="abdb" class="mw mx gu my b mz pd nb nc nd pe nf ng nh pf nj nk nl pg nn no np ph nr ns nt pc pa pb bk">We spend a tremendous amount of time building and maintaining business metrics and dimensions</li><li id="6936" class="mw mx gu my b mz pd nb nc nd pe nf ng nh pf nj nk nl pg nn no np ph nr ns nt pc pa pb bk">We see inconsistencies in how a particular metric is calculated, presented, and maintained across the Data &amp; Insights organization.</li><li id="2ab5" class="mw mx gu my b mz pd nb nc nd pe nf ng nh pf nj nk nl pg nn no np ph nr ns nt pc pa pb bk">It is challenging to scale such bespoke solutions to ever-changing and increasingly complex business needs.</li></ol><p id="6960" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Analytics Enablement is a collection of initiatives across Data &amp; Insights all focused on empowering Netflix analytic practitioners to efficiently produce and effectively deliver high-quality, actionable insights.</p><p id="cae8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Specifically, these initiatives are focused on enabling analytics rather than on the activities that produce analytics (e.g., dashboarding, analysis, research, etc.).</p><figure class="pl pm pn po pp pq pi pj paragraph-image"><div class="pi pj pk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*gUgNHuu6yqKdfbgg%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*gUgNHuu6yqKdfbgg%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*gUgNHuu6yqKdfbgg%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*gUgNHuu6yqKdfbgg%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*gUgNHuu6yqKdfbgg%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*gUgNHuu6yqKdfbgg%201100w,%20https://miro.medium.com/v2/resize:fit:1250/format:webp/0*gUgNHuu6yqKdfbgg%201250w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 625px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*gUgNHuu6yqKdfbgg 640w, https://miro.medium.com/v2/resize:fit:720/0*gUgNHuu6yqKdfbgg 720w, https://miro.medium.com/v2/resize:fit:750/0*gUgNHuu6yqKdfbgg 750w, https://miro.medium.com/v2/resize:fit:786/0*gUgNHuu6yqKdfbgg 786w, https://miro.medium.com/v2/resize:fit:828/0*gUgNHuu6yqKdfbgg 828w, https://miro.medium.com/v2/resize:fit:1100/0*gUgNHuu6yqKdfbgg 1100w, https://miro.medium.com/v2/resize:fit:1250/0*gUgNHuu6yqKdfbgg 1250w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 625px" /><img alt="" class="bh md pr c" width="625" height="423" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure><p id="87d0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As part of broad analytics enablement across all business domains, we invested in a chatbot to provide real insights to our end users using the power of LLM. One reason LLMs are well suited for such problems is that they tie the versatility of natural language with the power of data query to enable our business users to query data that would otherwise require sophisticated knowledge of underlying data models.</p><p id="9c03" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Besides providing the end user with an instant answer in a preferred data visualization, LORE instantly learns from the user’s feedback. This allows us to teach LLM a context-rich understanding of internal business metrics that were previously locked in custom code for each of the dashboard products.</p><figure class="pl pm pn po pp pq pi pj paragraph-image"><div role="button" tabindex="0" class="pt pu fj pv bh pw"><div class="pi pj ps"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*onXkeBFPL44KYBQB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*onXkeBFPL44KYBQB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*onXkeBFPL44KYBQB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*onXkeBFPL44KYBQB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*onXkeBFPL44KYBQB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*onXkeBFPL44KYBQB%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*onXkeBFPL44KYBQB%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*onXkeBFPL44KYBQB 640w, https://miro.medium.com/v2/resize:fit:720/0*onXkeBFPL44KYBQB 720w, https://miro.medium.com/v2/resize:fit:750/0*onXkeBFPL44KYBQB 750w, https://miro.medium.com/v2/resize:fit:786/0*onXkeBFPL44KYBQB 786w, https://miro.medium.com/v2/resize:fit:828/0*onXkeBFPL44KYBQB 828w, https://miro.medium.com/v2/resize:fit:1100/0*onXkeBFPL44KYBQB 1100w, https://miro.medium.com/v2/resize:fit:1400/0*onXkeBFPL44KYBQB 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pr c" width="700" height="191" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="5bc7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Some of the challenges we run into:</p><ul class=""><li id="7232" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk">Gaining user trust: To gain our end users’ trust, we focused on our model’s explainability. For example, LORE provides human-readable reasoning on how it arrived at the answer that users can cross-verify. LORE also provides a confidence score to our end users based on its grounding in the domain space.</li><li id="1cdb" class="mw mx gu my b mz pd nb nc nd pe nf ng nh pf nj nk nl pg nn no np ph nr ns nt oz pa pb bk">Training: We created easy-to-provide feedback using 👍 and 👎 with a fully integrated fine-tuning loop to allow end-users to teach new domains and questions around it effectively. This allowed us to bootstrap LORE across several domains within Netflix.</li></ul><p id="4d54" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Democratizing analytics can unlock the tremendous potential of data for everyone within the company. With Analytics enablement and LORE, we’ve enabled our business users to truly have a conversation with the data.</p><h1 id="cfe1" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Leveraging Foundational Platform Data to enable Cloud Efficiency Analytics</h1><p id="50f8" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk"><a class="af oy" href="https://www.linkedin.com/in/jhan-104105/?utm_source=share&amp;utm_campaign=share_via&amp;utm_content=profile" rel="noopener ugc nofollow" target="_blank">J Han</a>, <a class="af oy" href="https://www.linkedin.com/in/pallavi-phadnis-75280b20/" rel="noopener ugc nofollow" target="_blank">Pallavi Phadnis</a></p><p id="888c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, we use Amazon Web Services (AWS) for our cloud infrastructure needs, such as compute, storage, and networking to build and run the streaming platform that we love. Our ecosystem enables engineering teams to run applications and services at scale, utilizing a mix of open-source and proprietary solutions. In order to understand how efficiently we operate in this diverse technological landscape, the Data &amp; Insights organization partners closely with our engineering teams to share key efficiency metrics, empowering internal stakeholders to make informed business decisions.</p><p id="c749" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This is where our team, Platform DSE (Data Science Engineering), comes in to enable our engineering partners to understand what resources they’re using, how effectively they utilize those resources, and the cost associated with their resource usage. By creating curated datasets and democratizing access via a custom insights app and various integration points, downstream users can gain granular insights essential for making data-driven, cost-effective decisions for the business.</p><p id="6327" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To address the numerous analytic needs in a scalable way, we’ve developed a two-component solution:</p><ol class=""><li id="3423" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pc pa pb bk">Foundational Platform Data (FPD): This component provides a centralized data layer for all platform data, featuring a consistent data model and standardized data processing methodology. We work with different platform data providers to get <em class="nu">inventory</em>, <em class="nu">ownership</em>, and <em class="nu">usage</em> data for the respective platforms they own.</li><li id="e86e" class="mw mx gu my b mz pd nb nc nd pe nf ng nh pf nj nk nl pg nn no np ph nr ns nt pc pa pb bk">Cloud Efficiency Analytics (CEA): Built on top of FPD, this component offers an analytics data layer that provides time series efficiency metrics across various business use cases. Once the foundational data is ready, CEA consumes inventory, ownership, and usage data and applies the appropriate <em class="nu">business logic</em> to produce <em class="nu">cost</em> and <em class="nu">ownership attribution</em> at various granularities.</li></ol><p id="f9d1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As the source of truth for efficiency metrics, our team’s tenants are to provide accurate, reliable, and accessible data, comprehensive documentation to navigate the complexity of the efficiency space, and well-defined Service Level Agreements (SLAs) to set expectations with downstream consumers during delays, outages, or changes.</p><p id="fba8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Looking ahead, we aim to continue onboarding platforms, striving for nearly complete cost insight coverage. We’re also exploring new use cases, such as tailored reports for platforms, predictive analytics for optimizing usage and detecting anomalies in cost, and a root cause analysis tool using LLMs.</p><p id="b0a0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Ultimately, our goal is to enable our engineering organization to make efficiency-conscious decisions when building and maintaining the myriad of services that allows us to enjoy Netflix as a streaming service. For more detail on our modeling approach and principles, check out <a class="af oy" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/cloud-efficiency-at-netflix-f2a142955f83">this post</a>!</p></div></div></div><div class="ab cb px py pz qa" role="separator"><div class="gn go gp gq gr"><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="effc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Analytics Engineering is a key contributor to building our deep data culture at Netflix, and we are proud to have a large group of stunning colleagues that are not only applying but advancing our analytical capabilities at Netflix. The 2024 Analytics Summit continued to be a wonderful way to give visibility to one another on work across business verticals, celebrate our collective impact, and highlight what’s to come in analytics practice at Netflix.</p><p id="faa3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To learn more, follow the <a class="af oy" href="https://research.netflix.com/research-area/analytics" rel="noopener ugc nofollow" target="_blank">Netflix Research Site</a>, and if you are also interested in entertaining the world, have a look at <a class="af oy" href="https://explore.jobs.netflix.net/careers" rel="noopener ugc nofollow" target="_blank">our open roles</a>!</p></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/part-1-a-survey-of-analytics-engineering-work-at-netflix-d761cfd551ee</link>
      <guid>https://netflixtechblog.com/part-1-a-survey-of-analytics-engineering-work-at-netflix-d761cfd551ee</guid>
      <pubDate>Wed, 18 Dec 2024 00:26:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Cloud Efficiency at Netflix]]></title>
      <description><![CDATA[<div><div></div><p id="c997" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">By</strong> <a class="af nu" href="https://www.linkedin.com/in/jhan-104105?utm_source=share&amp;utm_campaign=share_via&amp;utm_content=profile" rel="noopener ugc nofollow" target="_blank">J Han</a>, <a class="af nu" href="https://www.linkedin.com/in/pallavi-phadnis-75280b20/" rel="noopener ugc nofollow" target="_blank">Pallavi Phadnis</a></p><h1 id="16b4" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Context</strong></h1><p id="f453" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix, we use Amazon Web Services (AWS) for our cloud infrastructure needs, such as compute, storage, and networking to build and run the streaming platform that we love. Our ecosystem enables engineering teams to run applications and services at scale, utilizing a mix of open-source and proprietary solutions. In turn, our self-serve platforms allow teams to create and deploy, sometimes custom, workloads more efficiently. This diverse technological landscape generates extensive and rich data from various infrastructure entities, from which, data engineers and analysts collaborate to provide actionable insights to the engineering organization in a continuous feedback loop that ultimately enhances the business.</p><p id="1ac0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">One crucial way in which we do this is through the democratization of highly curated data sources that sunshine usage and cost patterns across Netflix’s services and teams. The Data &amp; Insights organization partners closely with our engineering teams to share key efficiency metrics, empowering internal stakeholders to make informed business decisions.</p><h1 id="a744" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Data is Key</strong></h1><p id="a1ec" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">This is where our team, Platform DSE (Data Science Engineering), comes in to enable our engineering partners to understand what resources they’re using, how effectively and efficiently they use those resources, and the cost associated with their resource usage. We want our downstream consumers to make cost conscious decisions using our datasets.</p><p id="76e1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To address these numerous analytic needs in a scalable way, we’ve developed a two-component solution:</p><ol class=""><li id="c691" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oy oz pa bk">Foundational Platform Data (FPD): This component provides a centralized data layer for all platform data, featuring a consistent data model and standardized data processing methodology.</li><li id="5b27" class="mw mx gu my b mz pb nb nc nd pc nf ng nh pd nj nk nl pe nn no np pf nr ns nt oy oz pa bk">Cloud Efficiency Analytics (CEA): Built on top of FPD, this component offers an analytics data layer that provides time series efficiency metrics across various business use cases.</li></ol><figure class="pj pk pl pm pn po pg ph paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pg ph pi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*vDQJiJUttlRSpVBo%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*vDQJiJUttlRSpVBo%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*vDQJiJUttlRSpVBo%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*vDQJiJUttlRSpVBo%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*vDQJiJUttlRSpVBo%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*vDQJiJUttlRSpVBo%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*vDQJiJUttlRSpVBo%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*vDQJiJUttlRSpVBo 640w, https://miro.medium.com/v2/resize:fit:720/0*vDQJiJUttlRSpVBo 720w, https://miro.medium.com/v2/resize:fit:750/0*vDQJiJUttlRSpVBo 750w, https://miro.medium.com/v2/resize:fit:786/0*vDQJiJUttlRSpVBo 786w, https://miro.medium.com/v2/resize:fit:828/0*vDQJiJUttlRSpVBo 828w, https://miro.medium.com/v2/resize:fit:1100/0*vDQJiJUttlRSpVBo 1100w, https://miro.medium.com/v2/resize:fit:1400/0*vDQJiJUttlRSpVBo 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pt c" width="700" height="593" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="7a9d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Foundational Platform Data (FPD)</strong></p><p id="8685" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We work with different platform data providers to get <em class="pu">inventory</em>, <em class="pu">ownership</em>, and <em class="pu">usage</em> data for the respective platforms they own. Below is an example of how this framework applies to the <a class="af nu" href="https://spark.apache.org/" rel="noopener ugc nofollow" target="_blank">Spark</a> platform. FPD establishes<em class="pu"> data contracts</em> with producers to ensure data quality and reliability; these contracts allow the team to leverage a common data model for ownership. The standardized data model and processing promotes scalability and consistency.</p><figure class="pj pk pl pm pn po pg ph paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pg ph pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*cln5xplS7lpdE0KOh0LE1Q.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*cln5xplS7lpdE0KOh0LE1Q.jpeg" /><img alt="" class="bh md pt c" width="700" height="135" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="a35c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Cloud Efficiency Analytics (CEA Data)</strong></p><p id="0ba0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Once the foundational data is ready, CEA consumes inventory, ownership, and usage data and applies the appropriate <em class="pu">business logic</em> to produce <em class="pu">cost</em> and <em class="pu">ownership attribution</em> at various granularities. The data model approach in CEA is to compartmentalize and be <em class="pu">transparent</em>; we want downstream consumers to understand why they’re seeing resources show up under their name/org and how those costs are calculated. Another benefit to this approach is the ability to pivot quickly as new or changes in business logic is/are introduced.</p><figure class="pj pk pl pm pn po pg ph paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pg ph pw"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*bvD7xqAO9T9m4s4G%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*bvD7xqAO9T9m4s4G%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*bvD7xqAO9T9m4s4G%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*bvD7xqAO9T9m4s4G%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*bvD7xqAO9T9m4s4G%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*bvD7xqAO9T9m4s4G%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*bvD7xqAO9T9m4s4G%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*bvD7xqAO9T9m4s4G 640w, https://miro.medium.com/v2/resize:fit:720/0*bvD7xqAO9T9m4s4G 720w, https://miro.medium.com/v2/resize:fit:750/0*bvD7xqAO9T9m4s4G 750w, https://miro.medium.com/v2/resize:fit:786/0*bvD7xqAO9T9m4s4G 786w, https://miro.medium.com/v2/resize:fit:828/0*bvD7xqAO9T9m4s4G 828w, https://miro.medium.com/v2/resize:fit:1100/0*bvD7xqAO9T9m4s4G 1100w, https://miro.medium.com/v2/resize:fit:1400/0*bvD7xqAO9T9m4s4G 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pt c" width="700" height="351" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="fcca" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">* For cost accounting purposes, we resolve assets to a single owner, or distribute costs when assets are multi-tenant. However, we do also provide usage and cost at different aggregations for different consumers.</p><h1 id="3fc1" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Data Principles</strong></h1><p id="6a99" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">As the source of truth for efficiency metrics, our team’s tenants are to provide accurate, reliable, and accessible data, comprehensive documentation to navigate the complexity of the efficiency space, and well-defined Service Level Agreements (SLAs) to set expectations with downstream consumers during delays, outages or changes.</p><p id="d22d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">While ownership and cost may seem straightforward, the complexity of the datasets is considerably high due to the breadth and scope of the business infrastructure and platform specific features. Services can have multiple owners, cost heuristics are unique to each platform, and the scale of infra data is large. As we work on expanding infrastructure coverage to all verticals of the business, we face a unique set of challenges:</p><p id="50e2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">A Few Sizes to Fit the Majority</strong></p><p id="e6ac" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Despite data contracts and a standardized data model on transforming upstream platform data into FPD and CEA, there is usually some degree of customization that is unique to that particular platform. As the centralized source of truth, we feel the constant tension of where to place the processing burden. Decision-making involves ongoing transparent conversations with both our data producers and consumers, frequent prioritization checks, and alignment with business needs as <a class="af nu" href="https://jobs.netflix.com/culture" rel="noopener ugc nofollow" target="_blank">informed captains</a> in this space.</p><p id="fa84" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Data Guarantees</strong></p><p id="5d38" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For data correctness and trust, it’s crucial that we have audits and visibility into health metrics at each layer in the pipeline in order to investigate issues and root cause anomalies quickly. Maintaining data completeness while ensuring correctness becomes challenging due to upstream latency and required transformations to have the data ready for consumption. We continuously iterate our audits and incorporate feedback to refine and meet our SLAs.</p><p id="c365" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Abstraction Layers</strong></p><p id="e5ca" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We value <a class="af nu" href="https://jobs.netflix.com/culture" rel="noopener ugc nofollow" target="_blank">people over process</a>, and it is not uncommon for engineering teams to build custom SaaS solutions for other parts of the organization. Although this fosters innovation and improves development velocity, it can create a bit of a conundrum when it comes to understanding and interpreting usage patterns and attributing cost in a way that makes sense to the business and end consumer. With clear inventory, ownership, and usage data from FPD, and precise attribution in the analytical layer, we aim to provide metrics to downstream users regardless of whether they utilize and build on top of internal platforms or on AWS resources directly.</p><h1 id="acbb" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Future Forward</strong></h1><p id="1f2b" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Looking ahead, we aim to continue onboarding platforms to FPD and CEA, striving for nearly complete cost insight coverage in the upcoming year. Longer term, we plan to extend FPD to other areas of the business such as security and availability. We aim to move towards proactive approaches via predictive analytics and ML for optimizing usage and detecting anomalies in cost.</p><p id="e9c5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Ultimately, our goal is to enable our engineering organization to make efficiency-conscious decisions when building and maintaining the myriad of services that allow us to enjoy Netflix as a streaming service.</p><h1 id="78a3" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Acknowledgments</h1><p id="2782" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The FPD and CEA work would not have been possible without the cross functional input of many outstanding colleagues and our dedicated team building these important data assets.</p><p id="1be1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">—</p><p id="fdd6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A bit about the authors:</p><p id="4e87" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="pu">JHan enjoys nature, reading fantasy, and finding the best chocolate chip cookies and cinnamon rolls. She is adamant about writing the SQL select statement with leading commas.</em></p><p id="0922" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="pu">Pallavi enjoys music, travel and watching astrophysics documentaries. With 15+ years working with data, she knows everything’s better with a dash of analytics and a cup of coffee!</em></p></div>]]></description>
      <link>https://netflixtechblog.com/cloud-efficiency-at-netflix-f2a142955f83</link>
      <guid>https://netflixtechblog.com/cloud-efficiency-at-netflix-f2a142955f83</guid>
      <pubDate>Tue, 17 Dec 2024 23:17:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Title Launch Observability at Netflix Scale]]></title>
      <description><![CDATA[<div><div><h2 id="088d" class="pw-subtitle-paragraph hr gt gu bf b hs ht hu hv hw hx hy hz ia ib ic id ie if ig cq du">Part 1: Understanding The Challenges</h2><div></div><p id="51e3" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk"><strong class="nk gv">By:</strong> <a class="af oe" href="https://www.linkedin.com/in/varun-khaitan/" rel="noopener ugc nofollow" target="_blank">Varun Khaitan</a></p><p id="72bf" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">With special thanks to my stunning colleagues: <a class="af oe" href="https://www.linkedin.com/in/mallikarao/" rel="noopener ugc nofollow" target="_blank">Mallika Rao</a>, <a class="af oe" href="https://www.linkedin.com/in/esmir-mesic/" rel="noopener ugc nofollow" target="_blank">Esmir Mesic</a>, <a class="af oe" href="https://www.linkedin.com/in/hugodesmarques/" rel="noopener ugc nofollow" target="_blank">Hugo Marques</a></p><h1 id="c3f4" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">Introduction</h1><p id="0a98" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">At Netflix, we manage over a thousand global content launches each month, backed by billions of dollars in annual investment. Ensuring the success and discoverability of each title across our platform is a top priority, as we aim to connect every story with the right audience to delight our members. To achieve this, we are committed to building robust systems that deliver comprehensive observability, enabling us to take full accountability for every title on our service.</p><h1 id="b58e" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">The Challenge of Title Launch Observability</h1><p id="e5ac" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">As engineers, we’re wired to track system metrics like error rates, latencies, and CPU utilization — but what about metrics that matter to a title’s success?</p><p id="e27e" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">Consider the following example of two different Netflix Homepages:</p><figure class="pj pk pl pm pn po pg ph paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pg ph pi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*B4iyOBZJZEo7eW-p%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*B4iyOBZJZEo7eW-p%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*B4iyOBZJZEo7eW-p%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*B4iyOBZJZEo7eW-p%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*B4iyOBZJZEo7eW-p%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*B4iyOBZJZEo7eW-p%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*B4iyOBZJZEo7eW-p%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*B4iyOBZJZEo7eW-p 640w, https://miro.medium.com/v2/resize:fit:720/0*B4iyOBZJZEo7eW-p 720w, https://miro.medium.com/v2/resize:fit:750/0*B4iyOBZJZEo7eW-p 750w, https://miro.medium.com/v2/resize:fit:786/0*B4iyOBZJZEo7eW-p 786w, https://miro.medium.com/v2/resize:fit:828/0*B4iyOBZJZEo7eW-p 828w, https://miro.medium.com/v2/resize:fit:1100/0*B4iyOBZJZEo7eW-p 1100w, https://miro.medium.com/v2/resize:fit:1400/0*B4iyOBZJZEo7eW-p 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mp pt c" width="700" height="382" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pu ff pv pg ph pw px bf b bg z du">Sample Homepage A</figcaption></figure><figure class="pj pk pl pm pn po pg ph paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pg ph pi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*5F9ATQbyOp99jMwJ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*5F9ATQbyOp99jMwJ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*5F9ATQbyOp99jMwJ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*5F9ATQbyOp99jMwJ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*5F9ATQbyOp99jMwJ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*5F9ATQbyOp99jMwJ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*5F9ATQbyOp99jMwJ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*5F9ATQbyOp99jMwJ 640w, https://miro.medium.com/v2/resize:fit:720/0*5F9ATQbyOp99jMwJ 720w, https://miro.medium.com/v2/resize:fit:750/0*5F9ATQbyOp99jMwJ 750w, https://miro.medium.com/v2/resize:fit:786/0*5F9ATQbyOp99jMwJ 786w, https://miro.medium.com/v2/resize:fit:828/0*5F9ATQbyOp99jMwJ 828w, https://miro.medium.com/v2/resize:fit:1100/0*5F9ATQbyOp99jMwJ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*5F9ATQbyOp99jMwJ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh mp pt c" width="700" height="386" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pu ff pv pg ph pw px bf b bg z du">Sample Homepage B</figcaption></figure><p id="f8a6" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">To a basic recommendation system, the two sample pages might appear equivalent as long as the viewer watches the top title. Yet, these pages couldn’t be more different. Each title represents countless hours of effort and creativity, and our systems need to honor that uniqueness.</p><p id="ea38" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">How do we bridge this gap? How can we design systems that recognize these nuances and empower every title to shine and bring joy to our members?</p><h1 id="8bf0" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">The operational needs of a personalization system</h1><p id="931f" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">In the early days of Netflix Originals, our launch team would huddle together at midnight, manually verifying that titles appeared in all the right places. While this hands-on approach worked for a handful of titles, it quickly became clear that it couldn’t scale. As Netflix expanded globally and the volume of title launches skyrocketed, the operational challenges of maintaining this manual process became undeniable.</p><p id="d7bd" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">Operating a personalization system for a global streaming service involves addressing numerous inquiries about why certain titles appear or fail to appear at specific times and places. <br />Some examples:</p><ul class=""><li id="8d25" class="ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od py pz qa bk">Why is title X not showing on the Coming Soon row for a particular member?</li><li id="a0cc" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od py pz qa bk">Why is title Y missing from the search page in Brazil?</li><li id="f5d7" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od py pz qa bk">Is title Z being displayed correctly in all product experiences as intended?</li></ul><p id="83a3" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">As Netflix scaled, we faced the mounting challenge of providing accurate, timely answers to increasingly complex queries about title performance and discoverability. This led to a suite of fragmented scripts, runbooks, and ad hoc solutions scattered across teams — an approach that was neither sustainable nor efficient.</p><p id="b860" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">The stakes are even higher when ensuring every title launches flawlessly. Metadata and assets must be correctly configured, data must flow seamlessly, microservices must process titles without error, and algorithms must function as intended. The complexity of these operational demands underscored the urgent need for a scalable solution.</p><h1 id="fc77" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">Automating the Operations</h1><p id="2817" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">It becomes evident over time that we need to automate our operations to scale with the business. As we thought more about this problem and possible solutions, two clear options emerged.</p><h1 id="26ce" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">Option 1: Log Processing</h1><p id="be78" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">Log processing offers a straightforward solution for monitoring and analyzing title launches. By logging all titles as they are displayed, we can process these logs to identify anomalies and gain insights into system performance. This approach provides a few advantages:</p><ol class=""><li id="c8ff" class="ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od qg pz qa bk"><strong class="nk gv">Low burden on existing systems:</strong> Log processing imposes minimal changes to existing infrastructure. By leveraging logs, which are already generated during regular operations, we can scale observability without significant system modifications. This allows us to focus on data analysis and problem-solving rather than managing complex system changes.</li><li id="9430" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Using the source of truth:</strong> Logs serve as a reliable “source of truth” by providing a comprehensive record of system events. They allow us to verify whether titles are presented as intended and investigate any discrepancies. This capability is crucial for ensuring our recommendation systems and user interfaces function correctly, supporting successful title launches.</li></ol><p id="6c94" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">However, taking this approach also presents several challenges:</p><ol class=""><li id="ad3d" class="ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od qg pz qa bk"><strong class="nk gv">Catching Issues Ahead of Time:</strong> Logging primarily addresses post-launch scenarios, as logs are generated only after titles are shown to members. To detect issues proactively, we need to simulate traffic and predict system behavior in advance. Once artificial traffic is generated, discarding the response object and relying solely on logs becomes inefficient.</li><li id="3d21" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Appropriate Accuracy:</strong> Comprehensive logging requires services to log both included and excluded titles, along with reasons for exclusion. This could lead to an exponential increase in logged data. Utilizing probabilistic logging methods could compromise accuracy, making it difficult to ascertain whether a title’s absence in logs is due to exclusion or random chance.</li><li id="289c" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">SLA and Cost Considerations:</strong> Our existing online logging systems do not natively support logging at the title granularity level. While reengineering these systems to accommodate this additional axis is possible, it would entail increased costs. Additionally, the time-sensitive nature of these investigations precludes the use of cold storage, which cannot meet the stringent SLAs required.</li></ol><h1 id="aac6" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">Option 2: Observability Endpoints in Our Personalization Systems</h1><p id="7199" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">To prioritize title launch observability, we could adopt a centralized approach. By introducing observability endpoints across all systems, we can enable real-time data flow into a dedicated microservice for title launch observability. This approach embeds observability directly into the very fabric of services managing title launches and personalization, ensuring seamless monitoring and insights. Key benefits and strategies include:</p><ol class=""><li id="6ecb" class="ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od qg pz qa bk"><strong class="nk gv">Real-Time Monitoring: </strong>Observability endpoints enable real-time monitoring of system performance and title placements, allowing us to detect and address issues as they arise.</li><li id="f705" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Proactive Issue Detection: </strong>By simulating future traffic(an aspect we call “time travel”) and capturing system responses ahead of time, we can preemptively identify potential issues before they impact our members or the business.</li><li id="f16a" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Enhanced Accuracy:</strong> Observability endpoints provide precise data on title inclusions and exclusions, allowing us to make accurate assertions about system behavior and title visibility. It also provides us with advanced debugability information needed to fix identified issues.</li><li id="e717" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Scalability and Cost Efficiency:</strong> While initial implementation required some investment, this approach ultimately offers a scalable and cost-effective solution to managing title launches at Netflix scale.</li></ol><p id="cfd1" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">Choosing this option also comes with some tradeoffs:</p><ol class=""><li id="b2fe" class="ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od qg pz qa bk"><strong class="nk gv">Significant Initial Investment: </strong>Several systems would need to create new endpoints and refactor their codebases to adopt this new method of prioritizing launches.</li><li id="1a2c" class="ni nj gu nk b hs qb nm nn hv qc np nq nr qd nt nu nv qe nx ny nz qf ob oc od qg pz qa bk"><strong class="nk gv">Synchronization Risk: </strong>There would be a potential risk that these new endpoints may not accurately represent production behavior, thus necessitating conscious efforts to ensure all endpoints remain synchronized.</li></ol><h1 id="9e33" class="of og gu bf oh oi oj hu ok ol om hx on oo op oq or os ot ou ov ow ox oy oz pa bk">Up Next</h1><p id="1466" class="pw-post-body-paragraph ni nj gu nk b hs pb nm nn hv pc np nq nr pd nt nu nv pe nx ny nz pf ob oc od gn bk">By adopting a comprehensive observability strategy that includes real-time monitoring, proactive issue detection, and source of truth reconciliation, we’ve significantly enhanced our ability to ensure the successful launch and discovery of titles across Netflix, enriching the global viewing experience for our members. In the next part of this series, we’ll dive into how we achieved this, sharing key technical insights and details.</p><p id="3bce" class="pw-post-body-paragraph ni nj gu nk b hs nl nm nn hv no np nq nr ns nt nu nv nw nx ny nz oa ob oc od gn bk">Stay tuned for a closer look at the innovation behind the scenes!</p></div></div>]]></description>
      <link>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-c88c586629eb</link>
      <guid>https://netflixtechblog.com/title-launch-observability-at-netflix-scale-c88c586629eb</guid>
      <pubDate>Tue, 17 Dec 2024 22:54:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Netflix’s Distributed Counter Abstraction]]></title>
      <description><![CDATA[<div><div></div><p id="f7ea" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">By: <a class="af nu" href="https://www.linkedin.com/in/rajiv-shringi/" rel="noopener ugc nofollow" target="_blank">Rajiv Shringi</a>, <a class="af nu" href="https://www.linkedin.com/in/oleksii-tkachuk-98b47375/" rel="noopener ugc nofollow" target="_blank">Oleksii Tkachuk</a>, <a class="af nu" href="https://www.linkedin.com/in/kartik894/" rel="noopener ugc nofollow" target="_blank">Kartik Sathyanarayanan</a></p><h1 id="0da9" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Introduction</h1><p id="41fb" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">In our previous blog post, we introduced <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8">Netflix’s TimeSeries Abstraction</a>, a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the <strong class="my gv">Distributed Counter Abstraction</strong>. This counting service, built on top of the TimeSeries Abstraction, enables distributed counting at scale while maintaining similar low latency performance. As with all our abstractions, we use our <a class="af nu" href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6" rel="noopener">Data Gateway Control Plane</a> to shard, configure, and deploy this service globally.</p><p id="aebc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.</p><p id="fb3c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Note</strong>: <em class="oy">When it comes to distributed counters, terms such as ‘accurate’ or ‘precise’ should be taken with a grain of salt. In this context, they refer to a count very close to accurate, presented with minimal delays.</em></p><h1 id="21f6" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Use Cases and Requirements</h1><p id="1e4f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix, our counting use cases include tracking millions of user interactions, monitoring how often specific features or experiences are shown to users, and counting multiple facets of data during <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15">A/B test experiments</a>, among others.</p><p id="e35a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, these use cases can be classified into two broad categories:</p><ol class=""><li id="fc33" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Best-Effort</strong>: For this category, the count doesn’t have to be very accurate or durable. However, this category requires near-immediate access to the current count at low latencies, all while keeping infrastructure costs to a minimum.</li><li id="d9a3" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Eventually Consistent</strong>: This category needs accurate and durable counts, and is willing to tolerate a slight delay in accuracy and a slightly higher infrastructure cost as a trade-off.</li></ol><p id="7d8e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Both categories share common requirements, such as high throughput and high availability. The table below provides a detailed overview of the diverse requirements across these two categories.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*_Mx2WRBWOfASpK_e2xgoVw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*_Mx2WRBWOfASpK_e2xgoVw.png" /><img alt="" class="bh md pu c" width="700" height="494" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="db7c" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Distributed Counter Abstraction</h1><p id="16d7" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">To meet the outlined requirements, the Counter Abstraction was designed to be highly configurable. It allows users to choose between different counting modes, such as <strong class="my gv">Best-Effort</strong> or <strong class="my gv">Eventually Consistent</strong>, while considering the documented trade-offs of each option. After selecting a mode, users can interact with APIs without needing to worry about the underlying storage mechanisms and counting methods.</p><p id="0799" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s take a closer look at the structure and functionality of the API.</p><h1 id="4626" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">API</h1><p id="0433" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Counters are organized into separate namespaces that users set up for each of their specific use cases. Each namespace can be configured with different parameters, such as Type of Counter, Time-To-Live (TTL), and Counter Cardinality, using the service’s Control Plane.</p><p id="cc02" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The Counter Abstraction API resembles Java’s <a class="af nu" href="https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/util/concurrent/atomic/AtomicInteger.html" rel="noopener ugc nofollow" target="_blank">AtomicInteger</a> interface:</p><p id="d3f4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">AddCount/AddAndGetCount</strong>: Adjusts the count for the specified counter by the given delta value within a dataset. The delta value can be positive or negative. The <em class="oy">AddAndGetCount</em> counterpart also returns the count after performing the add operation.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "counter_name": "counter123",<br />  "delta": 2,<br />  "idempotency_token": { <br />    "token": "some_event_id",<br />    "generation_time": "2024-10-05T14:48:00Z"<br />  }<br />}</pre><p id="2e22" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The idempotency token can be used for counter types that support them. Clients can use this token to safely retry or <a class="af nu" href="https://research.google/pubs/the-tail-at-scale/" rel="noopener ugc nofollow" target="_blank">hedge</a> their requests. Failures in a distributed system are a given, and having the ability to safely retry requests enhances the reliability of the service.</p><p id="b098" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">GetCount</strong>: Retrieves the count value of the specified counter within a dataset.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "counter_name": "counter123"<br />}</pre><p id="a50d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">ClearCount</strong>: Effectively resets the count to 0 for the specified counter within a dataset.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "counter_name": "counter456",<br />  "idempotency_token": {...}<br />}</pre><p id="560d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Now, let’s look at the different types of counters supported within the Abstraction.</p><h1 id="3afc" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Types of Counters</h1><p id="0ea6" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The service primarily supports two types of counters: <strong class="my gv">Best-Effort</strong> and <strong class="my gv">Eventually Consistent</strong>, along with a third experimental type: <strong class="my gv">Accurate</strong>. In the following sections, we’ll describe the different approaches for these types of counters and the trade-offs associated with each.</p><h1 id="1042" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Best Effort Regional Counter</h1><p id="1497" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">This type of counter is powered by <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/announcing-evcache-distributed-in-memory-datastore-for-cloud-c26a698c27f7">EVCache</a>, Netflix’s distributed caching solution built on the widely popular <a class="af nu" href="https://memcached.org/" rel="noopener ugc nofollow" target="_blank">Memcached</a>. It is suitable for use cases like A/B experiments, where many concurrent experiments are run for relatively short durations and an approximate count is sufficient. Setting aside the complexities of provisioning, resource allocation, and control plane management, the core of this solution is remarkably straightforward:</p><pre class="pk pl pm pn po pv pw px bp py bb bk">// counter cache key<br />counterCacheKey = &lt;namespace&gt;:&lt;counter_name&gt;// add operation<br />return delta &gt; 0<br />    ? cache.incr(counterCacheKey, delta, TTL)<br />    : cache.decr(counterCacheKey, Math.abs(delta), TTL);// get operation<br />cache.get(counterCacheKey);// clear counts from all replicas<br />cache.delete(counterCacheKey, ReplicaPolicy.ALL);</pre><p id="70af" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">EVCache delivers extremely high throughput at low millisecond latency or better within a single region, enabling a multi-tenant setup within a shared cluster, saving infrastructure costs. However, there are some trade-offs: it lacks cross-region replication for the <em class="oy">increment</em> operation and does not provide <a class="af nu" href="https://netflix.github.io/EVCache/features/#consistency" rel="noopener ugc nofollow" target="_blank">consistency guarantees</a>, which may be necessary for an accurate count. Additionally, idempotency is not natively supported, making it unsafe to retry or hedge requests.</p><h1 id="1746" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Eventually Consistent Global Counter</h1><p id="3c43" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">While some users may accept the limitations of a Best-Effort counter, others opt for precise counts, durability and global availability. In the following sections, we’ll explore various strategies for achieving durable and accurate counts. Our objective is to highlight the challenges inherent in global distributed counting and explain the reasoning behind our chosen approach.</p><p id="e5ff" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Approach 1: Storing a Single Row per Counter</strong></p><p id="f787" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s start simple by using a single row per counter key within a table in a globally replicated datastore.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*X6k4-4N36IQ5yEPe%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*X6k4-4N36IQ5yEPe%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*X6k4-4N36IQ5yEPe%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*X6k4-4N36IQ5yEPe%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*X6k4-4N36IQ5yEPe%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*X6k4-4N36IQ5yEPe%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*X6k4-4N36IQ5yEPe%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*X6k4-4N36IQ5yEPe 640w, https://miro.medium.com/v2/resize:fit:720/0*X6k4-4N36IQ5yEPe 720w, https://miro.medium.com/v2/resize:fit:750/0*X6k4-4N36IQ5yEPe 750w, https://miro.medium.com/v2/resize:fit:786/0*X6k4-4N36IQ5yEPe 786w, https://miro.medium.com/v2/resize:fit:828/0*X6k4-4N36IQ5yEPe 828w, https://miro.medium.com/v2/resize:fit:1100/0*X6k4-4N36IQ5yEPe 1100w, https://miro.medium.com/v2/resize:fit:1400/0*X6k4-4N36IQ5yEPe 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="578" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="ca59" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s examine some of the drawbacks of this approach:</p><ul class=""><li id="61e8" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Lack of Idempotency</strong>: There is no idempotency key baked into the storage data-model preventing users from safely retrying requests. Implementing idempotency would likely require using an external system for such keys, which can further degrade performance or cause race conditions.</li><li id="2b44" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Heavy Contention</strong>: To update counts reliably, every writer must perform a Compare-And-Swap operation for a given counter using locks or transactions. Depending on the throughput and concurrency of operations, this can lead to significant contention, heavily impacting performance.</li></ul><p id="a373" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Secondary Keys</strong>: One way to reduce contention in this approach would be to use a secondary key, such as a <em class="oy">bucket_id</em>, which allows for distributing writes by splitting a given counter into <em class="oy">buckets</em>, while enabling reads to aggregate across buckets. The challenge lies in determining the appropriate number of buckets. A static number may still lead to contention with <em class="oy">hot keys</em>, while dynamically assigning the number of buckets per counter across millions of counters presents a more complex problem.</p><p id="121f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s see if we can iterate on our solution to overcome these drawbacks.</p><p id="875d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Approach 2: Per Instance Aggregation</strong></p><p id="ca5b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To address issues of hot keys and contention from writing to the same row in real-time, we could implement a strategy where each instance aggregates the counts in memory and then flushes them to disk at regular intervals. Introducing sufficient jitter to the flush process can further reduce contention.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6iUKbxJ093jJTiYL%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6iUKbxJ093jJTiYL%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6iUKbxJ093jJTiYL%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6iUKbxJ093jJTiYL%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6iUKbxJ093jJTiYL%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6iUKbxJ093jJTiYL%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*6iUKbxJ093jJTiYL%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6iUKbxJ093jJTiYL 640w, https://miro.medium.com/v2/resize:fit:720/0*6iUKbxJ093jJTiYL 720w, https://miro.medium.com/v2/resize:fit:750/0*6iUKbxJ093jJTiYL 750w, https://miro.medium.com/v2/resize:fit:786/0*6iUKbxJ093jJTiYL 786w, https://miro.medium.com/v2/resize:fit:828/0*6iUKbxJ093jJTiYL 828w, https://miro.medium.com/v2/resize:fit:1100/0*6iUKbxJ093jJTiYL 1100w, https://miro.medium.com/v2/resize:fit:1400/0*6iUKbxJ093jJTiYL 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="336" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="24b1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, this solution presents a new set of issues:</p><ul class=""><li id="dba3" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Vulnerability to Data Loss</strong>: The solution is vulnerable to data loss for all in-memory data during instance failures, restarts, or deployments.</li><li id="c41b" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Inability to Reliably Reset Counts</strong>: Due to counting requests being distributed across multiple machines, it is challenging to establish consensus on the exact point in time when a counter reset occurred.</li><li id="2535" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Lack of Idempotency: </strong>Just like the previous approach, this approach does not natively guarantee idempotency. One way to ensure idempotency is to consistently route the same set of counters to the same instance. However, such an approach may introduce additional complexity, and potential challenges with availability and latency in the write path.</li></ul><p id="038c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">That said, this approach may still be suitable in scenarios where these trade-offs are acceptable. However, let’s see if we can address some of these issues with a different event-based approach.</p><p id="f599" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Approach 3: Using Durable Queues</strong></p><p id="7eb6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this approach, we log counter events into a durable queuing system like <a class="af nu" href="https://kafka.apache.org/" rel="noopener ugc nofollow" target="_blank">Apache Kafka</a> to prevent any potential data loss. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This setup simplifies facilitating idempotency checks and resetting counts. Furthermore, by leveraging additional stream processing frameworks such as <a class="af nu" href="https://kafka.apache.org/documentation/streams/" rel="noopener ugc nofollow" target="_blank">Kafka Streams</a> or <a class="af nu" href="https://flink.apache.org/" rel="noopener ugc nofollow" target="_blank">Apache Flink</a>, we can implement windowed aggregations.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*mQikuGyuzZ_lT7Y4%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*mQikuGyuzZ_lT7Y4%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*mQikuGyuzZ_lT7Y4%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*mQikuGyuzZ_lT7Y4%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*mQikuGyuzZ_lT7Y4%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*mQikuGyuzZ_lT7Y4%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*mQikuGyuzZ_lT7Y4%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*mQikuGyuzZ_lT7Y4 640w, https://miro.medium.com/v2/resize:fit:720/0*mQikuGyuzZ_lT7Y4 720w, https://miro.medium.com/v2/resize:fit:750/0*mQikuGyuzZ_lT7Y4 750w, https://miro.medium.com/v2/resize:fit:786/0*mQikuGyuzZ_lT7Y4 786w, https://miro.medium.com/v2/resize:fit:828/0*mQikuGyuzZ_lT7Y4 828w, https://miro.medium.com/v2/resize:fit:1100/0*mQikuGyuzZ_lT7Y4 1100w, https://miro.medium.com/v2/resize:fit:1400/0*mQikuGyuzZ_lT7Y4 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="437" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="25bf" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, this approach comes with some challenges:</p><ul class=""><li id="708e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Potential Delays</strong>: Having the same consumer process all the counts from a given partition can lead to backups and delays, resulting in stale counts.</li><li id="f448" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Rebalancing Partitions</strong>: This approach requires auto-scaling and rebalancing of topic partitions as the cardinality of counters and throughput increases.</li></ul><p id="5f3d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Furthermore, all approaches that pre-aggregate counts make it challenging to support two of our requirements for accurate counters:</p><ul class=""><li id="5818" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Auditing of Counts</strong>: Auditing involves extracting data to an offline system for analysis to ensure that increments were applied correctly to reach the final value. This process can also be used to track the provenance of increments. However, auditing becomes infeasible when counts are aggregated without storing the individual increments.</li><li id="51be" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Potential Recounting</strong>: Similar to auditing, if adjustments to increments are necessary and recounting of events within a time window is required, pre-aggregating counts makes this infeasible.</li></ul><p id="ccda" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Barring those few requirements, this approach can still be effective if we determine the right way to scale our queue partitions and consumers while maintaining idempotency. However, let’s explore how we can adjust this approach to meet the auditing and recounting requirements.</p><p id="83bd" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Approach 4: Event Log of Individual Increments</strong></p><p id="57ab" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this approach, we log each individual counter increment along with its <strong class="my gv">event_time</strong> and <strong class="my gv">event_id</strong>. The event_id can include the source information of where the increment originated. The combination of event_time and event_id can also serve as the idempotency key for the write.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*0wKFK7xyTHnEKIhO%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*0wKFK7xyTHnEKIhO%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*0wKFK7xyTHnEKIhO%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*0wKFK7xyTHnEKIhO%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*0wKFK7xyTHnEKIhO%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*0wKFK7xyTHnEKIhO%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*0wKFK7xyTHnEKIhO%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*0wKFK7xyTHnEKIhO 640w, https://miro.medium.com/v2/resize:fit:720/0*0wKFK7xyTHnEKIhO 720w, https://miro.medium.com/v2/resize:fit:750/0*0wKFK7xyTHnEKIhO 750w, https://miro.medium.com/v2/resize:fit:786/0*0wKFK7xyTHnEKIhO 786w, https://miro.medium.com/v2/resize:fit:828/0*0wKFK7xyTHnEKIhO 828w, https://miro.medium.com/v2/resize:fit:1100/0*0wKFK7xyTHnEKIhO 1100w, https://miro.medium.com/v2/resize:fit:1400/0*0wKFK7xyTHnEKIhO 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="532" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="e421" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, <em class="oy">in its simplest form</em>, this approach has several drawbacks:</p><ul class=""><li id="1932" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Read Latency</strong>: Each read request requires scanning all increments for a given counter potentially degrading performance.</li><li id="5891" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Duplicate Work</strong>: Multiple threads might duplicate the effort of aggregating the same set of counters during read operations, leading to wasted effort and subpar resource utilization.</li><li id="973d" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Wide Partitions</strong>: If using a datastore like <a class="af nu" href="https://cassandra.apache.org/_/index.html" rel="noopener ugc nofollow" target="_blank">Apache Cassandra</a>, storing many increments for the same counter could lead to a <a class="af nu" href="https://thelastpickle.com/blog/2019/01/11/wide-partitions-cassandra-3-11.html" rel="noopener ugc nofollow" target="_blank">wide partition</a>, affecting read performance.</li><li id="21ef" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Large Data Footprint</strong>: Storing each increment individually could also result in a substantial data footprint over time. Without an efficient data retention strategy, this approach may struggle to scale effectively.</li></ul><p id="e879" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The combined impact of these issues can lead to increased infrastructure costs that may be difficult to justify. However, adopting an event-driven approach seems to be a significant step forward in addressing some of the challenges we’ve encountered and meeting our requirements.</p><p id="04e4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">How can we improve this solution further?</p><h1 id="08e8" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Netflix’s Approach</h1><p id="0918" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We use a combination of the previous approaches, where we log each counting activity as an event, and continuously aggregate these events in the background using queues and a sliding time window. Additionally, we employ a bucketing strategy to prevent wide partitions. In the following sections, we’ll explore how this approach addresses the previously mentioned drawbacks and meets all our requirements.</p><p id="ff08" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Note</strong>: <em class="oy">From here on, we will use the words “</em><strong class="my gv"><em class="oy">rollup</em></strong><em class="oy">” and “</em><strong class="my gv"><em class="oy">aggregate</em></strong><em class="oy">” interchangeably. They essentially mean the same thing, i.e., collecting individual counter increments/decrements and arriving at the final value.</em></p><p id="68cd" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">TimeSeries Event Store:</strong></p><p id="aa41" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We chose the <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8">TimeSeries Data Abstraction</a> as our event store, where counter mutations are ingested as event records. Some of the benefits of storing events in TimeSeries include:</p><p id="11da" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">High-Performance</strong>: The TimeSeries abstraction already addresses many of our requirements, including high availability and throughput, reliable and fast performance, and more.</p><p id="b03a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Reducing Code Complexity</strong>: We reduce a lot of code complexity in Counter Abstraction by delegating a major portion of the functionality to an existing service.</p><p id="6c3a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">TimeSeries Abstraction uses Cassandra as the underlying event store, but it can be configured to work with any persistent store. Here is what it looks like:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ge4X7ywSmtizcNE5%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ge4X7ywSmtizcNE5%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ge4X7ywSmtizcNE5%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ge4X7ywSmtizcNE5%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ge4X7ywSmtizcNE5%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ge4X7ywSmtizcNE5%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ge4X7ywSmtizcNE5%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ge4X7ywSmtizcNE5 640w, https://miro.medium.com/v2/resize:fit:720/0*ge4X7ywSmtizcNE5 720w, https://miro.medium.com/v2/resize:fit:750/0*ge4X7ywSmtizcNE5 750w, https://miro.medium.com/v2/resize:fit:786/0*ge4X7ywSmtizcNE5 786w, https://miro.medium.com/v2/resize:fit:828/0*ge4X7ywSmtizcNE5 828w, https://miro.medium.com/v2/resize:fit:1100/0*ge4X7ywSmtizcNE5 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ge4X7ywSmtizcNE5 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="334" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="2b96" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Handling Wide Partitions</strong>: The <em class="oy">time_bucket</em> and <em class="oy">event_bucket</em> columns play a crucial role in breaking up a wide partition, preventing high-throughput counter events from overwhelming a given partition. <em class="oy">For more information regarding this, refer to our previous </em><a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8"><em class="oy">blog</em></a>.</p><p id="3dc8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">No Over-Counting</strong>: The <em class="oy">event_time</em>, <em class="oy">event_id</em> and <em class="oy">event_item_key</em> columns form the idempotency key for the events for a given counter, enabling clients to retry safely without the risk of over-counting.</p><p id="43a9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Event Ordering</strong>: TimeSeries orders all events in descending order of time allowing us to leverage this property for events like count resets.</p><p id="278b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Event Retention</strong>: The TimeSeries Abstraction includes retention policies to ensure that events are not stored indefinitely, saving disk space and reducing infrastructure costs. Once events have been aggregated and moved to a more cost-effective store for audits, there’s no need to retain them in the primary storage.</p><p id="f647" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Now, let’s see how these events are aggregated for a given counter.</p><p id="5a6c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Aggregating Count Events:</strong></p><p id="80b9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As mentioned earlier, collecting all individual increments for every read request would be cost-prohibitive in terms of read performance. Therefore, a background aggregation process is necessary to continually converge counts and ensure optimal read performance.</p><p id="2ed6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="oy">But how can we safely aggregate count events amidst ongoing write operations?</em></p><p id="0a22" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This is where the concept of <em class="oy">Eventually Consistent </em>counts becomes crucial. <em class="oy">By intentionally lagging behind the current time by a safe margin</em>, we ensure that aggregation always occurs within an immutable window.</p><p id="5460" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Lets see what that looks like:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*EOpW-VnA_YZF7KOP%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*EOpW-VnA_YZF7KOP%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*EOpW-VnA_YZF7KOP%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*EOpW-VnA_YZF7KOP%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*EOpW-VnA_YZF7KOP%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*EOpW-VnA_YZF7KOP%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*EOpW-VnA_YZF7KOP%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*EOpW-VnA_YZF7KOP 640w, https://miro.medium.com/v2/resize:fit:720/0*EOpW-VnA_YZF7KOP 720w, https://miro.medium.com/v2/resize:fit:750/0*EOpW-VnA_YZF7KOP 750w, https://miro.medium.com/v2/resize:fit:786/0*EOpW-VnA_YZF7KOP 786w, https://miro.medium.com/v2/resize:fit:828/0*EOpW-VnA_YZF7KOP 828w, https://miro.medium.com/v2/resize:fit:1100/0*EOpW-VnA_YZF7KOP 1100w, https://miro.medium.com/v2/resize:fit:1400/0*EOpW-VnA_YZF7KOP 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="470" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="a980" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s break this down:</p><ul class=""><li id="c8ce" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">lastRollupTs</strong>: This represents the most recent time when the counter value was last aggregated. For a counter being operated for the first time, this timestamp defaults to a reasonable time in the past.</li><li id="881b" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Immutable Window and Lag</strong>: Aggregation can only occur safely within an immutable window that is no longer receiving counter events. The “acceptLimit” parameter of the TimeSeries Abstraction plays a crucial role here, as it rejects incoming events with timestamps beyond this limit. During aggregations, this window is pushed slightly further back to account for clock skews.</li></ul><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*DbtPCHPWoaauUkDr%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*DbtPCHPWoaauUkDr%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*DbtPCHPWoaauUkDr%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*DbtPCHPWoaauUkDr%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*DbtPCHPWoaauUkDr%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*DbtPCHPWoaauUkDr%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*DbtPCHPWoaauUkDr%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*DbtPCHPWoaauUkDr 640w, https://miro.medium.com/v2/resize:fit:720/0*DbtPCHPWoaauUkDr 720w, https://miro.medium.com/v2/resize:fit:750/0*DbtPCHPWoaauUkDr 750w, https://miro.medium.com/v2/resize:fit:786/0*DbtPCHPWoaauUkDr 786w, https://miro.medium.com/v2/resize:fit:828/0*DbtPCHPWoaauUkDr 828w, https://miro.medium.com/v2/resize:fit:1100/0*DbtPCHPWoaauUkDr 1100w, https://miro.medium.com/v2/resize:fit:1400/0*DbtPCHPWoaauUkDr 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="153" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="1fee" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This does mean that the counter value will lag behind its most recent update by some margin (typically in the order of seconds). <em class="oy">This approach does leave the door open for missed events due to cross-region replication issues. See “Future Work” section at the end.</em></p><ul class=""><li id="f593" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Aggregation Process</strong>: The rollup process aggregates all events in the aggregation window <em class="oy">since the last rollup </em>to arrive at the new value.</li></ul><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*oSHneX5BOi5VNGYM%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*oSHneX5BOi5VNGYM%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*oSHneX5BOi5VNGYM%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*oSHneX5BOi5VNGYM%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*oSHneX5BOi5VNGYM%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*oSHneX5BOi5VNGYM%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*oSHneX5BOi5VNGYM%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*oSHneX5BOi5VNGYM 640w, https://miro.medium.com/v2/resize:fit:720/0*oSHneX5BOi5VNGYM 720w, https://miro.medium.com/v2/resize:fit:750/0*oSHneX5BOi5VNGYM 750w, https://miro.medium.com/v2/resize:fit:786/0*oSHneX5BOi5VNGYM 786w, https://miro.medium.com/v2/resize:fit:828/0*oSHneX5BOi5VNGYM 828w, https://miro.medium.com/v2/resize:fit:1100/0*oSHneX5BOi5VNGYM 1100w, https://miro.medium.com/v2/resize:fit:1400/0*oSHneX5BOi5VNGYM 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="129" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="19c8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Rollup Store:</strong></p><p id="48a1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We save the results of this aggregation in a persistent store. The next aggregation will simply continue from this checkpoint.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*93S_a1YJ6zacuBnn%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*93S_a1YJ6zacuBnn%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*93S_a1YJ6zacuBnn%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*93S_a1YJ6zacuBnn%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*93S_a1YJ6zacuBnn%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*93S_a1YJ6zacuBnn%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*93S_a1YJ6zacuBnn%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*93S_a1YJ6zacuBnn 640w, https://miro.medium.com/v2/resize:fit:720/0*93S_a1YJ6zacuBnn 720w, https://miro.medium.com/v2/resize:fit:750/0*93S_a1YJ6zacuBnn 750w, https://miro.medium.com/v2/resize:fit:786/0*93S_a1YJ6zacuBnn 786w, https://miro.medium.com/v2/resize:fit:828/0*93S_a1YJ6zacuBnn 828w, https://miro.medium.com/v2/resize:fit:1100/0*93S_a1YJ6zacuBnn 1100w, https://miro.medium.com/v2/resize:fit:1400/0*93S_a1YJ6zacuBnn 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="318" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="586a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We create one such Rollup table <em class="oy">per dataset</em> and use Cassandra as our persistent store. However, as you will soon see in the Control Plane section, the Counter service can be configured to work with any persistent store.</p><p id="18db" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">LastWriteTs</strong>: Every time a given counter receives a write, we also log a <strong class="my gv">last-write-timestamp</strong> as a columnar update in this table. This is done using Cassandra’s <a class="af nu" href="https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlInsert.html#cqlInsert__timestamp-value" rel="noopener ugc nofollow" target="_blank">USING TIMESTAMP</a> feature to predictably apply the Last-Write-Win (LWW) semantics. This timestamp is the same as the <em class="oy">event_time</em> for the event. In the subsequent sections, we’ll see how this timestamp is used to keep some counters in active rollup circulation until they have caught up to their latest value.</p><p id="336a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Rollup Cache</strong></p><p id="25be" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To optimize read performance, these values are cached in EVCache for each counter. We combine the <strong class="my gv">lastRollupCount</strong> and <strong class="my gv">lastRollupTs</strong> <em class="oy">into a single cached value per counter</em> to prevent potential mismatches between the count and its corresponding checkpoint timestamp.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*giCU1AtWUYMXHZcI%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*giCU1AtWUYMXHZcI%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*giCU1AtWUYMXHZcI%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*giCU1AtWUYMXHZcI%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*giCU1AtWUYMXHZcI%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*giCU1AtWUYMXHZcI%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*giCU1AtWUYMXHZcI%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*giCU1AtWUYMXHZcI 640w, https://miro.medium.com/v2/resize:fit:720/0*giCU1AtWUYMXHZcI 720w, https://miro.medium.com/v2/resize:fit:750/0*giCU1AtWUYMXHZcI 750w, https://miro.medium.com/v2/resize:fit:786/0*giCU1AtWUYMXHZcI 786w, https://miro.medium.com/v2/resize:fit:828/0*giCU1AtWUYMXHZcI 828w, https://miro.medium.com/v2/resize:fit:1100/0*giCU1AtWUYMXHZcI 1100w, https://miro.medium.com/v2/resize:fit:1400/0*giCU1AtWUYMXHZcI 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="496" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="1bbf" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">But, how do we know which counters to trigger rollups for? Let’s explore our Write and Read path to understand this better.</p><p id="77ab" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Add/Clear Count:</strong></p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*wsxgnWH1yR0gHAEL%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*wsxgnWH1yR0gHAEL%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*wsxgnWH1yR0gHAEL%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*wsxgnWH1yR0gHAEL%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*wsxgnWH1yR0gHAEL%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*wsxgnWH1yR0gHAEL%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*wsxgnWH1yR0gHAEL%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*wsxgnWH1yR0gHAEL 640w, https://miro.medium.com/v2/resize:fit:720/0*wsxgnWH1yR0gHAEL 720w, https://miro.medium.com/v2/resize:fit:750/0*wsxgnWH1yR0gHAEL 750w, https://miro.medium.com/v2/resize:fit:786/0*wsxgnWH1yR0gHAEL 786w, https://miro.medium.com/v2/resize:fit:828/0*wsxgnWH1yR0gHAEL 828w, https://miro.medium.com/v2/resize:fit:1100/0*wsxgnWH1yR0gHAEL 1100w, https://miro.medium.com/v2/resize:fit:1400/0*wsxgnWH1yR0gHAEL 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="359" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="6b09" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">An <em class="oy">add</em> or <em class="oy">clear</em> count request writes durably to the TimeSeries Abstraction and updates the last-write-timestamp in the Rollup store. If the durability acknowledgement fails, clients can retry their requests with the same idempotency token without the risk of overcounting.Upon durability, we send a <em class="oy">fire-and-forget </em>request to trigger the rollup for the request counter.</p><p id="6a87" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">GetCount:</strong></p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*76pQR6OISx9yuRmi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*76pQR6OISx9yuRmi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*76pQR6OISx9yuRmi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*76pQR6OISx9yuRmi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*76pQR6OISx9yuRmi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*76pQR6OISx9yuRmi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*76pQR6OISx9yuRmi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*76pQR6OISx9yuRmi 640w, https://miro.medium.com/v2/resize:fit:720/0*76pQR6OISx9yuRmi 720w, https://miro.medium.com/v2/resize:fit:750/0*76pQR6OISx9yuRmi 750w, https://miro.medium.com/v2/resize:fit:786/0*76pQR6OISx9yuRmi 786w, https://miro.medium.com/v2/resize:fit:828/0*76pQR6OISx9yuRmi 828w, https://miro.medium.com/v2/resize:fit:1100/0*76pQR6OISx9yuRmi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*76pQR6OISx9yuRmi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="359" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="23ce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We return the last rolled-up count as<em class="oy"> a quick point-read operation</em>, accepting the trade-off of potentially delivering a slightly stale count. We also trigger a rollup during the read operation to advance the last-rollup-timestamp, enhancing the performance of <em class="oy">subsequent</em> aggregations. This process also <em class="oy">self-remediates </em>a stale count if any previous rollups had failed.</p><p id="2bdc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">With this approach, the counts<em class="oy"> continually converge</em> to their latest value. Now, let’s see how we scale this approach to millions of counters and thousands of concurrent operations using our Rollup Pipeline.</p><p id="9ab5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Rollup Pipeline:</strong></p><p id="c974" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Each <strong class="my gv">Counter-Rollup</strong> server operates a rollup pipeline to efficiently aggregate counts across millions of counters. This is where most of the complexity in Counter Abstraction comes in. In the following sections, we will share key details on how efficient aggregations are achieved.</p><p id="d23a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Light-Weight Roll-Up Event: </strong>As seen in our Write and Read paths above, every operation on a counter sends a light-weight event to the Rollup server:</p><pre class="pk pl pm pn po pv pw px bp py bb bk">rollupEvent: {<br />  "namespace": "my_dataset",<br />  "counter": "counter123"<br />}</pre><p id="8d93" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Note that this event does not include the increment. This is only an indication to the Rollup server that this counter has been accessed and now needs to be aggregated. Knowing exactly which specific counters need to be aggregated prevents scanning the entire event dataset for the purpose of aggregations.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*Yusg6kC9Jj9ayjbi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*Yusg6kC9Jj9ayjbi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*Yusg6kC9Jj9ayjbi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*Yusg6kC9Jj9ayjbi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*Yusg6kC9Jj9ayjbi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Yusg6kC9Jj9ayjbi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Yusg6kC9Jj9ayjbi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Yusg6kC9Jj9ayjbi 640w, https://miro.medium.com/v2/resize:fit:720/0*Yusg6kC9Jj9ayjbi 720w, https://miro.medium.com/v2/resize:fit:750/0*Yusg6kC9Jj9ayjbi 750w, https://miro.medium.com/v2/resize:fit:786/0*Yusg6kC9Jj9ayjbi 786w, https://miro.medium.com/v2/resize:fit:828/0*Yusg6kC9Jj9ayjbi 828w, https://miro.medium.com/v2/resize:fit:1100/0*Yusg6kC9Jj9ayjbi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*Yusg6kC9Jj9ayjbi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="284" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="0e5d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">In-Memory Rollup Queues:</strong> A given Rollup server instance runs a set of <em class="oy">in-memory</em> queues to receive rollup events and parallelize aggregations. In the first version of this service, we settled on using in-memory queues to reduce provisioning complexity, save on infrastructure costs, and make rebalancing the number of queues fairly straightforward. However, this comes with the trade-off of potentially missing rollup events in case of an instance crash. For more details, see the “Stale Counts” section in “Future Work.”</p><p id="0d6d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Minimize Duplicate Effort</strong>: We use a fast non-cryptographic hash like <a class="af nu" href="https://xxhash.com/" rel="noopener ugc nofollow" target="_blank">XXHash</a> to ensure that the same set of counters end up on the same queue. Further, we try to minimize the amount of duplicate aggregation work by having a separate rollup stack that chooses to run <em class="oy">fewer</em> <em class="oy">beefier</em> instances.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*u3p0kGfuwvK5mP_j%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*u3p0kGfuwvK5mP_j%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*u3p0kGfuwvK5mP_j%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*u3p0kGfuwvK5mP_j%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*u3p0kGfuwvK5mP_j%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*u3p0kGfuwvK5mP_j%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*u3p0kGfuwvK5mP_j%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*u3p0kGfuwvK5mP_j 640w, https://miro.medium.com/v2/resize:fit:720/0*u3p0kGfuwvK5mP_j 720w, https://miro.medium.com/v2/resize:fit:750/0*u3p0kGfuwvK5mP_j 750w, https://miro.medium.com/v2/resize:fit:786/0*u3p0kGfuwvK5mP_j 786w, https://miro.medium.com/v2/resize:fit:828/0*u3p0kGfuwvK5mP_j 828w, https://miro.medium.com/v2/resize:fit:1100/0*u3p0kGfuwvK5mP_j 1100w, https://miro.medium.com/v2/resize:fit:1400/0*u3p0kGfuwvK5mP_j 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="440" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="223c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Availability and Race Conditions: </strong>Having a single Rollup server instance can minimize duplicate aggregation work but may create availability challenges for triggering rollups. <em class="oy">If</em> we choose to horizontally scale the Rollup servers, we allow threads to overwrite rollup values while avoiding any form of distributed locking mechanisms to maintain high availability and performance. This approach remains safe because aggregation occurs within an immutable window. Although the concept of <em class="oy">now()</em> may differ between threads, causing rollup values to sometimes fluctuate, the counts will eventually converge to an accurate value within each immutable aggregation window.</p><p id="acf2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Rebalancing Queues</strong>: If we need to scale the number of queues, a simple Control Plane configuration update followed by a re-deploy is enough to rebalance the number of queues.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">      "eventual_counter_config": {             <br />          "queue_config": {                    <br />            "num_queues" : 8,  // change to 16 and re-deploy<br />...</pre><p id="3a8b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Handling Deployments</strong>: During deployments, these queues shut down gracefully, draining all existing events first, while the new Rollup server instance starts up with potentially new queue configurations. There may be a brief period when both the old and new Rollup servers are active, but as mentioned before, this race condition is managed since aggregations occur within immutable windows.</p><p id="a67a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Minimize Rollup Effort</strong>: Receiving multiple events for the same counter doesn’t mean rolling it up multiple times. We drain these rollup events into a Set, ensuring <em class="oy">a given counter is rolled up only once</em> <em class="oy">during a rollup window</em>.</p><p id="c500" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Efficient Aggregation: </strong>Each rollup consumer processes a batch of counters simultaneously. Within each batch, it queries the underlying TimeSeries abstraction in parallel to aggregate events within specified time boundaries. The TimeSeries abstraction optimizes these range scans to achieve low millisecond latencies.</p><p id="fea5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Dynamic Batching</strong>: The Rollup server dynamically adjusts the number of time partitions that need to be scanned based on cardinality of counters in order to prevent overwhelming the underlying store with many parallel read requests.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*hoPpSmQeScn87q0U%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*hoPpSmQeScn87q0U%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*hoPpSmQeScn87q0U%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*hoPpSmQeScn87q0U%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*hoPpSmQeScn87q0U%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*hoPpSmQeScn87q0U%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*hoPpSmQeScn87q0U%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*hoPpSmQeScn87q0U 640w, https://miro.medium.com/v2/resize:fit:720/0*hoPpSmQeScn87q0U 720w, https://miro.medium.com/v2/resize:fit:750/0*hoPpSmQeScn87q0U 750w, https://miro.medium.com/v2/resize:fit:786/0*hoPpSmQeScn87q0U 786w, https://miro.medium.com/v2/resize:fit:828/0*hoPpSmQeScn87q0U 828w, https://miro.medium.com/v2/resize:fit:1100/0*hoPpSmQeScn87q0U 1100w, https://miro.medium.com/v2/resize:fit:1400/0*hoPpSmQeScn87q0U 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="557" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="9446" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Adaptive Back-Pressure</strong>: Each consumer waits for one batch to complete before issuing the rollups for the next batch. It adjusts the wait time between batches based on the performance of the previous batch. This approach provides back-pressure during rollups to prevent overwhelming the underlying TimeSeries store.</p><p id="2693" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Handling Convergence</strong>:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*-hlw324cMUaC6pQJ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*-hlw324cMUaC6pQJ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*-hlw324cMUaC6pQJ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*-hlw324cMUaC6pQJ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*-hlw324cMUaC6pQJ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*-hlw324cMUaC6pQJ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*-hlw324cMUaC6pQJ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*-hlw324cMUaC6pQJ 640w, https://miro.medium.com/v2/resize:fit:720/0*-hlw324cMUaC6pQJ 720w, https://miro.medium.com/v2/resize:fit:750/0*-hlw324cMUaC6pQJ 750w, https://miro.medium.com/v2/resize:fit:786/0*-hlw324cMUaC6pQJ 786w, https://miro.medium.com/v2/resize:fit:828/0*-hlw324cMUaC6pQJ 828w, https://miro.medium.com/v2/resize:fit:1100/0*-hlw324cMUaC6pQJ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*-hlw324cMUaC6pQJ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="447" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="da2c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In order to prevent <strong class="my gv">low-cardinality</strong> counters from lagging behind too much and subsequently scanning too many time partitions, they are kept in constant rollup circulation. For <strong class="my gv">high-cardinality</strong> counters, continuously circulating them would consume excessive memory in our Rollup queues. This is where the <strong class="my gv">last-write-timestamp</strong> mentioned previously plays a crucial role. The Rollup server inspects this timestamp to determine if a given counter needs to be re-queued, ensuring that we continue aggregating until it has fully caught up with the writes.</p><p id="af9d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Now, let’s see how we leverage this counter type to provide an up-to-date current count in near-realtime.</p><h1 id="7747" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Experimental: Accurate Global Counter</h1><p id="8e14" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We are experimenting with a slightly modified version of the Eventually Consistent counter. Again, take the term ‘Accurate’ with a grain of salt. The key difference between this type of counter and its counterpart is that the <em class="oy">delta</em>, representing the counts since the last-rolled-up timestamp, is computed in real-time.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*FVOlMO0VgrQoVBBi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*FVOlMO0VgrQoVBBi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*FVOlMO0VgrQoVBBi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*FVOlMO0VgrQoVBBi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*FVOlMO0VgrQoVBBi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*FVOlMO0VgrQoVBBi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*FVOlMO0VgrQoVBBi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*FVOlMO0VgrQoVBBi 640w, https://miro.medium.com/v2/resize:fit:720/0*FVOlMO0VgrQoVBBi 720w, https://miro.medium.com/v2/resize:fit:750/0*FVOlMO0VgrQoVBBi 750w, https://miro.medium.com/v2/resize:fit:786/0*FVOlMO0VgrQoVBBi 786w, https://miro.medium.com/v2/resize:fit:828/0*FVOlMO0VgrQoVBBi 828w, https://miro.medium.com/v2/resize:fit:1100/0*FVOlMO0VgrQoVBBi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*FVOlMO0VgrQoVBBi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="290" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="a925" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Aggregating this delta in real-time can impact the performance of this operation, depending on the number of events and partitions that need to be scanned to retrieve this delta. The same principle of rolling up in batches applies here to prevent scanning too many partitions in parallel.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*M3dbSof98dTfeuNe%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*M3dbSof98dTfeuNe%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*M3dbSof98dTfeuNe%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*M3dbSof98dTfeuNe%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*M3dbSof98dTfeuNe%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*M3dbSof98dTfeuNe%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*M3dbSof98dTfeuNe%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*M3dbSof98dTfeuNe 640w, https://miro.medium.com/v2/resize:fit:720/0*M3dbSof98dTfeuNe 720w, https://miro.medium.com/v2/resize:fit:750/0*M3dbSof98dTfeuNe 750w, https://miro.medium.com/v2/resize:fit:786/0*M3dbSof98dTfeuNe 786w, https://miro.medium.com/v2/resize:fit:828/0*M3dbSof98dTfeuNe 828w, https://miro.medium.com/v2/resize:fit:1100/0*M3dbSof98dTfeuNe 1100w, https://miro.medium.com/v2/resize:fit:1400/0*M3dbSof98dTfeuNe 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="239" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="3b68" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Conversely, if the counters in this dataset areaccessedfrequently, the time gap for the delta remains narrow, making this approach of fetching current counts quite effective.</p><p id="49a2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Now, let’s see how all this complexity is managed by having a unified Control Plane configuration.</p><h1 id="98ba" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Control Plane</h1><p id="47f1" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The <a class="af nu" href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6" rel="noopener">Data Gateway Platform Control Plane</a> manages control settings for all abstractions and namespaces, including the Counter Abstraction. Below, is an example of a control plane configuration for a namespace that supports eventually consistent counters with low cardinality:</p><pre class="pk pl pm pn po pv pw px bp py bb bk">"persistence_configuration": [<br />  {<br />    "id": "CACHE",                             // Counter cache config<br />    "scope": "dal=counter",                                                   <br />    "physical_storage": {<br />      "type": "EVCACHE",                       // type of cache storage<br />      "cluster": "evcache_dgw_counter_tier1"   // Shared EVCache cluster<br />    }<br />  },<br />  {<br />    "id": "COUNTER_ROLLUP",<br />    "scope": "dal=counter",                    // Counter abstraction config<br />    "physical_storage": {                     <br />      "type": "CASSANDRA",                     // type of Rollup store<br />      "cluster": "cass_dgw_counter_uc1",       // physical cluster name<br />      "dataset": "my_dataset_1"                // namespace/dataset   <br />    },<br />    "counter_cardinality": "LOW",              // supported counter cardinality<br />    "config": {<br />      "counter_type": "EVENTUAL",              // Type of counter<br />      "eventual_counter_config": {             // eventual counter type<br />        "internal_config": {                  <br />          "queue_config": {                    // adjust w.r.t cardinality<br />            "num_queues" : 8,                  // Rollup queues per instance<br />            "coalesce_ms": 10000,              // coalesce duration for rollups<br />            "capacity_bytes": 16777216         // allocated memory per queue<br />          },<br />          "rollup_batch_count": 32             // parallelization factor<br />        }<br />      }<br />    }<br />  },<br />  {<br />    "id": "EVENT_STORAGE",<br />    "scope": "dal=ts",                         // TimeSeries Event store<br />    "physical_storage": {<br />      "type": "CASSANDRA",                     // persistent store type<br />      "cluster": "cass_dgw_counter_uc1",       // physical cluster name<br />      "dataset": "my_dataset_1",               // keyspace name<br />    },<br />    "config": {                              <br />      "time_partition": {                      // time-partitioning for events<br />        "buckets_per_id": 4,                   // event buckets within<br />        "seconds_per_bucket": "600",           // smaller width for LOW card<br />        "seconds_per_slice": "86400",          // width of a time slice table<br />      },<br />      "accept_limit": "5s",                    // boundary for immutability<br />    },<br />    "lifecycleConfigs": {<br />      "lifecycleConfig": [<br />        {<br />          "type": "retention",                 // Event retention<br />          "config": {<br />            "close_after": "518400s",<br />            "delete_after": "604800s"          // 7 day count event retention<br />          }<br />        }<br />      ]<br />    }<br />  }<br />]</pre><p id="9fd9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Using such a control plane configuration, we compose multiple abstraction layers using containers deployed on the same host, with each container fetching configuration specific to its scope.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*4MdrlEjWg2MXU9S3%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*4MdrlEjWg2MXU9S3%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*4MdrlEjWg2MXU9S3%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*4MdrlEjWg2MXU9S3%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*4MdrlEjWg2MXU9S3%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*4MdrlEjWg2MXU9S3%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*4MdrlEjWg2MXU9S3%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*4MdrlEjWg2MXU9S3 640w, https://miro.medium.com/v2/resize:fit:720/0*4MdrlEjWg2MXU9S3 720w, https://miro.medium.com/v2/resize:fit:750/0*4MdrlEjWg2MXU9S3 750w, https://miro.medium.com/v2/resize:fit:786/0*4MdrlEjWg2MXU9S3 786w, https://miro.medium.com/v2/resize:fit:828/0*4MdrlEjWg2MXU9S3 828w, https://miro.medium.com/v2/resize:fit:1100/0*4MdrlEjWg2MXU9S3 1100w, https://miro.medium.com/v2/resize:fit:1400/0*4MdrlEjWg2MXU9S3 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="737" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="6176" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Provisioning</h1><p id="ec90" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">As with the TimeSeries abstraction, our automation uses a bunch of user inputs regarding their workload and cardinalities to arrive at the right set of infrastructure and related control plane configuration. You can learn more about this process in a talk given by one of our stunning colleagues, <a class="af nu" href="https://www.linkedin.com/in/joseph-lynch-9976a431/" rel="noopener ugc nofollow" target="_blank">Joey Lynch</a> : <a class="af nu" href="https://www.youtube.com/watch?v=Lf6B1PxIvAs" rel="noopener ugc nofollow" target="_blank">How Netflix optimally provisions infrastructure in the cloud</a>.</p><h1 id="7e07" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Performance</h1><p id="f469" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At the time of writing this blog, this service was processing close to <strong class="my gv">75K count requests/second</strong><em class="oy"> globally</em> across the different API endpoints and datasets:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*1h_af4Kk3YrZrqlc%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*1h_af4Kk3YrZrqlc%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*1h_af4Kk3YrZrqlc%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*1h_af4Kk3YrZrqlc%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*1h_af4Kk3YrZrqlc%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*1h_af4Kk3YrZrqlc%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*1h_af4Kk3YrZrqlc%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*1h_af4Kk3YrZrqlc 640w, https://miro.medium.com/v2/resize:fit:720/0*1h_af4Kk3YrZrqlc 720w, https://miro.medium.com/v2/resize:fit:750/0*1h_af4Kk3YrZrqlc 750w, https://miro.medium.com/v2/resize:fit:786/0*1h_af4Kk3YrZrqlc 786w, https://miro.medium.com/v2/resize:fit:828/0*1h_af4Kk3YrZrqlc 828w, https://miro.medium.com/v2/resize:fit:1100/0*1h_af4Kk3YrZrqlc 1100w, https://miro.medium.com/v2/resize:fit:1400/0*1h_af4Kk3YrZrqlc 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="357" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="92ef" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">while providing<strong class="my gv"> single-digit millisecond</strong> latencies for all its endpoints:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*UnI7eore6gvuqrrF%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*UnI7eore6gvuqrrF%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*UnI7eore6gvuqrrF%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*UnI7eore6gvuqrrF%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*UnI7eore6gvuqrrF%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*UnI7eore6gvuqrrF%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*UnI7eore6gvuqrrF%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*UnI7eore6gvuqrrF 640w, https://miro.medium.com/v2/resize:fit:720/0*UnI7eore6gvuqrrF 720w, https://miro.medium.com/v2/resize:fit:750/0*UnI7eore6gvuqrrF 750w, https://miro.medium.com/v2/resize:fit:786/0*UnI7eore6gvuqrrF 786w, https://miro.medium.com/v2/resize:fit:828/0*UnI7eore6gvuqrrF 828w, https://miro.medium.com/v2/resize:fit:1100/0*UnI7eore6gvuqrrF 1100w, https://miro.medium.com/v2/resize:fit:1400/0*UnI7eore6gvuqrrF 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="366" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="3772" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Future Work</h1><p id="19ec" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">While our system is robust, we still have work to do in making it more reliable and enhancing its features. Some of that work includes:</p><ul class=""><li id="bafb" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qf pa pb bk"><strong class="my gv">Regional Rollups: </strong>Cross-region replication issues can result in missed events from other regions. An alternate strategy involves establishing a rollup table for each region, and then tallying them in a global rollup table. A key challenge in this design would be effectively communicating the clearing of the counter across regions.</li><li id="7818" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt qf pa pb bk"><strong class="my gv">Error Detection and Stale Counts</strong>: Excessively stale counts can occur if rollup events are lost or if a rollups fails and isn’t retried. This isn’t an issue for frequently accessed counters, as they remain in rollup circulation. This issue is more pronounced for counters that aren’t accessed frequently. Typically, the initial read for such a counter will trigger a rollup,<em class="oy"> self-remediating </em>the issue. However, for use cases that cannot accept potentially stale initial reads, we plan to implement improved error detection and utilize durable queues for resilient retries.</li></ul><h1 id="18c4" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Conclusion</h1><p id="64c0" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Distributed counting remains a challenging problem in computer science. In this blog, we explored multiple approaches to implement and deploy a Counting service at scale. While there may be other methods for distributed counting, our goal has been to deliver blazing fast performance at low infrastructure costs while maintaining high availability and providing idempotency guarantees. Along the way, we make various trade-offs to meet the diverse counting requirements at Netflix. We hope you found this blog post insightful.</p><p id="a883" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Stay tuned for <strong class="my gv">Part 3 </strong>of Composite Abstractions at Netflix, where we’ll introduce our <strong class="my gv">Graph Abstraction</strong>, a new service being built on top of the <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30">Key-Value Abstraction</a> <em class="oy">and</em> the <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8">TimeSeries Abstraction</a> to handle high-throughput, low-latency graphs.</p><h1 id="71bd" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Acknowledgments</h1><p id="78b5" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Special thanks to our stunning colleagues who contributed to the Counter Abstraction’s success: <a class="af nu" href="https://www.linkedin.com/in/joseph-lynch-9976a431/" rel="noopener ugc nofollow" target="_blank">Joey Lynch</a>, <a class="af nu" href="https://www.linkedin.com/in/vinaychella/" rel="noopener ugc nofollow" target="_blank">Vinay Chella</a>, <a class="af nu" href="https://www.linkedin.com/in/kaidanfullerton/" rel="noopener ugc nofollow" target="_blank">Kaidan Fullerton</a>, <a class="af nu" href="https://www.linkedin.com/in/tomdevoe/" rel="noopener ugc nofollow" target="_blank">Tom DeVoe</a>, <a class="af nu" href="https://www.linkedin.com/in/mengqingwang/" rel="noopener ugc nofollow" target="_blank">Mengqing Wang</a></p></div>]]></description>
      <link>https://netflixtechblog.com/netflixs-distributed-counter-abstraction-8d0c45eb66b2</link>
      <guid>https://netflixtechblog.com/netflixs-distributed-counter-abstraction-8d0c45eb66b2</guid>
      <pubDate>Tue, 12 Nov 2024 21:45:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Investigation of a Workbench UI Latency Issue]]></title>
      <description><![CDATA[<div><div></div><p id="c66a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">By: <a class="af nu" href="https://www.linkedin.com/in/hechaoli/" rel="noopener ugc nofollow" target="_blank">Hechao Li</a> and <a class="af nu" href="https://www.linkedin.com/in/mayworm/" rel="noopener ugc nofollow" target="_blank">Marcelo Mayworm</a></p><p id="2b76" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">With special thanks to our stunning colleagues <a class="af nu" href="https://www.linkedin.com/in/amer-ather-9071181/" rel="noopener ugc nofollow" target="_blank">Amer Ather</a>, <a class="af nu" href="https://www.linkedin.com/in/itaydafna" rel="noopener ugc nofollow" target="_blank">Itay Dafna</a>, <a class="af nu" href="https://www.linkedin.com/in/lucaepozzi/" rel="noopener ugc nofollow" target="_blank">Luca Pozzi</a>, <a class="af nu" href="https://www.linkedin.com/in/matheusdeoleao/" rel="noopener ugc nofollow" target="_blank">Matheus Leão</a>, and <a class="af nu" href="https://www.linkedin.com/in/yeji682/" rel="noopener ugc nofollow" target="_blank">Ye Ji</a>.</p><h1 id="c1c3" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Overview</h1><p id="072a" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix, the Analytics and Developer Experience organization, part of the Data Platform, offers a product called Workbench. Workbench is a remote development workspace based on<a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436"> Titus</a> that allows data practitioners to work with big data and machine learning use cases at scale. A common use case for Workbench is running<a class="af nu" href="https://jupyterlab.readthedocs.io/en/latest/" rel="noopener ugc nofollow" target="_blank"> JupyterLab</a> Notebooks.</p><p id="d5f7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Recently, several users reported that their JupyterLab UI becomes slow and unresponsive when running certain notebooks. This document details the intriguing process of debugging this issue, all the way from the UI down to the Linux kernel.</p><h1 id="1fae" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Symptom</h1><p id="6f03" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Machine Learning engineer <a class="af nu" href="https://www.linkedin.com/in/lucaepozzi/" rel="noopener ugc nofollow" target="_blank">Luca Pozzi</a> reported to our Data Platform team that their <strong class="my gv">JupyterLab UI on their workbench becomes slow and unresponsive when running some of their Notebooks.</strong> Restarting the <em class="oy">ipykernel</em> process, which runs the Notebook, might temporarily alleviate the problem, but the frustration persists as more notebooks are run.</p><h1 id="c9b9" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Quantify the Slowness</h1><p id="35ea" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">While we observed the issue firsthand, the term “UI being slow” is subjective and difficult to measure. To investigate this issue, <strong class="my gv">we needed a quantitative analysis of the slowness</strong>.</p><p id="2efa" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><a class="af nu" href="https://www.linkedin.com/in/itaydafna" rel="noopener ugc nofollow" target="_blank">Itay Dafna</a> devised an effective and simple method to quantify the UI slowness. Specifically, we opened a terminal via JupyterLab and held down a key (e.g., “j”) for 15 seconds while running the user’s notebook. The input to stdin is sent to the backend (i.e., JupyterLab) via a WebSocket, and the output to stdout is sent back from the backend and displayed on the UI. We then exported the <em class="oy">.har </em>file recording all communications from the browser and loaded it into a Notebook for analysis.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ltV3CYtNjLCzolXD%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ltV3CYtNjLCzolXD%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ltV3CYtNjLCzolXD%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ltV3CYtNjLCzolXD%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ltV3CYtNjLCzolXD%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ltV3CYtNjLCzolXD%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ltV3CYtNjLCzolXD%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ltV3CYtNjLCzolXD 640w, https://miro.medium.com/v2/resize:fit:720/0*ltV3CYtNjLCzolXD 720w, https://miro.medium.com/v2/resize:fit:750/0*ltV3CYtNjLCzolXD 750w, https://miro.medium.com/v2/resize:fit:786/0*ltV3CYtNjLCzolXD 786w, https://miro.medium.com/v2/resize:fit:828/0*ltV3CYtNjLCzolXD 828w, https://miro.medium.com/v2/resize:fit:1100/0*ltV3CYtNjLCzolXD 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ltV3CYtNjLCzolXD 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="252" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="e91b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Using this approach, we observed latencies ranging from 1 to 10 seconds, averaging 7.4 seconds.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*H7KW62J0jZKPTjQH%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*H7KW62J0jZKPTjQH%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*H7KW62J0jZKPTjQH%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*H7KW62J0jZKPTjQH%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*H7KW62J0jZKPTjQH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*H7KW62J0jZKPTjQH%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*H7KW62J0jZKPTjQH%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*H7KW62J0jZKPTjQH 640w, https://miro.medium.com/v2/resize:fit:720/0*H7KW62J0jZKPTjQH 720w, https://miro.medium.com/v2/resize:fit:750/0*H7KW62J0jZKPTjQH 750w, https://miro.medium.com/v2/resize:fit:786/0*H7KW62J0jZKPTjQH 786w, https://miro.medium.com/v2/resize:fit:828/0*H7KW62J0jZKPTjQH 828w, https://miro.medium.com/v2/resize:fit:1100/0*H7KW62J0jZKPTjQH 1100w, https://miro.medium.com/v2/resize:fit:1400/0*H7KW62J0jZKPTjQH 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="176" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="6cd5" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Blame The Notebook</h1><p id="ef5b" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Now that we have an objective metric for the slowness, let’s officially start our investigation. If you have read the symptom carefully, you must have noticed that the slowness only occurs when the user runs <strong class="my gv">certain</strong> notebooks but not others.</p><p id="c042" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Therefore, the first step is scrutinizing the specific Notebook experiencing the issue. Why does the UI always slow down after running this particular Notebook? Naturally, you would think that there must be something wrong with the code running in it.</p><p id="cf81" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Upon closely examining the user’s Notebook, we noticed a library called <em class="oy">pystan</em> , which provides Python bindings to a native C++ library called stan, looked suspicious. Specifically, <em class="oy">pystan</em> uses <em class="oy">asyncio</em>. However, <strong class="my gv">because there is already an existing <em class="oy">asyncio</em> event loop running in the Notebook process and <em class="oy">asyncio</em> cannot be nested by design, in order for <em class="oy">pystan</em> to work, the authors of <em class="oy">pystan</em> </strong><a class="af nu" href="https://pystan.readthedocs.io/en/latest/faq.html#how-can-i-use-pystan-with-jupyter-notebook-or-jupyterlab" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">recommend</strong></a><strong class="my gv"> injecting <em class="oy">pystan</em> into the existing event loop by using a package called </strong><a class="af nu" href="https://pypi.org/project/nest-asyncio/" rel="noopener ugc nofollow" target="_blank"><strong class="my gv"><em class="oy">nest_asyncio</em></strong></a>, a library that became unmaintained because <a class="af nu" href="https://github.com/erdewit/ib_insync/commit/ef5ea29e44e0c40bbadbc16c2281b3ac58aa4a40" rel="noopener ugc nofollow" target="_blank">the author unfortunately passed away</a>.</p><p id="de21" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Given this seemingly hacky usage, we naturally suspected that the events injected by <em class="oy">pystan</em> into the event loop were blocking the handling of the WebSocket messages used to communicate with the JupyterLab UI. This reasoning sounds very plausible. However, <strong class="my gv">the user claimed that there were cases when a Notebook not using <em class="oy">pystan</em> runs, the UI also became slow</strong>.</p><p id="ca77" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Moreover, after several rounds of discussion with ChatGPT, we learned more about the architecture and realized that, in theory, <strong class="my gv">the usage of <em class="oy">pystan</em> and <em class="oy">nest_asyncio</em> should not cause the slowness in handling the UI WebSocket</strong> for the following reasons:</p><p id="17ba" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Even though <em class="oy">pystan</em> uses <em class="oy">nest_asyncio</em> to inject itself into the main event loop, <strong class="my gv">the Notebook runs on a child process (i.e.</strong>,<strong class="my gv"> the <em class="oy">ipykernel</em> process) of the <em class="oy">jupyter-lab</em> server process</strong>, which means the main event loop being injected by <em class="oy">pystan</em> is that of the <em class="oy">ipykernel</em> process, not the <em class="oy">jupyter-server</em> process. Therefore, even if <em class="oy">pystan</em> blocks the event loop, it shouldn’t impact the <em class="oy">jupyter-lab</em> main event loop that is used for UI websocket communication. See the diagram below:</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*DsQuZV5qnRXp5mVw%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*DsQuZV5qnRXp5mVw%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*DsQuZV5qnRXp5mVw%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*DsQuZV5qnRXp5mVw%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*DsQuZV5qnRXp5mVw%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*DsQuZV5qnRXp5mVw%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*DsQuZV5qnRXp5mVw%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*DsQuZV5qnRXp5mVw 640w, https://miro.medium.com/v2/resize:fit:720/0*DsQuZV5qnRXp5mVw 720w, https://miro.medium.com/v2/resize:fit:750/0*DsQuZV5qnRXp5mVw 750w, https://miro.medium.com/v2/resize:fit:786/0*DsQuZV5qnRXp5mVw 786w, https://miro.medium.com/v2/resize:fit:828/0*DsQuZV5qnRXp5mVw 828w, https://miro.medium.com/v2/resize:fit:1100/0*DsQuZV5qnRXp5mVw 1100w, https://miro.medium.com/v2/resize:fit:1400/0*DsQuZV5qnRXp5mVw 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="591" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="c601" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In other words, <strong class="my gv"><em class="oy">pystan</em> events are injected to the event loop B in this diagram instead of event loop A</strong>. So, it shouldn’t block the UI WebSocket events.</p><p id="1d8c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">You might also think that because event loop A handles both the WebSocket events from the UI and the ZeroMQ socket events from the <em class="oy">ipykernel</em> process, a high volume of ZeroMQ events generated by the notebook could block the WebSocket. However, <strong class="my gv">when we captured packets on the ZeroMQ socket while reproducing the issue, we didn’t observe heavy traffic on this socket that could cause such blocking</strong>.</p><p id="f5d9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">A stronger piece of evidence to rule out <em class="oy">pystan</em> was that we were ultimately able to reproduce the issue even without it, which I’ll dive into later.</p><h1 id="ccc2" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Blame Noisy Neighbors</h1><p id="87ad" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The Workbench instance runs as a <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436">Titus container</a>. To efficiently utilize our compute resources, <strong class="my gv">Titus employs a CPU oversubscription feature</strong>, meaning the combined virtual CPUs allocated to containers exceed the number of available physical CPUs on a Titus agent. <strong class="my gv">If a container is unfortunate enough to be scheduled alongside other “noisy” containers — those that consume a lot of CPU resources — it could suffer from CPU deficiency.</strong></p><p id="d99f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, after examining the CPU utilization of neighboring containers on the same Titus agent as the Workbench instance, as well as the overall CPU utilization of the Titus agent, we quickly ruled out this hypothesis. Using the top command on the Workbench, we observed that when running the Notebook, <strong class="my gv">the Workbench instance uses only 4 out of the 64 CPUs allocated to it</strong>. Simply put, <strong class="my gv">this workload is not CPU-bound.</strong></p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*YXsntKLiontnkNhf%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*YXsntKLiontnkNhf%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*YXsntKLiontnkNhf%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*YXsntKLiontnkNhf%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*YXsntKLiontnkNhf%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*YXsntKLiontnkNhf%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*YXsntKLiontnkNhf%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*YXsntKLiontnkNhf 640w, https://miro.medium.com/v2/resize:fit:720/0*YXsntKLiontnkNhf 720w, https://miro.medium.com/v2/resize:fit:750/0*YXsntKLiontnkNhf 750w, https://miro.medium.com/v2/resize:fit:786/0*YXsntKLiontnkNhf 786w, https://miro.medium.com/v2/resize:fit:828/0*YXsntKLiontnkNhf 828w, https://miro.medium.com/v2/resize:fit:1100/0*YXsntKLiontnkNhf 1100w, https://miro.medium.com/v2/resize:fit:1400/0*YXsntKLiontnkNhf 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="252" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="ac15" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Blame The Network</h1><p id="8e12" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The next theory was that the network between the web browser UI (on the laptop) and the JupyterLab server was slow. To investigate, we <strong class="my gv">captured all the packets between the laptop and the server</strong> while running the Notebook and continuously pressing ‘j’ in the terminal.</p><p id="0018" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">When the UI experienced delays, we observed a 5-second pause in packet transmission from server port 8888 to the laptop. Meanwhile,<strong class="my gv"> traffic from other ports, such as port 22 for SSH, remained unaffected</strong>. This led us to conclude that the pause was caused by the application running on port 8888 (i.e., the JupyterLab process) rather than the network.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*c660xBwF4XuCA8KN%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*c660xBwF4XuCA8KN%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*c660xBwF4XuCA8KN%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*c660xBwF4XuCA8KN%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*c660xBwF4XuCA8KN%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*c660xBwF4XuCA8KN%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*c660xBwF4XuCA8KN%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*c660xBwF4XuCA8KN 640w, https://miro.medium.com/v2/resize:fit:720/0*c660xBwF4XuCA8KN 720w, https://miro.medium.com/v2/resize:fit:750/0*c660xBwF4XuCA8KN 750w, https://miro.medium.com/v2/resize:fit:786/0*c660xBwF4XuCA8KN 786w, https://miro.medium.com/v2/resize:fit:828/0*c660xBwF4XuCA8KN 828w, https://miro.medium.com/v2/resize:fit:1100/0*c660xBwF4XuCA8KN 1100w, https://miro.medium.com/v2/resize:fit:1400/0*c660xBwF4XuCA8KN 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="115" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="ff04" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">The Minimal Reproduction</h1><p id="b5d7" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">As previously mentioned, another strong piece of evidence proving the innocence of pystan was that we could reproduce the issue without it. By gradually stripping down the “bad” Notebook, we eventually arrived at a minimal snippet of code that reproduces the issue without any third-party dependencies or complex logic:</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">import time<br />import os<br />from multiprocessing import ProcessN = os.cpu_count()def launch_worker(worker_id):<br />  time.sleep(60)if __name__ == '__main__':<br />  with open('/root/2GB_file', 'r') as file:<br />    data = file.read()<br />    processes = []<br />    for i in range(N):<br />      p = Process(target=launch_worker, args=(i,))<br />      processes.append(p)<br />      p.start()for p in processes:<br />      p.join()</pre><p id="04dc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The code does only two things:</p><ol class=""><li id="2d46" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qa qb qc bk">Read a 2GB file into memory (the Workbench instance has 480G memory in total so this memory usage is almost negligible).</li><li id="9c35" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qa qb qc bk">Start N processes where N is the number of CPUs. The N processes do nothing but sleep.</li></ol><p id="9ea8" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">There is no doubt that this is the most silly piece of code I’ve ever written. It is neither CPU bound nor memory bound. Yet <strong class="my gv">it can cause the JupyterLab UI to stall for as many as 10 seconds!</strong></p><h1 id="2dca" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Questions</h1><p id="24d9" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">There are a couple of interesting observations that raise several questions:</p><ul class=""><li id="bba4" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qi qb qc bk">We noticed that <strong class="my gv">both steps are required in order to reproduce the issue</strong>. If you don’t read the 2GB file (that is not even used!), the issue is not reproducible. <strong class="my gv">Why using 2GB out of 480GB memory could impact the performance?</strong></li><li id="3e91" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qi qb qc bk"><strong class="my gv">When the UI delay occurs, the <em class="oy">jupyter-lab</em> process CPU utilization spikes to 100%</strong>, hinting at contention on the single-threaded event loop in this process (event loop A in the diagram before). <strong class="my gv">What does the <em class="oy">jupyter-lab</em> process need the CPU for, given that it is not the process that runs the Notebook?</strong></li><li id="72c2" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qi qb qc bk">The code runs in a Notebook, which means it runs in the <em class="oy">ipykernel</em> process, that is a child process of the <em class="oy">jupyter-lab</em> process. <strong class="my gv">How can anything that happens in a child process cause the parent process to have CPU contention?</strong></li><li id="101d" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qi qb qc bk">The workbench has 64CPUs. But when we printed <em class="oy">os.cpu_count()</em>, the output was 96. That means <strong class="my gv">the code starts more processes than the number of CPUs</strong>. <strong class="my gv">Why is that?</strong></li></ul><p id="9d59" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s answer the last question first. In fact, if you run <em class="oy">lscpu</em> and <em class="oy">nproc</em> commands inside a Titus container, you will also see different results — the former gives you 96, which is the number of physical CPUs on the Titus agent, whereas the latter gives you 64, which is the number of virtual CPUs allocated to the container. This discrepancy is due to the lack of a “CPU namespace” in the Linux kernel, causing the number of physical CPUs to be leaked to the container when calling certain functions to get the CPU count. The assumption here is that Python <strong class="my gv"><em class="oy">os.cpu_count()</em> uses the same function as the <em class="oy">lscpu</em> command, causing it to get the CPU count of the host instead of the container</strong>. Python 3.13 has <a class="af nu" href="https://docs.python.org/3.13/library/os.html#os.process_cpu_count" rel="noopener ugc nofollow" target="_blank">a new call that can be used to get the accurate CPU count</a>, but it’s not GA’ed yet.</p><p id="ea87" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">It will be proven later that this inaccurate number of CPUs can be a contributing factor to the slowness.</p><h1 id="cdd1" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">More Clues</h1><p id="b480" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Next, we used <em class="oy">py-spy</em> to do a profiling of the <em class="oy">jupyter-lab</em> process. Note that we profiled the parent <em class="oy">jupyter-lab </em>process, <strong class="my gv">not</strong> the <em class="oy">ipykernel</em> child process that runs the reproduction code. The profiling result is as follows:</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ho2C4015Disa8aFv%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ho2C4015Disa8aFv%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ho2C4015Disa8aFv%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ho2C4015Disa8aFv%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ho2C4015Disa8aFv%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ho2C4015Disa8aFv%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ho2C4015Disa8aFv%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ho2C4015Disa8aFv 640w, https://miro.medium.com/v2/resize:fit:720/0*ho2C4015Disa8aFv 720w, https://miro.medium.com/v2/resize:fit:750/0*ho2C4015Disa8aFv 750w, https://miro.medium.com/v2/resize:fit:786/0*ho2C4015Disa8aFv 786w, https://miro.medium.com/v2/resize:fit:828/0*ho2C4015Disa8aFv 828w, https://miro.medium.com/v2/resize:fit:1100/0*ho2C4015Disa8aFv 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ho2C4015Disa8aFv 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="433" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="55b0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As one can see, <strong class="my gv">a lot of CPU time (89%!!) is spent on a function called <em class="oy">__parse_smaps_rollup</em></strong>. In comparison, the terminal handler used only 0.47% CPU time. From the stack trace, we see that <strong class="my gv">this function is inside the event loop A</strong>,<strong class="my gv"> so it can definitely cause the UI WebSocket events to be delayed</strong>.</p><p id="fd28" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The stack trace also shows that this function is ultimately called by a function used by a Jupyter lab extension called <em class="oy">jupyter_resource_usage</em>. <strong class="my gv">We then disabled this extension and restarted the <em class="oy">jupyter-lab</em> process. As you may have guessed, we could no longer reproduce the slowness!</strong></p><p id="5e8f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">But our puzzle is not solved yet. Why does this extension cause the UI to slow down? Let’s keep digging.</p><h1 id="ff5d" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Root Cause Analysis</h1><p id="694f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">From the name of the extension and the names of the other functions it calls, we can infer that this extension is used to get resources such as CPU and memory usage information. Examining the code, we see that this function call stack is triggered when an API endpoint <em class="oy">/metrics/v1</em> is called from the UI. <strong class="my gv">The UI apparently calls this function periodically</strong>, according to the network traffic tab in Chrome’s Developer Tools.</p><p id="5465" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Now let’s look at the implementation starting from the call <em class="oy">get(jupter_resource_usage/api.py:42)</em> . The full code is <a class="af nu" href="https://github.com/jupyter-server/jupyter-resource-usage/blob/6f15ef91d5c7e50853516b90b5e53b3913d2ed34/jupyter_resource_usage/api.py#L28" rel="noopener ugc nofollow" target="_blank">here</a> and the key lines are shown below:</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">cur_process = psutil.Process()<br />all_processes = [cur_process] + cur_process.children(recursive=True)for p in all_processes:<br />  info = p.memory_full_info()</pre><p id="1f1a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Basically, it gets all children processes of the <em class="oy">jupyter-lab</em> process recursively, including both the <em class="oy">ipykernel</em> Notebook process and all processes created by the Notebook. Obviously, <strong class="my gv">the cost of this function is linear to the number of all children processes</strong>. In the reproduction code, we create 96 processes. So here we will have at least 96 (sleep processes) + 1 (<em class="oy">ipykernel</em> process) + 1 (<em class="oy">jupyter-lab</em> process) = 98 processes when it should actually be 64 (allocated CPUs) + 1 (<em class="oy">ipykernel</em> process) + 1 <em class="oy">(jupyter-lab</em> process) = 66 processes, because the number of CPUs allocated to the container is, in fact, 64.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*sHTjycVMUk1yVAsk%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*sHTjycVMUk1yVAsk%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*sHTjycVMUk1yVAsk%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*sHTjycVMUk1yVAsk%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*sHTjycVMUk1yVAsk%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*sHTjycVMUk1yVAsk%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sHTjycVMUk1yVAsk%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*sHTjycVMUk1yVAsk 640w, https://miro.medium.com/v2/resize:fit:720/0*sHTjycVMUk1yVAsk 720w, https://miro.medium.com/v2/resize:fit:750/0*sHTjycVMUk1yVAsk 750w, https://miro.medium.com/v2/resize:fit:786/0*sHTjycVMUk1yVAsk 786w, https://miro.medium.com/v2/resize:fit:828/0*sHTjycVMUk1yVAsk 828w, https://miro.medium.com/v2/resize:fit:1100/0*sHTjycVMUk1yVAsk 1100w, https://miro.medium.com/v2/resize:fit:1400/0*sHTjycVMUk1yVAsk 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="326" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="6210" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This is truly ironic. <strong class="my gv">The more CPUs we have, the slower we are!</strong></p><p id="98c2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At this point, we have answered one question: <strong class="my gv">Why does starting many grandchildren processes in the child process cause the parent process to be slow? </strong>Because the parent process runs a function that’s linear to the number all children process recursively.</p><p id="1f1f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, this solves only half of the puzzle. If you remember the previous analysis, <strong class="my gv">starting many child processes ALONE doesn’t reproduce the issue</strong>. If we don’t read the 2GB file, even if we create 2x more processes, we can’t reproduce the slowness.</p><p id="147b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">So now we must answer the next question: <strong class="my gv">Why does reading a 2GB file in the child process affect the parent process performance, </strong>especially when the workbench has as much as 480GB memory in total?</p><p id="49ac" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To answer this question, let’s look closely at the function <em class="oy">__parse_smaps_rollup</em>. As the name implies, <a class="af nu" href="https://github.com/giampaolo/psutil/blob/c034e6692cf736b5e87d14418a8153bb03f6cf42/psutil/_pslinux.py#L1978" rel="noopener ugc nofollow" target="_blank">this function</a> parses the file <em class="oy">/proc/&lt;pid&gt;/smaps_rollup</em>.</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">def _parse_smaps_rollup(self):<br />  uss = pss = swap = 0<br />  with open_binary("{}/{}/smaps_rollup".format(self._procfs_path, self.pid)) as f:<br />  for line in f:<br />    if line.startswith(b”Private_”):<br />    # Private_Clean, Private_Dirty, Private_Hugetlb<br />      s uss += int(line.split()[1]) * 1024<br />    elif line.startswith(b”Pss:”):<br />      pss = int(line.split()[1]) * 1024<br />    elif line.startswith(b”Swap:”):<br />      swap = int(line.split()[1]) * 1024<br />return (uss, pss, swap)</pre><p id="6952" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Naturally, you might think that when memory usage increases, this file becomes larger in size, causing the function to take longer to parse. Unfortunately, this is not the answer because:</p><ul class=""><li id="2f67" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qi qb qc bk">First, <a class="af nu" href="https://www.kernel.org/doc/Documentation/ABI/testing/procfs-smaps_rollup" rel="noopener ugc nofollow" target="_blank"><strong class="my gv">the number of lines in this file is constant</strong></a><strong class="my gv"> for all processes</strong>.</li><li id="173a" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qi qb qc bk">Second, <strong class="my gv">this is a special file in the /proc filesystem, which should be seen as a kernel interface</strong> instead of a regular file on disk. In other words, <strong class="my gv">I/O operations of this file are handled by the kernel rather than disk</strong>.</li></ul><p id="5700" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This file was introduced in <a class="af nu" href="https://github.com/torvalds/linux/commit/493b0e9d945fa9dfe96be93ae41b4ca4b6fdb317#diff-cb79e2d6ea6f9627ff68d1342a219f800e04ff6c6fa7b90c7e66bb391b2dd3ee" rel="noopener ugc nofollow" target="_blank">this commit</a> in 2017, with the purpose of improving the performance of user programs that determine aggregate memory statistics. Let’s first focus on <a class="af nu" href="https://elixir.bootlin.com/linux/v6.5.13/source/fs/proc/task_mmu.c#L1025" rel="noopener ugc nofollow" target="_blank">the handler of <em class="oy">open</em> syscall</a> on this <em class="oy">/proc/&lt;pid&gt;/smaps_rollup</em>.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*vGOD79Tleii7X22B%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*vGOD79Tleii7X22B%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*vGOD79Tleii7X22B%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*vGOD79Tleii7X22B%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*vGOD79Tleii7X22B%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*vGOD79Tleii7X22B%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*vGOD79Tleii7X22B%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*vGOD79Tleii7X22B 640w, https://miro.medium.com/v2/resize:fit:720/0*vGOD79Tleii7X22B 720w, https://miro.medium.com/v2/resize:fit:750/0*vGOD79Tleii7X22B 750w, https://miro.medium.com/v2/resize:fit:786/0*vGOD79Tleii7X22B 786w, https://miro.medium.com/v2/resize:fit:828/0*vGOD79Tleii7X22B 828w, https://miro.medium.com/v2/resize:fit:1100/0*vGOD79Tleii7X22B 1100w, https://miro.medium.com/v2/resize:fit:1400/0*vGOD79Tleii7X22B 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pm c" width="700" height="579" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="52cf" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Following through the <em class="oy">single_open</em> <a class="af nu" href="https://elixir.bootlin.com/linux/v6.5.13/source/fs/seq_file.c#L582" rel="noopener ugc nofollow" target="_blank">function</a>, we will find that it uses the function <em class="oy">show_smaps_rollup</em> for the show operation, which can translate to the <em class="oy">read</em> system call on the file. Next, we look at the <em class="oy">show_smaps_rollup</em> <a class="af nu" href="https://elixir.bootlin.com/linux/v6.5.13/source/fs/proc/task_mmu.c#L916" rel="noopener ugc nofollow" target="_blank">implementation</a>. You will notice <strong class="my gv">a do-while loop that is linear to the virtual memory area</strong>.</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">static int show_smaps_rollup(struct seq_file *m, void *v) {<br />  …<br />  vma_start = vma-&gt;vm_start;<br />  do {<br />    smap_gather_stats(vma, &amp;mss, 0);<br />    last_vma_end = vma-&gt;vm_end;<br />    …<br />  } for_each_vma(vmi, vma);<br />  …<br />}</pre><p id="976c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This perfectly <strong class="my gv">explains why the function gets slower when a 2GB file is read into memory</strong>. <strong class="my gv">Because the handler of reading the <em class="oy">smaps_rollup</em> file now takes longer to run the while loop</strong>. Basically, even though <strong class="my gv"><em class="oy">smaps_rollup</em></strong> already improved the performance of getting memory information compared to the old method of parsing the <em class="oy">/proc/&lt;pid&gt;/smaps</em> file, <strong class="my gv">it is still linear to the virtual memory used</strong>.</p><h1 id="d903" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">More Quantitative Analysis</h1><p id="3a6e" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Even though at this point the puzzle is solved, let’s conduct a more quantitative analysis. How much is the time difference when reading the <em class="oy">smaps_rollup</em> file with small versus large virtual memory utilization? Let’s write some simple benchmark code like below:</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">import osdef read_smaps_rollup(pid):<br />  with open("/proc/{}/smaps_rollup".format(pid), "rb") as f:<br />    for line in f:<br />      passif __name__ == “__main__”:<br />  pid = os.getpid()read_smaps_rollup(pid)with open(“/root/2G_file”, “rb”) as f:<br />    data = f.read()read_smaps_rollup(pid)</pre><p id="56c3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This program performs the following steps:</p><ol class=""><li id="d3b3" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qa qb qc bk">Reads the <em class="oy">smaps_rollup</em> file of the current process.</li><li id="2032" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qa qb qc bk">Reads a 2GB file into memory.</li><li id="7966" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qa qb qc bk">Repeats step 1.</li></ol><p id="12da" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We then use <em class="oy">strace</em> to find the accurate time of reading the <em class="oy">smaps_rollup</em> file.</p><pre class="pc pd pe pf pg pr ps pt bp pu bb bk">$ sudo strace -T -e trace=openat,read python3 benchmark.py 2&gt;&amp;1 | grep “smaps_rollup” -A 1openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 &lt;0.000023&gt;<br />read(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 &lt;0.000259&gt;<br />...<br />openat(AT_FDCWD, “/proc/3107492/smaps_rollup”, O_RDONLY|O_CLOEXEC) = 3 &lt;0.000029&gt;<br />read(3, “560b42ed4000–7ffdadcef000 — -p 0”…, 1024) = 670 &lt;0.027698&gt;</pre><p id="2e29" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As you can see, both times, the read <em class="oy">syscall</em> returned 670, meaning the file size remained the same at 670 bytes. However, <strong class="my gv">the time it took the second time (i.e.</strong>,<strong class="my gv"> 0.027698 seconds) is 100x the time it took the first time (i.e.</strong>,<strong class="my gv"> 0.000259 seconds)</strong>! This means that if there are 98 processes, the time spent on reading this file alone will be 98 * 0.027698 = 2.7 seconds! Such a delay can significantly affect the UI experience.</p><h1 id="8c5e" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Solution</h1><p id="9ac7" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">This extension is used to display the CPU and memory usage of the notebook process on the bar at the bottom of the Notebook:</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div class="oz pa ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*bNYMYTc5QQAxLyya%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*bNYMYTc5QQAxLyya%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*bNYMYTc5QQAxLyya%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*bNYMYTc5QQAxLyya%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*bNYMYTc5QQAxLyya%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*bNYMYTc5QQAxLyya%201100w,%20https://miro.medium.com/v2/resize:fit:1048/format:webp/0*bNYMYTc5QQAxLyya%201048w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 524px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*bNYMYTc5QQAxLyya 640w, https://miro.medium.com/v2/resize:fit:720/0*bNYMYTc5QQAxLyya 720w, https://miro.medium.com/v2/resize:fit:750/0*bNYMYTc5QQAxLyya 750w, https://miro.medium.com/v2/resize:fit:786/0*bNYMYTc5QQAxLyya 786w, https://miro.medium.com/v2/resize:fit:828/0*bNYMYTc5QQAxLyya 828w, https://miro.medium.com/v2/resize:fit:1100/0*bNYMYTc5QQAxLyya 1100w, https://miro.medium.com/v2/resize:fit:1048/0*bNYMYTc5QQAxLyya 1048w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 524px" /><img alt="" class="bh md pm c" width="524" height="33" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></figure><p id="0389" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We confirmed with the user that disabling the <em class="oy">jupyter-resource-usage</em> extension meets their requirements for UI responsiveness, and that this extension is not critical to their use case. Therefore, we provided a way for them to disable the extension.</p><h1 id="2e46" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Summary</h1><p id="5cb4" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">This was such a challenging issue that required debugging from the UI all the way down to the Linux kernel. It is fascinating that the problem is linear to both the number of CPUs and the virtual memory size — two dimensions that are generally viewed separately.</p><p id="dde1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Overall, we hope you enjoyed the irony of:</p><ol class=""><li id="b7b8" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qa qb qc bk">The extension used to monitor CPU usage causing CPU contention.</li><li id="93e2" class="mw mx gu my b mz qd nb nc nd qe nf ng nh qf nj nk nl qg nn no np qh nr ns nt qa qb qc bk">An interesting case where the more CPUs you have, the slower you get!</li></ol><p id="dfe6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">If you’re excited by tackling such technical challenges and have the opportunity to solve complex technical challenges and drive innovation, consider joining our <a class="af nu" href="https://explore.jobs.netflix.net/careers?query=Data+Platform&amp;pid=790298020581&amp;domain=netflix.com&amp;sort_by=relevance" rel="noopener ugc nofollow" target="_blank">Data Platform team</a>s. Be part of shaping the future of Data Security and Infrastructure, Data Developer Experience, Analytics Infrastructure and Enablement, and more. Explore the impact you can make with us!</p></div>]]></description>
      <link>https://netflixtechblog.com/investigation-of-a-workbench-ui-latency-issue-faa017b4653d</link>
      <guid>https://netflixtechblog.com/investigation-of-a-workbench-ui-latency-issue-faa017b4653d</guid>
      <pubDate>Mon, 14 Oct 2024 22:02:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Netflix TimeSeries Data Abstraction Layer]]></title>
      <description><![CDATA[<div><div></div><p id="b30e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><a class="af nu" href="https://www.linkedin.com/in/rajiv-shringi" rel="noopener ugc nofollow" target="_blank">Rajiv Shringi</a> <a class="af nu" href="https://www.linkedin.com/in/vinaychella/" rel="noopener ugc nofollow" target="_blank">Vinay Chella</a> <a class="af nu" href="https://www.linkedin.com/in/kaidanfullerton/" rel="noopener ugc nofollow" target="_blank">Kaidan Fullerton</a> <a class="af nu" href="https://www.linkedin.com/in/oleksii-tkachuk-98b47375/" rel="noopener ugc nofollow" target="_blank">Oleksii Tkachuk</a> <a class="af nu" href="https://www.linkedin.com/in/joseph-lynch-9976a431/" rel="noopener ugc nofollow" target="_blank">Joey Lynch</a></p><h1 id="b44d" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Introduction</strong></h1><p id="2bf4" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">As Netflix continues to expand and diversify into various sectors like <strong class="my gv">Video on Demand</strong> and <strong class="my gv">Gaming</strong>, the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital. In previous blog posts, we introduced the <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30"><strong class="my gv">Key-Value Data Abstraction Layer</strong></a> and the <a class="af nu" href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6" rel="noopener"><strong class="my gv">Data Gateway Platform</strong></a>, both of which are integral to Netflix’s data architecture. The Key-Value Abstraction offers a flexible, scalable solution for storing and accessing structured key-value data, while the Data Gateway Platform provides essential infrastructure for protecting, configuring, and deploying the data tier.</p><p id="f295" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Building on these foundational abstractions, we developed the <strong class="my gv">TimeSeries Abstraction</strong> — a versatile and scalable solution designed to efficiently store and query large volumes of temporal event data with low millisecond latencies, all in a cost-effective manner across various use cases.</p><p id="f9ce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this post, we will delve into the architecture, design principles, and real-world applications of the <strong class="my gv">TimeSeries Abstraction</strong>, demonstrating how it enhances our platform’s ability to manage temporal data at scale.</p><p id="f726" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Note: </strong><em class="oy">Contrary to what the name may suggest, this system is not built as a general-purpose time series database. We do not use it for metrics, histograms, timers, or any such near-real time analytics use case. Those use cases are well served by the Netflix </em><a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a"><em class="oy">Atlas</em></a><em class="oy"> telemetry system. Instead, we focus on addressing the challenge of storing and accessing extremely high-throughput, immutable temporal event data in a low-latency and cost-efficient manner.</em></p><h1 id="a578" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Challenges</h1><p id="ce8f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix, temporal data is continuously generated and utilized, whether from user interactions like video-play events, asset impressions, or complex micro-service network activities. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.</p><p id="11e4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">However, storing and querying such data presents a unique set of challenges:</p><ul class=""><li id="ef3e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">High Throughput</strong>: Managing up to 10 million writes per second while maintaining high availability.</li><li id="35ee" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Efficient Querying in Large Datasets</strong>: Storing petabytes of data while ensuring primary key reads return results within low double-digit milliseconds, and supporting searches and aggregations across multiple secondary attributes.</li><li id="6d28" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Global Reads and Writes</strong>: Facilitating read and write operations from anywhere in the world with adjustable consistency models.</li><li id="89d9" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Tunable Configuration</strong>: Offering the ability to partition datasets in either a single-tenant or multi-tenant datastore, with options to adjust various dataset aspects such as retention and consistency.</li><li id="78f2" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Handling Bursty Traffic</strong>: Managing significant traffic spikes during high-demand events, such as new content launches or regional failovers.</li><li id="7ea6" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Cost Efficiency</strong>: Reducing the cost per byte and per operation to optimize long-term retention while minimizing infrastructure expenses, which can amount to millions of dollars for Netflix.</li></ul><h1 id="264f" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">TimeSeries Abstraction</h1><p id="dc2a" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The TimeSeries Abstraction was developed to meet these requirements, built around the following core design principles:</p><ul class=""><li id="1e7d" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Partitioned Data</strong>: Data is partitioned using a unique temporal partitioning strategy combined with an event bucketing approach to efficiently manage bursty workloads and streamline queries.</li><li id="caa3" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Flexible Storage</strong>: The service is designed to integrate with various storage backends, including <a class="af nu" href="https://cassandra.apache.org/_/index.html" rel="noopener ugc nofollow" target="_blank">Apache Cassandra</a> and <a class="af nu" href="https://www.elastic.co/elasticsearch" rel="noopener ugc nofollow" target="_blank">Elasticsearch</a>, allowing Netflix to customize storage solutions based on specific use case requirements.</li><li id="3db7" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Configurability</strong>: TimeSeries offers a range of tunable options for each dataset, providing the flexibility needed to accommodate a wide array of use cases.</li><li id="b931" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Scalability</strong>: The architecture supports both horizontal and vertical scaling, enabling the system to handle increasing throughput and data volumes as Netflix expands its user base and services.</li><li id="b382" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Sharded Infrastructure</strong>: Leveraging the <strong class="my gv">Data Gateway Platform</strong>, we can deploy single-tenant and/or multi-tenant infrastructure with the necessary access and traffic isolation.</li></ul><p id="2a47" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s dive into the various aspects of this abstraction.</p><h1 id="ea66" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Data Model</h1><p id="fbf3" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We follow a unique event data model that encapsulates all the data we want to capture for events, while allowing us to query them efficiently.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*jl30Jl559Fnd29in%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*jl30Jl559Fnd29in%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*jl30Jl559Fnd29in%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*jl30Jl559Fnd29in%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*jl30Jl559Fnd29in%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*jl30Jl559Fnd29in%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*jl30Jl559Fnd29in%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*jl30Jl559Fnd29in 640w, https://miro.medium.com/v2/resize:fit:720/0*jl30Jl559Fnd29in 720w, https://miro.medium.com/v2/resize:fit:750/0*jl30Jl559Fnd29in 750w, https://miro.medium.com/v2/resize:fit:786/0*jl30Jl559Fnd29in 786w, https://miro.medium.com/v2/resize:fit:828/0*jl30Jl559Fnd29in 828w, https://miro.medium.com/v2/resize:fit:1100/0*jl30Jl559Fnd29in 1100w, https://miro.medium.com/v2/resize:fit:1400/0*jl30Jl559Fnd29in 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="342" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="7228" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Let’s start with the smallest unit of data in the abstraction and work our way up.</p><ul class=""><li id="cf78" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Event Item</strong>: An event item is a key-value pair that users use to store data for a given event. For example: <em class="oy">{“device_type”: “ios”}</em>.</li><li id="55c6" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Event</strong>: An event is a structured collection of one or more such event items. An event occurs at a specific point in time and is identified by a client-generated timestamp and an event identifier (such as a UUID). This combination of <strong class="my gv">event_time</strong> and <strong class="my gv">event_id</strong> also forms part of the unique idempotency key for the event, enabling users to safely retry requests.</li><li id="9145" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Time Series ID</strong>: A <strong class="my gv">time_series_id</strong> is a collection of one or more such events over the dataset’s retention period. For instance, a <strong class="my gv">device_id</strong> would store all events occurring for a given device over the retention period. All events are immutable, and the TimeSeries service only ever appends events to a given time series ID.</li><li id="ef6e" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Namespace</strong>: A namespace is a collection of time series IDs and event data, representing the complete TimeSeries dataset. Users can create one or more namespaces for each of their use cases. The abstraction applies various tunable options at the namespace level, which we will discuss further when we explore the service’s control plane.</li></ul><h1 id="eda3" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">API</h1><p id="a85a" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The abstraction provides the following APIs to interact with the event data.</p><ul class=""><li id="8d2a" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">WriteEventRecordsSync</strong>: This endpoint writes a batch of events and sends back a durability acknowledgement to the client. This is used in cases where users require a guarantee of durability.</li><li id="9513" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">WriteEventRecords</strong>: This is the fire-and-forget version of the above endpoint. It enqueues a batch of events without the durability acknowledgement. This is used in cases like logging or tracing, where users care more about throughput and can tolerate a small amount of data loss.</li></ul><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "events": [<br />    {<br />      "timeSeriesId": "profile100",<br />      "eventTime": "2024-10-03T21:24:23.988Z",<br />      "eventId": "550e8400-e29b-41d4-a716-446655440000",<br />      "eventItems": [<br />        {<br />          "eventItemKey": "ZGV2aWNlVHlwZQ==",  <br />          "eventItemValue": "aW9z"<br />        },<br />        {<br />          "eventItemKey": "ZGV2aWNlTWV0YWRhdGE=",<br />          "eventItemValue": "c29tZSBtZXRhZGF0YQ=="<br />        }<br />      ]<br />    },<br />    {<br />      "timeSeriesId": "profile100",<br />      "eventTime": "2024-10-03T21:23:30.000Z",<br />      "eventId": "123e4567-e89b-12d3-a456-426614174000",<br />      "eventItems": [<br />        {<br />          "eventItemKey": "ZGV2aWNlVHlwZQ==",  <br />          "eventItemValue": "YW5kcm9pZA=="<br />        }<br />      ]<br />    }<br />  ]<br />}</pre><p id="4e72" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">ReadEventRecords</strong>: Given a combination of a namespace, a timeSeriesId, a timeInterval, and optional eventFilters, this endpoint returns all the matching events, sorted descending by event_time, with low millisecond latency.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "timeSeriesId": "profile100",<br />  "timeInterval": {<br />    "start": "2024-10-02T21:00:00.000Z",<br />    "end":   "2024-10-03T21:00:00.000Z"<br />  },<br />  "eventFilters": [<br />    {<br />      "matchEventItemKey": "ZGV2aWNlVHlwZQ==",<br />      "matchEventItemValue": "aW9z"<br />    }<br />  ],<br />  "pageSize": 100,<br />  "totalRecordLimit": 1000<br />}</pre><p id="d4d9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">SearchEventRecords</strong>: Given a search criteria and a time interval, this endpoint returns all the matching events. These use cases are fine with eventually consistent reads.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "timeInterval": {<br />    "start": "2024-10-02T21:00:00.000Z",<br />    "end": "2024-10-03T21:00:00.000Z"<br />  },<br />  "searchQuery": {<br />    "booleanQuery": {<br />      "searchQuery": [<br />        {<br />          "equals": {<br />            "eventItemKey": "deviceType",<br />            "eventItemValue": "aW9z"<br />          }<br />        },<br />        {<br />          "equals": {<br />            "eventItemKey": "deviceType",<br />            "eventItemValue": "YW5kcm9pZA=="<br />          }<br />        }<br />      ],<br />      "operator": "OR"<br />    }<br />  },<br />  "pageSize": 100,<br />  "totalRecordLimit": 1000<br />}</pre><p id="9863" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">AggregateEventRecords</strong>: Given a search criteria and an aggregation mode (e.g. DistinctAggregation) , this endpoint performs the given aggregation within a given time interval. Similar to the Search endpoint, users can tolerate eventual consistency and a potentially higher latency (in seconds).</p><pre class="pk pl pm pn po pv pw px bp py bb bk">{<br />  "namespace": "my_dataset",<br />  "timeInterval": {<br />    "start": "2024-10-02T21:00:00.000Z",<br />    "end": "2024-10-03T21:00:00.000Z"<br />  },<br />  "searchQuery": {...some search criteria...},<br />  "aggregationQuery": {<br />    "distinct": {<br />      "eventItemKey": "deviceType",<br />      "pageSize": 100<br />    }<br />  }<br />}</pre><p id="626b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In the subsequent sections, we will talk about how we interact with this data at the storage layer.</p><h1 id="006b" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Storage Layer</h1><p id="39dd" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The storage layer for TimeSeries comprises a primary data store and an optional index data store. The primary data store ensures data durability during writes and is used for primary read operations, while the index data store is utilized for search and aggregate operations. At Netflix, <strong class="my gv">Apache Cassandra</strong> is the preferred choice for storing durable data in high-throughput scenarios, while <strong class="my gv">Elasticsearch</strong> is the preferred data store for indexing. However, similar to our approach with the API, the storage layer is not tightly coupled to these specific data stores. Instead, we define storage API contracts that must be fulfilled, allowing us the flexibility to replace the underlying data stores as needed.</p><h1 id="6efb" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Primary Datastore</h1><p id="3689" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">In this section, we will talk about how we leverage <strong class="my gv">Apache Cassandra</strong> for TimeSeries use cases.</p><h2 id="9b30" class="qe nw gu bf nx qf qg dy ob qh qi ea of nh qj qk ql nl qm qn qo np qp qq qr qs bk">Partitioning Scheme</h2><p id="6241" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix’s scale, the continuous influx of event data can quickly overwhelm traditional databases. Temporal partitioning addresses this challenge by dividing the data into manageable chunks based on time intervals, such as hourly, daily, or monthly windows. This approach enables efficient querying of specific time ranges without the need to scan the entire dataset. It also allows Netflix to archive, compress, or delete older data efficiently, optimizing both storage and query performance. Additionally, this partitioning mitigates the performance issues typically associated with <a class="af nu" href="https://thelastpickle.com/blog/2019/01/11/wide-partitions-cassandra-3-11.html" rel="noopener ugc nofollow" target="_blank">wide partitions</a> in Cassandra. By employing this strategy, we can operate at much higher disk utilization, as it reduces the need to reserve large amounts of disk space for compactions, thereby saving costs.</p><p id="04ec" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Here is what it looks like :</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*MxuEH6_pOVDcAMie%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*MxuEH6_pOVDcAMie%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*MxuEH6_pOVDcAMie%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*MxuEH6_pOVDcAMie%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*MxuEH6_pOVDcAMie%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*MxuEH6_pOVDcAMie%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*MxuEH6_pOVDcAMie%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*MxuEH6_pOVDcAMie 640w, https://miro.medium.com/v2/resize:fit:720/0*MxuEH6_pOVDcAMie 720w, https://miro.medium.com/v2/resize:fit:750/0*MxuEH6_pOVDcAMie 750w, https://miro.medium.com/v2/resize:fit:786/0*MxuEH6_pOVDcAMie 786w, https://miro.medium.com/v2/resize:fit:828/0*MxuEH6_pOVDcAMie 828w, https://miro.medium.com/v2/resize:fit:1100/0*MxuEH6_pOVDcAMie 1100w, https://miro.medium.com/v2/resize:fit:1400/0*MxuEH6_pOVDcAMie 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="335" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="ae71" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Time Slice: </strong>Atime slice is the unit of data retention and maps directly to a Cassandra table. We create multiple such time slices, each covering a specific interval of time. An event lands in one of these slices based on the <strong class="my gv">event_time</strong>. These slices are joined with <em class="oy">no time gaps</em>in between, with operations being <em class="oy">start-inclusive</em> and <em class="oy">end-exclusive</em>, ensuring that all data lands in one of the slices.</p><p id="05ba" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Why not use row-based Time-To-Live (TTL)?</strong></p><p id="bd01" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Using TTL on individual events would generate a significant number of <a class="af nu" href="https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html" rel="noopener ugc nofollow" target="_blank">tombstones</a> in Cassandra, degrading performance, especially during range scans. By employing discrete time slices and dropping them, we avoid the tombstone issue entirely. The tradeoff is that data may be retained slightly longer than necessary, as an entire table’s time range must fall outside the retention window before it can be dropped. Additionally, TTLs are difficult to adjust later, whereas TimeSeries can extend the dataset retention instantly with a single control plane operation.</p><p id="fc7e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Time Buckets</strong>: Within a time slice, data is further partitioned into time buckets. This facilitates effective range scans by allowing us to target specific time buckets for a given query range. The tradeoff is that if a user wants to read the entire range of data over a large time period, we must scan many partitions. We mitigate potential latency by scanning these partitions in parallel and aggregating the data at the end. In most cases, the advantage of targeting smaller data subsets outweighs the read amplification from these scatter-gather operations. Typically, users read a smaller subset of data rather than the entire retention range.</p><p id="17b2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Event Buckets</strong>: To manage extremely high-throughput write operations, which may result in a burst of writes for a given time series within a short period, we further divide the time bucket into event buckets. This prevents overloading the same partition for a given time range and also reduces partition sizes further, albeit with a slight increase in read amplification.</p><p id="ab2a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Note</strong>: <em class="oy">With Cassandra 4.x onwards, we notice a substantial improvement in the performance of scanning a range of data in a wide partition. See </em><strong class="my gv"><em class="oy">Future Enhancements</em></strong><em class="oy"> at the end to see the </em><strong class="my gv"><em class="oy">Dynamic Event bucketing</em></strong><em class="oy"> work that aims to take advantage of this.</em></p><h2 id="e7cd" class="qe nw gu bf nx qf qg dy ob qh qi ea of nh qj qk ql nl qm qn qo np qp qq qr qs bk">Storage Tables</h2><p id="baaa" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We use two kinds of tables</p><ul class=""><li id="1406" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Data tables</strong>: These are the time slices that store the actual event data.</li><li id="a159" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Metadata table</strong>: This table stores information about how each time slice is configured <em class="oy">per namespace</em>.</li></ul><h2 id="51df" class="qe nw gu bf nx qf qg dy ob qh qi ea of nh qj qk ql nl qm qn qo np qp qq qr qs bk">Data tables</h2><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ktuEBzveeK4f1mWH%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ktuEBzveeK4f1mWH%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ktuEBzveeK4f1mWH%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ktuEBzveeK4f1mWH%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ktuEBzveeK4f1mWH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ktuEBzveeK4f1mWH%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ktuEBzveeK4f1mWH%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ktuEBzveeK4f1mWH 640w, https://miro.medium.com/v2/resize:fit:720/0*ktuEBzveeK4f1mWH 720w, https://miro.medium.com/v2/resize:fit:750/0*ktuEBzveeK4f1mWH 750w, https://miro.medium.com/v2/resize:fit:786/0*ktuEBzveeK4f1mWH 786w, https://miro.medium.com/v2/resize:fit:828/0*ktuEBzveeK4f1mWH 828w, https://miro.medium.com/v2/resize:fit:1100/0*ktuEBzveeK4f1mWH 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ktuEBzveeK4f1mWH 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="430" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="4c23" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The partition key enables splitting events for a <strong class="my gv">time_series_id</strong> over a range of <strong class="my gv">time_bucket(s)</strong> and <strong class="my gv">event_bucket(s)</strong>, thus mitigating hot partitions, while the clustering key allows us to keep data sorted on disk in the order we almost always want to read it. The <strong class="my gv">value_metadata</strong> column stores metadata for the <strong class="my gv">event_item_value</strong> such as compression.</p><p id="654d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Writing to the data table:</strong></p><p id="524c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">User writes will land in a given time slice, time bucket, and event bucket as a factor of the <strong class="my gv">event_time</strong> attached to the event. This factor is dictated by the control plane configuration of a given namespace.</p><p id="2892" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For example:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qt"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*P4IThIE_PE9F8KYi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*P4IThIE_PE9F8KYi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*P4IThIE_PE9F8KYi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*P4IThIE_PE9F8KYi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*P4IThIE_PE9F8KYi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*P4IThIE_PE9F8KYi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*P4IThIE_PE9F8KYi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*P4IThIE_PE9F8KYi 640w, https://miro.medium.com/v2/resize:fit:720/0*P4IThIE_PE9F8KYi 720w, https://miro.medium.com/v2/resize:fit:750/0*P4IThIE_PE9F8KYi 750w, https://miro.medium.com/v2/resize:fit:786/0*P4IThIE_PE9F8KYi 786w, https://miro.medium.com/v2/resize:fit:828/0*P4IThIE_PE9F8KYi 828w, https://miro.medium.com/v2/resize:fit:1100/0*P4IThIE_PE9F8KYi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*P4IThIE_PE9F8KYi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="48" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="79ad" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Reading from the data table:</strong></p><p id="a48c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The below illustration depicts at a high-level on how we scatter-gather the reads from multiple partitions and join the result set at the end to return the final result.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*a805txbeIDqYP73d%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*a805txbeIDqYP73d%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*a805txbeIDqYP73d%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*a805txbeIDqYP73d%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*a805txbeIDqYP73d%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*a805txbeIDqYP73d%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*a805txbeIDqYP73d%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*a805txbeIDqYP73d 640w, https://miro.medium.com/v2/resize:fit:720/0*a805txbeIDqYP73d 720w, https://miro.medium.com/v2/resize:fit:750/0*a805txbeIDqYP73d 750w, https://miro.medium.com/v2/resize:fit:786/0*a805txbeIDqYP73d 786w, https://miro.medium.com/v2/resize:fit:828/0*a805txbeIDqYP73d 828w, https://miro.medium.com/v2/resize:fit:1100/0*a805txbeIDqYP73d 1100w, https://miro.medium.com/v2/resize:fit:1400/0*a805txbeIDqYP73d 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="494" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h2 id="0c7b" class="qe nw gu bf nx qf qg dy ob qh qi ea of nh qj qk ql nl qm qn qo np qp qq qr qs bk">Metadata table</h2><p id="20b0" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">This table stores the configuration data about the time slices for a given namespace.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*asJFOjl1iwlSajJc%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*asJFOjl1iwlSajJc%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*asJFOjl1iwlSajJc%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*asJFOjl1iwlSajJc%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*asJFOjl1iwlSajJc%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*asJFOjl1iwlSajJc%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*asJFOjl1iwlSajJc%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*asJFOjl1iwlSajJc 640w, https://miro.medium.com/v2/resize:fit:720/0*asJFOjl1iwlSajJc 720w, https://miro.medium.com/v2/resize:fit:750/0*asJFOjl1iwlSajJc 750w, https://miro.medium.com/v2/resize:fit:786/0*asJFOjl1iwlSajJc 786w, https://miro.medium.com/v2/resize:fit:828/0*asJFOjl1iwlSajJc 828w, https://miro.medium.com/v2/resize:fit:1100/0*asJFOjl1iwlSajJc 1100w, https://miro.medium.com/v2/resize:fit:1400/0*asJFOjl1iwlSajJc 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="317" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="6fce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Note the following:</p><ul class=""><li id="217e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">No Time Gaps</strong>: The end_time of a given time slice overlaps with the start_time of the next time slice, ensuring all events find a home.</li><li id="bac3" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Retention</strong>: The status indicates which tables fall inside and outside of the retention window.</li><li id="e410" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Flexible</strong>: This metadata can be adjusted per time slice, allowing us to tune the partition settings of future time slices based on observed data patterns in the current time slice.</li></ul><p id="f612" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">There is a lot more information that can be stored into the <strong class="my gv">metadata</strong> column (e.g., compaction settings for the table), but we only show the partition settings here for brevity.</p><h1 id="a523" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Index Datastore</h1><p id="5bab" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">To support secondary access patterns via non-primary key attributes, we index data into Elasticsearch. Users can configure a list of attributes per namespace that they wish to search and/or aggregate data on. The service extracts these fields from events as they stream in, indexing the resultant documents into Elasticsearch. Depending on the throughput, we may use Elasticsearch as a reverse index, retrieving the full data from Cassandra, or we may store the entire source data directly in Elasticsearch.</p><p id="4df3" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv">Note</strong>:<em class="oy"> Again, users are never directly exposed to Elasticsearch, just like they are not directly exposed to Cassandra. Instead, they interact with the Search and Aggregate API endpoints that translate a given query to that needed for the underlying datastore.</em></p><p id="3b37" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In the next section, we will talk about how we configure these data stores for different datasets.</p><h1 id="1edb" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Control Plane</h1><p id="1f56" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The data plane is responsible for executing the read and write operations, while the control plane configures every aspect of a namespace’s behavior. The data plane communicates with the TimeSeries control stack, which manages this configuration information. In turn, the TimeSeries control stack interacts with a sharded <strong class="my gv">Data Gateway Platform Control Plane</strong> that oversees control configurations for all abstractions and namespaces.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qu"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*aB6OKXoG-mT65Vh1%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*aB6OKXoG-mT65Vh1%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*aB6OKXoG-mT65Vh1%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*aB6OKXoG-mT65Vh1%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*aB6OKXoG-mT65Vh1%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*aB6OKXoG-mT65Vh1%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*aB6OKXoG-mT65Vh1%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*aB6OKXoG-mT65Vh1 640w, https://miro.medium.com/v2/resize:fit:720/0*aB6OKXoG-mT65Vh1 720w, https://miro.medium.com/v2/resize:fit:750/0*aB6OKXoG-mT65Vh1 750w, https://miro.medium.com/v2/resize:fit:786/0*aB6OKXoG-mT65Vh1 786w, https://miro.medium.com/v2/resize:fit:828/0*aB6OKXoG-mT65Vh1 828w, https://miro.medium.com/v2/resize:fit:1100/0*aB6OKXoG-mT65Vh1 1100w, https://miro.medium.com/v2/resize:fit:1400/0*aB6OKXoG-mT65Vh1 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="499" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="7f9c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Separating the responsibilities of the data plane and control plane helps maintain the high availability of our data plane, as the control plane takes on tasks that may require some form of schema consensus from the underlying data stores.</p><h1 id="cd98" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Namespace Configuration</h1><p id="7d27" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The below configuration snippet demonstrates the immense flexibility of the service and how we can tune several things per namespace using our control plane.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">"persistence_configuration": [<br />  {<br />    "id": "PRIMARY_STORAGE",<br />    "physical_storage": {<br />      "type": "CASSANDRA",                  // type of primary storage<br />      "cluster": "cass_dgw_ts_tracing",     // physical cluster name<br />      "dataset": "tracing_default"          // maps to the keyspace<br />    },<br />    "config": {<br />      "timePartition": {<br />        "secondsPerTimeSlice": "129600",    // width of a time slice<br />        "secondPerTimeBucket": "3600",      // width of a time bucket<br />        "eventBuckets": 4                   // how many event buckets within<br />      },<br />      "queueBuffering": {<br />        "coalesce": "1s",                   // how long to coalesce writes<br />        "bufferCapacity": 4194304           // queue capacity in bytes<br />      },<br />      "consistencyScope": "LOCAL",          // single-region/multi-region<br />      "consistencyTarget": "EVENTUAL",      // read/write consistency<br />      "acceptLimit": "129600s"              // how far back writes are allowed<br />    },<br />    "lifecycleConfigs": {<br />      "lifecycleConfig": [                  // Primary store data retention<br />        {<br />          "type": "retention",<br />          "config": {<br />            "close_after": "1296000s",      // close for reads/writes<br />            "delete_after": "1382400s"      // drop time slice<br />          }<br />        }<br />      ]<br />    }<br />  },<br />  {<br />    "id": "INDEX_STORAGE",<br />    "physicalStorage": {<br />      "type": "ELASTICSEARCH",              // type of index storage<br />      "cluster": "es_dgw_ts_tracing",       // ES cluster name<br />      "dataset": "tracing_default_useast1"  // base index name<br />    },<br />    "config": {<br />      "timePartition": {<br />        "secondsPerSlice": "129600"         // width of the index slice<br />      },<br />      "consistencyScope": "LOCAL",<br />      "consistencyTarget": "EVENTUAL",      // how should we read/write data<br />      "acceptLimit": "129600s",             // how far back writes are allowed<br />      "indexConfig": {<br />        "fieldMapping": {                   // fields to extract to index<br />          "tags.nf.app": "KEYWORD",<br />          "tags.duration": "INTEGER",<br />          "tags.enabled": "BOOLEAN"<br />        },<br />        "refreshInterval": "60s"            // Index related settings<br />      }<br />    },<br />    "lifecycleConfigs": {<br />      "lifecycleConfig": [<br />        {<br />          "type": "retention",              // Index retention settings<br />          "config": {<br />            "close_after": "1296000s",<br />            "delete_after": "1382400s"<br />          }<br />        }<br />      ]<br />    }<br />  }<br />]</pre><h1 id="fcef" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Provisioning Infrastructure</h1><p id="85c4" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">With so many different parameters, we need automated provisioning workflows to deduce the best settings for a given workload. When users want to create their namespaces, they specify a list of <em class="oy">workload</em> <em class="oy">desires</em>, which the automation translates into concrete infrastructure and related control plane configuration. We highly encourage you to watch this <a class="af nu" href="https://www.youtube.com/watch?v=2aBVKXi8LKk" rel="noopener ugc nofollow" target="_blank">ApacheCon talk</a>, by one of our stunning colleagues <strong class="my gv">Joey Lynch,</strong> on how we achieve this. We may go into detail on this subject in one of our future blog posts.</p><p id="311d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Once the system provisions the initial infrastructure, it then scales in response to the user workload. The next section describes how this is achieved.</p><h1 id="5890" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Scalability</h1><p id="561f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Our users may operate with limited information at the time of provisioning their namespaces, resulting in best-effort provisioning estimates. Further, evolving use-cases may introduce new throughput requirements over time. Here’s how we manage this:</p><ul class=""><li id="37e7" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Horizontal scaling</strong>: TimeSeries server instances can auto-scale up and down as per attached scaling policies to meet the traffic demand. The storage server capacity can be recomputed to accommodate changing requirements using our <a class="af nu" href="https://github.com/Netflix-Skunkworks/service-capacity-modeling/tree/main/service_capacity_modeling" rel="noopener ugc nofollow" target="_blank">capacity planner</a>.</li><li id="6f26" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Vertical scaling</strong>: We may also choose to vertically scale our TimeSeries server instances or our storage instances to get greater CPU, RAM and/or attached storage capacity.</li><li id="f575" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Scaling disk</strong>: We may attach <a class="af nu" href="https://aws.amazon.com/ebs/" rel="noopener ugc nofollow" target="_blank">EBS</a> to store data if the capacity planner prefers infrastructure that offers larger storage at a lower cost rather than SSDs optimized for latency. In such cases, we deploy jobs to scale the EBS volume when the disk storage reaches a certain percentage threshold.</li><li id="e2bd" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Re-partitioning data</strong>: Inaccurate workload estimates can lead to over or under-partitioning of our datasets. TimeSeries control-plane can adjust the partitioning configuration for upcoming time slices, once we realize the nature of data in the wild (via partition histograms). In the future we plan to support re-partitioning of older data and dynamic partitioning of current data.</li></ul><h1 id="bdec" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Design Principles</h1><p id="b328" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">So far, we have seen how TimeSeries stores, configures and interacts with event datasets. Let’s see how we apply different techniques to improve the performance of our operations and provide better guarantees.</p><h1 id="737f" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Event Idempotency</h1><p id="83da" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We prefer to bake in idempotency in all mutation endpoints, so that users can retry or hedge their requests safely. <a class="af nu" href="https://research.google/pubs/the-tail-at-scale/" rel="noopener ugc nofollow" target="_blank">Hedging</a> is when the client sends an identical competing request to the server, if the original request does not come back with a response in an expected amount of time. The client then responds with whichever request completes first. This is done to keep the tail latencies for an application relatively low. This can only be done safely if the mutations are idempotent. For TimeSeries, the combination of <strong class="my gv">event_time</strong>, <strong class="my gv">event_id</strong> and <strong class="my gv">event_item_key</strong> form the idempotency key for a given <strong class="my gv">time_series_id</strong> event.</p><h1 id="9ffa" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">SLO-based Hedging</h1><p id="751c" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We assign Service Level Objectives (SLO) targets for different endpoints within TimeSeries, as an indication of what we think the performance of those endpoints should be <em class="oy">for a given namespace</em>. We can then hedge a request if the response does not come back in that configured amount of time.</p><pre class="pk pl pm pn po pv pw px bp py bb bk">"slos": {<br />  "read": {               // SLOs per endpoint<br />    "latency": {<br />      "target": "0.5s",   // hedge around this number<br />      "max": "1s"         // time-out around this number<br />    }<br />  },<br />  "write": {<br />    "latency": {<br />      "target": "0.01s",<br />      "max": "0.05s"<br />    }<br />  }<br />}</pre><h1 id="4152" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Partial Return</h1><p id="87eb" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Sometimes, a client may be sensitive to latency and willing to accept a partial result set. A real-world example of this is real-time frequency capping. Precision is not critical in this case, but if the response is delayed, it becomes practically useless to the upstream client. Therefore, the client prefers to work with whatever data has been collected so far rather than timing out while waiting for all the data. The TimeSeries client supports partial returns around SLOs for this purpose. Importantly, we still maintain the latest order of events in this partial fetch.</p><h1 id="1d8b" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Adaptive Pagination</h1><p id="fabf" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">All reads start with a default fanout factor, scanning 8 partition buckets in parallel. However, if the service layer determines that the time_series dataset is dense — i.e., most reads are satisfied by reading the first few partition buckets — then it dynamically adjusts the fanout factor of future reads in order to reduce the read amplification on the underlying datastore. Conversely, if the dataset is sparse, we may want to increase this limit with a reasonable upper bound.</p><h1 id="7fcd" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Limited Write Window</h1><p id="d732" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">In most cases, the active range for writing data is smaller than the range for reading data — i.e., we want a range of time to become immutable as soon as possible so that we can apply optimizations on top of it. We control this by having a configurable “<strong class="my gv">acceptLimit</strong>” parameter that prevents users from writing events older than this time limit. For example, an accept limit of 4 hours means that users cannot write events older than <em class="oy">now() — 4 hours</em>. We sometimes raise this limit for backfilling historical data, but it is tuned back down for regular write operations. Once a range of data becomes immutable, we can safely do things like caching, compressing, and compacting it for reads.</p><h1 id="7223" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Buffering Writes</h1><p id="21e8" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We frequently leverage this service for handling bursty workloads. Rather than overwhelming the underlying datastore with this load all at once, we aim to distribute it more evenly by allowing events to coalesce over short durations (typically seconds). These events accumulate in in-memory queues running on each instance. Dedicated consumers then steadily drain these queues, grouping the events by their partition key, and batching the writes to the underlying datastore.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*pMVe_h3daBDLWdis%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*pMVe_h3daBDLWdis%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*pMVe_h3daBDLWdis%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*pMVe_h3daBDLWdis%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*pMVe_h3daBDLWdis%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*pMVe_h3daBDLWdis%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*pMVe_h3daBDLWdis%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*pMVe_h3daBDLWdis 640w, https://miro.medium.com/v2/resize:fit:720/0*pMVe_h3daBDLWdis 720w, https://miro.medium.com/v2/resize:fit:750/0*pMVe_h3daBDLWdis 750w, https://miro.medium.com/v2/resize:fit:786/0*pMVe_h3daBDLWdis 786w, https://miro.medium.com/v2/resize:fit:828/0*pMVe_h3daBDLWdis 828w, https://miro.medium.com/v2/resize:fit:1100/0*pMVe_h3daBDLWdis 1100w, https://miro.medium.com/v2/resize:fit:1400/0*pMVe_h3daBDLWdis 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="322" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="fa07" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The queues are tailored to each datastore since their operational characteristics depend on the specific datastore being written to. For instance, the batch size for writing to Cassandra is significantly smaller than that for indexing into Elasticsearch, leading to different drain rates and batch sizes for the associated consumers.</p><p id="e4da" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">While using in-memory queues does increase JVM garbage collection, we have experienced substantial improvements by transitioning to JDK 21 with ZGC. To illustrate the impact, ZGC has reduced our tail latencies by an impressive 86%:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*hj98LMk1UddaaDs-%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*hj98LMk1UddaaDs-%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*hj98LMk1UddaaDs-%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*hj98LMk1UddaaDs-%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*hj98LMk1UddaaDs-%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*hj98LMk1UddaaDs-%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*hj98LMk1UddaaDs-%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*hj98LMk1UddaaDs- 640w, https://miro.medium.com/v2/resize:fit:720/0*hj98LMk1UddaaDs- 720w, https://miro.medium.com/v2/resize:fit:750/0*hj98LMk1UddaaDs- 750w, https://miro.medium.com/v2/resize:fit:786/0*hj98LMk1UddaaDs- 786w, https://miro.medium.com/v2/resize:fit:828/0*hj98LMk1UddaaDs- 828w, https://miro.medium.com/v2/resize:fit:1100/0*hj98LMk1UddaaDs- 1100w, https://miro.medium.com/v2/resize:fit:1400/0*hj98LMk1UddaaDs- 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="345" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="e67f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Because we use in-memory queues, we are prone to losing events in case of an instance crash. As such, these queues are only used for use cases that can tolerate some amount of data loss .e.g. tracing/logging. For use cases that need guaranteed durability and/or read-after-write consistency, these queues are effectively disabled and writes are flushed to the data store almost immediately.</p><h1 id="3b16" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Dynamic Compaction</h1><p id="b018" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Once a time slice exits the active write window, we can leverage the immutability of the data to optimize it for read performance. This process may involve re-compacting immutable data using optimal compaction strategies, dynamically shrinking and/or splitting shards to optimize system resources, and other similar techniques to ensure fast and reliable performance.</p><p id="dbe4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The following section provides a glimpse into the real-world performance of some of our TimeSeries datasets.</p><h1 id="dd6b" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Real-world Performance</h1><p id="1625" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The service can write data in the order of low single digit milliseconds</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*VWrQj2ya5PQWusBq%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*VWrQj2ya5PQWusBq%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*VWrQj2ya5PQWusBq%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*VWrQj2ya5PQWusBq%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*VWrQj2ya5PQWusBq%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*VWrQj2ya5PQWusBq%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*VWrQj2ya5PQWusBq%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*VWrQj2ya5PQWusBq 640w, https://miro.medium.com/v2/resize:fit:720/0*VWrQj2ya5PQWusBq 720w, https://miro.medium.com/v2/resize:fit:750/0*VWrQj2ya5PQWusBq 750w, https://miro.medium.com/v2/resize:fit:786/0*VWrQj2ya5PQWusBq 786w, https://miro.medium.com/v2/resize:fit:828/0*VWrQj2ya5PQWusBq 828w, https://miro.medium.com/v2/resize:fit:1100/0*VWrQj2ya5PQWusBq 1100w, https://miro.medium.com/v2/resize:fit:1400/0*VWrQj2ya5PQWusBq 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="344" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="92ad" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">while consistently maintaining stable point-read latencies:</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*23F_CzqsjMoI8GHB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*23F_CzqsjMoI8GHB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*23F_CzqsjMoI8GHB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*23F_CzqsjMoI8GHB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*23F_CzqsjMoI8GHB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*23F_CzqsjMoI8GHB%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*23F_CzqsjMoI8GHB%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*23F_CzqsjMoI8GHB 640w, https://miro.medium.com/v2/resize:fit:720/0*23F_CzqsjMoI8GHB 720w, https://miro.medium.com/v2/resize:fit:750/0*23F_CzqsjMoI8GHB 750w, https://miro.medium.com/v2/resize:fit:786/0*23F_CzqsjMoI8GHB 786w, https://miro.medium.com/v2/resize:fit:828/0*23F_CzqsjMoI8GHB 828w, https://miro.medium.com/v2/resize:fit:1100/0*23F_CzqsjMoI8GHB 1100w, https://miro.medium.com/v2/resize:fit:1400/0*23F_CzqsjMoI8GHB 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="343" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="4890" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At the time of writing this blog, the service was processing close to <em class="oy">15 million events/second</em> across all the different datasets at peak globally.</p><figure class="pk pl pm pn po pp ph pi paragraph-image"><div role="button" tabindex="0" class="pq pr fj ps bh pt"><div class="ph pi qv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*dZFDUVX35Cj1MPOj%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*dZFDUVX35Cj1MPOj%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*dZFDUVX35Cj1MPOj%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*dZFDUVX35Cj1MPOj%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*dZFDUVX35Cj1MPOj%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*dZFDUVX35Cj1MPOj%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*dZFDUVX35Cj1MPOj%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*dZFDUVX35Cj1MPOj 640w, https://miro.medium.com/v2/resize:fit:720/0*dZFDUVX35Cj1MPOj 720w, https://miro.medium.com/v2/resize:fit:750/0*dZFDUVX35Cj1MPOj 750w, https://miro.medium.com/v2/resize:fit:786/0*dZFDUVX35Cj1MPOj 786w, https://miro.medium.com/v2/resize:fit:828/0*dZFDUVX35Cj1MPOj 828w, https://miro.medium.com/v2/resize:fit:1100/0*dZFDUVX35Cj1MPOj 1100w, https://miro.medium.com/v2/resize:fit:1400/0*dZFDUVX35Cj1MPOj 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pu c" width="700" height="366" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="8793" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Time Series Usage @ Netflix</h1><p id="3d90" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The TimeSeries Abstraction plays a vital role across key services at Netflix. Here are some impactful use cases:</p><ul class=""><li id="3163" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Tracing and Insights: </strong>Logs traces across all apps and micro-services within Netflix, to understand service-to-service communication, aid in debugging of issues, and answer support requests.</li><li id="a327" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">User Interaction Tracking</strong>: Tracks millions of user interactions — such as video playbacks, searches, and content engagement — providing insights that enhance Netflix’s recommendation algorithms in real-time and improve the overall user experience.</li><li id="7ea2" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Feature Rollout and Performance Analysis</strong>: Tracks the rollout and performance of new product features, enabling Netflix engineers to measure how users engage with features, which powers data-driven decisions about future improvements.</li><li id="660c" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Asset Impression Tracking and Optimization</strong>: Tracks asset impressions ensuring content and assets are delivered efficiently while providing real-time feedback for optimizations.</li><li id="03cc" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Billing and Subscription Management:</strong> Stores historical data related to billing and subscription management, ensuring accuracy in transaction records and supporting customer service inquiries.</li></ul><p id="a1ea" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">and more…</p><h1 id="3624" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Future Enhancements</h1><p id="451d" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">As the use cases evolve, and the need to make the abstraction even more cost effective grows, we aim to make many improvements to the service in the upcoming months. Some of them are:</p><ul class=""><li id="6046" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt oz pa pb bk"><strong class="my gv">Tiered Storage for Cost Efficiency: </strong>Support moving older, lesser-accessed data into cheaper object storage that has higher time to first byte, potentially saving Netflix millions of dollars.</li><li id="4073" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Dynamic Event Bucketing: </strong>Support real-time partitioning of keys into optimally-sized partitions as events stream in, rather than having a <em class="oy">somewhat</em> static configuration at the time of provisioning a namespace. This strategy has a huge advantage of <em class="oy">not</em> partitioning time_series_ids that don’t need it, thus saving the overall cost of read amplification. Also, with Cassandra 4.x, we have noted major improvements in reading a subset of data in a wide partition that could lead us to be less aggressive with partitioning the entire dataset ahead of time.</li><li id="07ae" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Caching: </strong>Take advantage of immutability of data and cache it intelligently for discrete time ranges.</li><li id="78ff" class="mw mx gu my b mz pc nb nc nd pd nf ng nh pe nj nk nl pf nn no np pg nr ns nt oz pa pb bk"><strong class="my gv">Count and other Aggregations: </strong>Some users are only interested in counting events in a given time interval rather than fetching all the event data for it.</li></ul><h1 id="d63b" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Conclusion</h1><p id="5ab6" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The TimeSeries Abstraction is a vital component of Netflix’s online data infrastructure, playing a crucial role in supporting both real-time and long-term decision-making. Whether it’s monitoring system performance during high-traffic events or optimizing user engagement through behavior analytics, TimeSeries Abstraction ensures that Netflix operates seamlessly and efficiently on a global scale.</p><p id="5764" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As Netflix continues to innovate and expand into new verticals, the TimeSeries Abstraction will remain a cornerstone of our platform, helping us push the boundaries of what’s possible in streaming and beyond.</p><p id="c076" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Stay tuned for Part 2, where we’ll introduce our <strong class="my gv">Distributed Counter Abstraction</strong>, a key element of <strong class="my gv">Netflix’s Composite Abstractions</strong>, built on top of the TimeSeries Abstraction.</p><h1 id="d6d2" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Acknowledgments</h1><p id="06af" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Special thanks to our stunning colleagues who contributed to TimeSeries Abstraction’s success: <a class="af nu" href="https://www.linkedin.com/in/tomdevoe/" rel="noopener ugc nofollow" target="_blank">Tom DeVoe</a> <a class="af nu" href="https://www.linkedin.com/in/mengqingwang/" rel="noopener ugc nofollow" target="_blank">Mengqing Wang</a>, <a class="af nu" href="https://www.linkedin.com/in/kartik894/" rel="noopener ugc nofollow" target="_blank">Kartik Sathyanarayanan</a>, <a class="af nu" href="https://www.linkedin.com/in/jordan-west-8aa1731a3/" rel="noopener ugc nofollow" target="_blank">Jordan West</a>, <a class="af nu" href="https://www.linkedin.com/in/matt-lehman-39549719b/" rel="noopener ugc nofollow" target="_blank">Matt Lehman</a>, <a class="af nu" href="https://www.linkedin.com/in/cheng-wang-10323417/" rel="noopener ugc nofollow" target="_blank">Cheng Wang</a>, <a class="af nu" href="https://www.linkedin.com/in/clohfink/" rel="noopener ugc nofollow" target="_blank">Chris Lohfink</a> .</p></div>]]></description>
      <link>https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8</link>
      <guid>https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8</guid>
      <pubDate>Tue, 08 Oct 2024 19:05:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Netflix’s Key-Value Data Abstraction Layer]]></title>
      <description><![CDATA[<div><div></div><p id="3a78" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><a class="af nu" href="https://www.linkedin.com/in/vidhya-arvind-11908723" rel="noopener ugc nofollow" target="_blank">Vidhya Arvind</a>, <a class="af nu" href="https://www.linkedin.com/in/rummadis/" rel="noopener ugc nofollow" target="_blank">Rajasekhar Ummadisetty</a>, <a class="af nu" href="https://jolynch.github.io/" rel="noopener ugc nofollow" target="_blank">Joey Lynch</a>, <a class="af nu" href="https://www.linkedin.com/in/vinaychella" rel="noopener ugc nofollow" target="_blank">Vinay Chella</a></p><h1 id="8c5d" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Introduction</h1><p id="8fe5" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At Netflix our ability to deliver seamless, high-quality, streaming experiences to millions of users hinges on robust, <em class="oy">global</em> backend infrastructure. Central to this infrastructure is our use of multiple online distributed databases such as <a class="af nu" href="https://cassandra.apache.org/" rel="noopener ugc nofollow" target="_blank">Apache Cassandra</a>, a NoSQL database known for its high availability and scalability. Cassandra serves as the backbone for a diverse array of use cases within Netflix, ranging from user sign-ups and storing viewing histories to supporting real-time analytics and live streaming.</p><p id="255e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Over time as new key-value databases were introduced and service owners launched new use cases, we encountered numerous challenges with datastore misuse. Firstly, developers struggled to reason about consistency, durability and performance in this complex global deployment across multiple stores. Second, developers had to constantly re-learn new data modeling practices and common yet critical data access patterns. These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. Additionally, the tight coupling with multiple native database APIs — APIs that continually evolve and sometimes introduce backward-incompatible changes — resulted in org-wide engineering efforts to maintain and optimize our microservice’s data access.</p><p id="e872" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To overcome these challenges, we developed a holistic approach that builds upon our <a class="af nu" href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6" rel="noopener">Data Gateway Platform</a>. This approach led to the creation of several foundational abstraction services, the most mature of which is our Key-Value (KV) Data Abstraction Layer (DAL). This abstraction simplifies data access, enhances the reliability of our infrastructure, and enables us to support the broad spectrum of use cases that Netflix demands with minimal developer effort.</p><p id="92d2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this post, we dive deep into how Netflix’s KV abstraction works, the architectural principles guiding its design, the challenges we faced in scaling diverse use cases, and the technical innovations that have allowed us to achieve the performance and reliability required by Netflix’s global operations.</p><h1 id="9a98" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">The Key-Value Service</strong></h1><p id="4bde" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The KV data abstraction service was introduced to solve the persistent challenges we faced with data access patterns in our distributed databases. Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.</p><h2 id="653a" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Data Model</h2><p id="adb9" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">At its core, the KV abstraction is built around a <strong class="my gv"><em class="oy">two-level map</em> </strong>architecture. The first level is a hashed string <strong class="my gv">ID</strong> (the primary key), and the second level is a <strong class="my gv"><em class="oy">sorted map of a key-value pair of bytes</em></strong>. This model supports both simple and complex data models, balancing flexibility and efficiency.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">HashMap&lt;String, SortedMap&lt;Bytes, Bytes&gt;&gt;</pre><p id="9c5f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For complex data models such as structured <code class="cx qc qd qe pu b">Records</code> or time-ordered <code class="cx qc qd qe pu b">Events</code>, this two-level approach handles hierarchical structures effectively, allowing related data to be retrieved together. For simpler use cases, it also represents flat key-value <code class="cx qc qd qe pu b">Maps</code> (e.g. <code class="cx qc qd qe pu b">id → {"" → value}</code>) or named <code class="cx qc qd qe pu b">Sets</code> (e.g.<code class="cx qc qd qe pu b">id → {key → ""}</code>). This adaptability allows the KV abstraction to be used in hundreds of diverse use cases, making it a versatile solution for managing both simple and complex data models in large-scale infrastructures like Netflix.</p><p id="e359" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The KV data can be visualized at a high level, as shown in the diagram below, where three records are shown.</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*9Ny8Uc-diSDnVGnk%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*9Ny8Uc-diSDnVGnk%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*9Ny8Uc-diSDnVGnk%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*9Ny8Uc-diSDnVGnk%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*9Ny8Uc-diSDnVGnk%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*9Ny8Uc-diSDnVGnk%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*9Ny8Uc-diSDnVGnk%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*9Ny8Uc-diSDnVGnk 640w, https://miro.medium.com/v2/resize:fit:720/0*9Ny8Uc-diSDnVGnk 720w, https://miro.medium.com/v2/resize:fit:750/0*9Ny8Uc-diSDnVGnk 750w, https://miro.medium.com/v2/resize:fit:786/0*9Ny8Uc-diSDnVGnk 786w, https://miro.medium.com/v2/resize:fit:828/0*9Ny8Uc-diSDnVGnk 828w, https://miro.medium.com/v2/resize:fit:1100/0*9Ny8Uc-diSDnVGnk 1100w, https://miro.medium.com/v2/resize:fit:1400/0*9Ny8Uc-diSDnVGnk 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md qn c" width="700" height="442" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message Item (   <br />  Bytes    key,<br />  Bytes    value,<br />  Metadata metadata,<br />  Integer  chunk<br />)</pre><h2 id="9585" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Database Agnostic Abstraction</h2><p id="7654" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The KV abstraction is designed to hide the implementation details of the underlying database, offering a consistent interface to application developers regardless of the optimal storage system for that use case. While Cassandra is one example, the abstraction works with multiple data stores like <a class="af nu" href="https://github.com/Netflix/EVCache" rel="noopener ugc nofollow" target="_blank">EVCache</a>, <a class="af nu" href="https://aws.amazon.com/dynamodb/" rel="noopener ugc nofollow" target="_blank">DynamoDB</a>, <a class="af nu" href="https://rocksdb.org/" rel="noopener ugc nofollow" target="_blank">RocksDB</a>, etc…</p><p id="6b60" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For example, when implemented with Cassandra, the abstraction leverages Cassandra’s partitioning and clustering capabilities. The record <strong class="my gv"><em class="oy">ID</em></strong> acts as the partition key, and the item <strong class="my gv"><em class="oy">key</em></strong> as the clustering column:</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*tMhXVTWqtHt24l1oflpAJQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*tMhXVTWqtHt24l1oflpAJQ.png" /><img alt="" class="bh md qn c" width="700" height="194" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="17d6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The corresponding Data Definition Language (DDL) for this structure in Cassandra is:</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">CREATE TABLE IF NOT EXISTS &lt;ns&gt;.&lt;table&gt; (<br />  id             text,<br />  key            blob,<br />  value          blob,<br />  value_metadata blob,PRIMARY KEY (id, key))<br />WITH CLUSTERING ORDER BY (key &lt;ASC|DESC&gt;)</pre><h2 id="dd6f" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Namespace: Logical and Physical Configuration</h2><p id="6e76" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">A <strong class="my gv">namespace</strong> defines where and how data is stored, providing logical and physical separation while abstracting the underlying storage systems. It also serves as central configuration of access patterns such as consistency or latency targets. Each namespace may use different backends: Cassandra, EVCache, or combinations of multiple. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs. Developers just provide their data problem rather than a database solution!</p><p id="898b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this example configuration, the <code class="cx qc qd qe pu b">ngsegment</code> namespace is backed by both a Cassandra cluster and an EVCache caching layer, allowing for highly durable persistent storage <em class="oy">and</em> lower-latency point reads.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">"persistence_configuration":[                                                   <br />  {                                                                           <br />    "id":"PRIMARY_STORAGE",                                                 <br />    "physical_storage": {                                                    <br />      "type":"CASSANDRA",                                                 <br />      "cluster":"cassandra_kv_ngsegment",                                <br />      "dataset":"ngsegment",                                             <br />      "table":"ngsegment",                                               <br />      "regions": ["us-east-1"],<br />      "config": {<br />        "consistency_scope": "LOCAL",<br />        "consistency_target": "READ_YOUR_WRITES"<br />      }                                            <br />    }                                                                       <br />  },                                                                          <br />  {                                                                           <br />    "id":"CACHE",                                                           <br />    "physical_storage": {                                                    <br />      "type":"CACHE",                                                     <br />      "cluster":"evcache_kv_ngsegment"                                   <br />     },                                                                      <br />     "config": {                                                              <br />       "default_cache_ttl": 180s                                             <br />     }                                                                       <br />  }                                                                           <br />] <br /></pre><h1 id="d313" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk"><strong class="al">Key APIs of the KV Abstraction</strong></h1><p id="0ffe" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">To support diverse use-cases, the KV abstraction provides four basic CRUD APIs:</p><h2 id="b9f2" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">PutItems <strong class="al">— Write one or more Items to a Record</strong></h2><p id="4d14" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The <code class="cx qc qd qe pu b">PutItems</code> API is an upsert operation, it can insert new data or update existing data in the two-level map structure.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message PutItemRequest (<br />  IdempotencyToken idempotency_token,<br />  string           namespace, <br />  string           id, <br />  List&lt;Item&gt;       items<br />)</pre><p id="d3db" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As you can see, the request includes the namespace, Record ID, one or more items, and an <strong class="my gv">idempotency token</strong> to ensure retries of the same write are safe. Chunked data can be written by staging chunks and then committing them with appropriate metadata (e.g. number of chunks).</p><h2 id="d04a" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">GetItems <strong class="al">— Read one or more Items from a Record</strong></h2><p id="487c" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The <code class="cx qc qd qe pu b">GetItems</code>API provides a structured and adaptive way to fetch data using ID, predicates, and selection mechanisms. This approach balances the need to retrieve large volumes of data while meeting stringent Service Level Objectives (SLOs) for performance and reliability.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message GetItemsRequest (<br />  String              namespace,<br />  String              id,<br />  Predicate           predicate,<br />  Selection           selection,<br />  Map&lt;String, Struct&gt; signals<br />)</pre><p id="ed18" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The <code class="cx qc qd qe pu b">GetItemsRequest</code> includes several key parameters:</p><ul class=""><li id="f54d" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qp qq qr bk"><strong class="my gv">Namespace</strong>: Specifies the logical dataset or table</li><li id="a823" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Id</strong>: Identifies the entry in the top-level HashMap</li><li id="b7f8" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Predicate</strong>: Filters the matching items and can retrieve all items (<code class="cx qc qd qe pu b">match_all</code>), specific items (<code class="cx qc qd qe pu b">match_keys</code>), or a range (<code class="cx qc qd qe pu b">match_range</code>)</li><li id="a2e5" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Selection</strong>: Narrows returned responses for example <code class="cx qc qd qe pu b">page_size_bytes</code> for pagination, <code class="cx qc qd qe pu b">item_limit</code> for limiting the total number of items across pages and <code class="cx qc qd qe pu b">include</code>/<code class="cx qc qd qe pu b">exclude</code> to include or exclude large values from responses</li><li id="9b65" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Signals:</strong> Provides in-band signaling to indicate client capabilities, such as supporting client compression or chunking.</li></ul><p id="697f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The <code class="cx qc qd qe pu b">GetItemResponse</code> message contains the matching data:</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message GetItemResponse (<br />  List&lt;Item&gt;       items,<br />  Optional&lt;String&gt; next_page_token<br />)</pre><ul class=""><li id="9d6e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qp qq qr bk"><strong class="my gv">Items</strong>: A list of retrieved items based on the <code class="cx qc qd qe pu b">Predicate</code> and <code class="cx qc qd qe pu b">Selection</code> defined in the request.</li><li id="5617" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Next Page Token</strong>: An optional token indicating the position for subsequent reads if needed, essential for handling large data sets across multiple requests. Pagination is a critical component for efficiently managing data retrieval, especially when dealing with large datasets that could exceed typical response size limits.</li></ul><h2 id="a3d0" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk"><strong class="al">DeleteItems — Delete one or more Items from a Record</strong></h2><p id="b258" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The <code class="cx qc qd qe pu b">DeleteItems</code> API provides flexible options for removing data, including record-level, item-level, and range deletes — all while supporting idempotency.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message DeleteItemsRequest (<br />  IdempotencyToken idempotency_token,<br />  String           namespace,<br />  String           id,<br />  Predicate        predicate<br />)<br /></pre><p id="b5bd" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Just like in the <code class="cx qc qd qe pu b">GetItems</code> API, the <code class="cx qc qd qe pu b">Predicate</code> allows one or more Items to be addressed at once:</p><ul class=""><li id="4e42" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qp qq qr bk"><strong class="my gv">Record-Level Deletes (match_all)</strong>: Removes the entire record in constant latency regardless of the number of items in the record.</li><li id="0b32" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Item-Range Deletes (match_range)</strong>: This deletes a range of items within a Record. Useful for keeping “n-newest” or prefix path deletion.</li><li id="7525" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Item-Level Deletes (match_keys)</strong>: Deletes one or more individual items.</li></ul><p id="4f76" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Some storage engines (any store which defers true deletion) such as Cassandra struggle with high volumes of deletes due to tombstone and compaction overhead. Key-Value optimizes both record and range deletes to generate a single tombstone for the operation — you can learn more about tombstones in <a class="af nu" href="https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html" rel="noopener ugc nofollow" target="_blank">About Deletes and Tombstones</a>.</p><p id="3569" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Item-level deletes create many tombstones but KV hides that storage engine complexity via <strong class="my gv">TTL-based deletes with jitter</strong>. Instead of immediate deletion, item metadata is updated as expired with randomly jittered TTL applied to stagger deletions. This technique maintains read pagination protections. While this doesn’t completely solve the problem it reduces load spikes and helps maintain consistent performance while compaction catches up. These strategies help maintain system performance, reduce read overhead, and meet SLOs by minimizing the impact of deletes.</p><h2 id="9f4f" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Complex Mutate and Scan APIs</h2><p id="d6e0" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Beyond simple CRUD on single Records, KV also supports complex multi-item and multi-record mutations and scans via <code class="cx qc qd qe pu b">MutateItems</code> and <code class="cx qc qd qe pu b">ScanItems</code> APIs. <code class="cx qc qd qe pu b">PutItems</code> also supports atomic writes of large blob data within a single <code class="cx qc qd qe pu b">Item</code> via a chunked protocol. These complex APIs require careful consideration to ensure predictable linear low-latency and we will share details on their implementation in a future post.</p><h1 id="4e6f" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Design Philosophies for reliable and predictable performance</h1><h2 id="c223" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Idempotency to fight tail latencies</h2><p id="d7cd" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">To ensure data integrity the <code class="cx qc qd qe pu b">PutItems</code> and <code class="cx qc qd qe pu b">DeleteItems</code> APIs use <strong class="my gv">idempotency tokens</strong>, which uniquely identify each mutative operation and guarantee that operations are logically executed in order, even when hedged or retried for latency reasons. This is especially crucial in last-write-wins databases like Cassandra, where ensuring the correct order and de-duplication of requests is vital.</p><p id="b8d5" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In the Key-Value abstraction, idempotency tokens contain a generation timestamp and random nonce token. Either or both may be required by backing storage engines to de-duplicate mutations.</p><pre class="po pp pq pr ps pt pu pv bp pw bb bk">message IdempotencyToken (<br />  Timestamp generation_time,<br />  String    token<br />)</pre><p id="3089" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At Netflix, <strong class="my gv">client-generated monotonic tokens</strong> are preferred due to their reliability, especially in environments where network delays could impact server-side token generation. This combines a client provided monotonic <code class="cx qc qd qe pu b">generation_time</code> timestamp with a 128 bit random UUID <code class="cx qc qd qe pu b">token</code>. Although clock-based token generation can suffer from clock skew, our tests on EC2 Nitro instances show drift is minimal (under 1 millisecond). In some cases that require stronger ordering, regionally unique tokens can be generated using tools like Zookeeper, or globally unique tokens such as a transaction IDs can be used.</p><p id="abaa" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The following graphs illustrate the observed <a class="af nu" href="https://docs.google.com/document/d/1XLBjQ9scZCy-xIo51Rs--CSdFV781fnp5hXdXTBAk1k/edit" rel="noopener ugc nofollow" target="_blank">clock skew</a> on our Cassandra fleet, suggesting the safety of this technique on modern cloud VMs with direct access to high-quality clocks. To further maintain safety, KV servers reject writes bearing tokens with large drift both preventing silent write discard (write has timestamp far in past) and immutable doomstones (write has a timestamp far in future) in storage engines vulnerable to those.</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg qx"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*gTmQpPIyZcKDb4Fb%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*gTmQpPIyZcKDb4Fb%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*gTmQpPIyZcKDb4Fb%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*gTmQpPIyZcKDb4Fb%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*gTmQpPIyZcKDb4Fb%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*gTmQpPIyZcKDb4Fb%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*gTmQpPIyZcKDb4Fb%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*gTmQpPIyZcKDb4Fb 640w, https://miro.medium.com/v2/resize:fit:720/0*gTmQpPIyZcKDb4Fb 720w, https://miro.medium.com/v2/resize:fit:750/0*gTmQpPIyZcKDb4Fb 750w, https://miro.medium.com/v2/resize:fit:786/0*gTmQpPIyZcKDb4Fb 786w, https://miro.medium.com/v2/resize:fit:828/0*gTmQpPIyZcKDb4Fb 828w, https://miro.medium.com/v2/resize:fit:1100/0*gTmQpPIyZcKDb4Fb 1100w, https://miro.medium.com/v2/resize:fit:1400/0*gTmQpPIyZcKDb4Fb 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md qn c" width="700" height="685" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h2 id="7b42" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Handling Large Data through Chunking</h2><p id="4ff0" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Key-Value is also designed to efficiently handle large blobs, a common challenge for traditional key-value stores. Databases often face limitations on the amount of data that can be stored per key or partition. To address these constraints, KV uses transparent <strong class="my gv">chunking</strong> to manage large data efficiently.</p><p id="2392" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For items smaller than 1 MiB, data is stored directly in the main backing storage (e.g. Cassandra), ensuring fast and efficient access. However, for larger items, only the <strong class="my gv">id</strong>, <strong class="my gv">key</strong>, and <strong class="my gv">metadata</strong> are stored in the primary storage, while the actual data is split into smaller chunks and stored separately in chunk storage. This chunk storage can also be Cassandra but with a different partitioning scheme optimized for handling large values. The idempotency token ties all these writes together into one atomic operation.</p><p id="605e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">By splitting large items into chunks, we ensure that latency scales linearly with the size of the data, making the system both predictable and efficient. A future blog post will describe the <strong class="my gv">chunking architecture</strong> in more detail, including its intricacies and optimization strategies.</p><h2 id="2b4f" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Client-Side Compression</h2><p id="5ad2" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The KV abstraction leverages client-side payload compression to optimize performance, especially for large data transfers. While many databases offer server-side compression, handling compression on the client side reduces expensive server CPU usage, network bandwidth, and disk I/O. In one of our deployments, which helps power Netflix’s search, enabling client-side compression reduced payload sizes by 75%, significantly improving cost efficiency.</p><h2 id="5805" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Smarter Pagination</h2><p id="739c" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We chose payload size in bytes as the limit per response page rather than the number of items because it allows us to provide predictable operation SLOs. For instance, we can provide a single-digit millisecond SLO on a 2 MiB page read. Conversely, using the number of items per page as the limit would result in unpredictable latencies due to significant variations in item size. A request for 10 items per page could result in vastly different latencies if each item was 1 KiB versus 1 MiB.</p><p id="21ce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Using bytes as a limit poses challenges as few backing stores support byte-based pagination; most data stores use the number of results e.g. DynamoDB and Cassandra limit by number of items or rows. To address this, we use a static limit for the initial queries to the backing store, query with this limit, and process the results. If more data is needed to meet the byte limit, additional queries are executed until the limit is met, the excess result is discarded and a page token is generated.</p><p id="51d1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This static limit can lead to inefficiencies, one large item in the result may cause us to discard many results, while small items may require multiple iterations to fill a page, resulting in read amplification. To mitigate these issues, we implemented <em class="oy">adaptive</em> pagination which dynamically tunes the limits based on observed data.</p><h2 id="6f49" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Adaptive Pagination</h2><p id="d9f6" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">When an initial request is made, a query is executed in the storage engine, and the results are retrieved. As the consumer processes these results, the system tracks the number of items consumed and the total size used. This data helps calculate an approximate item size, which is stored in the page token. For subsequent page requests, this stored information allows the server to apply the appropriate limits to the underlying storage, reducing unnecessary work and minimizing read amplification.</p><p id="c57f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">While this method is effective for follow-up page requests, what happens with the initial request? In addition to storing item size information in the page token, the server also estimates the average item size for a given namespace and caches it locally. This cached estimate helps the server set a more optimal limit on the backing store for the initial request, improving efficiency. The server continuously adjusts this limit based on recent query patterns or other factors to keep it accurate. For subsequent pages, the server uses both the cached data and the information in the page token to fine-tune the limits.</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg qy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*yg8xyQEoEmvKYoOV%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*yg8xyQEoEmvKYoOV%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*yg8xyQEoEmvKYoOV%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*yg8xyQEoEmvKYoOV%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*yg8xyQEoEmvKYoOV%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*yg8xyQEoEmvKYoOV%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*yg8xyQEoEmvKYoOV%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*yg8xyQEoEmvKYoOV 640w, https://miro.medium.com/v2/resize:fit:720/0*yg8xyQEoEmvKYoOV 720w, https://miro.medium.com/v2/resize:fit:750/0*yg8xyQEoEmvKYoOV 750w, https://miro.medium.com/v2/resize:fit:786/0*yg8xyQEoEmvKYoOV 786w, https://miro.medium.com/v2/resize:fit:828/0*yg8xyQEoEmvKYoOV 828w, https://miro.medium.com/v2/resize:fit:1100/0*yg8xyQEoEmvKYoOV 1100w, https://miro.medium.com/v2/resize:fit:1400/0*yg8xyQEoEmvKYoOV 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md qn c" width="700" height="559" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="e11f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In addition to adaptive pagination, a mechanism is in place to send a response early if the server detects that processing the request is at risk of exceeding the request’s latency SLO.</p><p id="8c43" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For example, let us assume a client submits a <code class="cx qc qd qe pu b">GetItems</code> request with a per-page limit of 2 MiB and a maximum end-to-end latency limit of 500ms. While processing this request, the server retrieves data from the backing store. This particular record has thousands of small items so it would normally take longer than the 500ms SLO to gather the full page of data. If this happens, the client would receive an SLO violation error, causing the request to fail even though there is nothing exceptional. To prevent this, the server tracks the elapsed time while fetching data. If it determines that continuing to retrieve more data might breach the SLO, the server will stop processing further results and return a response with a pagination token.</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg qz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*hEkIfkUJ4KDnbbGx%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*hEkIfkUJ4KDnbbGx%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*hEkIfkUJ4KDnbbGx%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*hEkIfkUJ4KDnbbGx%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*hEkIfkUJ4KDnbbGx%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*hEkIfkUJ4KDnbbGx%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*hEkIfkUJ4KDnbbGx%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*hEkIfkUJ4KDnbbGx 640w, https://miro.medium.com/v2/resize:fit:720/0*hEkIfkUJ4KDnbbGx 720w, https://miro.medium.com/v2/resize:fit:750/0*hEkIfkUJ4KDnbbGx 750w, https://miro.medium.com/v2/resize:fit:786/0*hEkIfkUJ4KDnbbGx 786w, https://miro.medium.com/v2/resize:fit:828/0*hEkIfkUJ4KDnbbGx 828w, https://miro.medium.com/v2/resize:fit:1100/0*hEkIfkUJ4KDnbbGx 1100w, https://miro.medium.com/v2/resize:fit:1400/0*hEkIfkUJ4KDnbbGx 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md qn c" width="700" height="296" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="858e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This approach ensures that requests are processed within the SLO, even if the full page size isn’t met, giving clients predictable progress. Furthermore, if the client is a gRPC server with proper deadlines, the client is smart enough not to issue further requests, reducing useless work.</p><p id="c909" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">If you want to know more, the <a class="af nu" href="https://www.infoq.com/articles/netflix-highly-reliable-stateful-systems/" rel="noopener ugc nofollow" target="_blank">How Netflix Ensures Highly-Reliable Online Stateful Systems</a> article talks in further detail about these and many other techniques.</p><h2 id="c0c5" class="oz nw gu bf nx pa pb dy ob pc pd ea of nh pe pf pg nl ph pi pj np pk pl pm pn bk">Signaling</h2><p id="2e5c" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">KV uses in-band messaging we call <em class="oy">signaling</em> that allows the dynamic configuration of the client and enables it to communicate its capabilities to the server. This ensures that configuration settings and tuning parameters can be exchanged seamlessly between the client and server. Without signaling, the client would need static configuration — requiring a redeployment for each change — or, with dynamic configuration, would require coordination with the client team.</p><p id="f164" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For server-side signals, when the client is initialized, it sends a handshake to the server. The server responds back with signals, such as target or max latency SLOs, allowing the client to dynamically adjust timeouts and hedging policies. Handshakes are then made periodically in the background to keep the configuration current. For client-communicated signals, the client, along with each request, communicates its capabilities, such as whether it can handle compression, chunking, and other features.</p><figure class="po pp pq pr ps qi qf qg paragraph-image"><div role="button" tabindex="0" class="qj qk fj ql bh qm"><div class="qf qg ra"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*sVOLoSeIKpzDMQ5N%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*sVOLoSeIKpzDMQ5N%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*sVOLoSeIKpzDMQ5N%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*sVOLoSeIKpzDMQ5N%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*sVOLoSeIKpzDMQ5N%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*sVOLoSeIKpzDMQ5N%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sVOLoSeIKpzDMQ5N%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*sVOLoSeIKpzDMQ5N 640w, https://miro.medium.com/v2/resize:fit:720/0*sVOLoSeIKpzDMQ5N 720w, https://miro.medium.com/v2/resize:fit:750/0*sVOLoSeIKpzDMQ5N 750w, https://miro.medium.com/v2/resize:fit:786/0*sVOLoSeIKpzDMQ5N 786w, https://miro.medium.com/v2/resize:fit:828/0*sVOLoSeIKpzDMQ5N 828w, https://miro.medium.com/v2/resize:fit:1100/0*sVOLoSeIKpzDMQ5N 1100w, https://miro.medium.com/v2/resize:fit:1400/0*sVOLoSeIKpzDMQ5N 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md qn c" width="700" height="264" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="ea21" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">KV Usage @ Netflix</h1><p id="18e7" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The KV abstraction powers several key Netflix use cases, including:</p><ul class=""><li id="f38f" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qp qq qr bk"><strong class="my gv">Streaming Metadata</strong>: High-throughput, low-latency access to streaming metadata, ensuring personalized content delivery in real-time.</li><li id="555b" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">User Profiles</strong>: Efficient storage and retrieval of user preferences and history, enabling seamless, personalized experiences across devices.</li><li id="61d8" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Messaging</strong>: Storage and retrieval of <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658">push registry</a> for messaging needs, enabling the millions of requests to flow through.</li><li id="bcc9" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Real-Time Analytics</strong>: This persists large-scale impression and provides insights into user behavior and system performance, <a class="af nu" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/bulldozer-batch-data-moving-from-data-warehouse-to-online-key-value-stores-41bac13863f8">moving data from offline to online</a> and vice versa.</li></ul><h1 id="332b" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Future Enhancements</h1><p id="6a18" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Looking forward, we plan to enhance the KV abstraction with:</p><ul class=""><li id="27c2" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qp qq qr bk"><strong class="my gv">Lifecycle Management</strong>: Fine-grained control over data retention and deletion.</li><li id="3282" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Summarization</strong>: Techniques to improve retrieval efficiency by summarizing records with many items into fewer backing rows.</li><li id="249b" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">New Storage Engines</strong>: Integration with more storage systems to support new use cases.</li><li id="1ebe" class="mw mx gu my b mz qs nb nc nd qt nf ng nh qu nj nk nl qv nn no np qw nr ns nt qp qq qr bk"><strong class="my gv">Dictionary Compression</strong>: Further reducing data size while maintaining performance.</li></ul><h1 id="496e" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Conclusion</h1><p id="b19f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">The Key-Value service at Netflix is a flexible, cost-effective solution that supports a wide range of data patterns and use cases, from low to high traffic scenarios, including critical Netflix streaming use-cases. The simple yet robust design allows it to handle diverse data models like HashMaps, Sets, Event storage, Lists, and Graphs. It abstracts the complexity of the underlying databases from our developers, which enables our application engineers to focus on solving business problems instead of becoming experts in every storage engine and their distributed <a class="af nu" href="https://jepsen.io/consistency" rel="noopener ugc nofollow" target="_blank">consistency models</a>. As Netflix continues to innovate in online datastores, the KV abstraction remains a central component in managing data efficiently and reliably at scale, ensuring a solid foundation for future growth.</p><p id="683f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><strong class="my gv"><em class="oy">Acknowledgments:</em></strong><em class="oy"> Special thanks to our stunning colleagues who contributed to Key Value’s success: </em><a class="af nu" href="https://www.linkedin.com/in/william-schor/" rel="noopener ugc nofollow" target="_blank"><em class="oy">William Schor</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/mengqingwang/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Mengqing Wang</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/cthumuluru/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Chandrasekhar Thumuluru</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/john-l-693b7915a/" rel="noopener ugc nofollow" target="_blank"><em class="oy">John Lu</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/georgecampbell/" rel="noopener ugc nofollow" target="_blank"><em class="oy">George Cambell</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/akhaku/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Ammar Khaku</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/jordan-west-8aa1731a3/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Jordan West</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/clohfink/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Chris Lohfink</em></a><em class="oy">, </em><a class="af nu" href="https://www.linkedin.com/in/matt-lehman-39549719b/" rel="noopener ugc nofollow" target="_blank"><em class="oy">Matt Lehman</em></a><em class="oy">, and the whole online datastores team (ODS, f.k.a CDE).</em></p></div>]]></description>
      <link>https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30</link>
      <guid>https://netflixtechblog.com/introducing-netflixs-key-value-data-abstraction-layer-1ea8a0a11b30</guid>
      <pubDate>Thu, 19 Sep 2024 00:49:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Pushy to the Limit: Evolving Netflix’s WebSocket proxy for the future]]></title>
      <description><![CDATA[<div><div></div><p id="7b9e" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">By </em><a class="af nv" href="https://www.linkedin.com/in/kyagna/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Karthik Yagna</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/baskar-o-n-46477b3/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Baskar Odayarkoil</em></a><em class="nu">, and </em><a class="af nv" href="https://www.linkedin.com/in/alexander-ellis/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Alex Ellis</em></a></p><p id="e6ee" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Pushy is Netflix’s WebSocket server that maintains persistent WebSocket connections with devices running the Netflix application. This allows data to be sent to the device from backend services on demand, without the need for continually polling requests from the device. Over the last few years, Pushy has seen tremendous growth, evolving from its role as a best-effort message delivery service to be an integral part of the Netflix ecosystem. This post describes how we’ve grown and scaled Pushy to meet its new and future needs, as it handles hundreds of millions of concurrent WebSocket connections, delivers hundreds of thousands of messages per second, and maintains a steady 99.999% message delivery reliability rate.</p><h1 id="c697" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">History &amp; motivation</h1><p id="2131" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">There were two main motivating use cases that drove Pushy’s initial development and usage. The first was voice control, where you can play a title or search using your virtual assistant with a voice command like “Show me Stranger Things on Netflix.” (See <a class="af nv" href="https://help.netflix.com/en/node/111997" rel="noopener ugc nofollow" target="_blank"><em class="nu">How to use voice controls with Netflix</em></a> if you want to do this yourself!).</p><p id="3c8c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">If we consider the Alexa use case, we can see how this partnership with Amazon enabled this to work. Once they receive the voice command, we allow them to make an authenticated call through <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/open-sourcing-zuul-2-82ea476cb2b3">apiproxy</a>, our streaming edge proxy, to our internal voice service. This call includes metadata, such as the user’s information and details about the command, such as the specific show to play. The voice service then constructs a message for the device and places it on the message queue, which is then processed and sent to Pushy to deliver to the device. Finally, the device receives the message, and the action, such as “Show me Stranger Things on Netflix”, is performed. This initial functionality was built out for FireTVs and was expanded from there.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa pb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*WQ1W30ChfWrEmmR5%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*WQ1W30ChfWrEmmR5%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*WQ1W30ChfWrEmmR5%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*WQ1W30ChfWrEmmR5%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*WQ1W30ChfWrEmmR5%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*WQ1W30ChfWrEmmR5%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*WQ1W30ChfWrEmmR5%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*WQ1W30ChfWrEmmR5 640w, https://miro.medium.com/v2/resize:fit:720/0*WQ1W30ChfWrEmmR5 720w, https://miro.medium.com/v2/resize:fit:750/0*WQ1W30ChfWrEmmR5 750w, https://miro.medium.com/v2/resize:fit:786/0*WQ1W30ChfWrEmmR5 786w, https://miro.medium.com/v2/resize:fit:828/0*WQ1W30ChfWrEmmR5 828w, https://miro.medium.com/v2/resize:fit:1100/0*WQ1W30ChfWrEmmR5 1100w, https://miro.medium.com/v2/resize:fit:1400/0*WQ1W30ChfWrEmmR5 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Sample system diagram for an Alexa voice command, with the voice command entering Netflix’s cloud infrastructure via apiproxy and existing via a server-side message through Pushy to the device." class="bh md pm c" width="700" height="347" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du"><em class="pr">Sample system diagram for an Alexa voice command. Where aws ends and the internet begins is an exercise left to the reader.</em></figcaption></figure><p id="022f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The other main use case was RENO, the Rapid Event Notification System mentioned above. Before the integration with Pushy, the TV UI would continuously poll a backend service to see if there were any row updates to get the latest information. These requests would happen every few seconds, which ended up creating extraneous requests to the backend and were costly for devices, which are frequently resource constrained. The integration with WebSockets and Pushy alleviated both of these points, allowing the origin service to send row updates as they were ready, resulting in lower request rates and cost savings.</p><p id="63a0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For more background on Pushy, you can see <a class="af nv" href="https://www.youtube.com/watch?v=6w6E_B55p0E" rel="noopener ugc nofollow" target="_blank">this InfoQ talk by Susheel Aroskar</a>. Since that presentation, Pushy has grown in both size and scope, and this article will be discussing the investments we’ve made to evolve Pushy for the next generation of features.</p><h1 id="2f72" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Client Reach</h1><p id="5d47" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">This integration was initially rolled out for Fire TVs, PS4s, Samsung TVs, and LG TVs, leading to a reach of about 30 million candidate devices. With these clear benefits, we continued to build out this functionality for more devices, enabling the same efficiency wins. As of today, we’ve expanded our list of candidate devices even further to nearly a billion devices, including mobile devices running the Netflix app and the website experience. We’ve even extended support to older devices that lack modern capabilities, like support for TLS and HTTPS requests. For those, we’ve enabled secure communication from client to Pushy via an encryption/decryption layer on each, allowing for confidential messages to flow between the device and server.</p><h1 id="d786" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Scaling to handle that growth (and more)</h1><h2 id="99cc" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Growth</h2><p id="7458" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">With that extended reach, Pushy has gotten busier. Over the last five years, Pushy has gone from tens of millions of concurrent connections to hundreds of millions of concurrent connections, and it regularly reaches 300,000 messages sent per second. To support this growth, we’ve revisited Pushy’s past assumptions and design decisions with an eye towards both Pushy’s future role and future stability. Pushy had been relatively hands-free operationally over the last few years, and as we updated Pushy to fit its evolving role, our goal was also to get it into a stable state for the next few years. This is particularly important as we build out new functionality that relies on Pushy; a strong, stable infrastructure foundation allows our partners to continue to build on top of Pushy with confidence.</p><p id="f063" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Throughout this evolution, we’ve been able to maintain high availability and a consistent message delivery rate, with Pushy successfully maintaining 99.999% reliability for message delivery over the last few months. When our partners want to deliver a message to a device, it’s our job to make sure they can do so.</p><p id="4520" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Here are a few of the ways we’ve evolved Pushy to handle its growing scale.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6yETYqbh6V9LhZcI%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6yETYqbh6V9LhZcI%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6yETYqbh6V9LhZcI%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6yETYqbh6V9LhZcI%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6yETYqbh6V9LhZcI%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6yETYqbh6V9LhZcI%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*6yETYqbh6V9LhZcI%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6yETYqbh6V9LhZcI 640w, https://miro.medium.com/v2/resize:fit:720/0*6yETYqbh6V9LhZcI 720w, https://miro.medium.com/v2/resize:fit:750/0*6yETYqbh6V9LhZcI 750w, https://miro.medium.com/v2/resize:fit:786/0*6yETYqbh6V9LhZcI 786w, https://miro.medium.com/v2/resize:fit:828/0*6yETYqbh6V9LhZcI 828w, https://miro.medium.com/v2/resize:fit:1100/0*6yETYqbh6V9LhZcI 1100w, https://miro.medium.com/v2/resize:fit:1400/0*6yETYqbh6V9LhZcI 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="A few of the related services in Pushy’s immediate ecosystem and the changes we’ve made for them." class="bh md pm c" width="700" height="364" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">A few of the related services in Pushy’s immediate ecosystem and the changes we’ve made for them.</figcaption></figure><h2 id="a76c" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Message processor</h2><p id="7fed" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">One aspect that we invested in was the evolution of the asynchronous message processor. The previous version of the message processor was a Mantis stream-processing job that processed messages from the message queue. It was very efficient, but it had a set job size, requiring manual intervention if we wanted to horizontally scale it, and it required manual intervention when rolling out a new version.</p><p id="c757" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">It served Pushy’s needs well for many years. As the scale of the messages being processed increased and we were making more code changes in the message processor, we found ourselves looking for something more flexible. In particular, we were looking for some of the features we enjoy with our other services: automatic horizontal scaling, canaries, automated red/black rollouts, and more observability. With this in mind, we rewrote the message processor as a standalone Spring Boot service using Netflix paved-path components. Its job is the same, but it does so with easy rollouts, canary configuration that lets us roll changes safely, and autoscaling policies we’ve defined to let it handle varying volumes.</p><p id="8436" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Rewriting always comes with a risk, and it’s never the first solution we reach for, particularly when working with a system that’s in place and working well. In this case, we found that the burden from maintaining and improving the custom stream processing job was increasing, and we made the judgment call to do the rewrite. Part of the reason we did so was the clear role that the message processor played — we weren’t rewriting a huge monolithic service, but instead a well-scoped component that had explicit goals, well-defined success criteria, and a clear path towards improvement. Since the rewrite was completed in mid-2023, the message processor component has been completely zero touch, happily automated and running reliably on its own.</p><h2 id="93ca" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Push Registry</h2><p id="4448" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">For most of its life, Pushy has used <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-dynomite-making-non-distributed-databases-distributed-c7bce3d89404">Dynomite</a> for keeping track of device connection metadata in its Push Registry. Dynomite is a Netflix open source wrapper around Redis that provides a few additional features like auto-sharding and cross-region replication, and it provided Pushy with low latency and easy record expiry, both of which are critical for Pushy’s workload.</p><p id="ddbc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As Pushy’s portfolio grew, we experienced some pain points with Dynomite. Dynomite had great performance, but it required manual scaling as the system grew. The folks on the Cloud Data Engineering (CDE) team, the ones building the paved path for internal data at Netflix, graciously helped us scale it up and make adjustments, but it ended up being an involved process as we kept growing.</p><p id="04b6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">These pain points coincided with the introduction of KeyValue, which was a new offering from the CDE team that is roughly “HashMap as a service” for Netflix developers. KeyValue is an abstraction over the storage engine itself, which allows us to choose the best storage engine that meets our SLO needs. In our case, we value low latency — the faster we can read from KeyValue, the faster these messages can get delivered. With CDE’s help, we migrated our Push Registry to use KV instead, and we have been extremely satisfied with the result. After tuning our store for Pushy’s needs, it has been on autopilot since, appropriately scaling and serving our requests with very low latency.</p><h2 id="76b8" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Scaling Pushy horizontally and vertically</h2><p id="0445" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Most of the other services our team runs, like apiproxy, the streaming edge proxy, are CPU bound, and we have autoscaling policies that scale them horizontally when we see an increase in CPU usage. This maps well to their workload — more HTTP requests means more CPU used, and we can scale up and down accordingly.</p><p id="e4a2" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Pushy has slightly different performance characteristics, with each node maintaining many connections and delivering messages on demand. In Pushy’s case, CPU usage is consistently low, since most of the connections are parked and waiting for an occasional message. Instead of relying on CPU, we scale Pushy on the number of connections, with exponential scaling to scale faster after higher thresholds are reached. We load balance the initial HTTP requests to establish the connections and rely on a reconnect protocol where devices will reconnect every 30 minutes or so, with some staggering, that gives us a steady stream of reconnecting devices to balance connections across all available instances.</p><p id="cd71" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For a few years, our scaling policy had been that we would add new instances when the average number of connections reached 60,000 connections per instance. For a couple hundred million devices, this meant that we were regularly running thousands of Pushy instances. We can horizontally scale Pushy to our heart’s content, but we would be less content with our bill and would have to shard Pushy further to get around NLB connection limits. This evolution effort aligned well with an internal focus on cost efficiency, and we used this as an opportunity to revisit these earlier assumptions with an eye towards efficiency.</p><p id="7d83" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Both of these would be helped by increasing the number of connections that each Pushy node could handle, reducing the total number of Pushy instances and running more efficiently with the right balance between instance type, instance cost, and maximum concurrent connections. It would also allow us to have more breathing room with the NLB limits, reducing the toil of additional sharding as we continue to grow. That being said, increasing the number of connections per node is not without its own drawbacks. When a Pushy instance goes down, the devices that were connected to it will immediately try to reconnect. By increasing the number of connections per instance, it means that we would be increasing the number of devices that would be immediately trying to reconnect. We could have a million connections per instance, but a down node would lead to a thundering herd of a million devices reconnecting at the same time.</p><p id="806f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This delicate balance led to us doing a deep evaluation of many instance types and performance tuning options. Striking that balance, we ended up with instances that handle an average of 200,000 connections per node, with breathing room to go up to 400,000 connections if we had to. This makes for a nice balance between CPU usage, memory usage, and the thundering herd when a device connects. We’ve also enhanced our autoscaling policies to scale exponentially; the farther we are past our target average connection count, the more instances we’ll add. These improvements have enabled Pushy to be almost entirely hands off operationally, giving us plenty of flexibility as more devices come online in different patterns.</p><h2 id="b5d7" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Reliability &amp; building a stable foundation</h2><p id="fcec" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Alongside these efforts to scale Pushy for the future, we also took a close look at our reliability after finding some connectivity edge cases during recent feature development. We found a few areas for improvement around the connection between Pushy and the device, with failures due to Pushy attempting to send messages on a connection that had failed without notifying Pushy. Ideally something like a silent failure wouldn’t happen, but we frequently see odd client behavior, particularly on older devices.</p><p id="215b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In collaboration with the client teams, we were able to make some improvements. On the client side, better connection handling and improvements around the reconnect flow meant that they were more likely to reconnect appropriately. In Pushy, we added additional heartbeats, idle connection cleanup, and better connection tracking, which meant that we were keeping around fewer and fewer stale connections.</p><p id="ab19" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">While these improvements were mostly around those edge cases for the feature development, they had the side benefit of bumping our message delivery rates up even further. We already had a good message delivery rate, but this additional bump has enabled Pushy to regularly average 5 9s of message delivery reliability.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*SFyzjaMH524tYkkQ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*SFyzjaMH524tYkkQ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*SFyzjaMH524tYkkQ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*SFyzjaMH524tYkkQ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*SFyzjaMH524tYkkQ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*SFyzjaMH524tYkkQ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*SFyzjaMH524tYkkQ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*SFyzjaMH524tYkkQ 640w, https://miro.medium.com/v2/resize:fit:720/0*SFyzjaMH524tYkkQ 720w, https://miro.medium.com/v2/resize:fit:750/0*SFyzjaMH524tYkkQ 750w, https://miro.medium.com/v2/resize:fit:786/0*SFyzjaMH524tYkkQ 786w, https://miro.medium.com/v2/resize:fit:828/0*SFyzjaMH524tYkkQ 828w, https://miro.medium.com/v2/resize:fit:1100/0*SFyzjaMH524tYkkQ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*SFyzjaMH524tYkkQ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Push message delivery success rate over a recent 2-week period, staying consistently over 5 9s of reliability." class="bh md pm c" width="700" height="220" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du"><em class="pr">Push message delivery success rate over a recent 2-week period.</em></figcaption></figure><h1 id="4fa9" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Recent developments</h1><p id="22ab" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">With this stable foundation and all of these connections, what can we now do with them? This question has been the driving force behind nearly all of the recent features built on top of Pushy, and it’s an exciting question to ask, particularly as an infrastructure team.</p><h2 id="2591" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Shift towards direct push</h2><p id="832c" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">The first change from Pushy’s traditional role is what we call direct push; instead of a backend service dropping the message on the asynchronous message queue, it can instead leverage the Push library to skip the asynchronous queue entirely. When called to deliver a message in the direct path, the Push library will look up the Pushy connected to the target device in the Push Registry, then send the message directly to that Pushy. Pushy will respond with a status code reflecting whether it was able to successfully deliver the message or it encountered an error, and the Push library will bubble that up to the calling code in the service.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*PJHwCgmRYIYMVPcl%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*PJHwCgmRYIYMVPcl%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*PJHwCgmRYIYMVPcl%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*PJHwCgmRYIYMVPcl%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*PJHwCgmRYIYMVPcl%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*PJHwCgmRYIYMVPcl%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*PJHwCgmRYIYMVPcl%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*PJHwCgmRYIYMVPcl 640w, https://miro.medium.com/v2/resize:fit:720/0*PJHwCgmRYIYMVPcl 720w, https://miro.medium.com/v2/resize:fit:750/0*PJHwCgmRYIYMVPcl 750w, https://miro.medium.com/v2/resize:fit:786/0*PJHwCgmRYIYMVPcl 786w, https://miro.medium.com/v2/resize:fit:828/0*PJHwCgmRYIYMVPcl 828w, https://miro.medium.com/v2/resize:fit:1100/0*PJHwCgmRYIYMVPcl 1100w, https://miro.medium.com/v2/resize:fit:1400/0*PJHwCgmRYIYMVPcl 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="The system diagram for the direct and indirect push paths. The direct push path goes directly from a backend service to Pushy, while the indirect path goes to a decoupled message queue, which is then handled by a message processor and sent on to Pushy." class="bh md pm c" width="700" height="273" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">The system diagram for the direct and indirect push paths.</figcaption></figure><p id="8c27" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Susheel, the original author of Pushy, added this functionality as an optional path, but for years, nearly all backend services relied on the indirect path with its “best-effort” being good enough for their use cases. In recent years, we’ve seen usage of this direct path really take off as the needs of backend services have grown. In particular, rather than being just best effort, these direct messages allow the calling service to have immediate feedback about the delivery, letting them retry if a device they’re targeting has gone offline.</p><p id="80f7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">These days, messages sent via direct push make up the majority of messages sent through Pushy. For example, for a recent 24 hour period, direct messages averaged around 160,000 messages per second and indirect averaged at around 50,000 messages per second..</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div class="oz pa qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*oCI-seLx9OMSYZQk%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*oCI-seLx9OMSYZQk%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*oCI-seLx9OMSYZQk%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*oCI-seLx9OMSYZQk%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*oCI-seLx9OMSYZQk%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*oCI-seLx9OMSYZQk%201100w,%20https://miro.medium.com/v2/resize:fit:1306/format:webp/0*oCI-seLx9OMSYZQk%201306w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 653px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*oCI-seLx9OMSYZQk 640w, https://miro.medium.com/v2/resize:fit:720/0*oCI-seLx9OMSYZQk 720w, https://miro.medium.com/v2/resize:fit:750/0*oCI-seLx9OMSYZQk 750w, https://miro.medium.com/v2/resize:fit:786/0*oCI-seLx9OMSYZQk 786w, https://miro.medium.com/v2/resize:fit:828/0*oCI-seLx9OMSYZQk 828w, https://miro.medium.com/v2/resize:fit:1100/0*oCI-seLx9OMSYZQk 1100w, https://miro.medium.com/v2/resize:fit:1306/0*oCI-seLx9OMSYZQk 1306w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 653px" /><img alt="Graph of direct vs indirect messages per second, showing around 150,000 direct messages per second and around 50,000 indirect messages per second." class="bh md pm c" width="653" height="506" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">Graph of direct vs indirect messages per second.</figcaption></figure><h2 id="7caa" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Device to device messaging</h2><p id="f571" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">As we’ve thought through this evolving use case, our concept of a message sender has also evolved. What if we wanted to move past Pushy’s pattern of delivering server-side messages? What if we wanted to have a device send a message to a backend service, or maybe even to another device? Our messages had traditionally been unidirectional as we send messages from the server to the device, but we now leverage these bidirectional connections and direct device messaging to enable what we call device to device messaging. This device to device messaging supported early phone-to-TV communication in support of games like Triviaverse, and it’s the messaging foundation for our <a class="af nv" href="https://help.netflix.com/en/node/132821" rel="noopener ugc nofollow" target="_blank">Companion Mode</a> as TVs and phones communicate back and forth.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div class="oz pa ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*rA3HZj7YEo5Sp4Xjp5c6EA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*rA3HZj7YEo5Sp4Xjp5c6EA.png" /><img alt="A screenshot of one of the authors playing Triviaquest with a mobile device as the controller." class="bh md pm c" width="596" height="1049" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">A screenshot of one of the authors playing Triviaquest with a mobile device as the controller.</figcaption></figure><p id="15e9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This requires higher level knowledge of the system, where we need to know not just information about a single device, but more broader information, like what devices are connected for an account that the phone can pair with. This also enables things like subscribing to device events to know when another device comes online and when they’re available to pair or send a message to. This has been built out with an additional service that receives device connection information from Pushy. These events, sent over a Kafka topic, let the service keep track of the device list for a given account. Devices can subscribe to these events, allowing them to receive a message from the service when another device for the same account comes online.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*PhEf0jXvhXbx6kwN%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*PhEf0jXvhXbx6kwN%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*PhEf0jXvhXbx6kwN%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*PhEf0jXvhXbx6kwN%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*PhEf0jXvhXbx6kwN%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*PhEf0jXvhXbx6kwN%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*PhEf0jXvhXbx6kwN%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*PhEf0jXvhXbx6kwN 640w, https://miro.medium.com/v2/resize:fit:720/0*PhEf0jXvhXbx6kwN 720w, https://miro.medium.com/v2/resize:fit:750/0*PhEf0jXvhXbx6kwN 750w, https://miro.medium.com/v2/resize:fit:786/0*PhEf0jXvhXbx6kwN 786w, https://miro.medium.com/v2/resize:fit:828/0*PhEf0jXvhXbx6kwN 828w, https://miro.medium.com/v2/resize:fit:1100/0*PhEf0jXvhXbx6kwN 1100w, https://miro.medium.com/v2/resize:fit:1400/0*PhEf0jXvhXbx6kwN 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Pushy and its relationship with the Device List Service for discovering other devices. Pushy reaches out to the Device List Service, and when it receives the device list in response, propagates that back to the requesting device." class="bh md pm c" width="700" height="314" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">Pushy and its relationship with the Device List Service for discovering other devices.</figcaption></figure><p id="7742" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This device list enables the discoverability aspect of these device to device messages. Once the devices have this knowledge of the other devices connected for the same account, they’re able to choose a target device from this list that they can then send messages to.</p><p id="e182" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Once a device has that list, it can send a message to Pushy over its WebSocket connection with that device as the target in what we call a <em class="nu">device to device message</em> (1 in the diagram below). Pushy looks up the target device’s metadata in the Push registry (2) and sends the message to the second Pushy that the target device is connected to (3), as if it was the backend service in the direct push pattern above. That Pushy delivers the message to the target device (4), and the original Pushy will receive a status code in response, which it can pass back to the source device (5).</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*dEQ1TpVfTQNs3eg4%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*dEQ1TpVfTQNs3eg4%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*dEQ1TpVfTQNs3eg4%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*dEQ1TpVfTQNs3eg4%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*dEQ1TpVfTQNs3eg4%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*dEQ1TpVfTQNs3eg4%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*dEQ1TpVfTQNs3eg4%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*dEQ1TpVfTQNs3eg4 640w, https://miro.medium.com/v2/resize:fit:720/0*dEQ1TpVfTQNs3eg4 720w, https://miro.medium.com/v2/resize:fit:750/0*dEQ1TpVfTQNs3eg4 750w, https://miro.medium.com/v2/resize:fit:786/0*dEQ1TpVfTQNs3eg4 786w, https://miro.medium.com/v2/resize:fit:828/0*dEQ1TpVfTQNs3eg4 828w, https://miro.medium.com/v2/resize:fit:1100/0*dEQ1TpVfTQNs3eg4 1100w, https://miro.medium.com/v2/resize:fit:1400/0*dEQ1TpVfTQNs3eg4 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="A basic order of events for a device to device message." class="bh md pm c" width="700" height="310" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">A basic order of events for a device to device message.</figcaption></figure><h2 id="8a43" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">The messaging protocol</h2><p id="1ace" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">We’ve defined a basic JSON-based message protocol for device to device messaging that lets these messages be passed from the source device to the target device. As a networking team, we naturally lean towards abstracting the communication layer with encapsulation wherever possible. This generalized message means that device teams are able to define their own protocols on top of these messages — Pushy would just be the transport layer, happily forwarding messages back and forth.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div class="oz pa qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*4-ijw8c0BTX9r20jVIgKNA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*4-ijw8c0BTX9r20jVIgKNA.png" /><img alt="A simple block diagram showing the client app protocol on top of the device to device protocol, which itself is on top of the WebSocket &amp; Pushy protocol." class="bh md pm c" width="354" height="194" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">The client app protocol, built on top of the device to device protocol, built on top of Pushy.</figcaption></figure><p id="c46f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This generalization paid off in terms of investment and operational support. We built the majority of this functionality in October 2022, and we’ve only needed small tweaks since then. We needed nearly no modifications as client teams built out the functionality on top of this layer, defining the higher level application-specific protocols that powered the features they were building. We really do enjoy working with our partner teams, but if we’re able to give them the freedom to build on top of our infrastructure layer without us getting involved, then we’re able to increase their velocity, make their lives easier, and play our infrastructure roles as message platform providers.</p><p id="2933" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">With early features in experimentation, Pushy sees an average of 1000 device to device messages per second, a number that will only continue to grow.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6gn9UvREat4OqRoU%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6gn9UvREat4OqRoU%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6gn9UvREat4OqRoU%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6gn9UvREat4OqRoU%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6gn9UvREat4OqRoU%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6gn9UvREat4OqRoU%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*6gn9UvREat4OqRoU%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6gn9UvREat4OqRoU 640w, https://miro.medium.com/v2/resize:fit:720/0*6gn9UvREat4OqRoU 720w, https://miro.medium.com/v2/resize:fit:750/0*6gn9UvREat4OqRoU 750w, https://miro.medium.com/v2/resize:fit:786/0*6gn9UvREat4OqRoU 786w, https://miro.medium.com/v2/resize:fit:828/0*6gn9UvREat4OqRoU 828w, https://miro.medium.com/v2/resize:fit:1100/0*6gn9UvREat4OqRoU 1100w, https://miro.medium.com/v2/resize:fit:1400/0*6gn9UvREat4OqRoU 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="Graph of device to device messages per second, showing an average of 1000 messages per second." class="bh md pm c" width="700" height="425" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">Graph of device to device messages per second.</figcaption></figure><h2 id="03fc" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">The Netty-gritty details</h2><p id="aaec" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">In Pushy, we handle incoming WebSocket messages in our PushClientProtocolHandler (<a class="af nv" href="https://github.com/Netflix/zuul/blob/99ef8841c8b7b82536d5fb193fd751c675c9ad0d/zuul-core/src/main/java/com/netflix/zuul/netty/server/push/PushClientProtocolHandler.java" rel="noopener ugc nofollow" target="_blank">code pointer to class in Zuul that we extend</a>), which extends Netty’s ChannelInboundHandlerAdapter and is added to the Netty pipeline for each client connection. We listen for incoming WebSocket messages from the connected device in its channelRead method and parse the incoming message. If it’s a device to device message, we pass the message, the ChannelHandlerContext, and the PushUserAuth information about the connection’s identity to our DeviceToDeviceManager.</p><figure class="pc pd pe pf pg ph oz pa paragraph-image"><div role="button" tabindex="0" class="pi pj fj pk bh pl"><div class="oz pa qq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*cp-lfclw0ayykX2H%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*cp-lfclw0ayykX2H%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*cp-lfclw0ayykX2H%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*cp-lfclw0ayykX2H%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*cp-lfclw0ayykX2H%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*cp-lfclw0ayykX2H%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*cp-lfclw0ayykX2H%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*cp-lfclw0ayykX2H 640w, https://miro.medium.com/v2/resize:fit:720/0*cp-lfclw0ayykX2H 720w, https://miro.medium.com/v2/resize:fit:750/0*cp-lfclw0ayykX2H 750w, https://miro.medium.com/v2/resize:fit:786/0*cp-lfclw0ayykX2H 786w, https://miro.medium.com/v2/resize:fit:828/0*cp-lfclw0ayykX2H 828w, https://miro.medium.com/v2/resize:fit:1100/0*cp-lfclw0ayykX2H 1100w, https://miro.medium.com/v2/resize:fit:1400/0*cp-lfclw0ayykX2H 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="A rough overview of the internal organization for these components, with the code classes described above. Inside Pushy, a Push Client Protocol handler inside a Netty Channel calls out to the Device to Device manager, which itself calls out to the Push Message Sender class that forwards the message on to the other Pushy." class="bh md pm c" width="700" height="500" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pn ff po oz pa pp pq bf b bg z du">A rough overview of the internal organization for these components.</figcaption></figure><p id="973c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The DeviceToDeviceManager is responsible for validating the message, doing some bookkeeping, and kicking off an async call that validates that the device is an authorized target, looks up the Pushy for the target device in the local cache (or makes a call to the data store if it’s not found), and forwards on the message. We run this asynchronously to avoid any event loop blocking due to these calls. The DeviceToDeviceManager is also responsible for observability, with metrics around cache hits, calls to the data store, message delivery rates, and latency percentile measurements. We’ve relied heavily on these metrics for alerts and optimizations — Pushy really is a metrics service that occasionally will deliver a message or two!</p><h2 id="3a46" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Security</h2><p id="d15f" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">As the edge of the Netflix cloud, security considerations are always top of mind. With every connection over HTTPS, we’ve limited these messages to just authenticated WebSocket connections, added rate limiting, and added authorization checks to ensure that a device is able to target another device — you may have the best intentions in mind, but I’d strongly prefer it if you weren’t able to send arbitrary data to my personal TV from yours (and vice versa, I’m sure!).</p><h2 id="f500" class="ps nx gu bf ny pt pu dy oc pv pw ea og nh px py pz nl qa qb qc np qd qe qf qg bk">Latency and other considerations</h2><p id="c746" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">One main consideration with the products built on top of this is latency, particularly when this feature is used for anything interactive within the Netflix app.</p><p id="9d25" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We’ve added caching to Pushy to reduce the number of lookups in the hotpath for things that are unlikely to change frequently, like a device’s allowed list of targets and the Pushy instance the target device is connected to. We have to do some lookups on the initial messages to know where to send them, but it enables us to send subsequent messages faster without any KeyValue lookups. For these requests where caching removed KeyValue from the hot path, we were able to greatly speed things up. From the incoming message arriving at Pushy to the response being sent back to the device, we reduced median latency to less than a millisecond, with the 99th percentile of latency at less than 4ms.</p><p id="b708" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Our KeyValue latency is usually very low, but we have seen brief periods of elevated read latencies due to underlying issues in our KeyValue datastore. Overall latencies increased for other parts of Pushy, like client registration, but we saw very little increase in device to device latency with this caching in place.</p><h1 id="da41" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Cultural aspects that enable this work</h1><p id="17eb" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Pushy’s scale and system design considerations make the work technically interesting, but we also deliberately focus on non-technical aspects that have helped to drive Pushy’s growth. We focus on iterative development that solves the hardest problem first, with projects frequently starting with quick hacks or prototypes to prove out a feature. As we do this initial version, we do our best to keep an eye towards the future, allowing us to move quickly from supporting a single, focused use case to a broad, generalized solution. For example, for our cross-device messaging, we were able to solve hard problems in the early work for <em class="nu">Triviaverse</em> that we later leveraged for the generic device to device solution.</p><p id="160a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As one can immediately see in the system diagrams above, Pushy does not exist in a vacuum, with projects frequently involving at least half a dozen teams. Trust, experience, communication, and strong relationships all enable this to work. Our team wouldn’t exist without our platform users, and we certainly wouldn’t be here writing this post without all of the work our product and client teams do. This has also emphasized the importance of building and sharing — if we’re able to get a prototype together with a device team, we’re able to then show it off to seed ideas from other teams. It’s one thing to mention that you can send these messages, but it’s another to show off the TV responding to the first click of the phone controller button!</p><h1 id="35a3" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">The future of Pushy</h1><p id="1f62" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">If there’s anything certain in this world, it’s that Pushy will continue to grow and evolve. We have many new features in the works, like WebSocket message proxying, WebSocket message tracing, a global broadcast mechanism, and subscription functionality in support of Games and Live. With all of this investment, Pushy is a stable, reinforced foundation, ready for this next generation of features.</p><p id="4db1" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We’ll be writing about those new features as well — stay tuned for future posts.</p><p id="df27" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">Special thanks to our stunning colleagues </em><a class="af nv" href="https://www.linkedin.com/in/jeremy-kelly-526a30180/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Jeremy Kelly</em></a><em class="nu"> and </em><a class="af nv" href="https://www.linkedin.com/in/justin-guerra-3282262b/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Justin Guerra</em></a><em class="nu"> who have both been invaluable to Pushy’s growth and the WebSocket ecosystem at large. We would also like to thank our larger teams and our numerous partners for their great work; it truly takes a village!</em></p></div>]]></description>
      <link>https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658</link>
      <guid>https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658</guid>
      <pubDate>Tue, 10 Sep 2024 21:15:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Noisy Neighbor Detection with eBPF]]></title>
      <description><![CDATA[<div class="ab cb"><div class="ci bh fz ga gb gc"><div><div></div><p id="3108" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">By </em><a class="af nv" href="https://www.linkedin.com/in/josefernandezmn/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Jose Fernandez</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/sebastien-dabdoub-2a5a0958/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Sebastien Dabdoub</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/jason-koch-5692172/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Jason Kock</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/artemtkachuk/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Artem Tkachuk</em></a></p><p id="8709" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The Compute and Performance Engineering teams at Netflix regularly investigate performance issues in our multi-tenant environment. The first step is determining whether the problem originates from the application or the underlying infrastructure. One issue that often complicates this process is the "noisy neighbor" problem. On <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436">Titus</a>, our multi-tenant compute platform, a "noisy neighbor" refers to a container or system service that heavily utilizes the server's resources, causing performance degradation in adjacent containers. We usually focus on CPU utilization because it is our workload's most frequent source of noisy neighbor issues.</p><p id="f191" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Detecting the effects of noisy neighbors is complex. Traditional performance analysis tools such as<a class="af nv" href="https://www.brendangregg.com/perf.html" rel="noopener ugc nofollow" target="_blank"> perf</a> can introduce significant overhead, risking further performance degradation. Additionally, these tools are typically deployed after the fact, which is too late for effective investigation.Another challenge is that debugging noisy neighbor issues requires significant low-level expertise and specialized tooling<em class="nu">. </em>In this blog post, we'll reveal how we leveraged eBPF to achieve continuous, low-overhead instrumentation of the Linux scheduler, enabling effective self-serve monitoring of noisy neighbor issues. Learn how Linux kernel instrumentation can improve your infrastructure observability with deeper insights and enhanced monitoring.</p><h1 id="1e46" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Continuous Instrumentation of the Linux Scheduler</h1><p id="792f" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">To ensure the reliability of our workloads that depend on low latency responses, we instrumented the <a class="af nv" href="https://en.wikipedia.org/wiki/Run_queue" rel="noopener ugc nofollow" target="_blank">run queue</a> latency for each container, which measures the time processes spend in the scheduling queue before being dispatched to the CPU. Extended waiting in this queue can be a telltale of performance issues, especially when containers are not utilizing their total CPU allocation. Continuous instrumentation is critical to catching such matters as they emerge, and eBPF, with its hooks into the Linux scheduler with minimal overhead, enabled us to monitor run queue latency efficiently.</p><p id="048b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To emit a run queue latency metric, we leveraged three eBPF hooks: <strong class="my gv">sched_wakeup, sched_wakeup_new,</strong> and <strong class="my gv">sched_switch</strong>.</p></div></div><div class="oz"><div class="ab cb"><div class="ly pa lz pb ma pc cf pd cg pe ci bh"><figure class="pi pj pk pl pm oz pn po paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pf pg ph"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*6bapyclfXZPsUIaXFM-xaQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*6bapyclfXZPsUIaXFM-xaQ.png" /><img alt="" class="bh md pt c" width="1000" height="563" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure></div></div></div><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="184b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The <strong class="my gv">sched_wakeup </strong>and <strong class="my gv">sched_wakeup_new</strong> hooks are invoked when a process changes state from 'sleeping' to 'runnable.' They let us identify when a process is ready to run and is waiting for CPU time. During this event, we generate a timestamp and store it in an eBPF hash map using the process ID as the key.</p><pre class="pi pj pk pl pm pu pv pw bp px bb bk">struct {<br />    __uint(type, BPF_MAP_TYPE_HASH);<br />    __uint(max_entries, MAX_TASK_ENTRIES);<br />    __uint(key_size, sizeof(u32));<br />    __uint(value_size, sizeof(u64));<br />} runq_lat SEC(".maps");SEC("tp_btf/sched_wakeup")<br />int tp_sched_wakeup(u64 *ctx)<br />{<br />    struct task_struct *task = (void *)ctx[0];<br />    u32 pid = task-&gt;pid;<br />    u64 ts = bpf_ktime_get_ns();bpf_map_update_elem(&amp;runq_lat, &amp;pid, &amp;ts, BPF_NOEXIST);<br />    return 0;<br />}</pre><p id="5852" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Conversely, the <strong class="my gv">sched_switch</strong> hook is triggered when the CPU switches between processes. This hook provides pointers to the process currently utilizing the CPU and the process about to take over. We use the upcoming task's process ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the process entered the queue, which we had previously stored. We then calculate the run queue latency by simply subtracting the timestamps.</p><pre class="pi pj pk pl pm pu pv pw bp px bb bk">SEC("tp_btf/sched_switch")<br />int tp_sched_switch(u64 *ctx)<br />{<br />    struct task_struct *prev = (struct task_struct *)ctx[1];<br />    struct task_struct *next = (struct task_struct *)ctx[2];<br />    u32 prev_pid = prev-&gt;pid;<br />    u32 next_pid = next-&gt;pid;// fetch timestamp of when the next task was enqueued<br />    u64 *tsp = bpf_map_lookup_elem(&amp;runq_lat, &amp;next_pid);<br />    if (tsp == NULL) {<br />        return 0; // missed enqueue<br />    }// calculate runq latency before deleting the stored timestamp<br />    u64 now = bpf_ktime_get_ns();<br />    u64 runq_lat = now - *tsp;// delete pid from enqueued map<br />    bpf_map_delete_elem(&amp;runq_lat, &amp;next_pid);<br />    ....</pre><p id="58f0" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">One of the advantages of eBPF is its ability to provide pointers to the actual kernel data structures representing processes or threads, also known as tasks in kernel terminology. This feature enables access to a wealth of information stored about a process. We required the process's cgroup ID to associate it with a container for our specific use case. However, the cgroup information in the struct is safeguarded by an<a class="af nv" href="https://elixir.bootlin.com/linux/v6.6.16/source/include/linux/sched.h#L1225" rel="noopener ugc nofollow" target="_blank"> RCU (Read Copy Update) lock</a>.</p><p id="0a15" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To safely access this RCU-protected information, we can leverage <a class="af nv" href="https://docs.kernel.org/bpf/kfuncs.html" rel="noopener ugc nofollow" target="_blank">kfuncs</a> in eBPF. kfuncs are kernel functions that can be called from eBPF programs. There are kfuncs available to lock and unlock RCU read-side critical sections. These functions ensure that our eBPF program remains safe and efficient while retrieving the cgroup ID from the task struct.</p><pre class="pi pj pk pl pm pu pv pw bp px bb bk">void bpf_rcu_read_lock(void) __ksym;<br />void bpf_rcu_read_unlock(void) __ksym;u64 get_task_cgroup_id(struct task_struct *task)<br />{<br />    struct css_set *cgroups;<br />    u64 cgroup_id;<br />    bpf_rcu_read_lock();<br />    cgroups = task-&gt;cgroups;<br />    cgroup_id = cgroups-&gt;dfl_cgrp-&gt;kn-&gt;id;<br />    bpf_rcu_read_unlock();<br />    return cgroup_id;<br />}</pre><p id="ba4a" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Having the data ready, we must package it and send it to userspace. For this purpose, we chose the eBPF <a class="af nv" href="https://nakryiko.com/posts/bpf-ringbuf/" rel="noopener ugc nofollow" target="_blank">ring buffer</a>. It is efficient, high-performing, and user-friendly. It can handle variable-length data records and allows data reading without necessitating extra memory copying or syscalls. However, the sheer amount of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data effectively.</p><pre class="pi pj pk pl pm pu pv pw bp px bb bk">struct {<br />    __uint(type, BPF_MAP_TYPE_RINGBUF);<br />    __uint(max_entries, 256 * 1024);<br />} events SEC(".maps");struct {<br />    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);<br />    __uint(max_entries, MAX_TASK_ENTRIES);<br />    __uint(key_size, sizeof(u64));<br />    __uint(value_size, sizeof(u64));<br />} cgroup_id_to_last_event_ts SEC(".maps");struct runq_event {<br />    u64 prev_cgroup_id;<br />    u64 cgroup_id;<br />    u64 runq_lat;<br />    u64 ts;<br />};SEC("tp_btf/sched_switch")<br />int tp_sched_switch(u64 *ctx)<br />{<br />    // ....<br />    // The previous code<br />    // ....u64 prev_cgroup_id = get_task_cgroup_id(prev);<br />    u64 cgroup_id = get_task_cgroup_id(next);// per-cgroup-id-per-CPU rate-limiting <br />    // to balance observability with performance overhead<br />    u64 *last_ts = <br />        bpf_map_lookup_elem(&amp;cgroup_id_to_last_event_ts, &amp;cgroup_id);<br />    u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;// check the rate limit for the cgroup_id in consideration<br />    // before doing more work<br />    if (now - last_ts_val &lt; RATE_LIMIT_NS) {<br />        // Rate limit exceeded, drop the event<br />        return 0;<br />    }struct runq_event *event;<br />    event = bpf_ringbuf_reserve(&amp;events, sizeof(*event), 0);if (event) {<br />        event-&gt;prev_cgroup_id = prev_cgroup_id;<br />        event-&gt;cgroup_id = cgroup_id;<br />        event-&gt;runq_lat = runq_lat;<br />        event-&gt;ts = now;<br />        bpf_ringbuf_submit(event, 0);<br />        // Update the last event timestamp for the current cgroup_id<br />        bpf_map_update_elem(&amp;cgroup_id_to_last_event_ts, &amp;cgroup_id,<br />            &amp;now, BPF_ANY);}return 0;<br />}</pre><p id="cd03" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Our userspace application, developed in Go, processes events from the ring buffer to emit metrics to our metrics backend, <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-atlas-netflixs-primary-telemetry-platform-bd31f4d8ed9a">Atlas</a>. Each event includes a run queue latency sample with a cgroup ID, which we associate with running containers on the host. We categorize it as a system service if no such association is found. When a cgroup ID correlates with a container, we emit a percentile timer Atlas metric (<code class="cx qd qe qf pv b">runq.latency</code>) for that container. We also increment a counter metric (<code class="cx qd qe qf pv b">sched.switch.out</code>) to monitor preemptions occurring for the container's processes. Access to the prev_cgroup_id of the preempted process allows us to tag the metric with the cause of the preemption, whether it's due to a process within the same container (or cgroup), a process in another container, or a system service.</p><p id="1bb4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">It's important to highlight that both the <code class="cx qd qe qf pv b">runq.latency</code> metric and the <code class="cx qd qe qf pv b">sched.switch.out</code> metrics are needed to determine if a container is affected by noisy neighbors, which is the goal we aim to achieve — relying solely on the runq.latency metric can lead to misconceptions. For example, if a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU request limits. However, simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue.</p><h1 id="b5da" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">A Noisy Neighbor Story</h1><p id="8be9" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Below is the <code class="cx qd qe qf pv b">runq.latency</code> metric for a server running a single container with ample CPU overhead. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters.</p></div></div><div class="oz"><div class="ab cb"><div class="ly pa lz pb ma pc cf pd cg pe ci bh"><figure class="pi pj pk pl pm oz pn po paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pf pg qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*_DcYxRgeDwX5i07IrdTZyA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*_DcYxRgeDwX5i07IrdTZyA.png" /><img alt="" class="bh md pt c" width="1000" height="420" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qh ff qi pf pg qj qk bf b bg z du">container1’s 99th percentile runq.latency averages 83µs (microseconds), with spikes up to 400µs, without adjacent containers. This serves as our baseline for a container not contending for CPU on a host.</figcaption></figure></div></div></div><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="d6c6" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">At 10:35, launching <code class="cx qd qe qf pv b">container2</code>, which fully utilized all CPUs on the host, caused a significant 131-millisecond spike (131,000 microseconds) in <code class="cx qd qe qf pv b">container1</code>'s P99 run queue latency. This spike would be noticeable in the userspace application if it were serving HTTP traffic. If userspace app owners reported an unexplained latency spike, we could quickly identify the noisy neighbor issue through run queue latency metrics.</p></div></div><div class="oz"><div class="ab cb"><div class="ly pa lz pb ma pc cf pd cg pe ci bh"><figure class="pi pj pk pl pm oz pn po paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pf pg ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*DJrwEbrWPOxVMS0JP7uE9A.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*DJrwEbrWPOxVMS0JP7uE9A.png" /><img alt="" class="bh md pt c" width="1000" height="459" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="qh ff qi pf pg qj qk bf b bg z du">Launching container2 at 10:35, which maxes out all CPUs on the host, <strong class="bf ny">caused a 131-millisecond spike in container1’s P99 run queue latency</strong> due to increased preemptions by system processes. This indicates a noisy neighbor issue, where system services compete for CPU time with containers.</figcaption></figure></div></div></div><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="1ce4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The <code class="cx qd qe qf pv b">sched.switch.out</code> metric indicates that the spike was due to increased preemptions by system processes, highlighting a noisy neighbor issue where system services compete with containers for CPU time. Our metrics show that the noisy neighbors were actually system processes, likely triggered by <code class="cx qd qe qf pv b">container2</code> consuming all available CPU capacity.</p><h1 id="37af" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Optimizing eBPF Code</h1><p id="5964" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">We developed an open-source eBPF process monitor called <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5">bpftop</a> to measure the overhead of eBPF code in this hot kernel path. Our estimates suggest that the instrumentation adds less than 600 nanoseconds to each sched_* hook. We conducted a performance analysis on a Java service running in a container, and the instrumentation did not introduce significant overhead. The performance variance with the run queue profiling code active versus inactive was not measurable in milliseconds.</p><p id="2b6f" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">During our research on how eBPF statistics are measured in the kernel, we identified an opportunity to improve its calculation. We submitted this <a class="af nv" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce09cbdd988887662546a1175bcfdfc6c8fdd150" rel="noopener ugc nofollow" target="_blank">patch</a>, which was included in the Linux kernel 6.10 release.</p></div></div><div class="oz"><div class="ab cb"><div class="ly pa lz pb ma pc cf pd cg pe ci bh"><figure class="pi pj pk pl pm oz pn po paragraph-image"><div role="button" tabindex="0" class="pp pq fj pr bh ps"><div class="pf pg qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*YD6hkXce9a70AgvSHstgWA.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*YD6hkXce9a70AgvSHstgWA.gif" /><img alt="" class="bh md pt c" width="1000" height="208" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure></div></div></div><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="3e54" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Through trial and error and using bpftop, we identified several optimizations that helped maintain low overhead for this code:</p><ul class=""><li id="bdd6" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qn qo qp bk">We found that BPF_MAP_TYPE_HASH was the most performant for storing enqueued timestamps. Using BPF_MAP_TYPE_TASK_STORAGE resulted in nearly a twofold performance decline. BPF_MAP_TYPE_PERCPU_HASH was slightly less performant than BPF_MAP_TYPE_HASH, which was unexpected and requires further investigation.</li><li id="0f38" class="mw mx gu my b mz qq nb nc nd qr nf ng nh qs nj nk nl qt nn no np qu nr ns nt qn qo qp bk">The BPF_CORE_READ helper adds 20–30 nanoseconds per invocation. In the case of raw tracepoints, specifically those that are "BTF-enabled" (tp_btf/*), it is safe and more efficient to access the task struct members directly. Andrii Nakryiko recommends this approach in this <a class="af nv" href="https://nakryiko.com/posts/bpf-core-reference-guide/#btf-enabled-bpf-program-types-with-direct-memory-reads" rel="noopener ugc nofollow" target="_blank">blog post</a>.</li><li id="0f4d" class="mw mx gu my b mz qq nb nc nd qr nf ng nh qs nj nk nl qt nn no np qu nr ns nt qn qo qp bk">BPF_MAP_TYPE_LRU_HASH maps are 40–50 nanoseconds slower per operation than regular hash maps. Due to space concerns from PID churn, we initially used them for enqueued timestamps. We have since increased the map size, mitigating this risk.</li><li id="c6da" class="mw mx gu my b mz qq nb nc nd qr nf ng nh qs nj nk nl qt nn no np qu nr ns nt qn qo qp bk">The sched_switch, sched_wakeup, and sched_wakeup_new are all triggered for kernel tasks, which are identifiable by their PID of 0. We found monitoring these tasks unnecessary, so we implemented several early exit conditions and conditional logic to prevent executing costly operations, such as accessing BPF maps, when dealing with a kernel task. Notably, kernel tasks operate through the scheduler queue like any regular process.</li></ul><h1 id="d4d7" class="nw nx gu bf ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot bk">Conclusion</h1><p id="6194" class="pw-post-body-paragraph mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt gn bk">Our findings highlight the value of low-overhead continuous instrumentation of the Linux kernel with eBPF. We have integrated these metrics into customer dashboards, enabling actionable insights and guiding multitenancy performance discussions. We can also now use these metrics to refine CPU isolation strategies to minimize the impact of noisy neighbors. Additionally, thanks to these metrics, we've gained deeper insights into the Linux scheduler.</p><p id="4565" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This project has also deepened our understanding of eBPF technology and underscored the importance of tools like bpftop for optimizing eBPF code. As eBPF adoption increases, we foresee more infrastructure observability and business logic shifting to it. One promising project in this space is <a class="af nv" href="https://github.com/sched-ext/scx" rel="noopener ugc nofollow" target="_blank">sched_ext</a>, potentially revolutionizing how scheduling decisions are made and tailored to specific workload needs.</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/noisy-neighbor-detection-with-ebpf-64b1f4b3bbdd</link>
      <guid>https://netflixtechblog.com/noisy-neighbor-detection-with-ebpf-64b1f4b3bbdd</guid>
      <pubDate>Tue, 10 Sep 2024 20:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Recommending for Long-Term Member Satisfaction at Netflix]]></title>
      <description><![CDATA[<div><div></div><p id="989b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">By <a class="af nu" href="https://www.linkedin.com/in/jiangwei-pan-66a62a13/" rel="noopener ugc nofollow" target="_blank">Jiangwei Pan</a>, <a class="af nu" href="https://www.linkedin.com/in/thegarytang/" rel="noopener ugc nofollow" target="_blank">Gary Tang</a>, <a class="af nu" href="https://www.linkedin.com/in/henry-kang-wang-06701716/" rel="noopener ugc nofollow" target="_blank">Henry Wang</a>, and <a class="af nu" href="https://www.linkedin.com/in/jbasilico/" rel="noopener ugc nofollow" target="_blank">Justin Basilico</a></p><h1 id="1b73" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Introduction</h1><p id="9df8" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Our mission at Netflix is to entertain the world. Our personalization algorithms play a crucial role in delivering on this mission for all members by recommending the right shows, movies, and games at the right time. This goal extends beyond immediate engagement; we aim to create an experience that brings lasting enjoyment to our members. Traditional recommender systems often optimize for short-term metrics like clicks or engagement, which may not fully capture long-term satisfaction. We strive to recommend content that not only engages members in the moment but also enhances their long-term satisfaction, which increases the value they get from Netflix, and thus they’ll be more likely to continue to be a member.</p><h1 id="36a3" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Recommendations as Contextual Bandit</h1><p id="46d8" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">One simple way we can view recommendations is as a contextual bandit problem. When a member visits, that becomes a context for our system and it selects an action of what recommendations to show, and then the member provides various types of feedback. These feedback signals can be immediate (skips, plays, thumbs up/down, or adding items to their playlist) or delayed (completing a show or renewing their subscription). We can define reward functions to reflect the quality of the recommendations from these feedback signals and then train a contextual bandit policy on historical data to maximize the expected reward.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz pa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Y8QDcyallv_mh7ylPzXqkA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Y8QDcyallv_mh7ylPzXqkA.png" /><img alt="" class="bh md pl c" width="700" height="389" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="df36" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Improving Recommendations: Models and Objectives</h1><p id="a852" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">There are many ways that a recommendation model can be improved. They may come from more informative input features, more data, different architectures, more parameters, and so forth. In this post, we focus on a less-discussed aspect about improving the recommender objective by defining a reward function that tries to better reflect long-term member satisfaction.</p><h1 id="b2b2" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Retention as Reward?</h1><p id="e108" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Member retention might seem like an obvious reward for optimizing long-term satisfaction because members should stay if they’re satisfied, however it has several drawbacks:</p><ul class=""><li id="9d43" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pm pn po bk"><strong class="my gv">Noisy</strong>: Retention can be influenced by numerous external factors, such as seasonal trends, marketing campaigns, or personal circumstances unrelated to the service.</li><li id="901d" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Low Sensitivity</strong>: Retention is only sensitive for members on the verge of canceling their subscription, not capturing the full spectrum of member satisfaction.</li><li id="64c1" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Hard to Attribute</strong>: Members might cancel only after a series of bad recommendations.</li><li id="e86d" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Slow to Measure</strong>: We only get one signal per account per month.</li></ul><p id="26e7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Due to these challenges, optimizing for retention alone is impractical.</p><h1 id="c903" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Proxy Rewards</h1><p id="216f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Instead, we can train our bandit policy to optimize a proxy reward function that is highly aligned with long-term member satisfaction while being sensitive to individual recommendations. The proxy reward <em class="pu">r(user, item)</em> is a function of user interaction with the recommended item. For example, if we recommend “One Piece” and a member plays then subsequently completes and gives it a thumbs-up, a simple proxy reward might be defined as <em class="pu">r(user, item) = f(play, complete, thumb)</em>.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*xfSMqEoF0I2_qOPu%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*xfSMqEoF0I2_qOPu%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*xfSMqEoF0I2_qOPu%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*xfSMqEoF0I2_qOPu%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*xfSMqEoF0I2_qOPu%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*xfSMqEoF0I2_qOPu%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*xfSMqEoF0I2_qOPu%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*xfSMqEoF0I2_qOPu 640w, https://miro.medium.com/v2/resize:fit:720/0*xfSMqEoF0I2_qOPu 720w, https://miro.medium.com/v2/resize:fit:750/0*xfSMqEoF0I2_qOPu 750w, https://miro.medium.com/v2/resize:fit:786/0*xfSMqEoF0I2_qOPu 786w, https://miro.medium.com/v2/resize:fit:828/0*xfSMqEoF0I2_qOPu 828w, https://miro.medium.com/v2/resize:fit:1100/0*xfSMqEoF0I2_qOPu 1100w, https://miro.medium.com/v2/resize:fit:1400/0*xfSMqEoF0I2_qOPu 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pl c" width="700" height="186" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h2 id="d77d" class="pw nw gu bf nx px py dy ob pz qa ea of nh qb qc qd nl qe qf qg np qh qi qj qk bk">Click-through rate (CTR)</h2><p id="1f77" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Click-through rate (CTR), or in our case play-through rate, can be viewed as a simple proxy reward where <em class="pu">r(user, item) </em>= 1 if the user clicks a recommendation and 0 otherwise. CTR is a common feedback signal that generally reflects user preference expectations. It is a simple yet strong baseline for many recommendation applications. In some cases, such as ads personalization where the click is the target action, CTR may even be a reasonable reward for production models. However, in most cases, over-optimizing CTR can lead to promoting clickbaity items, which may harm long-term satisfaction.</p><h2 id="29ad" class="pw nw gu bf nx px py dy ob pz qa ea of nh qb qc qd nl qe qf qg np qh qi qj qk bk">Beyond CTR</h2><p id="f461" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">To align the proxy reward function more closely with long-term satisfaction, we need to look beyond simple interactions, consider all types of user actions, and understand their true implications on user satisfaction.</p><p id="65da" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We give a few examples in the Netflix context:</p><ul class=""><li id="911d" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pm pn po bk"><strong class="my gv">Fast season completion </strong>✅: Completing a season of a recommended TV show in one day is a strong sign of enjoyment and long-term satisfaction.</li><li id="a04d" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Thumbs-down after completion </strong>❌: Completing a TV show in several weeks followed by a thumbs-down indicates low satisfaction despite significant time spent.</li><li id="680e" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Playing a movie for just 10 minutes </strong>❓: In this case, the user’s satisfaction is ambiguous. The brief engagement might indicate that the user decided to abandon the movie, or it could simply mean the user was interrupted and plans to finish the movie later, perhaps the next day.</li><li id="c0e6" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Discovering new genres </strong>✅ ✅: Watching more Korean or game shows after “Squid Game” suggests the user is discovering something new. This discovery was likely even more valuable since it led to a variety of engagements in a new area for a member.</li></ul><h1 id="179d" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Reward Engineering</h1><p id="305f" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Reward engineering is the iterative process of refining the proxy reward function to align with long-term member satisfaction. It is similar to feature engineering, except that it can be derived from data that isn’t available at serving time. Reward engineering involves four stages: hypothesis formation, defining a new proxy reward, training a new bandit policy, and A/B testing. Below is a simple example.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*YRi8chIaj_OlV-Fd%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*YRi8chIaj_OlV-Fd%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*YRi8chIaj_OlV-Fd%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*YRi8chIaj_OlV-Fd%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*YRi8chIaj_OlV-Fd%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*YRi8chIaj_OlV-Fd%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*YRi8chIaj_OlV-Fd%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*YRi8chIaj_OlV-Fd 640w, https://miro.medium.com/v2/resize:fit:720/0*YRi8chIaj_OlV-Fd 720w, https://miro.medium.com/v2/resize:fit:750/0*YRi8chIaj_OlV-Fd 750w, https://miro.medium.com/v2/resize:fit:786/0*YRi8chIaj_OlV-Fd 786w, https://miro.medium.com/v2/resize:fit:828/0*YRi8chIaj_OlV-Fd 828w, https://miro.medium.com/v2/resize:fit:1100/0*YRi8chIaj_OlV-Fd 1100w, https://miro.medium.com/v2/resize:fit:1400/0*YRi8chIaj_OlV-Fd 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pl c" width="700" height="378" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h1 id="018c" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Challenge: Delayed Feedback</h1><p id="093d" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">User feedback used in the proxy reward function is often delayed or missing. For example, a member may decide to play a recommended show for just a few minutes on the first day and take several weeks to fully complete the show. This completion feedback is therefore delayed. Additionally, some user feedback may never occur; while we may wish otherwise, not all members provide a thumbs-up or thumbs-down after completing a show, leaving us uncertain about their level of enjoyment.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*cfammHCaAxkEjJhL%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*cfammHCaAxkEjJhL%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*cfammHCaAxkEjJhL%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*cfammHCaAxkEjJhL%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*cfammHCaAxkEjJhL%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*cfammHCaAxkEjJhL%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*cfammHCaAxkEjJhL%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*cfammHCaAxkEjJhL 640w, https://miro.medium.com/v2/resize:fit:720/0*cfammHCaAxkEjJhL 720w, https://miro.medium.com/v2/resize:fit:750/0*cfammHCaAxkEjJhL 750w, https://miro.medium.com/v2/resize:fit:786/0*cfammHCaAxkEjJhL 786w, https://miro.medium.com/v2/resize:fit:828/0*cfammHCaAxkEjJhL 828w, https://miro.medium.com/v2/resize:fit:1100/0*cfammHCaAxkEjJhL 1100w, https://miro.medium.com/v2/resize:fit:1400/0*cfammHCaAxkEjJhL 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pl c" width="700" height="264" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="d3ce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We could try and wait to give a longer window to observe feedback, but how long should we wait for delayed feedback before computing the proxy rewards? If we wait too long (e.g., weeks), we miss the opportunity to update the bandit policy with the latest data. In a highly dynamic environment like Netflix, a stale bandit policy can degrade the user experience and be particularly bad at recommending newer items.</p><h2 id="5011" class="pw nw gu bf nx px py dy ob pz qa ea of nh qb qc qd nl qe qf qg np qh qi qj qk bk">Solution: predict missing feedback</h2><p id="6574" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">We aim to update the bandit policy shortly after making a recommendation while also defining the proxy reward function based on all user feedback, including delayed feedback. Since delayed feedback has not been observed at the time of policy training, we can predict it. This prediction occurs for each training example with delayed feedback, using already observed feedback and other relevant information up to the training time as input features. Thus, the prediction also gets better as time progresses.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz pv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*-dmyaQqosWyMq-UU%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*-dmyaQqosWyMq-UU%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*-dmyaQqosWyMq-UU%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*-dmyaQqosWyMq-UU%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*-dmyaQqosWyMq-UU%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*-dmyaQqosWyMq-UU%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*-dmyaQqosWyMq-UU%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*-dmyaQqosWyMq-UU 640w, https://miro.medium.com/v2/resize:fit:720/0*-dmyaQqosWyMq-UU 720w, https://miro.medium.com/v2/resize:fit:750/0*-dmyaQqosWyMq-UU 750w, https://miro.medium.com/v2/resize:fit:786/0*-dmyaQqosWyMq-UU 786w, https://miro.medium.com/v2/resize:fit:828/0*-dmyaQqosWyMq-UU 828w, https://miro.medium.com/v2/resize:fit:1100/0*-dmyaQqosWyMq-UU 1100w, https://miro.medium.com/v2/resize:fit:1400/0*-dmyaQqosWyMq-UU 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md pl c" width="700" height="309" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><p id="c5ce" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The proxy reward is then calculated for each training example using both observed and predicted feedback. These training examples are used to update the bandit policy.</p><p id="d041" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">But aren’t we still only relying on observed feedback in the proxy reward function? Yes, because delayed feedback is predicted based on observed feedback. However, it is simpler to reason about rewards using all feedback directly. For instance, the delayed thumbs-up prediction model may be a complex neural network that takes into account all observed feedback (e.g., short-term play patterns). It’s more straightforward to define the proxy reward as a simple function of the thumbs-up feedback rather than a complex function of short-term interaction patterns. It can also be used to adjust for potential biases in how feedback is provided.</p><p id="c920" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The reward engineering diagram is updated with an optional delayed feedback prediction step.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fj pj bh pk"><div class="oy oz ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Rnu7B69daM-JY13CdtM6IQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Rnu7B69daM-JY13CdtM6IQ.png" /><img alt="" class="bh md pl c" width="700" height="371" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure><h2 id="dc5a" class="pw nw gu bf nx px py dy ob pz qa ea of nh qb qc qd nl qe qf qg np qh qi qj qk bk">Two types of ML models</h2><p id="8005" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">It’s worth noting that this approach employs two types of ML models:</p><ul class=""><li id="8fd2" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pm pn po bk"><strong class="my gv">Delayed Feedback Prediction Models</strong>: These models predict <em class="pu">p(final feedback | observed feedbacks)</em>. The predictions are used to define and compute proxy rewards for bandit policy training examples. As a result, these models are used offline during the bandit policy training.</li><li id="4114" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk"><strong class="my gv">Bandit Policy Models</strong>: These models are used in the bandit policy <em class="pu">π(item | user; r)</em> to generate recommendations online and in real-time.</li></ul><h1 id="b98d" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Challenge: Online-Offline Metric Disparity</h1><p id="7da3" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">Improved input features or neural network architectures often lead to better offline model metrics (e.g., AUC for classification models). However, when these improved models are subjected to A/B testing, we often observe flat or even negative online metrics, which can quantify long-term member satisfaction.</p><p id="3112" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">This online-offline metric disparity usually occurs when the proxy reward used in the recommendation policy is not fully aligned with long-term member satisfaction. In such cases, a model may achieve higher proxy rewards (offline metrics) but result in worse long-term member satisfaction (online metrics).</p><p id="d753" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Nevertheless, the model improvement is genuine. One approach to resolve this is to further refine the proxy reward definition to align better with the improved model. When this tuning results in positive online metrics, the model improvement can be effectively productized. See [1] for more discussions on this challenge.</p><h1 id="576c" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">Summary and Open Questions</h1><p id="24ec" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">In this post, we provided an overview of our reward engineering efforts to align Netflix recommendations with long-term member satisfaction. While retention remains our north star, it is not easy to optimize directly. Therefore, our efforts focus on defining a proxy reward that is aligned with long-term satisfaction and sensitive to individual recommendations. Finally, we discussed the unique challenge of delayed user feedback at Netflix and proposed an approach that has proven effective for us. Refer to [2] for an earlier overview of the reward innovation efforts at Netflix.</p><p id="f30d" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">As we continue to improve our recommendations, several open questions remain:</p><ul class=""><li id="b35e" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pm pn po bk">Can we learn a good proxy reward function automatically by correlating behavior with retention?</li><li id="5372" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk">How long should we wait for delayed feedback before using its predicted value in policy training?</li><li id="9605" class="mw mx gu my b mz pp nb nc nd pq nf ng nh pr nj nk nl ps nn no np pt nr ns nt pm pn po bk">How can we leverage Reinforcement Learning to further align the policy with long-term satisfaction?</li></ul><h1 id="74f9" class="nv nw gu bf nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bk">References</h1><p id="d6c0" class="pw-post-body-paragraph mw mx gu my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gn bk">[1] <a class="af nu" href="https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/18140" rel="noopener ugc nofollow" target="_blank">Deep learning for recommender systems: A Netflix case study</a>. AI Magazine 2021. Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, Justin Basilico.</p><p id="76f7" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">[2] <a class="af nu" href="https://web.archive.org/web/20231011142826id_/https://dl.acm.org/doi/pdf/10.1145/3604915.3608873" rel="noopener ugc nofollow" target="_blank">Reward innovation for long-term member satisfaction</a>. RecSys 2023. Gary Tang, Jiangwei Pan, Henry Wang, Justin Basilico.</p></div>]]></description>
      <link>https://netflixtechblog.com/recommending-for-long-term-member-satisfaction-at-netflix-ac15cada49ef</link>
      <guid>https://netflixtechblog.com/recommending-for-long-term-member-satisfaction-at-netflix-ac15cada49ef</guid>
      <pubDate>Thu, 29 Aug 2024 03:01:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Improve Your Next Experiment by Learning Better Proxy Metrics From Past Experiments]]></title>
      <description><![CDATA[<div class="ab cb"><div class="ci bh fz ga gb gc"><div><div></div><p id="5ed4" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">By </em><a class="af nv" href="https://www.linkedin.com/in/aurelien-bibaut/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Aurélien Bibaut</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/winston-chou-6491b0168/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Winston Chou</em></a><em class="nu">, </em><a class="af nv" href="https://www.linkedin.com/in/simon-ejdemyr-22b920123/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Simon Ejdemyr</em></a><em class="nu">, and </em><a class="af nv" href="https://www.linkedin.com/in/kallus/" rel="noopener ugc nofollow" target="_blank"><em class="nu">Nathan Kallus</em></a></p></div></div><div class="nw"><div class="ab cb"><div class="ly nx lz ny ma nz cf oa cg ob ci bh"><figure class="of og oh oi oj nw ok ol paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="oc od oe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*m_lKjIe460GlWr5JseoQzw.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*m_lKjIe460GlWr5JseoQzw.jpeg" /><img alt="" class="bh md oq c" width="1000" height="385" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div></figure></div></div></div><div class="ab cb"><div class="ci bh fz ga gb gc"><p id="a71b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We are excited to share <a class="af nv" href="https://arxiv.org/pdf/2402.17637" rel="noopener ugc nofollow" target="_blank">our work</a> on how to learn good proxy metrics from historical experiments at <a class="af nv" href="https://kdd2024.kdd.org/" rel="noopener ugc nofollow" target="_blank">KDD 2024</a>. This work addresses a fundamental question for technology companies and academic researchers alike: how do we establish that a treatment that improves short-term (statistically sensitive) outcomes also improves long-term (statistically insensitive) outcomes? Or, faced with multiple short-term outcomes, how do we optimally trade them off for long-term benefit?</p><p id="2b17" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">For example, in an A/B test, you may observe that a product change improves the click-through rate. However, the test does not provide enough signal to measure a change in long-term retention, leaving you in the dark as to whether this treatment makes users more satisfied with your service. The click-through rate is a <em class="nu">proxy metric</em> (<em class="nu">S</em>, for surrogate, in our paper) while retention is a downstream <em class="nu">business outcome </em>or <em class="nu">north star metric </em>(<em class="nu">Y</em>). We may even have several proxy metrics, such as other types of clicks or the length of engagement after click. Taken together, these form a <em class="nu">vector</em> of proxy metrics.</p><p id="ca1c" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">The goal of our work is to understand the true relationship between the proxy metric(s) and the north star metric — so that we can assess a proxy’s ability to stand in for the north star metric, learn how to combine multiple metrics into a single best one, and better explore and compare different proxies.</p><p id="8738" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Several intuitive approaches to understanding this relationship have surprising pitfalls:</p><ul class=""><li id="9fd6" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt or os ot bk"><strong class="my gv">Looking only at user-level correlations between the proxy <em class="nu">S </em>and north star <em class="nu">Y</em>.</strong> Continuing the example from above, you may find that users with a higher click-through rate also tend to have a higher retention. But this does not mean that a <em class="nu">product change </em>that improves the click-through rate will also improve retention (in fact, promoting clickbait may have the opposite effect). This is because, as any introductory causal inference class will tell you, there are many confounders between <em class="nu">S </em>and <em class="nu">Y</em> — many of which you can never reliably observe and control for.</li><li id="1221" class="mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt or os ot bk"><strong class="my gv">Looking naively at treatment effect correlations between <em class="nu">S </em>and <em class="nu">Y.</em></strong> Suppose you are lucky enough to have many historical A/B tests. Further imagine the ordinary least squares (OLS) regression line through a scatter plot of <em class="nu">Y </em>on <em class="nu">S</em> in which each point represents the (<em class="nu">S</em>,<em class="nu">Y</em>)-treatment effect from a previous test. Even if you find that this line has a positive slope, you unfortunately <em class="nu">cannot</em> conclude that product changes that improve <em class="nu">S </em>will also improve <em class="nu">Y</em>. The reason for this is correlated measurement error — if <em class="nu">S</em> and <em class="nu">Y</em> are positively correlated in the population, then treatment arms that happen to have more users with high <em class="nu">S</em> will also have more users with high <em class="nu">Y</em>.</li></ul><p id="31fc" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Between these naive approaches, we find that the second one is the easier trap to fall into. This is because the dangers of the first approach are well-known, whereas covariances between <em class="nu">estimated</em> treatment effects can appear misleadingly causal. In reality, these covariances can be severely biased compared to what we actually care about: covariances between <em class="nu">true</em> treatment effects. In the extreme — such as when the negative effects of clickbait are substantial but clickiness and retention are highly correlated at the user level — the true relationship between <em class="nu">S </em>and <em class="nu">Y </em>can be negative even if the OLS slope is positive. Only more data per experiment could diminish this bias — using more experiments as data points will only yield more precise estimates of the badly biased slope. At first glance, this would appear to imperil any hope of using existing experiments to detect the relationship.</p><figure class="of og oh oi oj nw oc od paragraph-image"><div role="button" tabindex="0" class="om on fj oo bh op"><div class="oc od oz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*o0Br8UYxvXPga-Sh%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*o0Br8UYxvXPga-Sh%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*o0Br8UYxvXPga-Sh%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*o0Br8UYxvXPga-Sh%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*o0Br8UYxvXPga-Sh%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*o0Br8UYxvXPga-Sh%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*o0Br8UYxvXPga-Sh%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*o0Br8UYxvXPga-Sh 640w, https://miro.medium.com/v2/resize:fit:720/0*o0Br8UYxvXPga-Sh 720w, https://miro.medium.com/v2/resize:fit:750/0*o0Br8UYxvXPga-Sh 750w, https://miro.medium.com/v2/resize:fit:786/0*o0Br8UYxvXPga-Sh 786w, https://miro.medium.com/v2/resize:fit:828/0*o0Br8UYxvXPga-Sh 828w, https://miro.medium.com/v2/resize:fit:1100/0*o0Br8UYxvXPga-Sh 1100w, https://miro.medium.com/v2/resize:fit:1400/0*o0Br8UYxvXPga-Sh 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /><img alt="" class="bh md oq c" width="700" height="241" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div></div><figcaption class="pa ff pb oc od pc pd bf b bg z du"><em class="pe">This figure shows a hypothetical treatment effect covariance matrix between S and Y (white line; negative correlation), a unit-level sampling covariance matrix creating correlated measurement errors between these metrics (black line; positive correlation), and the covariance matrix of estimated treatment effects which is a weighted combination of the first two (orange line; no correlation).</em></figcaption></figure><p id="8f12" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">To overcome this bias, we propose better ways to leverage historical experiments, inspired by techniques from the literature on weak instrumental variables. More specifically, we show that three estimators are consistent for the true proxy/north-star relationship under different constraints (the <a class="af nv" href="https://arxiv.org/pdf/2402.17637" rel="noopener ugc nofollow" target="_blank">paper</a> provides more details and should be helpful for practitioners interested in choosing the best estimator for their setting):</p><ul class=""><li id="7bee" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt or os ot bk">A <strong class="my gv">Total Covariance (TC) </strong>estimator allows us to estimate the OLS slope from a scatter plot of <em class="nu">true </em>treatment effects by subtracting the scaled measurement error covariance from the covariance of estimated treatment effects. Under the assumption that the correlated measurement error is the same across experiments (homogeneous covariances), the bias of this estimator is inversely proportional to the total number of units across all experiments, as opposed to the number of members per experiment.</li><li id="15f5" class="mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt or os ot bk"><strong class="my gv">Jackknife Instrumental Variables Estimation (JIVE)</strong> converges to the same OLS slope as the TC estimator but does not require the assumption of homogeneous covariances. JIVE eliminates correlated measurement error by removing each observation’s data from the computation of its instrumented surrogate values.</li><li id="a4aa" class="mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt or os ot bk">A <strong class="my gv">Limited Information Maximum Likelihood (LIML) </strong>estimator is statistically efficient as long as there are no direct effects between the treatment and <em class="nu">Y</em> (that is, <em class="nu">S</em> fully mediates all treatment effects on <em class="nu">Y</em>). We find that LIML is highly sensitive to this assumption and recommend TC or JIVE for most applications.</li></ul><p id="e658" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">Our methods yield linear structural models of treatment effects that are easy to interpret. As such, they are well-suited to the decentralized and rapidly-evolving practice of experimentation at Netflix, which runs <a class="af nv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/experimentation-is-a-major-focus-of-data-science-across-netflix-f67923f8e985">thousands of experiments per year</a> on many diverse parts of the business. Each area of experimentation is staffed by independent Data Science and Engineering teams. While every team ultimately cares about the same north star metrics (e.g., long-term revenue), it is highly impractical for most teams to measure these in short-term A/B tests. Therefore, each has also developed proxies that are more sensitive and directly relevant to their work (e.g., user engagement or latency). To complicate matters more, teams are constantly innovating on these secondary metrics to find the right balance of sensitivity and long-term impact.</p><p id="7ad9" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">In this decentralized environment, linear models of treatment effects are a highly useful tool for coordinating efforts around proxy metrics and aligning them towards the north star:</p><ol class=""><li id="6178" class="mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt pf os ot bk"><strong class="my gv">Managing metric tradeoffs.</strong> Because experiments in one area can affect metrics in another area, there is a need to measure all secondary metrics in all tests, but also to understand the relative impact of these metrics on the north star. This is so we can inform decision-making when one metric trades off against another metric.</li><li id="96f2" class="mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt pf os ot bk"><strong class="my gv">Informing metrics innovation.</strong> To minimize wasted effort on metric development, it is also important to understand how metrics correlate with the north star “net of” existing metrics.</li><li id="5c65" class="mw mx gu my b mz ou nb nc nd ov nf ng nh ow nj nk nl ox nn no np oy nr ns nt pf os ot bk"><strong class="my gv">Enabling teams to work independently.</strong> Lastly, teams need simple tools in order to iterate on their own metrics. Teams may come up with dozens of variations of secondary metrics, and slow, complicated tools for evaluating these variations are unlikely to be adopted. Conversely, our models are easy and fast to fit, and are actively used to develop proxy metrics at Netflix.</li></ol><p id="0d35" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk">We are thrilled about the research and implementation of these methods at Netflix — while also continuing to strive for <strong class="my gv"><em class="nu">great and always better</em></strong>, per our <a class="af nv" href="https://jobs.netflix.com/culture" rel="noopener ugc nofollow" target="_blank">culture</a>. For example, we still have some way to go to develop a more flexible data architecture to streamline the application of these methods within Netflix. Interested in helping us? See our <a class="af nv" href="https://jobs.netflix.com/" rel="noopener ugc nofollow" target="_blank">open job postings</a>!</p><p id="f30b" class="pw-post-body-paragraph mw mx gu my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gn bk"><em class="nu">For feedback on this blog post and for supporting and making this work better, we thank Apoorva Lal, Martin Tingley, Patric Glynn, Richard McDowell, Travis Brooks, and Ayal Chen-Zion.</em></p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/improve-your-next-experiment-by-learning-better-proxy-metrics-from-past-experiments-64c786c2a3ac</link>
      <guid>https://netflixtechblog.com/improve-your-next-experiment-by-learning-better-proxy-metrics-from-past-experiments-64c786c2a3ac</guid>
      <pubDate>Mon, 26 Aug 2024 17:46:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Investigation of a Cross-regional Network Performance Issue]]></title>
      <description><![CDATA[<div><div></div><p id="680a" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><a class="af nu" href="https://www.linkedin.com/in/hechaoli/" rel="noopener ugc nofollow" target="_blank">Hechao Li</a>, <a class="af nu" href="https://www.linkedin.com/in/rogercruz/" rel="noopener ugc nofollow" target="_blank">Roger Cruz</a></p><h1 id="a44a" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Cloud Networking Topology</h1><p id="8b0b" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">Netflix operates a highly efficient cloud computing infrastructure that supports a wide array of applications essential for our SVOD (Subscription Video on Demand), live streaming and gaming services. Utilizing Amazon AWS, our infrastructure is hosted across multiple geographic regions worldwide. This global distribution allows our applications to deliver content more effectively by serving traffic closer to our customers. Like any distributed system, our applications occasionally require data synchronization between regions to maintain seamless service delivery.</p><p id="1fe6" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">The following diagram shows a simplified cloud network topology for cross-region traffic.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fi pj bg pk"><div class="oy oz pa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*RpHklRseVBeBJG6u%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*RpHklRseVBeBJG6u%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*RpHklRseVBeBJG6u%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*RpHklRseVBeBJG6u%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*RpHklRseVBeBJG6u%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*RpHklRseVBeBJG6u%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*RpHklRseVBeBJG6u%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*RpHklRseVBeBJG6u 640w, https://miro.medium.com/v2/resize:fit:720/0*RpHklRseVBeBJG6u 720w, https://miro.medium.com/v2/resize:fit:750/0*RpHklRseVBeBJG6u 750w, https://miro.medium.com/v2/resize:fit:786/0*RpHklRseVBeBJG6u 786w, https://miro.medium.com/v2/resize:fit:828/0*RpHklRseVBeBJG6u 828w, https://miro.medium.com/v2/resize:fit:1100/0*RpHklRseVBeBJG6u 1100w, https://miro.medium.com/v2/resize:fit:1400/0*RpHklRseVBeBJG6u 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h1 id="f7b1" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">The Problem At First Glance</h1><p id="f312" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">Our Cloud Network Engineering on-call team received a request to address a network issue affecting an application with cross-region traffic. Initially, it appeared that the application was experiencing timeouts, likely due to suboptimal network performance. As we all know, the longer the network path, the more devices the packets traverse, increasing the likelihood of issues. For this incident, <strong class="my gu">the client application is located in an internal subnet in the US region while the server application is located in an external subnet in a European region</strong>. Therefore, it is natural to blame the network since packets need to travel long distances through the internet.</p><p id="d4f7" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">As network engineers, our initial reaction when the network is blamed is typically, “No, it can’t be the network,” and our task is to prove it. Given that there were no recent changes to the network infrastructure and no reported AWS issues impacting other applications, the on-call engineer suspected a noisy neighbor issue and sought assistance from the Host Network Engineering team.</p><h1 id="aa0a" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Blame the Neighbors</h1><p id="de5a" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">In this context, a noisy neighbor issue occurs when a container shares a host with other network-intensive containers. <strong class="my gu">These noisy neighbors consume excessive network resources, causing other containers on the same host to suffer from degraded network performance. </strong>Despite each container having bandwidth limitations, oversubscription can still lead to such issues.</p><p id="f159" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Upon investigating other containers on the same host — most of which were part of the same application — we quickly eliminated the possibility of noisy neighbors. <strong class="my gu">The network throughput for both the problematic container and all others was significantly below the set bandwidth limits.</strong> We attempted to resolve the issue by removing these bandwidth limits, allowing the application to utilize as much bandwidth as necessary. However, the problem persisted.</p><h1 id="8f96" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Blame the Network</h1><p id="cdc9" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">We observed some <strong class="my gu">TCP packets in the network marked with the RST flag</strong>, a flag indicating that a connection should be immediately terminated. Although the frequency of these packets was not alarmingly high, the presence of any RST packets still raised suspicion on the network. To determine whether this was indeed a network-induced issue, we conducted a tcpdump on the client. In the packet capture file, we spotted one TCP stream that was closed after exactly 30 seconds.</p><p id="e4ad" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">SYN at 18:47:06</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fi pj bg pk"><div class="oy oz pm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ZLnTrJNuCBe4tUry%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ZLnTrJNuCBe4tUry%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ZLnTrJNuCBe4tUry%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ZLnTrJNuCBe4tUry%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ZLnTrJNuCBe4tUry%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ZLnTrJNuCBe4tUry%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ZLnTrJNuCBe4tUry%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ZLnTrJNuCBe4tUry 640w, https://miro.medium.com/v2/resize:fit:720/0*ZLnTrJNuCBe4tUry 720w, https://miro.medium.com/v2/resize:fit:750/0*ZLnTrJNuCBe4tUry 750w, https://miro.medium.com/v2/resize:fit:786/0*ZLnTrJNuCBe4tUry 786w, https://miro.medium.com/v2/resize:fit:828/0*ZLnTrJNuCBe4tUry 828w, https://miro.medium.com/v2/resize:fit:1100/0*ZLnTrJNuCBe4tUry 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ZLnTrJNuCBe4tUry 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="6af6" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">After the 3-way handshake (SYN,SYN-ACK,ACK), the traffic started flowing normally. Nothing strange until FIN at 18:47:36 (30 seconds later)</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fi pj bg pk"><div class="oy oz pm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*0-aCcRviD0JHcngn%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*0-aCcRviD0JHcngn%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*0-aCcRviD0JHcngn%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*0-aCcRviD0JHcngn%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*0-aCcRviD0JHcngn%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*0-aCcRviD0JHcngn%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*0-aCcRviD0JHcngn%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*0-aCcRviD0JHcngn 640w, https://miro.medium.com/v2/resize:fit:720/0*0-aCcRviD0JHcngn 720w, https://miro.medium.com/v2/resize:fit:750/0*0-aCcRviD0JHcngn 750w, https://miro.medium.com/v2/resize:fit:786/0*0-aCcRviD0JHcngn 786w, https://miro.medium.com/v2/resize:fit:828/0*0-aCcRviD0JHcngn 828w, https://miro.medium.com/v2/resize:fit:1100/0*0-aCcRviD0JHcngn 1100w, https://miro.medium.com/v2/resize:fit:1400/0*0-aCcRviD0JHcngn 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="f0da" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">The packet capture results clearly indicated that <strong class="my gu">it was the client application that initiated the connection termination by sending a FIN packet</strong>. Following this, the server continued to send data; however, since the client had already decided to close the connection, it responded with RST packets to all subsequent data from the server.</p><p id="c1f7" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">To ensure that the client wasn’t closing the connection due to packet loss, we also conducted a packet capture on the server side to verify that all packets sent by the server were received. This task was complicated by the fact that the packets passed through a NAT gateway (NGW), which meant that on the server side, the client’s IP and port appeared as those of the NGW, differing from those seen on the client side. Consequently, to accurately match TCP streams, <strong class="my gu">we needed to identify the TCP stream on the client side, locate the raw TCP sequence number, and then use this number as a filter on the server side to find the corresponding TCP stream.</strong></p><p id="92bd" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">With packet capture results from both the client and server sides, we confirmed that <strong class="my gu">all packets sent by the server were correctly received before the client sent a FIN</strong>.</p><p id="1f6d" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Now, from the network point of view, the story is clear. The client initiated the connection requesting data from the server. The server kept sending data to the client with no problem. However, at a certain point, <strong class="my gu">despite the server still having data to send, the client chose to terminate the reception of data</strong>. This led us to suspect that the issue might be related to the client application itself.</p><h1 id="6807" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Blame the Application</h1><p id="e481" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">In order to fully understand the problem, we now need to understand how the application works. As shown in the diagram below, the application runs in the us-east-1 region. <strong class="my gu">It reads data from cross-region servers and writes the data to consumers within the same region.</strong> The client runs as containers, whereas the servers are EC2 instances.</p><p id="e82c" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><strong class="my gu">Notably, the cross-region read was problematic </strong>while the write path was smooth. Most importantly, there is a 30-second application-level timeout for reading the data. The application (client) errors out if it fails to read an initial batch of data from the servers within 30 seconds. When we increased this timeout to 60 seconds, everything worked as expected. <strong class="my gu">This explains why the client initiated a FIN — because it lost patience waiting for the server to transfer data</strong>.</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fi pj bg pk"><div class="oy oz pn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*zNmSGl1_5vtOHETn%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*zNmSGl1_5vtOHETn%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*zNmSGl1_5vtOHETn%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*zNmSGl1_5vtOHETn%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*zNmSGl1_5vtOHETn%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*zNmSGl1_5vtOHETn%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*zNmSGl1_5vtOHETn%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*zNmSGl1_5vtOHETn 640w, https://miro.medium.com/v2/resize:fit:720/0*zNmSGl1_5vtOHETn 720w, https://miro.medium.com/v2/resize:fit:750/0*zNmSGl1_5vtOHETn 750w, https://miro.medium.com/v2/resize:fit:786/0*zNmSGl1_5vtOHETn 786w, https://miro.medium.com/v2/resize:fit:828/0*zNmSGl1_5vtOHETn 828w, https://miro.medium.com/v2/resize:fit:1100/0*zNmSGl1_5vtOHETn 1100w, https://miro.medium.com/v2/resize:fit:1400/0*zNmSGl1_5vtOHETn 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="da9f" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Could it be that the server was updated to send data more slowly? Could it be that the client application was updated to receive data more slowly? Could it be that the data volume became too large to be completely sent out within 30 seconds? Sadly, <strong class="my gu">we received negative answers for all 3 questions from the application owner.</strong> The server had been operating without changes for over a year, there were no significant updates in the latest rollout of the client, and the data volume had remained consistent.</p><h1 id="76c8" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Blame the Kernel</h1><p id="efa1" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">If both the network and the application weren’t changed recently, then what changed? In fact, we discovered that the issue coincided with a recent <strong class="my gu">Linux kernel upgrade from version 6.5.13 to 6.6.10</strong>. To test this hypothesis, we rolled back the kernel upgrade and it did restore normal operation to the application.</p><p id="9ee2" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Honestly speaking, at that time I didn’t believe it was a kernel bug because I assumed the TCP implementation in the kernel should be solid and stable (Spoiler alert: How wrong was I!). But we were also out of ideas from other angles.</p><p id="f536" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">There were about 14k commits between the good and bad kernel versions. Engineers on the team methodically and diligently bisected between the two versions. When the bisecting was narrowed to a couple of commits, <strong class="my gu">a change with “tcp” in its commit message caught our attention. The final bisecting confirmed that </strong><a class="af nu" href="https://lore.kernel.org/netdev/20230717152917.751987-1-edumazet@google.com/T/" rel="noopener ugc nofollow" target="_blank"><strong class="my gu">this commit</strong></a><strong class="my gu"> was our culprit</strong>.</p><p id="cc01" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Interestingly, while reviewing the email history related to this commit, we found that <a class="af nu" href="https://github.com/eventlet/eventlet/issues/821" rel="noopener ugc nofollow" target="_blank">another user had reported a Python test failure following the same kernel upgrade</a>. Although their solution was not directly applicable to our situation, it suggested that <strong class="my gu">a simpler test might also reproduce our problem</strong>. Using <em class="po">strace</em>, we observed that the application configured the following socket options when communicating with the server:</p><pre class="pb pc pd pe pf pp pq pr bo ps ba bj">[pid 1699] setsockopt(917, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0<br />[pid 1699] setsockopt(917, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0<br />[pid 1699] setsockopt(917, SOL_SOCKET, SO_SNDBUF, [131072], 4) = 0<br />[pid 1699] setsockopt(917, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0<br />[pid 1699] setsockopt(917, SOL_TCP, TCP_NODELAY, [1], 4) = 0</pre><p id="3e18" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">We then developed a minimal client-server C application that transfers a file from the server to the client, with the client configuring the same set of socket options. During testing, we used a 10M file, which represents the volume of data typically transferred within 30 seconds before the client issues a FIN. <strong class="my gu">On the old kernel, this cross-region transfer completed in 22 seconds, whereas on the new kernel, it took 39 seconds to finish.</strong></p><h1 id="ed1b" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">The Root Cause</h1><p id="1490" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">With the help of the minimal reproduction setup, we were ultimately able to pinpoint the root cause of the problem. In order to understand the root cause, it’s essential to have a grasp of the TCP receive window.</p><h2 id="6a6e" class="py nw gt be nx pz qa dx ob qb qc dz of nh qd qe qf nl qg qh qi np qj qk ql qm bj">TCP Receive Window</h2><p id="a997" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">Simply put, <strong class="my gu">the TCP receive window is how the receiver tells the sender “This is how many bytes you can send me without me ACKing any of them”</strong>. Assuming the sender is the server and the receiver is the client, then we have:</p><figure class="pb pc pd pe pf pg oy oz paragraph-image"><div role="button" tabindex="0" class="ph pi fi pj bg pk"><div class="oy oz qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*98gJP81W46nhdonq%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*98gJP81W46nhdonq%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*98gJP81W46nhdonq%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*98gJP81W46nhdonq%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*98gJP81W46nhdonq%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*98gJP81W46nhdonq%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*98gJP81W46nhdonq%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*98gJP81W46nhdonq 640w, https://miro.medium.com/v2/resize:fit:720/0*98gJP81W46nhdonq 720w, https://miro.medium.com/v2/resize:fit:750/0*98gJP81W46nhdonq 750w, https://miro.medium.com/v2/resize:fit:786/0*98gJP81W46nhdonq 786w, https://miro.medium.com/v2/resize:fit:828/0*98gJP81W46nhdonq 828w, https://miro.medium.com/v2/resize:fit:1100/0*98gJP81W46nhdonq 1100w, https://miro.medium.com/v2/resize:fit:1400/0*98gJP81W46nhdonq 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h2 id="f567" class="py nw gt be nx pz qa dx ob qb qc dz of nh qd qe qf nl qg qh qi np qj qk ql qm bj">The Window Size</h2><p id="0d83" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">Now that we know the TCP receive window size could affect the throughput, the question is, how is the window size calculated? As an application writer, you can’t decide the window size, however, you can decide how much memory you want to use for buffering received data. This is configured using <strong class="my gu"><em class="po">SO_RCVBUF</em> socket option</strong> we saw in the <em class="po">strace</em> result above. However, note that the value of this option means how much <strong class="my gu">application data</strong> can be queued in the receive buffer. In <a class="af nu" href="https://man7.org/linux/man-pages/man7/socket.7.html" rel="noopener ugc nofollow" target="_blank">man 7 socket</a>, there is</p><blockquote class="qo qp qq"><p id="a73f" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">SO_RCVBUF</p><p id="ed59" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Sets or gets the maximum socket receive buffer in bytes.<br /> The kernel doubles this value (to allow space for<br /> bookkeeping overhead) when it is set using setsockopt(2),<br /> and this doubled value is returned by getsockopt(2). The<br /> default value is set by the<br /> /proc/sys/net/core/rmem_default file, and the maximum<br /> allowed value is set by the /proc/sys/net/core/rmem_max<br /> file. The minimum (doubled) value for this option is 256.</p></blockquote><p id="71af" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">This means, when the user gives a value X, then <a class="af nu" href="https://elixir.bootlin.com/linux/v6.9-rc1/source/net/core/sock.c#L976" rel="noopener ugc nofollow" target="_blank">the kernel stores 2X in the variable sk-&gt;sk_rcvbuf</a>. In other words, <strong class="my gu">the kernel assumes that the bookkeeping overhead is as much as the actual data (i.e. 50% of the sk_rcvbuf)</strong>.</p><h2 id="7337" class="py nw gt be nx pz qa dx ob qb qc dz of nh qd qe qf nl qg qh qi np qj qk ql qm bj">sysctl_tcp_adv_win_scale</h2><p id="9a3d" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">However, the assumption above may not be true because the actual overhead really depends on a lot of factors such as Maximum Transmission Unit (MTU). Therefore, <strong class="my gu">the kernel provided this <em class="po">sysctl_tcp_adv_win_scale</em> which you can use to tell the kernel what the actual overhead is</strong>. (I believe 99% of people also don’t know how to set this parameter correctly and I’m definitely one of them. You’re the kernel, if you don’t know the overhead, how can you expect me to know?).</p><p id="4e58" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">According to <a class="af nu" href="https://docs.kernel.org/networking/ip-sysctl.html" rel="noopener ugc nofollow" target="_blank">the <em class="po">sysctl</em> doc</a>,</p><blockquote class="qo qp qq"><p id="23e2" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><em class="gt">tcp_adv_win_scale — INTEGER</em></p><p id="a18b" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><em class="gt">Obsolete since linux-6.6 Count buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale &gt; 0) or bytes-bytes/2^(-tcp_adv_win_scale), if it is &lt;= 0.</em></p><p id="0786" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><em class="gt">Possible values are [-31, 31], inclusive.</em></p><p id="2f88" class="mw mx po my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj"><em class="gt">Default: 1</em></p></blockquote><p id="ebc0" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">For 99% of people, we’re just using the default value 1, which in turn means the overhead is calculated by <em class="po">rcvbuf/2^tcp_adv_win_scale = 1/2 * rcvbuf</em>. This matches the assumption when setting the <em class="po">SO_RCVBUF</em> value.</p><p id="289d" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Let’s recap. Assume you set <em class="po">SO_RCVBUF</em> to 65536, which is the value set by the application as shown in the <em class="po">setsockopt</em> syscall. Then we have:</p><ul class=""><li id="e79f" class="mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qr qs qt bj">SO_RCVBUF = 65536</li><li id="72c0" class="mw mx gt my b mz qu nb nc nd qv nf ng nh qw nj nk nl qx nn no np qy nr ns nt qr qs qt bj">rcvbuf = 2 * 65536 = 131072</li><li id="e031" class="mw mx gt my b mz qu nb nc nd qv nf ng nh qw nj nk nl qx nn no np qy nr ns nt qr qs qt bj">overhead = rcvbuf / 2 = 131072 / 2 = 65536</li><li id="7e11" class="mw mx gt my b mz qu nb nc nd qv nf ng nh qw nj nk nl qx nn no np qy nr ns nt qr qs qt bj">receive window size = rcvbuf — overhead = 131072–65536 = 65536</li></ul><p id="6c6d" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">(Note, this calculation is simplified. The real calculation is more complex.)</p><p id="4e0b" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">In short, the receive window size before the kernel upgrade was 65536. With this window size, the application was able to transfer 10M data within 30 seconds.</p><h2 id="0ac0" class="py nw gt be nx pz qa dx ob qb qc dz of nh qd qe qf nl qg qh qi np qj qk ql qm bj">The Change</h2><p id="8abe" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj"><a class="af nu" href="https://lore.kernel.org/netdev/20230717152917.751987-1-edumazet@google.com/T/" rel="noopener ugc nofollow" target="_blank">This commit</a> obsoleted <em class="po">sysctl_tcp_adv_win_scale</em> and introduced a <em class="po">scaling_ratio</em> that can more accurately calculate the overhead or window size, which is the right thing to do. With the change, the window size is now <em class="po">rcvbuf * scaling_ratio</em>.</p><p id="9bdf" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">So how is <em class="po">scaling_ratio</em> calculated? It is calculated using <strong class="my gu"><em class="po">skb-&gt;len/skb-&gt;truesize</em></strong> where <em class="po">skb-&gt;len</em> is the length of the tcp data length in an <em class="po">skb</em> and <em class="po">truesize</em> is the total size of the <em class="po">skb</em>. <strong class="my gu">This is surely a more accurate ratio based on real data rather than a hardcoded 50%.</strong> Now, here is the next question: during the TCP handshake <strong class="my gu">before any data is transferred, how do we decide the initial <em class="po">scaling_ratio</em>? </strong>The answer is, a magic and conservative ratio was chosen with the value being roughly 0.25.</p><p id="f7cb" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Now we have:</p><ul class=""><li id="5f22" class="mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt qr qs qt bj">SO_RCVBUF = 65536</li><li id="3e62" class="mw mx gt my b mz qu nb nc nd qv nf ng nh qw nj nk nl qx nn no np qy nr ns nt qr qs qt bj">rcvbuf = 2 * 65536 = 131072</li><li id="88c1" class="mw mx gt my b mz qu nb nc nd qv nf ng nh qw nj nk nl qx nn no np qy nr ns nt qr qs qt bj">receive window size = rcvbuf * 0.25 = 131072 * 0.25 = 32768</li></ul><p id="79ea" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">In short, <strong class="my gu">the receive window size halved after the kernel upgrade. Hence the throughput was cut in half</strong>,<strong class="my gu"> causing the data transfer time to double.</strong></p><p id="de39" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Naturally, you may ask, I understand that the initial window size is small, but <strong class="my gu">why doesn’t the window grow when we have a more accurate ratio of the payload later</strong> (i.e. <em class="po">skb-&gt;len/skb-&gt;truesize</em>)? With some debugging, we eventually found out that the <em class="po">scaling_ratio</em> does <a class="af nu" href="https://elixir.bootlin.com/linux/v6.7.9/source/net/ipv4/tcp_input.c#L248" rel="noopener ugc nofollow" target="_blank">get updated to a more accurate <em class="po">skb-&gt;len/skb-&gt;truesize</em></a>, which in our case is around 0.66. However, another variable, <em class="po">window_clamp</em>, is not updated accordingly. <em class="po">window_clamp</em> is the <a class="af nu" href="https://elixir.bootlin.com/linux/v6.7.9/source/include/linux/tcp.h#L256" rel="noopener ugc nofollow" target="_blank">maximum receive window allowed to be advertised</a>, which is also initialized to <em class="po">0.25 * rcvbuf </em>using the initial <em class="po">scaling_ratio</em>. As a result, <strong class="my gu">the receive window size is capped at this value and can’t grow bigger</strong>.</p><h1 id="a8f3" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">The Fix</h1><p id="e049" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">In theory, the fix is to update <em class="po">window_clamp</em> along with <em class="po">scaling_ratio</em>. However, in order to have a simple fix that doesn’t introduce other unexpected behaviors, <a class="af nu" href="https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=697a6c8cec03" rel="noopener ugc nofollow" target="_blank">our final fix was to increase the initial <em class="po">scaling_ratio</em> from 25% to 50%</a>. This will make the receive window size backward compatible with the original default <em class="po">sysctl_tcp_adv_win_scale</em>.</p><p id="fb5c" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">Meanwhile, notice that the problem is not only caused by the changed kernel behavior but also by the fact that the application sets <em class="po">SO_RCVBUF</em> and has a 30-second application-level timeout. In fact, the application is Kafka Connect and both settings are the default configurations (<a class="af nu" href="https://kafka.apache.org/documentation/#connectconfigs_receive.buffer.bytes" rel="noopener ugc nofollow" target="_blank"><em class="po">receive.buffer.bytes=64k</em></a> and <a class="af nu" href="https://kafka.apache.org/documentation/#consumerconfigs_request.timeout.ms" rel="noopener ugc nofollow" target="_blank"><em class="po">request.timeout.ms=30s</em></a>). We also<a class="af nu" href="https://issues.apache.org/jira/browse/KAFKA-16496" rel="noopener ugc nofollow" target="_blank"> created a kafka ticket to change receive.buffer.bytes to -1</a> to allow Linux to auto tune the receive window.</p><h1 id="53cb" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Conclusion</h1><p id="0152" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">This was a very interesting debugging exercise that covered many layers of Netflix’s stack and infrastructure. While it technically wasn’t the “network” to blame, this time it turned out the culprit was the software components that make up the network (i.e. the TCP implementation in the kernel).</p><p id="d2c0" class="pw-post-body-paragraph mw mx gt my b mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt gm bj">If tackling such technical challenges excites you, consider joining our Cloud Infrastructure Engineering teams. Explore opportunities by visiting <a class="af nu" href="https://jobs.netflix.com/" rel="noopener ugc nofollow" target="_blank">Netflix Jobs</a> and searching for Cloud Engineering positions.</p><h1 id="eb85" class="nv nw gt be nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or os bj">Acknowledgments</h1><p id="c3e7" class="pw-post-body-paragraph mw mx gt my b mz ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np ox nr ns nt gm bj">Special thanks to our stunning colleagues <a class="af nu" href="https://www.linkedin.com/in/alok-tiagi-99205015/" rel="noopener ugc nofollow" target="_blank">Alok Tiagi</a>, <a class="af nu" href="https://www.linkedin.com/in/artemtkachuk/" rel="noopener ugc nofollow" target="_blank">Artem Tkachuk</a>, <a class="af nu" href="https://www.linkedin.com/in/jethanadams/" rel="noopener ugc nofollow" target="_blank">Ethan Adams</a>, <a class="af nu" href="https://www.linkedin.com/in/jorge-rodriguez-12b5595/" rel="noopener ugc nofollow" target="_blank">Jorge Rodriguez</a>, <a class="af nu" href="https://www.linkedin.com/in/nickmahilani/" rel="noopener ugc nofollow" target="_blank">Nick Mahilani</a>, <a class="af nu" href="https://tycho.pizza/" rel="noopener ugc nofollow" target="_blank">Tycho Andersen</a> and <a class="af nu" href="https://www.linkedin.com/in/vinay-rayini/" rel="noopener ugc nofollow" target="_blank">Vinay Rayini</a> for investigating and mitigating this issue. We would also like to thank Linux kernel network expert <a class="af nu" href="https://www.linkedin.com/in/eric-dumazet-ba252942/" rel="noopener ugc nofollow" target="_blank">Eric Dumazet</a> for reviewing and applying the patch.</p></div>]]></description>
      <link>https://netflixtechblog.com/investigation-of-a-cross-regional-network-performance-issue-422d6218fdf1</link>
      <guid>https://netflixtechblog.com/investigation-of-a-cross-regional-network-performance-issue-422d6218fdf1</guid>
      <pubDate>Tue, 06 Aug 2024 00:18:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Java 21 Virtual Threads - Dude, Where’s My Lock?]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fy fz ga gb"><div><div><h2 id="6f13" class="pw-subtitle-paragraph hq gs gt be b hr hs ht hu hv hw hx hy hz ia ib ic id ie if cp dt">Getting real with virtual threads</h2><div></div><p id="5713" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">By <a class="af od" href="https://www.linkedin.com/in/vfilanovsky/" rel="noopener ugc nofollow" target="_blank">Vadim Filanovsky</a>, <a class="af od" href="https://www.linkedin.com/in/mike-huang-a552781/" rel="noopener ugc nofollow" target="_blank">Mike Huang</a>, <a class="af od" href="https://www.linkedin.com/in/danny-thomas-a623413/" rel="noopener ugc nofollow" target="_blank">Danny Thomas</a> and <a class="af od" href="https://www.linkedin.com/in/martinchalupa/" rel="noopener ugc nofollow" target="_blank">Martin Chalupa</a></p><h1 id="2d29" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Intro</h1><p id="2e2e" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Netflix has an extensive history of using Java as our primary programming language across our vast fleet of microservices. As we pick up newer versions of Java, our JVM Ecosystem team seeks out new language features that can improve the ergonomics and performance of our systems. In a <a class="af od" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b">recent article</a>, we detailed how our workloads benefited from switching to generational ZGC as our default garbage collector when we migrated to Java 21. Virtual threads is another feature we are excited to adopt as part of this migration.</p><p id="3630" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">For those new to virtual threads, <a class="af od" href="https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html" rel="noopener ugc nofollow" target="_blank">they are described</a> as “lightweight threads that dramatically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications.” Their power comes from their ability to be suspended and resumed automatically via continuations when blocking operations occur, thus freeing the underlying operating system threads to be reused for other operations. Leveraging virtual threads can unlock higher performance when utilized in the appropriate context.</p><p id="3d96" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">In this article we discuss one of the peculiar cases that we encountered along our path to deploying virtual threads on Java 21.</p><h1 id="17c0" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">The problem</h1><p id="3256" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Netflix engineers raised several independent reports of intermittent timeouts and hung instances to the Performance Engineering and JVM Ecosystem teams. Upon closer examination, we noticed a set of common traits and symptoms. In all cases, the apps affected ran on Java 21 with SpringBoot 3 and embedded Tomcat serving traffic on REST endpoints. The instances that experienced the issue simply stopped serving traffic even though the JVM on those instances remained up and running. One clear symptom characterizing the onset of this issue is a persistent increase in the number of sockets in <code class="cw pf pg ph pi b">closeWait</code> state as illustrated by the graph below:</p><figure class="pm pn po pp pq pr pj pk paragraph-image"><div role="button" tabindex="0" class="ps pt fi pu bg pv"><div class="pj pk pl"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*b5oZiN2Ew96GEeZ9oIIhPA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*b5oZiN2Ew96GEeZ9oIIhPA.png" /></picture></div></div></figure><h1 id="f9a6" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Collected diagnostics</h1><p id="4115" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Sockets remaining in <code class="cw pf pg ph pi b">closeWait</code> state indicate that the remote peer closed the socket, but it was never closed on the local instance, presumably because the application failed to do so. This can often indicate that the application is hanging in an abnormal state, in which case application thread dumps may reveal additional insight.</p><p id="b6ef" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">In order to troubleshoot this issue, we first leveraged our <a class="af od" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e">alerts system</a> to catch an instance in this state. Since we periodically collect and persist thread dumps for all JVM workloads, we can often retroactively piece together the behavior by examining these thread dumps from an instance. However, we were surprised to find that all our thread dumps show a perfectly idle JVM with no clear activity. Reviewing recent changes revealed that these impacted services enabled virtual threads, and we knew that virtual thread call stacks do not show up in <code class="cw pf pg ph pi b">jstack</code>-generated thread dumps. To obtain a more complete thread dump containing the state of the virtual threads, we used the “<code class="cw pf pg ph pi b">jcmd Thread.dump_to_file</code>” command instead. As a last-ditch effort to introspect the state of JVM, we also collected a heap dump from the instance.</p><h1 id="25f0" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Analysis</h1><p id="165a" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Thread dumps revealed thousands of “blank” virtual threads:</p><pre class="pm pn po pp pq px pi py bo pz ba bj">#119821 "" virtual#119820 "" virtual#119823 "" virtual#120847 "" virtual#119822 "" virtual<br />...</pre><p id="1d0f" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">These are the VTs (virtual threads) for which a thread object is created, but has not started running, and as such, has no stack trace. In fact, there were approximately the same number of blank VTs as the number of sockets in closeWait state. To make sense of what we were seeing, we need to first understand how VTs operate.</p><p id="f352" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">A virtual thread is not mapped 1:1 to a dedicated OS-level thread. Rather, we can think of it as a task that is scheduled to a fork-join thread pool. When a virtual thread enters a blocking call, like waiting for a <code class="cw pf pg ph pi b">Future</code>, it relinquishes the OS thread it occupies and simply remains in memory until it is ready to resume. In the meantime, the OS thread can be reassigned to execute other VTs in the same fork-join pool. This allows us to multiplex a lot of VTs to just a handful of underlying OS threads. In JVM terminology, the underlying OS thread is referred to as the “carrier thread” to which a virtual thread can be “mounted” while it executes and “unmounted” while it waits. A great in-depth description of virtual thread is available in <a class="af od" href="https://openjdk.org/jeps/444" rel="noopener ugc nofollow" target="_blank">JEP 444</a>.</p><p id="fa4e" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">In our environment, we utilize a blocking model for Tomcat, which in effect holds a worker thread for the lifespan of a request. By enabling virtual threads, Tomcat switches to virtual execution. Each incoming request creates a new virtual thread that is simply scheduled as a task on a <a class="af od" href="https://github.com/apache/tomcat/blob/10.1.24/java/org/apache/tomcat/util/threads/VirtualThreadExecutor.java" rel="noopener ugc nofollow" target="_blank">Virtual Thread Executor</a>. We can see Tomcat creates a <code class="cw pf pg ph pi b">VirtualThreadExecutor</code> <a class="af od" href="https://github.com/apache/tomcat/blob/10.1.24/java/org/apache/tomcat/util/net/AbstractEndpoint.java#L1070-L1071" rel="noopener ugc nofollow" target="_blank">here</a>.</p><p id="d97c" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">Tying this information back to our problem, the symptoms correspond to a state when Tomcat keeps creating a new web worker VT for each incoming request, but there are no available OS threads to mount them onto.</p><h1 id="520c" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Why is Tomcat stuck?</h1><p id="31ec" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">What happened to our OS threads and what are they busy with? As <a class="af od" href="https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-04C03FFC-066D-4857-85B9-E5A27A875AF9" rel="noopener ugc nofollow" target="_blank">described here</a>, a VT will be pinned to the underlying OS thread if it performs a blocking operation while inside a <code class="cw pf pg ph pi b">synchronized</code> block or method. This is exactly what is happening here. Here is a relevant snippet from a thread dump obtained from the stuck instance:</p><pre class="pm pn po pp pq px pi py bo pz ba bj">#119515 "" virtual<br />      java.base/jdk.internal.misc.Unsafe.park(Native Method)<br />      java.base/java.lang.VirtualThread.parkOnCarrierThread(VirtualThread.java:661)<br />      java.base/java.lang.VirtualThread.park(VirtualThread.java:593)<br />      java.base/java.lang.System$2.parkVirtualThread(System.java:2643)<br />      java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)<br />      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)<br />      java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)<br />      java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)<br />      zipkin2.reporter.internal.CountBoundedQueue.offer(CountBoundedQueue.java:54)<br />      zipkin2.reporter.internal.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:230)<br />      zipkin2.reporter.brave.AsyncZipkinSpanHandler.end(AsyncZipkinSpanHandler.java:214)<br />      brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:98)<br />      brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:48)<br />      brave.internal.recorder.PendingSpans.finish(PendingSpans.java:116)<br />      brave.RealSpan.finish(RealSpan.java:134)<br />      brave.RealSpan.finish(RealSpan.java:129)<br />      io.micrometer.tracing.brave.bridge.BraveSpan.end(BraveSpan.java:117)<br />      io.micrometer.tracing.annotation.AbstractMethodInvocationProcessor.after(AbstractMethodInvocationProcessor.java:67)<br />      io.micrometer.tracing.annotation.ImperativeMethodInvocationProcessor.proceedUnderSynchronousSpan(ImperativeMethodInvocationProcessor.java:98)<br />      io.micrometer.tracing.annotation.ImperativeMethodInvocationProcessor.process(ImperativeMethodInvocationProcessor.java:73)<br />      io.micrometer.tracing.annotation.SpanAspect.newSpanMethod(SpanAspect.java:59)<br />      java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)<br />      java.base/java.lang.reflect.Method.invoke(Method.java:580)<br />      org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:637)<br />...</pre><p id="627b" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">In this stack trace, we enter the synchronization in <code class="cw pf pg ph pi b">brave.RealSpan.finish(<a class="af od" href="https://github.com/openzipkin/brave/blob/6.0.3/brave/src/main/java/brave/RealSpan.java#L134" rel="noopener ugc nofollow" target="_blank">RealSpan.java:134</a>)</code>. This virtual thread is effectively pinned — it is mounted to an actual OS thread even while it waits to acquire a reentrant lock. There are 3 VTs in this exact state and another VT identified as “<code class="cw pf pg ph pi b">&lt;redacted&gt; @DefaultExecutor - 46542</code>” that also follows the same code path. These 4 virtual threads are pinned while waiting to acquire a lock. Because the app is deployed on an instance with 4 vCPUs, <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/lang/VirtualThread.java#L1102-L1134" rel="noopener ugc nofollow" target="_blank">the fork-join pool that underpins VT execution</a> also contains 4 OS threads. Now that we have exhausted all of them, no other virtual thread can make any progress. This explains why Tomcat stopped processing the requests and why the number of sockets in <code class="cw pf pg ph pi b">closeWait</code> state keeps climbing. Indeed, Tomcat accepts a connection on a socket, creates a request along with a virtual thread, and passes this request/thread to the executor for processing. However, the newly created VT cannot be scheduled because all of the OS threads in the fork-join pool are pinned and never released. So these newly created VTs are stuck in the queue, while still holding the socket.</p><h1 id="caf7" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Who has the lock?</h1><p id="f2d5" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Now that we know VTs are waiting to acquire a lock, the next question is: Who holds the lock? Answering this question is key to understanding what triggered this condition in the first place. Usually a thread dump indicates who holds the lock with either “<code class="cw pf pg ph pi b">- locked &lt;0x…&gt; (at …)</code>” or “<code class="cw pf pg ph pi b">Locked ownable synchronizers</code>,” but neither of these show up in our thread dumps. As a matter of fact, no locking/parking/waiting information is included in the <code class="cw pf pg ph pi b">jcmd</code>-generated thread dumps. This is a limitation in Java 21 and will be addressed in the future releases. Carefully combing through the thread dump reveals that there are a total of 6 threads contending for the same <code class="cw pf pg ph pi b">ReentrantLock</code> and associated <code class="cw pf pg ph pi b">Condition</code>. Four of these six threads are detailed in the previous section. Here is another thread:</p><pre class="pm pn po pp pq px pi py bo pz ba bj">#119516 "" virtual<br />      java.base/java.lang.VirtualThread.park(VirtualThread.java:582)<br />      java.base/java.lang.System$2.parkVirtualThread(System.java:2643)<br />      java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)<br />      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)<br />      java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)<br />      java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)<br />      zipkin2.reporter.internal.CountBoundedQueue.offer(CountBoundedQueue.java:54)<br />      zipkin2.reporter.internal.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:230)<br />      zipkin2.reporter.brave.AsyncZipkinSpanHandler.end(AsyncZipkinSpanHandler.java:214)<br />      brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:98)<br />      brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:48)<br />      brave.internal.recorder.PendingSpans.finish(PendingSpans.java:116)<br />      brave.RealScopedSpan.finish(RealScopedSpan.java:64)<br />      ...</pre><p id="fc4e" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">Note that while this thread seemingly goes through the same code path for finishing a span, it does not go through a <code class="cw pf pg ph pi b">synchronized</code> block. Finally here is the 6th thread:</p><pre class="pm pn po pp pq px pi py bo pz ba bj">#107 "AsyncReporter &lt;redacted&gt;"<br />      java.base/jdk.internal.misc.Unsafe.park(Native Method)<br />      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)<br />      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1761)<br />      zipkin2.reporter.internal.CountBoundedQueue.drainTo(CountBoundedQueue.java:81)<br />      zipkin2.reporter.internal.AsyncReporter$BoundedAsyncReporter.flush(AsyncReporter.java:241)<br />      zipkin2.reporter.internal.AsyncReporter$Flusher.run(AsyncReporter.java:352)<br />      java.base/java.lang.Thread.run(Thread.java:1583)</pre><p id="1f0e" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">This is actually a normal platform thread, not a virtual thread. Paying particular attention to the line numbers in this stack trace, it is peculiar that the thread seems to be blocked within the internal <code class="cw pf pg ph pi b">acquire()</code> method <em class="qf">after</em> <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L1761" rel="noopener ugc nofollow" target="_blank">completing the wait</a>. In other words, this calling thread owned the lock upon entering <code class="cw pf pg ph pi b">awaitNanos()</code>. We know the lock was explicitly acquired <a class="af od" href="https://github.com/openzipkin/zipkin-reporter-java/blob/3.4.0/core/src/main/java/zipkin2/reporter/internal/CountBoundedQueue.java#L76" rel="noopener ugc nofollow" target="_blank">here</a>. However, by the time the wait completed, it could not reacquire the lock. Summarizing our thread dump analysis:</p><figure class="pm pn po pp pq pr"><div class="qg jq l fi"></div></figure><p id="b6e7" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">There are 5 virtual threads and 1 regular thread waiting for the lock. Out of those 5 VTs, 4 of them are pinned to the OS threads in the fork-join pool. There’s still no information on who owns the lock. As there’s nothing more we can glean from the thread dump, our next logical step is to peek into the heap dump and introspect the state of the lock.</p><h1 id="096b" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Inspecting the lock</h1><p id="a6d9" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Finding the lock in the heap dump was relatively straightforward. Using the excellent <a class="af od" href="https://eclipse.dev/mat/" rel="noopener ugc nofollow" target="_blank">Eclipse MAT</a> tool, we examined the objects on the stack of the <code class="cw pf pg ph pi b">AsyncReporter</code> non-virtual thread to identify the lock object. Reasoning about the current state of the lock was perhaps the trickiest part of our investigation. Most of the relevant code can be found in the <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java" rel="noopener ugc nofollow" target="_blank">AbstractQueuedSynchronizer.java</a>. While we don’t claim to fully understand the inner workings of it, we reverse-engineered enough of it to match against what we see in the heap dump. This diagram illustrates our findings:</p></div></div><div class="pr"><div class="ab ca"><div class="mj qj mk qk ml ql ce qm cf qn ch bg"><figure class="pm pn po pp pq pr qp qq paragraph-image"><div role="button" tabindex="0" class="ps pt fi pu bg pv"><div class="pj pk qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*6AOJeVdbhmStpb9CRj30nw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*6AOJeVdbhmStpb9CRj30nw.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fy fz ga gb"><p id="11f3" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">First off, the <code class="cw pf pg ph pi b">exclusiveOwnerThread</code> field is <code class="cw pf pg ph pi b">null</code> (2), signifying that no one owns the lock. We have an “empty” <code class="cw pf pg ph pi b">ExclusiveNode</code> (3) at the head of the list (<code class="cw pf pg ph pi b">waiter</code> is <code class="cw pf pg ph pi b">null</code> and <code class="cw pf pg ph pi b">status</code> is cleared) followed by another <code class="cw pf pg ph pi b">ExclusiveNode</code> with <code class="cw pf pg ph pi b">waiter</code> pointing to one of the virtual threads contending for the lock — <code class="cw pf pg ph pi b">#119516</code> (4). The only place we found that clears the <code class="cw pf pg ph pi b">exclusiveOwnerThread</code> field is within the <code class="cw pf pg ph pi b">ReentrantLock.Sync.tryRelease()</code> method (<a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/ReentrantLock.java#L178" rel="noopener ugc nofollow" target="_blank">source link</a>). There we also set <code class="cw pf pg ph pi b">state = 0</code> matching the state that we see in the heap dump (1).</p><p id="b2bc" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">With this in mind, we traced the <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L1058-L1064" rel="noopener ugc nofollow" target="_blank">code path</a> to <code class="cw pf pg ph pi b">release()</code> the lock. After successfully calling <code class="cw pf pg ph pi b">tryRelease()</code>, the lock-holding thread attempts to <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L641-L647" rel="noopener ugc nofollow" target="_blank">signal the next waiter</a> in the list. At this point, the lock-holding thread is still at the head of the list, even though ownership of the lock is <em class="qf">effectively released</em>. The <em class="qf">next </em>node in the list points to the thread that is <em class="qf">about to acquire the lock</em>.</p><p id="a4a6" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">To understand how this signaling works, let’s look at the <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L670-L765" rel="noopener ugc nofollow" target="_blank">lock acquire path</a> in the <code class="cw pf pg ph pi b">AbstractQueuedSynchronizer.acquire()</code> method. Grossly oversimplifying, it’s an infinite loop, where threads attempt to acquire the lock and then park if the attempt was unsuccessful:</p><pre class="pm pn po pp pq px pi py bo pz ba bj">while(true) {<br />   if (tryAcquire()) {<br />      return; // lock acquired<br />   }<br />   park();<br />}</pre><p id="b3b3" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">When the lock-holding thread releases the lock and signals to unpark the next waiter thread, the unparked thread iterates through this loop again, giving it another opportunity to acquire the lock. Indeed, our thread dump indicates that all of our waiter threads are parked on <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L754" rel="noopener ugc nofollow" target="_blank">line 754</a>. Once unparked, the thread that managed to acquire the lock should end up in <a class="af od" href="https://github.com/openjdk/jdk21u/blob/jdk-21.0.3-ga/src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#L716-L723" rel="noopener ugc nofollow" target="_blank">this code block</a>, effectively resetting the head of the list and clearing the reference to the waiter.</p><p id="8fcd" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">To restate this more concisely, the lock-owning thread is referenced by the head node of the list. Releasing the lock notifies the next node in the list while acquiring the lock resets the head of the list to the current node. This means that what we see in the heap dump reflects the state when one thread has already released the lock but the next thread has yet to acquire it. It’s a weird in-between state that should be transient, but our JVM is stuck here. We know thread <code class="cw pf pg ph pi b">#119516</code> was notified and is about to acquire the lock because of the <code class="cw pf pg ph pi b">ExclusiveNode</code> state we identified at the head of the list. However, thread dumps show that thread <code class="cw pf pg ph pi b">#119516</code> continues to wait, just like other threads contending for the same lock. How can we reconcile what we see between the thread and heap dumps?</p><h1 id="2378" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">The lock with no place to run</h1><p id="f1cc" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Knowing that thread <code class="cw pf pg ph pi b">#119516</code> was actually notified, we went back to the thread dump to re-examine the state of the threads. Recall that we have 6 total threads waiting for the lock with 4 of the virtual threads each pinned to an OS thread. These 4 will not yield their OS thread until they acquire the lock and proceed out of the <code class="cw pf pg ph pi b">synchronized</code> block. <code class="cw pf pg ph pi b">#107 “AsyncReporter &lt;redacted&gt;”</code> is a regular platform thread, so nothing should prevent it from proceeding if it acquires the lock. This leaves us with the last thread: <code class="cw pf pg ph pi b">#119516</code>. It is a VT, but it is not pinned to an OS thread. Even if it’s notified to be unparked, it cannot proceed because there are no more OS threads left in the fork-join pool to schedule it onto. That’s exactly what happens here — although <code class="cw pf pg ph pi b">#119516</code> is signaled to unpark itself, it cannot leave the parked state because the fork-join pool is occupied by the 4 other VTs waiting to acquire the same lock. None of those pinned VTs can proceed until they acquire the lock. It’s a variation of the <a class="af od" href="https://en.wikipedia.org/wiki/Deadlock" rel="noopener ugc nofollow" target="_blank">classic deadlock problem</a>, but instead of 2 locks we have one lock and a semaphore with 4 permits as represented by the fork-join pool.</p><p id="5cdd" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">Now that we know exactly what happened, it was easy to come up with a <a class="af od" href="https://gist.github.com/DanielThomas/0b099c5f208d7deed8a83bf5fc03179e" rel="noopener ugc nofollow" target="_blank">reproducible test case</a>.</p><h1 id="ee54" class="oe of gt be og oh oi ht oj ok ol hw om on oo op oq or os ot ou ov ow ox oy oz bj">Conclusion</h1><p id="6f88" class="pw-post-body-paragraph nh ni gt nj b hr pa nl nm hu pb no np nq pc ns nt nu pd nw nx ny pe oa ob oc gm bj">Virtual threads are expected to improve performance by reducing overhead related to thread creation and context switching. Despite some sharp edges as of Java 21, virtual threads largely deliver on their promise. In our quest for more performant Java applications, we see further virtual thread adoption as a key towards unlocking that goal. We look forward to Java 23 and beyond, which brings a wealth of upgrades and hopefully addresses the integration between virtual threads and locking primitives.</p><p id="30c0" class="pw-post-body-paragraph nh ni gt nj b hr nk nl nm hu nn no np nq nr ns nt nu nv nw nx ny nz oa ob oc gm bj">This exploration highlights just one type of issue that performance engineers solve at Netflix. We hope this glimpse into our problem-solving approach proves valuable to others in their future investigations.</p></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d</link>
      <guid>https://netflixtechblog.com/java-21-virtual-threads-dude-wheres-my-lock-3052540e231d</guid>
      <pubDate>Mon, 29 Jul 2024 20:04:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Maestro: Netflix’s Workflow Orchestrator]]></title>
      <description><![CDATA[<div><div></div><p id="0ec7" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">By <a class="af nt" href="https://www.linkedin.com/in/jheua/" rel="noopener ugc nofollow" target="_blank">Jun He</a>, <a class="af nt" href="https://www.linkedin.com/in/natalliadzenisenka/" rel="noopener ugc nofollow" target="_blank">Natallia Dzenisenka</a>, <a class="af nt" href="https://www.linkedin.com/in/praneethy91/" rel="noopener ugc nofollow" target="_blank">Praneeth Yenugutala</a>, <a class="af nt" href="https://www.linkedin.com/in/yingyi-zhang-a0a164111/" rel="noopener ugc nofollow" target="_blank">Yingyi Zhang</a>, and <a class="af nt" href="https://www.linkedin.com/in/anjali-norwood-9521a16" rel="noopener ugc nofollow" target="_blank">Anjali Norwood</a></p><h1 id="4de7" class="nu nv gt be nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or bj">TL;DR</h1><p id="3080" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">We are thrilled to announce that the Maestro source code is now open to the public! Please visit the <a class="af nt" href="https://github.com/Netflix/maestro" rel="noopener ugc nofollow" target="_blank">Maestro GitHub repository</a> to get started. If you find it useful, please <a class="af nt" href="https://github.com/Netflix/maestro" rel="noopener ugc nofollow" target="_blank">give us a star</a>.</p><h2 id="2bef" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">What is Maestro</h2><p id="3756" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to manage large-scale workflows such as data pipelines and machine learning model training pipelines. It oversees the entire lifecycle of a workflow, from start to finish, including retries, queuing, task distribution to compute engines, etc.. Users can package their business logic in various formats such as Docker images, notebooks, bash script, SQL, Python, and more. Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflow, and conditional branch, etc.</p><h2 id="6c09" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Our Journey with Maestro</h2><p id="f334" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Since we first introduced Maestro in <a class="af nt" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">this blog post</a>, we have successfully migrated hundreds of thousands of workflows to it on behalf of users with minimal interruption. The transition was seamless, and Maestro has met our design goals by handling our ever-growing workloads. Over the past year, we’ve seen a remarkable 87.5% increase in executed jobs. Maestro now launches thousands of workflow instances and runs half a million jobs daily on average, and has completed around 2 million jobs on particularly busy days.</p><h2 id="fd3e" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Scalability and Versatility</h2><p id="5f06" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro is a fully managed workflow orchestrator that provides Workflow-as-a-Service to thousands of end users, applications, and services at Netflix. It supports a wide range of workflow use cases, including ETL pipelines, ML workflows, AB test pipelines, pipelines to move data between different storages, etc. Maestro’s horizontal scalability ensures it can manage both a large number of workflows and a large number of jobs within a single workflow.</p><p id="c621" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">At Netflix, workflows are intricately connected. Splitting them into smaller groups and managing them across different clusters adds unnecessary complexity and degrades the user experience. This approach also requires additional mechanisms to coordinate these fragmented workflows. Since Netflix’s data tables are housed in a single data warehouse, we believe a single orchestrator should handle all workflows accessing it.</p><p id="986e" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Join us on this exciting journey by exploring the <a class="af nt" href="https://github.com/Netflix/maestro" rel="noopener ugc nofollow" target="_blank">Maestro GitHub repository</a> and contributing to its ongoing development. Your support and feedback are invaluable as we continue to improve the Maestro project.</p><h1 id="7110" class="nu nv gt be nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or bj">Introducing Maestro</h1><p id="4941" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Netflix Maestro offers a comprehensive set of features designed to meet the diverse needs of both engineers and non-engineers. It includes the common functions and reusable patterns applicable to various use cases in a loosely coupled way.</p><p id="1921" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">A workflow definition is defined in a JSON format. Maestro combines user-supplied fields with those managed by Maestro to form a flexible and powerful orchestration definition. An example can be found in the <a class="af nt" href="https://github.com/Netflix/maestro/wiki/Workflow-definition-example" rel="noopener ugc nofollow" target="_blank">Maestro repository wiki</a>.</p><p id="2ac0" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">A Maestro workflow definition comprises two main sections: properties and versioned workflow including its metadata. Properties include author and owner information, and execution settings. Maestro preserves key properties across workflow versions, such as author and owner information, run strategy, and concurrency settings. This consistency simplifies management and aids in trouble-shootings. If the ownership of the current workflow changes, the new owner can claim the ownership of the workflows without creating a new workflow version. Users can also enable the triggering or alerting features for a given workflow over the properties.</p><p id="656f" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Versioned workflow includes attributes like a unique identifier, name, description, tags, timeout settings, and criticality levels (low, medium, high) for prioritization. Each workflow change creates a new version, enabling tracking and easy reversion, with the active or the latest version used by default. A workflow consists of steps, which are the nodes in the workflow graph defined by users. Steps can represent jobs, another workflow using subworkflow step, or a loop using foreach step. Steps consist of unique identifiers, step types, tags, input and output step parameters, step dependencies, retry policies, and failure mode, step outputs, etc. Maestro supports configurable retry policies based on error types to enhance step resilience.</p><p id="cf60" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">This high-level overview of Netflix Maestro’s workflow definition and properties highlights its flexibility to define complex workflows. Next, we dive into some of the useful features in the following sections.</p><h2 id="5be6" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Workflow Run Strategy</h2><p id="ae40" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Users want to automate data pipelines while retaining control over the execution order. This is crucial when workflows cannot run in parallel or must halt current executions when new ones occur. Maestro uses predefined run strategies to decide whether a workflow instance should run or not. Here is the list of predefined run strategies Maestro offers.</p><p id="a617" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Sequential Run Strategy</strong><br />This is the default strategy used by maestro, which runs workflows one at a time based on a First-In-First-Out (FIFO) order. With this run strategy, Maestro runs workflows in the order they are triggered. Note that an execution does not depend on the previous states. Once a workflow instance reaches one of the terminal states, whether succeeded or not, Maestro will start the next one in the queue.</p><p id="9d17" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Strict Sequential Run Strategy<br /></strong>With this run strategy, Maestro will run workflows in the order they are triggered but block execution if there’s a blocking error in the workflow instance history. Newly triggered workflow instances are queued until the error is resolved by manually restarting the failed instances or marking the failed ones unblocked.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*NvdLiYWhhWb0tvL-%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*NvdLiYWhhWb0tvL-%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*NvdLiYWhhWb0tvL-%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*NvdLiYWhhWb0tvL-%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*NvdLiYWhhWb0tvL-%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*NvdLiYWhhWb0tvL-%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*NvdLiYWhhWb0tvL-%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*NvdLiYWhhWb0tvL- 640w, https://miro.medium.com/v2/resize:fit:720/0*NvdLiYWhhWb0tvL- 720w, https://miro.medium.com/v2/resize:fit:750/0*NvdLiYWhhWb0tvL- 750w, https://miro.medium.com/v2/resize:fit:786/0*NvdLiYWhhWb0tvL- 786w, https://miro.medium.com/v2/resize:fit:828/0*NvdLiYWhhWb0tvL- 828w, https://miro.medium.com/v2/resize:fit:1100/0*NvdLiYWhhWb0tvL- 1100w, https://miro.medium.com/v2/resize:fit:1400/0*NvdLiYWhhWb0tvL- 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="8320" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">In the above example, run5 fails at 5AM, then later runs are queued but do not run. When someone manually marks run5 unblocked or restarts it, then the workflow execution will resume. This run strategy is useful for time insensitive but business critical workflows. This gives the workflow owners the option to review the failures at a later time and unblock the executions after verifying the correctness.</p><p id="b23e" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">First-only Run Strategy</strong><br />With this run strategy, Maestro ensures that the running workflow is complete before queueing a new workflow instance. If a new workflow instance is queued while the current one is still running, Maestro will remove the queued instance. Maestro will execute a new workflow instance only if there is no workflow instance currently running, effectively turning off queuing with this run strategy. This approach helps to avoid idempotency issues by not queuing new workflow instances.</p><p id="eaa2" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Last-only Run Strategy</strong><br />With this run strategy, Maestro ensures the running workflow is the latest triggered one and keeps only the last instance. If a new workflow instance is queued while there is an existing workflow instance already running, Maestro will stop the running instance and execute the newly triggered one. This is useful if a workflow is designed to always process the latest data, such as processing the latest snapshot of an entire table each time.</p><p id="7523" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Parallel with Concurrency Limit Run Strategy</strong><br />With this run strategy, Maestro will run multiple triggered workflow instances in parallel, constrained by a predefined concurrency limit. This helps to fan out and distribute the execution, enabling the processing of large amounts of data within the time limit. A common use case for this strategy is for backfilling the old data.</p><h2 id="3d04" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Parameters and Expression Language Support</h2><p id="7591" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">In Maestro, parameters play an important role. Maestro supports dynamic parameters with code injection, which is super useful and powerful. This feature significantly enhances the flexibility and dynamism of workflows, allowing using parameters to control execution logic and enable state sharing between workflows and their steps, as well as between upstream and downstream steps. Together with other Maestro features, it makes the defining of workflows dynamic and enables users to define parameterized workflows for complex use cases.</p><p id="efd3" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">However, code injection introduces significant security and safety concerns. For example, users might unintentionally write an infinite loop that creates an array and appends items to it, eventually crashing the server with out-of-memory (OOM) issues. While one approach could be to ask users to embed the injected code within their business logic instead of the workflow definition, this would impose additional work on users and tightly couple their business logic with the workflow. In certain cases, this approach blocks users to design some complex parameterized workflows.</p><p id="ef02" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">To mitigate these risks and assist users to build parameterized workflows, we developed our own customized expression language parser, a simple, secure, and safe expression language (SEL). SEL supports code injection while incorporating validations during syntax tree parsing to protect the system. It leverages the Java Security Manager to restrict access, ensuring a secure and controlled environment for code execution.</p><p id="3a2f" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Simple, Secure, and Safe Expression Language (SEL)<br /></strong>SEL is a homemade simple, secure, and safe expression language (SEL) to address the risks associated with code injection within Maestro parameterized workflows. It is a simple expression language and the grammar and syntax follow JLS (<a class="af nt" href="https://docs.oracle.com/javase/specs/" rel="noopener ugc nofollow" target="_blank">Java Language Specifications</a>). SEL supports a subset of JLS, focusing on Maestro use cases. For example, it supports data types for all Maestro parameter types, raising errors, datetime handling, and many predefined utility methods. SEL also includes additional runtime checks, such as loop iteration limits, array size checks, object memory size limits and so on, to enhance security and reliability. For more details about SEL, please refer to the <a class="af nt" href="https://github.com/Netflix/maestro/blob/main/netflix-sel/docs/index.md#welcome-to-sel" rel="noopener ugc nofollow" target="_blank">Maestro GitHub documentation</a>.</p><p id="0162" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Output Parameters</strong><br />To further enhance parameter support, Maestro allows for callable step execution, which returns output parameters from user execution back to the system. The output data is transmitted to Maestro via its REST API, ensuring that the step runtime does not have direct access to the Maestro database. This approach significantly reduces security concerns.</p><p id="e293" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Parameterized Workflows</strong><br />Thanks to the powerful parameter support, users can easily create parameterized workflows in addition to static ones. Users enjoy defining parameterized workflows because they are easy to manage and troubleshoot while being powerful enough to solve complex use cases.</p><ul class=""><li id="ecc7" class="mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns qa qb qc bj">Static workflows are simple and easy to use but come with limitations. Often, users have to duplicate the same workflow multiple times to accommodate minor changes. Additionally, workflow and jobs cannot share the states without using parameters.</li><li id="26fe" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj">On the other hand, completely dynamic workflows can be challenging to manage and support. They are difficult to debug or troubleshoot and hard to be reused by others.</li><li id="2cc0" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj">Parameterized workflows strike a balance by being initialized step by step at runtime based on user defined parameters. This approach provides great flexibility for users to control the execution at runtime while remaining easy to manage and understand.</li></ul><p id="76e0" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">As we described in <a class="af nt" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c#360e">the previous Maestro blog post</a>, parameter support enables the creation of complex parameterized workflows, such as backfill data pipelines.</p><h2 id="03a6" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Workflow Execution Patterns</h2><p id="7289" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro provides multiple useful building blocks that allow users to easily define dataflow patterns or other workflow patterns. It provides support for common patterns directly within the Maestro engine. Direct engine support not only enables us to optimize these patterns but also ensures a consistent approach to implementing them. Next, we will talk about the three major building blocks that Maestro provides.</p><p id="f47a" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Foreach Support</strong><br />In Maestro, the foreach pattern is modeled as a dedicated step within the original workflow definition. Each iteration of the foreach loop is internally treated as a separate workflow instance, which scales similarly as any other Maestro workflow based on the step executions (i.e. a sub-graph) defined within the foreach definition block. The execution of sub-graph within a foreach step is delegated to a separate workflow instance. Foreach step then monitors and collects the status of these foreach workflow instances, each managing the execution of a single iteration. For more details, please refer to <a class="af nt" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c#360e">our previous Maestro blog post</a>.</p><p id="d95e" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">The foreach pattern is frequently used to repeatedly run the same jobs with different parameters, such as data backfilling or machine learning model tuning. It would be tedious and time consuming to request users to explicitly define each iteration in the workflow definition (potentially hundreds of thousands of iterations). Additionally, users would need to create new workflows if the foreach range changes, further complicating the process.</p><p id="5686" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Conditional Branch Support</strong><br />The conditional branch feature allows subsequent steps to run only if specific conditions in the upstream step are met. These conditions are defined using the SEL expression language, which is evaluated at runtime. Combined with other building blocks, users can build powerful workflows, e.g. doing some remediation if the audit check step fails and then run the job again.</p><p id="9e17" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Subworkflow Support<br /></strong>The subworkflow feature allows a workflow step to run another workflow, enabling the sharing of common functions across multiple workflows. This effectively enables “workflow as a function” and allows users to build a graph of workflows. For example, we have observed complex workflows consisting of hundreds of subworkflows to process data across hundreds tables, where subworkflows are provided by multiple teams.</p><p id="7d57" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">These patterns can be combined together to build composite patterns for complex workflow use cases. For instance, we can loop over a set of subworkflows or run nested foreach loops. One example that Maestro users developed is an auto-recovery workflow that utilizes both conditional branch and subworkflow features to handle errors and retry jobs automatically.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*d7XTqfPjAkuCBv6C%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*d7XTqfPjAkuCBv6C%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*d7XTqfPjAkuCBv6C%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*d7XTqfPjAkuCBv6C%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*d7XTqfPjAkuCBv6C%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*d7XTqfPjAkuCBv6C%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*d7XTqfPjAkuCBv6C%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*d7XTqfPjAkuCBv6C 640w, https://miro.medium.com/v2/resize:fit:720/0*d7XTqfPjAkuCBv6C 720w, https://miro.medium.com/v2/resize:fit:750/0*d7XTqfPjAkuCBv6C 750w, https://miro.medium.com/v2/resize:fit:786/0*d7XTqfPjAkuCBv6C 786w, https://miro.medium.com/v2/resize:fit:828/0*d7XTqfPjAkuCBv6C 828w, https://miro.medium.com/v2/resize:fit:1100/0*d7XTqfPjAkuCBv6C 1100w, https://miro.medium.com/v2/resize:fit:1400/0*d7XTqfPjAkuCBv6C 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="add9" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">In this example, subworkflow `job1` runs another workflow consisting of extract-transform-load (ETL) and audit jobs. Next, a status check job leverages the Maestro parameter and SEL support to retrieve the status of the previous job. Based on this status, it can decide whether to complete the workflow or to run a recovery job to address any data issues. After resolving the issue, it then executes subworkflow `job2`, which runs the same workflow as subworkflow `job1`.</p><h2 id="dd55" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Step Runtime and Step Parameter</h2><p id="8600" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj"><strong class="mx gu">Step Runtime Interface<br /></strong>In Maestro, we use step runtime to describe a job at execution time. The step runtime interface defines two pieces of information:</p><ol class=""><li id="2de4" class="mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns qi qb qc bj">A set of basic APIs to control the behavior of a step instance at execution runtime.</li><li id="7fbd" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qi qb qc bj">Some simple data structures to track step runtime state and execution result.</li></ol><p id="e253" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Maestro offers a few step runtime implementations such as foreach step runtime, subworkflow step runtime (mentioned in previous section). Each implementation defines its own logic for start, execute and terminate operations. At runtime, these operations control the way to initialize a step instance, perform the business logic and terminate the execution under certain conditions (i.e. manual intervention by users).</p><p id="17d4" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Also, Maestro step runtime internally keeps track of runtime state as well as the execution result of the step. The runtime state is used to determine the next state transition of the step and tell if it has failed or terminated. The execution result hosts both step artifacts and the timeline of step execution history, which are accessible by subsequent steps.</p><p id="ddb1" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj"><strong class="mx gu">Step Parameter Merging<br /></strong>To control step behavior in a dynamic way, Maestro supports both runtime parameters and tags injection in step runtime. This makes a Maestro step more flexible to absorb runtime changes (i.e. overridden parameters) before actually being started. Maestro internally maintains a step parameter map that is initially empty and is updated by merging step parameters in the order below:</p><ul class=""><li id="8853" class="mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns qa qb qc bj"><strong class="mx gu">Default General Parameters</strong>: Parameters merging starts from default parameters that in general every step should have. For example, workflow_instance_id, step_instance_uuid, step_attempt_id and step_id are required parameters for each maestro step. They are internally reserved by maestro and cannot be passed by users.</li><li id="4d0d" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Injected Parameters</strong>: Maestro then merges injected parameters (if present) into the parameter map. The injected parameters come from step runtime, which are dynamically generated based on step schema. Each type of step can have its own schema with specific parameters associated with this step. The step schema can evolve independently with no need to update Maestro code.</li><li id="54c3" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Default Typed Parameters</strong>: After injecting runtime parameters, Maestro tries to merge default parameters that are related to a specific type of step. For example, foreach step has loop_params and loop_index default parameters which are internally set by maestro and used for foreach step only.</li><li id="0a3b" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Workflow and Step Info Parameters</strong>: These parameters contain information about step and the workflow it belongs to. This can be identity information, i.e. workflow_id and will be merged to step parameter map if present.</li><li id="0292" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Undefined New Parameters</strong>: When starting or restarting a maestro workflow instance, users can specify new step parameters that are not present in initial step definition. ParamsManager merges these parameters to ensure they are available at execution time.</li><li id="d548" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Step Definition Parameters</strong>: These step parameters are defined by users at definition time and get merged if they are not empty.</li><li id="3f46" class="mv mw gt mx b my qd na nb nc qe ne nf ng qf ni nj nk qg nm nn no qh nq nr ns qa qb qc bj"><strong class="mx gu">Run and Restart Parameters</strong>: When starting or restarting a maestro workflow instance, users can override defined parameters by providing run or restart parameters. These two types of parameters are merged at the end so that step runtime can see the most recent and accurate parameter space.</li></ul><p id="7dbf" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">The parameters merging logic can be visualized in the diagram below.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*bARelX8reZTdmFgr%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*bARelX8reZTdmFgr%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*bARelX8reZTdmFgr%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*bARelX8reZTdmFgr%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*bARelX8reZTdmFgr%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*bARelX8reZTdmFgr%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*bARelX8reZTdmFgr%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*bARelX8reZTdmFgr 640w, https://miro.medium.com/v2/resize:fit:720/0*bARelX8reZTdmFgr 720w, https://miro.medium.com/v2/resize:fit:750/0*bARelX8reZTdmFgr 750w, https://miro.medium.com/v2/resize:fit:786/0*bARelX8reZTdmFgr 786w, https://miro.medium.com/v2/resize:fit:828/0*bARelX8reZTdmFgr 828w, https://miro.medium.com/v2/resize:fit:1100/0*bARelX8reZTdmFgr 1100w, https://miro.medium.com/v2/resize:fit:1400/0*bARelX8reZTdmFgr 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h2 id="d5a8" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Step Dependencies and Signals</h2><p id="9493" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Steps in the Maestro execution workflow graph can express execution dependencies using step dependencies. A step dependency specifies the data-related conditions required by a step to start execution. These conditions are usually defined based on signals, which are pieces of messages carrying information such as parameter values and can be published through step outputs or external systems like SNS or Kafka messages.</p><p id="5a67" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Signals in Maestro serve both signal trigger pattern and signal dependencies (a publisher-subscriber) pattern. One step can publish an output signal (<a class="af nt" href="https://github.com/Netflix/maestro/blob/main/maestro-common/src/testFixtures/resources/fixtures/instances/sample-step-instance-failed.json#L151-L215" rel="noopener ugc nofollow" target="_blank">a sample example</a>) that can unblock the execution of multiple other steps that depend on it. A <a class="af nt" href="https://github.com/Netflix/maestro/blob/main/maestro-common/src/main/java/com/netflix/maestro/models/definition/SignalOutputsDefinition.java" rel="noopener ugc nofollow" target="_blank">signal definition</a> includes a list of mapped parameters, allowing Maestro to perform “signal matching” on a subset of fields. Additionally, Maestro supports <a class="af nt" href="https://github.com/Netflix/maestro/blob/main/maestro-common/src/main/java/com/netflix/maestro/models/parameter/SignalOperator.java" rel="noopener ugc nofollow" target="_blank">signal operators</a> like &lt;, &gt;, etc., on signal parameter values.</p><p id="652d" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Netflix has built various abstractions on top of the concept of signals. For instance, a ETL workflow can update a table with data and send signals that unblock steps in downstream workflows dependent on that data. Maestro supports “signal lineage,” which allows users to navigate all historical instances of signals and the workflow steps that match (i.e. publishing or consuming) those signals. Signal triggering guarantees exactly-once execution for the workflow subscribing a signal or a set of joined signals. This approach is efficient, as it conserves resources by only executing the workflow or step when the specified conditions in the signals are met. A signal service is implemented for those advanced abstractions. Please refer to the <a class="af nt" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c#1fdf">Maestro blog</a> for further details on it.</p><h2 id="28e5" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Breakpoint</h2><p id="8764" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro allows users to set breakpoints on workflow steps, functioning similarly to code-level breakpoints in an IDE. When a workflow instance executes and reaches a step with a breakpoint, that step enters a “paused” state. This halts the workflow graph’s progression until a user manually resumes from the breakpoint. If multiple instances of a workflow step are paused at a breakpoint, resuming one instance will only affect that specific instance, leaving the others in a paused state. Deleting the breakpoint will cause all paused step instances to resume.</p><p id="cace" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">This feature is particularly useful during the initial development of a workflow, allowing users to inspect step executions and output data. It is also beneficial when running a step multiple times in a “foreach” pattern with various input parameters. Setting a single breakpoint on a step will cause all iterations of the foreach loop to pause at that step for debugging purposes. Additionally, the breakpoint feature allows human intervention during the workflow execution and can also be used for other purposes, e.g. supporting mutating step states while the workflow is running.</p><h2 id="7ab1" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Timeline</h2><p id="ed4f" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro includes a step execution timeline, capturing all significant events such as execution state machine changes and the reasoning behind them. This feature is useful for debugging, providing insights into the status of a step. For example, it logs transitions such as “Created” and “Evaluating params”, etc. An example of a timeline is included <a class="af nt" href="https://github.com/Netflix/maestro/blob/main/maestro-common/src/testFixtures/resources/fixtures/instances/sample-step-instance-failed.json#L137-L150" rel="noopener ugc nofollow" target="_blank">here</a> for reference. The implemented step runtimes can add the timeline events into the timeline to surface the execution information to the end users.</p><h2 id="461d" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Retry Policies</h2><p id="33ba" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro supports retry policies for steps that reach a terminal state due to failure. Users can specify the number of retries and configure retry policies, including delays between retries and exponential backoff strategies, in addition to fixed interval retries. Maestro distinguishes between two types of retries: “platform” and “user.” Platform retries address platform-level errors unrelated to user logic, while user retries are for user-defined conditions. Each type can have its own set of retry policies.</p><p id="46fa" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Automatic retries are beneficial for handling transient errors that can be resolved without user intervention. Maestro provides the flexibility to set retries to zero for non-idempotent steps to avoid retry. This feature ensures that users have control over how retries are managed based on their specific requirements.</p><h2 id="ef2f" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Aggregated View</h2><p id="4c03" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Because a workflow instance can have multiple runs, it is important for users to see an aggregated state of all steps in the workflow instance. Aggregated view is computed by merging base aggregated view with current runs instance step statuses. For example, as you can see on the figure below simulating a simple case, there is a first run, where step1 and step2 succeeded, step3 failed, and step4 and step5 have not started. When the user restarts the run, the run starts from step3 in run 2 with step1 and step2 skipped which succeeded in the previous run. After all steps succeed, the aggregated view shows the run states for all steps.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn qj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*UC1Sj36z5IfvDz9X%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*UC1Sj36z5IfvDz9X%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*UC1Sj36z5IfvDz9X%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*UC1Sj36z5IfvDz9X%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*UC1Sj36z5IfvDz9X%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*UC1Sj36z5IfvDz9X%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*UC1Sj36z5IfvDz9X%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*UC1Sj36z5IfvDz9X 640w, https://miro.medium.com/v2/resize:fit:720/0*UC1Sj36z5IfvDz9X 720w, https://miro.medium.com/v2/resize:fit:750/0*UC1Sj36z5IfvDz9X 750w, https://miro.medium.com/v2/resize:fit:786/0*UC1Sj36z5IfvDz9X 786w, https://miro.medium.com/v2/resize:fit:828/0*UC1Sj36z5IfvDz9X 828w, https://miro.medium.com/v2/resize:fit:1100/0*UC1Sj36z5IfvDz9X 1100w, https://miro.medium.com/v2/resize:fit:1400/0*UC1Sj36z5IfvDz9X 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h2 id="0eeb" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Rollup</h2><p id="6431" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Rollup provides a high-level summary of a workflow instance, detailing the status of each step and the count of steps in each status. It flattens steps across the current instance and any nested non-inline workflows like subworkflows or foreach steps. For instance, if a successful workflow has three steps, one of which is a subworkflow corresponding to a five-step workflow, the rollup will indicate that seven steps succeeded. Only leaf steps are counted in the rollup, as other steps serve merely as pointers to concrete workflows.</p><p id="4d8c" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Rollup also retains references to any non-successful steps, offering a clear overview of step statuses and facilitating easy navigation to problematic steps, even within nested workflows. The aggregated rollup for a workflow instance is calculated by combining the current run’s runtime data with a base rollup. The current state is derived from the statuses of active steps, including aggregated rollups for foreach and subworkflow steps. The base rollup is established when the workflow instance begins and includes statuses of inline steps (excluding foreach and subworkflows) from the previous run that are not part of the current run.</p><p id="957b" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">For subworkflow steps, the rollup simply reflects the rollup of the subworkflow instance. For foreach steps, the rollup combines the base rollup of the foreach step with the current state rollup. The base is derived from the previous run’s aggregated rollup, excluding the iterations to be restarted in the new run. The current state is periodically updated by aggregating rollups of running iterations until all iterations reach a terminal state.</p><p id="2610" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">Due to these processes, the rollup model is eventually consistent. While the figure below illustrates a straightforward example of rollup, the calculations can become complex and recursive, especially with multiple levels of nested foreaches and subworkflows.</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ISib6wPtCLtAbOuU%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ISib6wPtCLtAbOuU%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ISib6wPtCLtAbOuU%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ISib6wPtCLtAbOuU%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ISib6wPtCLtAbOuU%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ISib6wPtCLtAbOuU%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ISib6wPtCLtAbOuU%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ISib6wPtCLtAbOuU 640w, https://miro.medium.com/v2/resize:fit:720/0*ISib6wPtCLtAbOuU 720w, https://miro.medium.com/v2/resize:fit:750/0*ISib6wPtCLtAbOuU 750w, https://miro.medium.com/v2/resize:fit:786/0*ISib6wPtCLtAbOuU 786w, https://miro.medium.com/v2/resize:fit:828/0*ISib6wPtCLtAbOuU 828w, https://miro.medium.com/v2/resize:fit:1100/0*ISib6wPtCLtAbOuU 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ISib6wPtCLtAbOuU 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h2 id="fb0a" class="ox nv gt be nw oy oz dx oa pa pb dz oe ng pc pd pe nk pf pg ph no pi pj pk pl bj">Maestro Event Publishing</h2><p id="8676" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">When workflow definition, workflow instance or step instance is changed, Maestro generates an event, processes it internally and publishes the processed event to external system(s). Maestro has both internal and external events. The internal event tracks changes within the life cycle of workflow, workflow instance or step instance. It is published to an internal queue and processed within Maestro. After internal events are processed, some of them will be transformed into external event and sent out to the external queue (i.e. SNS, Kafka). The external event carries maestro status change information for downstream services. The event publishing flow is illustrated in the diagram below:</p><figure class="pp pq pr ps pt pu pm pn paragraph-image"><div role="button" tabindex="0" class="pv pw fi px bg py"><div class="pm pn po"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*n2Kiea-ngDjnKppJ%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*n2Kiea-ngDjnKppJ%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*n2Kiea-ngDjnKppJ%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*n2Kiea-ngDjnKppJ%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*n2Kiea-ngDjnKppJ%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*n2Kiea-ngDjnKppJ%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*n2Kiea-ngDjnKppJ%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*n2Kiea-ngDjnKppJ 640w, https://miro.medium.com/v2/resize:fit:720/0*n2Kiea-ngDjnKppJ 720w, https://miro.medium.com/v2/resize:fit:750/0*n2Kiea-ngDjnKppJ 750w, https://miro.medium.com/v2/resize:fit:786/0*n2Kiea-ngDjnKppJ 786w, https://miro.medium.com/v2/resize:fit:828/0*n2Kiea-ngDjnKppJ 828w, https://miro.medium.com/v2/resize:fit:1100/0*n2Kiea-ngDjnKppJ 1100w, https://miro.medium.com/v2/resize:fit:1400/0*n2Kiea-ngDjnKppJ 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="8ae3" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">As shown in the diagram, the Maestro event processor bridges the two aforementioned Maestro events. It listens on the internal queue to get the published <a class="af nt" href="https://github.com/Netflix/maestro/tree/main/maestro-engine/src/main/java/com/netflix/maestro/engine/jobevents" rel="noopener ugc nofollow" target="_blank">internal events</a>. Within the processor, the internal job event is processed based on its type and gets converted to an <a class="af nt" href="https://github.com/Netflix/maestro/tree/main/maestro-common/src/main/java/com/netflix/maestro/models/events" rel="noopener ugc nofollow" target="_blank">external event</a> if needed. The notification publisher at the end emits the external event so that downstream services can consume.</p><p id="da33" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">The downstream services are mostly event-driven. The Maestro event carries the most useful message for downstream services to capture different changes in Maestro. In general, these changes can be classified into two categories: workflow change and instance status change. The workflow change event is associated with actions at workflow level, i.e definition or properties of a workflow has changed. Meanwhile, instance status change tracks status transition on workflow instance or step instance.</p><h1 id="9ed3" class="nu nv gt be nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or bj">Get Started with Maestro</h1><p id="5db4" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Maestro has been extensively used within Netflix, and today, we are excited to make the Maestro source code publicly available. We hope that the scalability and usability that Maestro offers can expedite workflow development outside Netflix. We invite you to try Maestro, use it within your organization, and contribute to its development.</p><p id="7494" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">You can find the Maestro code repository at <a class="af nt" href="https://github.com/Netflix/maestro" rel="noopener ugc nofollow" target="_blank">github.com/Netflix/maestro</a>. If you have any questions, thoughts, or comments about Maestro, please feel free to create a <a class="af nt" href="https://github.com/Netflix/maestro/issues" rel="noopener ugc nofollow" target="_blank">GitHub issue</a> in the Maestro repository. We are eager to hear from you.</p><p id="f320" class="pw-post-body-paragraph mv mw gt mx b my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns gm bj">We are taking workflow orchestration to the next level and constantly solving new problems and challenges, please stay tuned for updates. If you are passionate about solving large scale orchestration problems, please <a class="af nt" href="https://jobs.netflix.com/search?team=Data+Platform" rel="noopener ugc nofollow" target="_blank">join us</a>.</p><h1 id="5fe2" class="nu nv gt be nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq or bj">Acknowledgements</h1><p id="d8df" class="pw-post-body-paragraph mv mw gt mx b my os na nb nc ot ne nf ng ou ni nj nk ov nm nn no ow nq nr ns gm bj">Thanks to other Maestro team members, <a class="af nt" href="https://www.linkedin.com/in/binbing-hou/" rel="noopener ugc nofollow" target="_blank">Binbing Hou</a>, <a class="af nt" href="http://linkedin.com/in/zhuoran-d-96848b154" rel="noopener ugc nofollow" target="_blank">Zhuoran Dong</a>, <a class="af nt" href="https://www.linkedin.com/in/brittany-truong-a35b54bb" rel="noopener ugc nofollow" target="_blank">Brittany Truong</a>, <a class="af nt" href="https://www.linkedin.com/in/rdeepak2002/" rel="noopener ugc nofollow" target="_blank">Deepak Ramalingam</a>, <a class="af nt" href="http://linkedin.com/in/moctarba" rel="noopener ugc nofollow" target="_blank">Moctar Ba</a>, for their contributions to the Maestro project. Thanks to our Product Manager <a class="af nt" href="https://www.linkedin.com/in/ashpokh/" rel="noopener ugc nofollow" target="_blank">Ashim Pokharel</a> for driving the strategy and requirements. We’d also like to thank <a class="af nt" href="https://www.linkedin.com/in/andrew-seier/" rel="noopener ugc nofollow" target="_blank">Andrew Seier</a>, <a class="af nt" href="https://www.linkedin.com/in/romain-cledat-4a211a5" rel="noopener ugc nofollow" target="_blank">Romain Cledat</a>, <a class="af nt" href="https://www.linkedin.com/in/agorajek/" rel="noopener ugc nofollow" target="_blank">Olek Gorajek</a>, and other stunning colleagues at Netflix for their contributions to the Maestro project. We also thank Prashanth Ramdas, Eva Tse, David Noor, Charles Smith and other leaders of Netflix engineering organizations for their constructive feedback and suggestions on the Maestro project.</p></div>]]></description>
      <link>https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78</link>
      <guid>https://netflixtechblog.com/maestro-netflixs-workflow-orchestrator-ee13a06f9c78</guid>
      <pubDate>Mon, 22 Jul 2024 19:38:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fy fz ga gb"><div><div><h2 id="14a1" class="pw-subtitle-paragraph hq gs gt be b hr hs ht hu hv hw hx hy hz ia ib ic id ie if cp dt">Applying Quality of Service techniques at the application level</h2><div class="ig ih ii ij ik"></div><p id="0656" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj"><a class="af oi" href="https://www.linkedin.com/in/amendira/" rel="noopener ugc nofollow" target="_blank">Anirudh Mendiratta</a>, <a class="af oi" href="https://www.linkedin.com/in/kzwang" rel="noopener ugc nofollow" target="_blank">Kevin Wang</a>, <a class="af oi" href="https://jolynch.github.io/" rel="noopener ugc nofollow" target="_blank">Joey Lynch</a>, <a class="af oi" href="https://www.linkedin.com/in/ivern" rel="noopener ugc nofollow" target="_blank">Javier Fernandez-Ivern</a>, <a class="af oi" href="https://www.linkedin.com/in/benjamin-fedorka" rel="noopener ugc nofollow" target="_blank">Benjamin Fedorka</a></p><h1 id="37a8" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Introduction</h1><p id="8aa0" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">In November 2020, we introduced the concept of prioritized load shedding at the API gateway level in our blog post, <a class="af oi" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94">Keeping Netflix Reliable Using Prioritized Load Shedding</a>. Today, we’re excited to dive deeper into how we’ve extended this strategy to the individual service level, focusing on the video streaming control plane and data plane, to further enhance user experience and system resilience.</p><h1 id="5443" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">The Evolution of Load Shedding at Netflix</h1><p id="b404" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">At Netflix, ensuring a seamless viewing experience for millions of users simultaneously is paramount. Our initial approach for prioritized load shedding was implemented at the Zuul API gateway layer. This system effectively manages different types of network traffic, ensuring that critical playback requests receive priority over less critical telemetry traffic.</p><p id="9d2f" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Building on this foundation, we recognized the need to apply a similar prioritization logic deeper within our architecture, specifically at the service layer where different types of requests within the same service could be prioritized differently. The advantages of applying these techniques at the service level in addition to our edge API gateway are:</p><ol class=""><li id="c118" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh pk pl pm bj">Service teams can own their prioritization logic and can apply finer grained prioritization.</li><li id="32f7" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh pk pl pm bj">This can be used for backend to backend communication, i.e. for services not sitting behind our edge API gateway.</li><li id="c6f4" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh pk pl pm bj">Services can use cloud capacity more efficiently by combining different request types into one cluster and shedding low priority requests when necessary instead of maintaining separate clusters for failure isolation.</li></ol><h1 id="d410" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Introducing Service-Level Prioritized Load Shedding</h1><p id="4cd7" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">PlayAPI is a critical backend service on the video streaming control plane, responsible for handling device initiated manifest and license requests necessary to start playback. We categorize these requests into two types based on their criticality:</p><ol class=""><li id="daf3" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh pk pl pm bj"><strong class="no gu">User-Initiated Requests (critical):</strong> These requests are made when a user hits play and directly impact the user’s ability to start watching a show or a movie.</li><li id="9efd" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh pk pl pm bj"><strong class="no gu">Pre-fetch Requests (non-critical):</strong> These requests are made optimistically when a user browses content without the user hitting play, to reduce latency should the user decide to watch a particular title. A failure in only pre-fetch requests does not result in a playback failure, but slightly increases the latency between pressing play and video appearing on screen.</li></ol></div></div><div class="ps"><div class="ab ca"><div class="mo pt mp pu mq pv ce pw cf px ch bg"><figure class="qb qc qd qe qf ps qg qh paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*2KByIB47RWng5UNH%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*2KByIB47RWng5UNH%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*2KByIB47RWng5UNH%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*2KByIB47RWng5UNH%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*2KByIB47RWng5UNH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*2KByIB47RWng5UNH%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*2KByIB47RWng5UNH%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*2KByIB47RWng5UNH 640w, https://miro.medium.com/v2/resize:fit:720/0*2KByIB47RWng5UNH 720w, https://miro.medium.com/v2/resize:fit:750/0*2KByIB47RWng5UNH 750w, https://miro.medium.com/v2/resize:fit:786/0*2KByIB47RWng5UNH 786w, https://miro.medium.com/v2/resize:fit:828/0*2KByIB47RWng5UNH 828w, https://miro.medium.com/v2/resize:fit:1100/0*2KByIB47RWng5UNH 1100w, https://miro.medium.com/v2/resize:fit:2000/0*2KByIB47RWng5UNH 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Netflix on Chrome making pre-fetch requests to PlayAPI while the user is browsing content</em></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="ca91" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">The Problem</h2><p id="991e" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">In order to handle large traffic spikes, high backend latency, or an under-scaled backend service, PlayAPI previously used a concurrency limiter to throttle requests that would reduce the availability of both user-initiated and prefetch requests equally. This was not ideal because:</p><ol class=""><li id="844a" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh pk pl pm bj">Spikes in pre-fetch traffic reduced availability for user-initiated requests</li><li id="174b" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh pk pl pm bj">Increased backend latency reduced availability for user-initiated requests and pre-fetch requests equally, when the system had enough capacity to serve all user-initiated requests.</li></ol><p id="47f1" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Sharding the critical and non-critical requests into separate clusters was an option, which addressed problem 1 and added failure isolation between the two types of requests, however it came with a higher compute cost. Another disadvantage of sharding is that it adds some operational overhead — engineers need to make sure CI/CD, auto-scaling, metrics, and alerts are enabled for the new cluster.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div class="py pz rh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*pNfPHfPFe_k8r-YC%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*pNfPHfPFe_k8r-YC%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*pNfPHfPFe_k8r-YC%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*pNfPHfPFe_k8r-YC%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*pNfPHfPFe_k8r-YC%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*pNfPHfPFe_k8r-YC%201100w,%20https://miro.medium.com/v2/resize:fit:742/format:webp/0*pNfPHfPFe_k8r-YC%20742w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 371px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*pNfPHfPFe_k8r-YC 640w, https://miro.medium.com/v2/resize:fit:720/0*pNfPHfPFe_k8r-YC 720w, https://miro.medium.com/v2/resize:fit:750/0*pNfPHfPFe_k8r-YC 750w, https://miro.medium.com/v2/resize:fit:786/0*pNfPHfPFe_k8r-YC 786w, https://miro.medium.com/v2/resize:fit:828/0*pNfPHfPFe_k8r-YC 828w, https://miro.medium.com/v2/resize:fit:1100/0*pNfPHfPFe_k8r-YC 1100w, https://miro.medium.com/v2/resize:fit:742/0*pNfPHfPFe_k8r-YC 742w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 371px" /></picture></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><strong class="be ol"><em class="qr">Option 1</em></strong><em class="qr"> — No isolation</em></figcaption></figure><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz ri"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*BqZJayMkzt5-ZIHB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*BqZJayMkzt5-ZIHB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*BqZJayMkzt5-ZIHB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*BqZJayMkzt5-ZIHB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*BqZJayMkzt5-ZIHB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*BqZJayMkzt5-ZIHB%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*BqZJayMkzt5-ZIHB%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*BqZJayMkzt5-ZIHB 640w, https://miro.medium.com/v2/resize:fit:720/0*BqZJayMkzt5-ZIHB 720w, https://miro.medium.com/v2/resize:fit:750/0*BqZJayMkzt5-ZIHB 750w, https://miro.medium.com/v2/resize:fit:786/0*BqZJayMkzt5-ZIHB 786w, https://miro.medium.com/v2/resize:fit:828/0*BqZJayMkzt5-ZIHB 828w, https://miro.medium.com/v2/resize:fit:1100/0*BqZJayMkzt5-ZIHB 1100w, https://miro.medium.com/v2/resize:fit:1400/0*BqZJayMkzt5-ZIHB 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><strong class="be ol"><em class="qr">Option 2</em></strong><em class="qr"> — Isolation but higher compute cost</em></figcaption></figure><h2 id="c84b" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Our Solution</h2><p id="8318" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">We implemented a concurrency limiter within PlayAPI that prioritizes user-initiated requests over prefetch requests without physically sharding the two request handlers. This mechanism uses the partitioning functionality of the open source <a class="af oi" href="https://github.com/Netflix/concurrency-limits" rel="noopener ugc nofollow" target="_blank">Netflix/concurrency-limits</a> Java library. We create two partitions in our limiter:</p><ul class=""><li id="28e9" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh rj pl pm bj"><strong class="no gu">User-Initiated Partition:</strong> Guaranteed 100% throughput.</li><li id="542f" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh rj pl pm bj"><strong class="no gu">Pre-fetch Partition:</strong> Utilizes only excess capacity.</li></ul><figure class="qb qc qd qe qf ps py pz paragraph-image"><div class="py pz rk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*BS1KXcwsikLJ4Zok%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*BS1KXcwsikLJ4Zok%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*BS1KXcwsikLJ4Zok%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*BS1KXcwsikLJ4Zok%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*BS1KXcwsikLJ4Zok%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*BS1KXcwsikLJ4Zok%201100w,%20https://miro.medium.com/v2/resize:fit:822/format:webp/0*BS1KXcwsikLJ4Zok%20822w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 411px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*BS1KXcwsikLJ4Zok 640w, https://miro.medium.com/v2/resize:fit:720/0*BS1KXcwsikLJ4Zok 720w, https://miro.medium.com/v2/resize:fit:750/0*BS1KXcwsikLJ4Zok 750w, https://miro.medium.com/v2/resize:fit:786/0*BS1KXcwsikLJ4Zok 786w, https://miro.medium.com/v2/resize:fit:828/0*BS1KXcwsikLJ4Zok 828w, https://miro.medium.com/v2/resize:fit:1100/0*BS1KXcwsikLJ4Zok 1100w, https://miro.medium.com/v2/resize:fit:822/0*BS1KXcwsikLJ4Zok 822w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 411px" /></picture></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><strong class="be ol"><em class="qr">Option 3</em></strong><em class="qr"> — Single cluster with prioritized load-shedding offers application-level isolation with lower compute cost. Each instance serves both types of requests and has a partition whose size adjusts dynamically to ensure that pre-fetch requests only get excess capacity. This allows user-initiated requests to “steal” pre-fetch capacity when necessary.</em></figcaption></figure><p id="5eea" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">The partitioned limiter is configured as a pre-processing <a class="af oi" href="https://github.com/Netflix/concurrency-limits/blob/master/concurrency-limits-servlet/src/main/java/com/netflix/concurrency/limits/servlet/ConcurrencyLimitServletFilter.java" rel="noopener ugc nofollow" target="_blank">Servlet Filter</a> that uses HTTP headers sent by devices to determine a request’s criticality, thus avoiding the need to read and parse the request body for rejected requests. This ensures that the limiter is not itself a bottleneck and can effectively reject requests while using minimal CPU. As an example, the filter can be initialized as</p><pre class="qb qc qd qe qf rl rm rn bo ro ba bj">Filter filter = new ConcurrencyLimitServletFilter(<br />        new ServletLimiterBuilder()<br />                .named("playapi")<br />                .partitionByHeader("X-Netflix.Request-Name")<br />                .partition("user-initiated", 1.0)<br />                .partition("pre-fetch", 0.0)<br />                .build());</pre><p id="a211" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Note that in steady state, there is no throttling and the prioritization has no effect on the handling of pre-fetch requests. The prioritization mechanism only kicks in when a server is at the concurrency limit and needs to reject requests.</p><h2 id="24b8" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Testing</h2><p id="ecef" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">In order to validate that our load-shedding worked as intended, we used Failure Injection Testing to inject 2 second latency in pre-fetch calls, where the typical p99 latency for these calls is &lt; 200 ms. The failure was injected on one baseline instance with regular load shedding and one canary instance with prioritized load shedding. Some internal services that PlayAPI calls use separate clusters for user-initiated and pre-fetch requests and run pre-fetch clusters hotter. This test case simulates a scenario where a pre-fetch cluster for a downstream service is experiencing high latency.</p></div></div><div class="ps"><div class="ab ca"><div class="mo pt mp pu mq pv ce pw cf px ch bg"><figure class="qb qc qd qe qf ps qg qh paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz ru"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*oU-FvJW2BCw5Z158%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*oU-FvJW2BCw5Z158%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*oU-FvJW2BCw5Z158%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*oU-FvJW2BCw5Z158%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*oU-FvJW2BCw5Z158%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*oU-FvJW2BCw5Z158%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*oU-FvJW2BCw5Z158%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*oU-FvJW2BCw5Z158 640w, https://miro.medium.com/v2/resize:fit:720/0*oU-FvJW2BCw5Z158 720w, https://miro.medium.com/v2/resize:fit:750/0*oU-FvJW2BCw5Z158 750w, https://miro.medium.com/v2/resize:fit:786/0*oU-FvJW2BCw5Z158 786w, https://miro.medium.com/v2/resize:fit:828/0*oU-FvJW2BCw5Z158 828w, https://miro.medium.com/v2/resize:fit:1100/0*oU-FvJW2BCw5Z158 1100w, https://miro.medium.com/v2/resize:fit:2000/0*oU-FvJW2BCw5Z158 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Baseline — Without prioritized load-shedding. Both pre-fetch and user-initiated see an equal drop in availability</em></figcaption></figure><figure class="mk ps qg qh paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz ru"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*hcY1lYOP4CVxn-LS%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*hcY1lYOP4CVxn-LS%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*hcY1lYOP4CVxn-LS%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*hcY1lYOP4CVxn-LS%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*hcY1lYOP4CVxn-LS%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*hcY1lYOP4CVxn-LS%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*hcY1lYOP4CVxn-LS%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*hcY1lYOP4CVxn-LS 640w, https://miro.medium.com/v2/resize:fit:720/0*hcY1lYOP4CVxn-LS 720w, https://miro.medium.com/v2/resize:fit:750/0*hcY1lYOP4CVxn-LS 750w, https://miro.medium.com/v2/resize:fit:786/0*hcY1lYOP4CVxn-LS 786w, https://miro.medium.com/v2/resize:fit:828/0*hcY1lYOP4CVxn-LS 828w, https://miro.medium.com/v2/resize:fit:1100/0*hcY1lYOP4CVxn-LS 1100w, https://miro.medium.com/v2/resize:fit:2000/0*hcY1lYOP4CVxn-LS 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Canary — With prioritized load-shedding. Only pre-fetch availability drops while user-initiated availability stays at 100%</em></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fy fz ga gb"><p id="2b2e" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Without prioritized load-shedding, both user-initiated and prefetch availability drop when latency is injected. However, after adding prioritized load-shedding, user-initiated requests maintain a 100% availability and only prefetch requests are throttled.</p><p id="cc2a" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">We were ready to roll this out to production and see how it performed in the wild!</p><h2 id="d734" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Real-World Application and Results</h2><p id="8504" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">Netflix engineers work hard to keep our systems available, and it was a while before we had a production incident that tested the efficacy of our solution. A few months after deploying prioritized load shedding, we had an infrastructure outage at Netflix that impacted streaming for many of our users. Once the outage was fixed, we got a 12x spike in pre-fetch requests per second from Android devices, presumably because there was a backlog of queued requests built up.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz rv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*0AdiUnX8fdinJTNR%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*0AdiUnX8fdinJTNR%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*0AdiUnX8fdinJTNR%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*0AdiUnX8fdinJTNR%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*0AdiUnX8fdinJTNR%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*0AdiUnX8fdinJTNR%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*0AdiUnX8fdinJTNR%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*0AdiUnX8fdinJTNR 640w, https://miro.medium.com/v2/resize:fit:720/0*0AdiUnX8fdinJTNR 720w, https://miro.medium.com/v2/resize:fit:750/0*0AdiUnX8fdinJTNR 750w, https://miro.medium.com/v2/resize:fit:786/0*0AdiUnX8fdinJTNR 786w, https://miro.medium.com/v2/resize:fit:828/0*0AdiUnX8fdinJTNR 828w, https://miro.medium.com/v2/resize:fit:1100/0*0AdiUnX8fdinJTNR 1100w, https://miro.medium.com/v2/resize:fit:1400/0*0AdiUnX8fdinJTNR 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Spike in Android pre-fetch RPS</em></figcaption></figure><p id="d570" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">This could have resulted in a second outage as our systems weren’t scaled to handle this traffic spike. Did prioritized load-shedding in PlayAPI help us here?</p><p id="8fbe" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Yes! While the availability for prefetch requests dropped as low as 20%, the availability for user-initiated requests was &gt; 99.4% due to prioritized load-shedding.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz rw"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*gVNG6nlvDevP-53B%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*gVNG6nlvDevP-53B%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*gVNG6nlvDevP-53B%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*gVNG6nlvDevP-53B%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*gVNG6nlvDevP-53B%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*gVNG6nlvDevP-53B%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*gVNG6nlvDevP-53B%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*gVNG6nlvDevP-53B 640w, https://miro.medium.com/v2/resize:fit:720/0*gVNG6nlvDevP-53B 720w, https://miro.medium.com/v2/resize:fit:750/0*gVNG6nlvDevP-53B 750w, https://miro.medium.com/v2/resize:fit:786/0*gVNG6nlvDevP-53B 786w, https://miro.medium.com/v2/resize:fit:828/0*gVNG6nlvDevP-53B 828w, https://miro.medium.com/v2/resize:fit:1100/0*gVNG6nlvDevP-53B 1100w, https://miro.medium.com/v2/resize:fit:1400/0*gVNG6nlvDevP-53B 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Availability of pre-fetch and user-initiated requests</em></figcaption></figure><p id="e7c2" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">At one point we were throttling more than 50% of all requests but the availability of user-initiated requests continued to be &gt; 99.4%.</p><h1 id="60a9" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Generic service work prioritization</h1><p id="bbc2" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">Based on the success of this approach, we have created an internal library to enable services to perform prioritized load shedding based on pluggable utilization measures, with multiple priority levels.</p><p id="5062" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Unlike API gateway, which needs to handle a large volume of requests with varying priorities, most microservices typically receive requests with only a few distinct priorities. To maintain consistency across different services, we have introduced four predefined priority buckets inspired by the <a class="af oi" href="https://linux.die.net/man/8/tc-prio" rel="noopener ugc nofollow" target="_blank">Linux tc-prio levels</a>:</p><ul class=""><li id="b0ff" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh rj pl pm bj"><strong class="no gu">CRITICAL</strong>: Affect core functionality — These will never be shed if we are not in complete failure.</li><li id="36f6" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh rj pl pm bj"><strong class="no gu">DEGRADED</strong>: Affect user experience — These will be progressively shed as the load increases.</li><li id="21ce" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh rj pl pm bj"><strong class="no gu">BEST_EFFORT</strong>: Do not affect the user — These will be responded to in a best effort fashion and may be shed progressively in normal operation.</li><li id="2f38" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh rj pl pm bj"><strong class="no gu">BULK</strong>: Background work, expect these to be routinely shed.</li></ul><p id="0a3c" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Services can either choose the upstream client’s priority <em class="rx">or</em> map incoming requests to one of these priority buckets by examining various request attributes, such as HTTP headers or the request body, for more precise control. Here is an example of how services can map requests to priority buckets:</p><pre class="qb qc qd qe qf rl rm rn bo ro ba bj">ResourceLimiterRequestPriorityProvider requestPriorityProvider() {<br />    return contextProvider -&gt; {<br />        if (contextProvider.getRequest().isCritical()) {<br />              return PriorityBucket.CRITICAL;<br />          } else if (contextProvider.getRequest().isHighPriority()) {<br />              return PriorityBucket.DEGRADED;<br />          } else if (contextProvider.getRequest().isMediumPriority()) {<br />              return PriorityBucket.BEST_EFFORT;<br />          } else {<br />              return PriorityBucket.BULK;<br />          }<br />        };<br />    }</pre><h2 id="46c9" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Generic CPU based load-shedding</h2><p id="6762" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">Most services at Netflix autoscale on CPU utilization, so it is a natural measure of system load to tie into the prioritized load shedding framework. Once a request is mapped to a priority bucket, services can determine when to shed traffic from a particular bucket based on CPU utilization. In order to maintain the signal to autoscaling that scaling is needed, prioritized shedding only starts shedding load <em class="rx">after</em> hitting the target CPU utilization, and as system load increases, more critical traffic is progressively shed in an attempt to maintain user experience.</p><p id="fc7c" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">For example, if a cluster targets a 60% CPU utilization for auto-scaling, it can be configured to start shedding requests when the CPU utilization exceeds this threshold. When a traffic spike causes the cluster’s CPU utilization to significantly surpass this threshold, it will gradually shed low-priority traffic to conserve resources for high-priority traffic. This approach also allows more time for auto-scaling to add additional instances to the cluster. Once more instances are added, CPU utilization will decrease, and low-priority traffic will resume being served normally.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div class="py pz ry"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*sdKTOYaSQ_tEjE8r%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*sdKTOYaSQ_tEjE8r%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*sdKTOYaSQ_tEjE8r%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*sdKTOYaSQ_tEjE8r%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*sdKTOYaSQ_tEjE8r%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*sdKTOYaSQ_tEjE8r%201100w,%20https://miro.medium.com/v2/resize:fit:1274/format:webp/0*sdKTOYaSQ_tEjE8r%201274w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 637px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*sdKTOYaSQ_tEjE8r 640w, https://miro.medium.com/v2/resize:fit:720/0*sdKTOYaSQ_tEjE8r 720w, https://miro.medium.com/v2/resize:fit:750/0*sdKTOYaSQ_tEjE8r 750w, https://miro.medium.com/v2/resize:fit:786/0*sdKTOYaSQ_tEjE8r 786w, https://miro.medium.com/v2/resize:fit:828/0*sdKTOYaSQ_tEjE8r 828w, https://miro.medium.com/v2/resize:fit:1100/0*sdKTOYaSQ_tEjE8r 1100w, https://miro.medium.com/v2/resize:fit:1274/0*sdKTOYaSQ_tEjE8r 1274w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 637px" /></picture></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Percentage of requests (Y-axis) being load-shed based on CPU utilization (X-axis) for different priority buckets</em></figcaption></figure><h2 id="df97" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Experiments with CPU based load-shedding</h2><p id="afe6" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">We ran a series of experiments sending a large request volume at a service which normally targets 45% CPU for auto scaling but which was prevented from scaling up for the purpose of monitoring CPU load shedding under extreme load conditions. The instances were configured to shed noncritical traffic after 60% CPU and critical traffic after 80%.</p><p id="0fe9" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">As RPS was dialed up past 6x the autoscale volume, the service was able to shed first noncritical and then critical requests. Latency remained within reasonable limits throughout, and successful RPS throughput remained stable.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz rz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*Wr6bJzQVf3dV4clf%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*Wr6bJzQVf3dV4clf%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*Wr6bJzQVf3dV4clf%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*Wr6bJzQVf3dV4clf%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*Wr6bJzQVf3dV4clf%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Wr6bJzQVf3dV4clf%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*Wr6bJzQVf3dV4clf%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Wr6bJzQVf3dV4clf 640w, https://miro.medium.com/v2/resize:fit:720/0*Wr6bJzQVf3dV4clf 720w, https://miro.medium.com/v2/resize:fit:750/0*Wr6bJzQVf3dV4clf 750w, https://miro.medium.com/v2/resize:fit:786/0*Wr6bJzQVf3dV4clf 786w, https://miro.medium.com/v2/resize:fit:828/0*Wr6bJzQVf3dV4clf 828w, https://miro.medium.com/v2/resize:fit:1100/0*Wr6bJzQVf3dV4clf 1100w, https://miro.medium.com/v2/resize:fit:1400/0*Wr6bJzQVf3dV4clf 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Experimental behavior of CPU based load-shedding using synthetic traffic.</em></figcaption></figure><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz sa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*DZCzE_AAi2cJXRRr%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*DZCzE_AAi2cJXRRr%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*DZCzE_AAi2cJXRRr%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*DZCzE_AAi2cJXRRr%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*DZCzE_AAi2cJXRRr%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*DZCzE_AAi2cJXRRr%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*DZCzE_AAi2cJXRRr%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*DZCzE_AAi2cJXRRr 640w, https://miro.medium.com/v2/resize:fit:720/0*DZCzE_AAi2cJXRRr 720w, https://miro.medium.com/v2/resize:fit:750/0*DZCzE_AAi2cJXRRr 750w, https://miro.medium.com/v2/resize:fit:786/0*DZCzE_AAi2cJXRRr 786w, https://miro.medium.com/v2/resize:fit:828/0*DZCzE_AAi2cJXRRr 828w, https://miro.medium.com/v2/resize:fit:1100/0*DZCzE_AAi2cJXRRr 1100w, https://miro.medium.com/v2/resize:fit:1400/0*DZCzE_AAi2cJXRRr 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">P99 latency stayed within a reasonable range throughout the experiment, even as RPS surpassed 6x the autoscale target.</em></figcaption></figure><h2 id="ffc7" class="qs ok gt be ol qt qu dx oo qv qw dz or nv qx qy qz nz ra rb rc od rd re rf rg bj">Anti-patterns with load-shedding</h2><p id="6712" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj"><strong class="no gu">Anti-pattern 1 — No shedding</strong></p><p id="3332" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">In the above graphs, the limiter does a good job keeping latency low for the successful requests. If there was no shedding here, we’d see latency increase for all requests, instead of a fast failure in some requests that can be retried. Further, this can result in a death spiral where one instance becomes unhealthy, resulting in more load on other instances, resulting in all instances becoming unhealthy before auto-scaling can kick in.</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div class="py pz cg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*Bp5AKoNtQOfHaExB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*Bp5AKoNtQOfHaExB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*Bp5AKoNtQOfHaExB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*Bp5AKoNtQOfHaExB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*Bp5AKoNtQOfHaExB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Bp5AKoNtQOfHaExB%201100w,%20https://miro.medium.com/v2/resize:fit:1360/format:webp/0*Bp5AKoNtQOfHaExB%201360w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 680px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Bp5AKoNtQOfHaExB 640w, https://miro.medium.com/v2/resize:fit:720/0*Bp5AKoNtQOfHaExB 720w, https://miro.medium.com/v2/resize:fit:750/0*Bp5AKoNtQOfHaExB 750w, https://miro.medium.com/v2/resize:fit:786/0*Bp5AKoNtQOfHaExB 786w, https://miro.medium.com/v2/resize:fit:828/0*Bp5AKoNtQOfHaExB 828w, https://miro.medium.com/v2/resize:fit:1100/0*Bp5AKoNtQOfHaExB 1100w, https://miro.medium.com/v2/resize:fit:1360/0*Bp5AKoNtQOfHaExB 1360w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 680px" /></picture></div></figure><figure class="qb qc qd qe qf ps py pz paragraph-image"><div class="py pz cg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*WPvKjlopcBGixDGB%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*WPvKjlopcBGixDGB%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*WPvKjlopcBGixDGB%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*WPvKjlopcBGixDGB%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*WPvKjlopcBGixDGB%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*WPvKjlopcBGixDGB%201100w,%20https://miro.medium.com/v2/resize:fit:1360/format:webp/0*WPvKjlopcBGixDGB%201360w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 680px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*WPvKjlopcBGixDGB 640w, https://miro.medium.com/v2/resize:fit:720/0*WPvKjlopcBGixDGB 720w, https://miro.medium.com/v2/resize:fit:750/0*WPvKjlopcBGixDGB 750w, https://miro.medium.com/v2/resize:fit:786/0*WPvKjlopcBGixDGB 786w, https://miro.medium.com/v2/resize:fit:828/0*WPvKjlopcBGixDGB 828w, https://miro.medium.com/v2/resize:fit:1100/0*WPvKjlopcBGixDGB 1100w, https://miro.medium.com/v2/resize:fit:1360/0*WPvKjlopcBGixDGB 1360w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 680px" /></picture></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">No load-shedding: In the absence of load-shedding, increased latency can degrade all requests instead of rejecting some requests (that can be retried), and can make instances unhealthy</em></figcaption></figure><p id="34a6" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj"><strong class="no gu">Anti-pattern 2 — Congestive failure</strong></p><p id="2d12" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Another anti-pattern to watch out for is congestive failure or shedding too aggressively. If the load-shedding is due to an increase in traffic, the successful RPS should not drop after load-shedding. Here is an example of what congestive failure looks like:</p><figure class="qb qc qd qe qf ps py pz paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz sb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*gPGs2BJ1Oxu9O7TK%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*gPGs2BJ1Oxu9O7TK%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*gPGs2BJ1Oxu9O7TK%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*gPGs2BJ1Oxu9O7TK%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*gPGs2BJ1Oxu9O7TK%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*gPGs2BJ1Oxu9O7TK%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*gPGs2BJ1Oxu9O7TK%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*gPGs2BJ1Oxu9O7TK 640w, https://miro.medium.com/v2/resize:fit:720/0*gPGs2BJ1Oxu9O7TK 720w, https://miro.medium.com/v2/resize:fit:750/0*gPGs2BJ1Oxu9O7TK 750w, https://miro.medium.com/v2/resize:fit:786/0*gPGs2BJ1Oxu9O7TK 786w, https://miro.medium.com/v2/resize:fit:828/0*gPGs2BJ1Oxu9O7TK 828w, https://miro.medium.com/v2/resize:fit:1100/0*gPGs2BJ1Oxu9O7TK 1100w, https://miro.medium.com/v2/resize:fit:1400/0*gPGs2BJ1Oxu9O7TK 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="qn fe qo py pz qp qq be b bf z dt"><em class="qr">Congestive failure: After 16:57, the service starts rejecting most requests and is not able to sustain a successful 240 RPS that it was before load-shedding kicked in. This can be seen in fixed concurrency limiters or when load-shedding consumes too much CPU preventing any other work from being done</em></figcaption></figure><p id="8f93" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">We can see in the <strong class="no gu">Experiments with CPU based load-shedding</strong> section above that our load-shedding implementation avoids both these anti-patterns by keeping latency low and sustaining as much successful RPS during load-shedding as before.</p><h1 id="9b9d" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Generic IO based load-shedding</h1><p id="cadc" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">Some services are not CPU-bound but instead are IO-bound by backing services or datastores that can apply back pressure via increased latency when they are overloaded either in compute or in storage capacity. For these services we re-use the prioritized load shedding techniques, but we introduce new utilization measures to feed into the shedding logic. Our initial implementation supports two forms of latency based shedding in addition to standard adaptive concurrency limiters (themselves a measure of average latency):</p><ol class=""><li id="48b6" class="nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh pk pl pm bj">The service can specify per-endpoint target and maximum latencies, which allow the service to shed when the service is abnormally slow regardless of backend.</li><li id="ece4" class="nm nn gt no b hr pn nq nr hu po nt nu nv pp nx ny nz pq ob oc od pr of og oh pk pl pm bj">The Netflix storage services running on the <a class="af oi" href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6" rel="noopener">Data Gateway</a> return observed storage target and max latency SLO utilization, allowing services to shed when they overload their allocated storage capacity.</li></ol><p id="ace0" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">These utilization measures provide early warning signs that a service is generating too much load to a backend, and allow it to shed low priority work before it overwhelms that backend. The main advantage of these techniques over concurrency limits alone is they require less tuning as our services already must maintain tight latency service-level-objectives (SLOs), for example a p50 &lt; 10ms and p100 &lt; 500ms. So, rephrasing these existing SLOs as utilizations allows us to shed low priority work early to prevent further latency impact to high priority work. At the same time, the system <em class="rx">will accept as much work as it can</em> while maintaining SLO’s.</p><p id="3a01" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">To create these utilization measures, we count how many requests are processed <em class="rx">slower</em> than our target and maximum latency objectives, and emit the percentage of requests failing to meet those latency goals. For example, our KeyValue storage service offers a 10ms target with 500ms max latency for each namespace, and all clients receive utilization measures per data namespace to feed into their prioritized load shedding. These measures look like:</p><pre class="qb qc qd qe qf rl rm rn bo ro ba bj">utilization(namespace) = {<br />  overall = 12<br />  latency = {<br />    slo_target = 12,<br />    slo_max = 0<br />  }<br />  system = {<br />    storage = 17,<br />    compute = 10,<br />  }<br />}</pre><p id="d8cd" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">In this case, 12% of requests are slower than the 10ms target, 0% are slower than the 500ms max latency (timeout), and 17% of allocated storage is utilized. Different use cases consult different utilizations in their prioritized shedding, for example batches that write data daily may get shed when system storage utilization is approaching capacity as writing more data would create further instability.</p><p id="6c98" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">An example where the latency utilization is useful is for one of our critical file origin services which accepts writes of new files in the AWS cloud and acts as an origin (serves reads) for those files to our Open Connect CDN infrastructure. Writes are the most critical and should never be shed by the service, but when the backing datastore is getting overloaded, it is reasonable to progressively shed reads to files which are less critical to the CDN as it can retry those reads and they do not affect the product experience.</p><p id="6791" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">To achieve this goal, the origin service configured a KeyValue latency based limiter that starts shedding reads to files which are less critical to the CDN when the datastore reports a target latency utilization exceeding 40%. We then stress tested the system by generating over 50Gbps of read traffic, some of it to high priority files and some of it to low priority files:</p></div></div><div class="ps"><div class="ab ca"><div class="mo pt mp pu mq pv ce pw cf px ch bg"><figure class="qb qc qd qe qf ps qg qh paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz rz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*HI2zGO_MOxD-X1cG%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*HI2zGO_MOxD-X1cG%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*HI2zGO_MOxD-X1cG%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*HI2zGO_MOxD-X1cG%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*HI2zGO_MOxD-X1cG%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*HI2zGO_MOxD-X1cG%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*HI2zGO_MOxD-X1cG%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*HI2zGO_MOxD-X1cG 640w, https://miro.medium.com/v2/resize:fit:720/0*HI2zGO_MOxD-X1cG 720w, https://miro.medium.com/v2/resize:fit:750/0*HI2zGO_MOxD-X1cG 750w, https://miro.medium.com/v2/resize:fit:786/0*HI2zGO_MOxD-X1cG 786w, https://miro.medium.com/v2/resize:fit:828/0*HI2zGO_MOxD-X1cG 828w, https://miro.medium.com/v2/resize:fit:1100/0*HI2zGO_MOxD-X1cG 1100w, https://miro.medium.com/v2/resize:fit:2000/0*HI2zGO_MOxD-X1cG 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div></figure><figure class="mk ps qg qh paragraph-image"><div role="button" tabindex="0" class="qi qj fi qk bg ql"><div class="py pz sc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*AZnhEhtrsp9MEJFA%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*AZnhEhtrsp9MEJFA%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*AZnhEhtrsp9MEJFA%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*AZnhEhtrsp9MEJFA%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*AZnhEhtrsp9MEJFA%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*AZnhEhtrsp9MEJFA%201100w,%20https://miro.medium.com/v2/resize:fit:2000/format:webp/0*AZnhEhtrsp9MEJFA%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*AZnhEhtrsp9MEJFA 640w, https://miro.medium.com/v2/resize:fit:720/0*AZnhEhtrsp9MEJFA 720w, https://miro.medium.com/v2/resize:fit:750/0*AZnhEhtrsp9MEJFA 750w, https://miro.medium.com/v2/resize:fit:786/0*AZnhEhtrsp9MEJFA 786w, https://miro.medium.com/v2/resize:fit:828/0*AZnhEhtrsp9MEJFA 828w, https://miro.medium.com/v2/resize:fit:1100/0*AZnhEhtrsp9MEJFA 1100w, https://miro.medium.com/v2/resize:fit:2000/0*AZnhEhtrsp9MEJFA 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fy fz ga gb"><p id="a68b" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">In this test, there are a nominal number of critical writes and a high number of reads to both low and high priority files. In the top-left graph we ramp to 2000 read/second of ~4MiB files until we can trigger overload of the backend store at over 50Gbps in the top-center graph. When that happens, the top-right graph shows that even under significant load, the origin <em class="rx">only</em> sheds low priority read work to preserve high-priority writes and reads. Before this change when we hit breaking points, critical writes <em class="rx">and</em> reads would fail along with low priority reads. During this test the CPU load of the file serving service was nominal (&lt;10%), so in this case only IO based limiters are able to protect the system. It is also important to note that the origin will serve more traffic as long as the backing datastore continues accepting it with low latency, preventing the problems we had with concurrency limits in the past where they would either shed too early when nothing was actually wrong or too late when we had entered congestive failure.</p><h1 id="eb96" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Conclusion and Future Directions</h1><p id="e80b" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">The implementation of service-level prioritized load shedding has proven to be a significant step forward in maintaining high availability and excellent user experience for Netflix customers, even during unexpected system stress.</p><p id="8513" class="pw-post-body-paragraph nm nn gt no b hr np nq nr hu ns nt nu nv nw nx ny nz oa ob oc od oe of og oh gm bj">Stay tuned for more updates as we innovate to keep your favorite shows streaming smoothly, no matter what SLO busters lie in wait.</p><h1 id="ebf8" class="oj ok gt be ol om on ht oo op oq hw or os ot ou ov ow ox oy oz pa pb pc pd pe bj">Acknowledgements</h1><p id="48fd" class="pw-post-body-paragraph nm nn gt no b hr pf nq nr hu pg nt nu nv ph nx ny nz pi ob oc od pj of og oh gm bj">We would like to acknowledge the many members of the Netflix consumer product, platform, and open connect teams who have designed, implemented, and tested these prioritization techniques. In particular: <a class="af oi" href="https://www.linkedin.com/in/xiaomei-liu-b475711" rel="noopener ugc nofollow" target="_blank">Xiaomei Liu</a>, <a class="af oi" href="https://www.linkedin.com/in/rummadis" rel="noopener ugc nofollow" target="_blank">Raj Ummadisetty</a>, <a class="af oi" href="https://www.linkedin.com/in/shyam-gala-5891224/" rel="noopener ugc nofollow" target="_blank">Shyam Gala</a>, <a class="af oi" href="https://www.linkedin.com/in/justin-guerra-3282262b" rel="noopener ugc nofollow" target="_blank">Justin Guerra</a>, <a class="af oi" href="https://www.linkedin.com/in/william-schor" rel="noopener ugc nofollow" target="_blank">William Schor</a>, <a class="af oi" href="https://www.linkedin.com/in/tonyghita" rel="noopener ugc nofollow" target="_blank">Tony Ghita</a> et al.</p></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/enhancing-netflix-reliability-with-service-level-prioritized-load-shedding-e735e6ce8f7d</link>
      <guid>https://netflixtechblog.com/enhancing-netflix-reliability-with-service-level-prioritized-load-shedding-e735e6ce8f7d</guid>
      <pubDate>Wed, 26 Jun 2024 00:58:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[A Recap of the Data Engineering Open Forum at Netflix]]></title>
      <description><![CDATA[<div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><div><div><h2 id="96bb" class="pw-subtitle-paragraph hq gs gt be b hr hs ht hu hv hw hx hy hz ia ib ic id ie if cp dt">A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024</h2><div class="ig ih ii ij ik"></div><figure class="np nq nr ns nt nu nm nn paragraph-image"><div role="button" tabindex="0" class="nv nw fi nx bg ny"><div class="nm nn no"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*k1mwTj0BpJuP0TDi%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*k1mwTj0BpJuP0TDi%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*k1mwTj0BpJuP0TDi%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*k1mwTj0BpJuP0TDi%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*k1mwTj0BpJuP0TDi%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*k1mwTj0BpJuP0TDi%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*k1mwTj0BpJuP0TDi%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*k1mwTj0BpJuP0TDi 640w, https://miro.medium.com/v2/resize:fit:720/0*k1mwTj0BpJuP0TDi 720w, https://miro.medium.com/v2/resize:fit:750/0*k1mwTj0BpJuP0TDi 750w, https://miro.medium.com/v2/resize:fit:786/0*k1mwTj0BpJuP0TDi 786w, https://miro.medium.com/v2/resize:fit:828/0*k1mwTj0BpJuP0TDi 828w, https://miro.medium.com/v2/resize:fit:1100/0*k1mwTj0BpJuP0TDi 1100w, https://miro.medium.com/v2/resize:fit:1400/0*k1mwTj0BpJuP0TDi 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="oa fe ob nm nn oc od be b bf z dt">The Data Engineering Open Forum at Netflix on April 18th, 2024.</figcaption></figure><p id="b8c3" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj">At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role in this mission by enabling data-driven decision-making at scale. Netflix is not the only place where data engineers are solving challenging problems with creative solutions. On April 18th, 2024, we hosted the inaugural Data Engineering Open Forum at our Los Gatos office, bringing together data engineers from various industries to share, learn, and connect.</p><p id="deb9" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj">At the conference, our speakers share their unique perspectives on modern developments, immediate challenges, and future prospects of data engineering. We are excited to share the recordings of talks from the conference with the rest of the world.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="5081" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Opening Remarks</h2><p id="51b3" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/9NnnYHuH8GQ?si=nYBpQPhGwxX-oo1l" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="887f" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/max-schmeiser/" rel="noopener ugc nofollow" target="_blank">Max Schmeiser</a> (Vice President of Studio and Content Data Science &amp; Engineering)</p><p id="748d" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: Max Schmeiser extends a warm welcome to all attendees, marking the beginning of our inaugural Data Engineering Open Forum.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="7d84" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data Platform</h2><p id="50fe" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/0j6b9V9tmKA?si=fMEuLmrIK5ATi52d" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="f8ca" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speakers:</strong></p><ul class=""><li id="9707" class="oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz qh qi qj bj"><a class="af qg" href="https://www.linkedin.com/in/stephanievezich/" rel="noopener ugc nofollow" target="_blank">Stephanie Vezich Tamayo</a> (Senior Machine Learning Engineer at Netflix)</li><li id="f82c" class="oe of gt og b hr qk oi oj hu ql ol om on qm op oq or qn ot ou ov qo ox oy oz qh qi qj bj"><a class="af qg" href="https://www.linkedin.com/in/binbing-hou/" rel="noopener ugc nofollow" target="_blank">Binbing Hou</a> (Senior Software Engineer at Netflix)</li></ul><p id="0f74" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: At Netflix, hundreds of thousands of workflows and millions of jobs are running every day on our big data platform, but diagnosing and remediating job failures can impose considerable operational burdens. To handle errors efficiently, Netflix developed a rule-based classifier for error classification called “Pensive.” However, as the system has increased in scale and complexity, Pensive has been facing challenges due to its limited support for operational automation, especially for handling memory configuration errors and unclassified errors. To address these challenges, we have developed a new feature called “Auto Remediation,” which integrates the rules-based classifier with an ML service.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="e4ce" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Automating the Data Architect: Generative AI for Enterprise Data Modeling</h2><p id="6c9e" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/DtzIIVJq8wA?si=i5fLXA7G8IMyiF0u" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="0f8f" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/jide-o-87602512/" rel="noopener ugc nofollow" target="_blank">Jide Ogunjobi</a> (Founder &amp; CTO at Context Data)</p><p id="3294" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: As organizations accumulate ever-larger stores of data across disparate systems, efficiently querying and gaining insights from enterprise data remain ongoing challenges. To address this, we propose developing an intelligent agent that can automatically discover, map, and query all data within an enterprise. This “Enterprise Data Model/Architect Agent” employs generative AI techniques for autonomous enterprise data modeling and architecture.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><figure class="np nq nr ns nt nu nm nn paragraph-image"><div role="button" tabindex="0" class="nv nw fi nx bg ny"><div class="nm nn qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*1e3shrfbfV2J6S4-%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*1e3shrfbfV2J6S4-%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*1e3shrfbfV2J6S4-%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*1e3shrfbfV2J6S4-%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*1e3shrfbfV2J6S4-%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*1e3shrfbfV2J6S4-%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*1e3shrfbfV2J6S4-%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*1e3shrfbfV2J6S4- 640w, https://miro.medium.com/v2/resize:fit:720/0*1e3shrfbfV2J6S4- 720w, https://miro.medium.com/v2/resize:fit:750/0*1e3shrfbfV2J6S4- 750w, https://miro.medium.com/v2/resize:fit:786/0*1e3shrfbfV2J6S4- 786w, https://miro.medium.com/v2/resize:fit:828/0*1e3shrfbfV2J6S4- 828w, https://miro.medium.com/v2/resize:fit:1100/0*1e3shrfbfV2J6S4- 1100w, https://miro.medium.com/v2/resize:fit:1400/0*1e3shrfbfV2J6S4- 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="oa fe ob nm nn oc od be b bf z dt">Tulika Bhatt, Senior Data Engineer at Netflix, shared how her team manages impression data at scale.</figcaption></figure><h2 id="2127" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Real-Time Delivery of Impressions at Scale</h2><p id="6c5d" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/ARTHgxoJmCE?si=MDx1Qa8W7nNxkA_m" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="e10f" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker:</strong> <a class="af qg" href="https://www.linkedin.com/in/tulikabhatt/" rel="noopener ugc nofollow" target="_blank">Tulika Bhatt</a> (Senior Data Engineer at Netflix)</p><p id="0938" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: Netflix generates approximately 18 billion impressions daily. These impressions significantly influence a viewer’s browsing experience, as they are essential for powering video ranker algorithms and computing adaptive pages, With the evolution of user interfaces to be more responsive to in-session interactions, coupled with the growing demand for real-time adaptive recommendations, it has become highly imperative that these impressions are provided on a near real-time basis. This talk will delve into the creative solutions Netflix deploys to manage this high-volume, real-time data requirement while balancing scalability and cost.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="897c" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Reflections on Building a Data Platform From the Ground Up in a Post-GDPR World</h2><p id="c9a2" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/WdSsneeI6RE?si=-gpe4DprVQKoZt_V" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="95c5" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/jessmlarson/" rel="noopener ugc nofollow" target="_blank">Jessica Larson</a> (Data Engineer &amp; Author of “Snowflake Access Control”)</p><p id="0fc3" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: The requirements for creating a new data warehouse in the post-GDPR world are significantly different from those of the pre-GDPR world, such as the need to prioritize sensitive data protection and regulatory compliance over performance and cost. In this talk, Jessica Larson shares her takeaways from building a new data platform post-GDPR.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="1f14" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Unbundling the Data Warehouse: The Case for Independent Storage</h2><p id="bb16" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/CmEIJ-lagVU?si=Z4VcYL_FBV4bIGJW" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="f92d" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/jasonreid/" rel="noopener ugc nofollow" target="_blank">Jason Reid</a> (Co-founder &amp; Head of Product at Tabular)</p><p id="f7eb" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: Unbundling a data warehouse means splitting it into constituent and modular components that interact via open standard interfaces. In this talk, Jason Reid discusses the pros and cons of both data warehouse bundling and unbundling in terms of performance, governance, and flexibility, and he examines how the trend of data warehouse unbundling will impact the data engineering landscape in the next 5 years.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><figure class="np nq nr ns nt nu nm nn paragraph-image"><div role="button" tabindex="0" class="nv nw fi nx bg ny"><div class="nm nn qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ZPAsa_6qxkh4BJDY%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ZPAsa_6qxkh4BJDY%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ZPAsa_6qxkh4BJDY%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ZPAsa_6qxkh4BJDY%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ZPAsa_6qxkh4BJDY%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ZPAsa_6qxkh4BJDY%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ZPAsa_6qxkh4BJDY%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ZPAsa_6qxkh4BJDY 640w, https://miro.medium.com/v2/resize:fit:720/0*ZPAsa_6qxkh4BJDY 720w, https://miro.medium.com/v2/resize:fit:750/0*ZPAsa_6qxkh4BJDY 750w, https://miro.medium.com/v2/resize:fit:786/0*ZPAsa_6qxkh4BJDY 786w, https://miro.medium.com/v2/resize:fit:828/0*ZPAsa_6qxkh4BJDY 828w, https://miro.medium.com/v2/resize:fit:1100/0*ZPAsa_6qxkh4BJDY 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ZPAsa_6qxkh4BJDY 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="oa fe ob nm nn oc od be b bf z dt">Clark Wright, Staff Analytics Engineer at Airbnb, talked about the concept of Data Quality Score at Airbnb.</figcaption></figure><h2 id="b6c5" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Data Quality Score: How We Evolved the Data Quality Strategy at Airbnb</h2><p id="434b" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/Lv-bFDSzrqw?si=SBdnoFcOHjqe34Ve" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="e688" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/clark-wright/" rel="noopener ugc nofollow" target="_blank">Clark Wright</a> (Staff Analytics Engineer at Airbnb)</p><p id="fed1" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: Recently, Airbnb published a post to their Tech Blog called <a class="af qg" href="https://medium.com/airbnb-engineering/data-quality-score-the-next-chapter-of-data-quality-at-airbnb-851dccda19c3" rel="noopener">Data Quality Score: The next chapter of data quality at Airbnb</a>. In this talk, Clark Wright shares the narrative of how data practitioners at Airbnb recognized the need for higher-quality data and then proposed, conceptualized, and launched Airbnb’s first Data Quality Score.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><h2 id="3896" class="pi pj gt be pk pl pm dx pn po pp dz pq on pr ps pt or pu pv pw ov px py pz qa bj">Data Productivity at Scale</h2><p id="d942" class="pw-post-body-paragraph oe of gt og b hr qb oi oj hu qc ol om on qd op oq or qe ot ou ov qf ox oy oz gm bj"><a class="af qg" href="https://youtu.be/KP5ml1tOfbY?si=hmyBjQRx422zUg-k" rel="noopener ugc nofollow" target="_blank"><strong class="og gu">Recording</strong></a></p><p id="6e1b" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Speaker</strong>: <a class="af qg" href="https://www.linkedin.com/in/izeigerman/" rel="noopener ugc nofollow" target="_blank">Iaroslav Zeigerman</a> (Co-Founder and Chief Architect at Tobiko Data)</p><p id="d2d0" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj"><strong class="og gu">Summary</strong>: The development and evolution of data pipelines are hindered by outdated tooling compared to software development. Creating new development environments is cumbersome: Populating them with data is compute-intensive, and the deployment process is error-prone, leading to higher costs, slower iteration, and unreliable data. SQLMesh, an open-source project born from our collective experience at companies like Airbnb, Apple, Google, and Netflix, is designed to handle the complexities of evolving data pipelines at an internet scale. In this talk, Iaroslav Zeigerman discusses challenges faced by data practitioners today and how core SQLMesh concepts solve them.</p></div></div></div><div class="ab ca pa pb pc pd" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><p id="3594" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj">Last but not least, thank you to the organizers of the Data Engineering Open Forum: <a class="af qg" href="https://www.linkedin.com/in/chris-colburn/" rel="noopener ugc nofollow" target="_blank">Chris Colburn</a>, <a class="af qg" href="https://www.linkedin.com/in/xinranwaibel/" rel="noopener ugc nofollow" target="_blank">Xinran Waibel</a>, <a class="af qg" href="https://www.linkedin.com/in/jaibalani/" rel="noopener ugc nofollow" target="_blank">Jai Balani</a>, <a class="af qg" href="https://www.linkedin.com/in/rashmi-shamprasad-51630b19/" rel="noopener ugc nofollow" target="_blank">Rashmi Shamprasad</a>, and <a class="af qg" href="https://www.linkedin.com/in/patriciapho/" rel="noopener ugc nofollow" target="_blank">Patricia Ho</a>.</p><p id="3a37" class="pw-post-body-paragraph oe of gt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj">Until next time!</p><blockquote class="qq qr qs"><p id="5067" class="oe of qt og b hr oh oi oj hu ok ol om on oo op oq or os ot ou ov ow ox oy oz gm bj">If you are interested in attending a future Data Engineering Open Forum, we highly recommend you join our <a class="af qg" href="https://groups.google.com/g/data-engineering-open-forum" rel="noopener ugc nofollow" target="_blank">Google Group</a> to stay tuned to event announcements.</p></blockquote></div></div></div></div></div></div></div></div></div></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/a-recap-of-the-data-engineering-open-forum-at-netflix-6b4d4410b88f</link>
      <guid>https://netflixtechblog.com/a-recap-of-the-data-engineering-open-forum-at-netflix-6b4d4410b88f</guid>
      <pubDate>Thu, 20 Jun 2024 17:01:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Video annotator: building video classifiers using vision-language models and active learning]]></title>
      <description><![CDATA[<div><div class="hu hv hw hx hy"></div><p id="f4a6" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank">Amir Ziai</a>, <a class="af ny" href="https://www.linkedin.com/in/aneeshvartakavi" rel="noopener ugc nofollow" target="_blank">Aneesh Vartakavi</a>, <a class="af ny" href="https://www.linkedin.com/in/kelli-griggs-32990125/" rel="noopener ugc nofollow" target="_blank">Kelli Griggs</a>, <a class="af ny" href="https://www.linkedin.com/in/eugene-lok-6465045b" rel="noopener ugc nofollow" target="_blank">Eugene Lok</a>, <a class="af ny" href="https://www.linkedin.com/in/yvonne-jukes-814ba04" rel="noopener ugc nofollow" target="_blank">Yvonne Jukes</a>, <a class="af ny" href="https://www.linkedin.com/in/alejandro-alonso-ba733548" rel="noopener ugc nofollow" target="_blank">Alex Alonso</a>, <a class="af ny" href="https://www.linkedin.com/in/vi-pallavika-iyengar-144abb1b/" rel="noopener ugc nofollow" target="_blank">Vi Iyengar</a>, <a class="af ny" href="https://www.linkedin.com/in/anna-pulido-61025063" rel="noopener ugc nofollow" target="_blank">Anna Pulido</a></p><figure class="nz oa ob oc od oe"><div class="of jj l fi"></div></figure><h1 id="791b" class="oi oj gt be ok ol om on oo op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf bj">Introduction</h1><h2 id="da8a" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Problem</h2><p id="4a96" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">High-quality and consistent <strong class="nc gu">annotations</strong> are fundamental to the successful development of robust machine learning models. Conventional techniques for training machine learning classifiers are <strong class="nc gu">resource intensive</strong>. They involve a cycle where domain experts annotate a dataset, which is then transferred to data scientists to train models, review outcomes, and make changes. This labeling process tends to be time-consuming and inefficient, sometimes halting after a few annotation cycles.</p><h2 id="e830" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Implications</h2><p id="16a0" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">Consequently, less effort is invested in annotating high-quality datasets compared to iterating on complex models and algorithmic methods to improve performance and fix edge cases. As a result, ML systems grow rapidly in complexity.</p><p id="15c4" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Furthermore, constraints on time and resources often result in leveraging third-party annotators rather than <strong class="nc gu">domain experts</strong>. These annotators perform the labeling task without a deep <strong class="nc gu">understanding</strong> of the model’s intended deployment or usage, often making consistent labeling of borderline or <strong class="nc gu">hard examples</strong>, especially in more subjective tasks, a challenge.</p><p id="10cf" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">This necessitates multiple review rounds with domain experts, leading to unexpected costs and delays. This lengthy cycle can also result in model <strong class="nc gu">drift</strong>, as it takes longer to fix edge cases and deploy new models, potentially hurting usefulness and stakeholder trust.</p><h2 id="5a33" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Solution</h2><p id="3b78" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">We suggest that more direct involvement of domain experts, using a <strong class="nc gu">human-in-the-loop</strong> system, can resolve many of these practical challenges. We introduce a novel framework, <strong class="nc gu">Video Annotator</strong> (VA), which leverages <strong class="nc gu">active learning</strong> techniques and <strong class="nc gu">zero-shot</strong> capabilities of large <strong class="nc gu">vision-language</strong> models to guide users to focus their efforts on progressively harder examples, enhancing the model’s sample efficiency and keeping costs low.</p><p id="11e6" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">VA seamlessly integrates model building into the data annotation process, facilitating user validation of the model before deployment, therefore helping with building <strong class="nc gu">trust</strong> and fostering a sense of <strong class="nc gu">ownership</strong>. VA also supports a <strong class="nc gu">continuous</strong> annotation process, allowing users to rapidly deploy models, monitor their quality in production, and swiftly fix any edge cases by annotating a few more examples and deploying a new model version.</p><p id="7386" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">This self-service architecture empowers users to make improvements without active involvement of data scientists or third-party annotators, allowing for fast iteration.</p><h1 id="c608" class="oi oj gt be ok ol om on oo op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf bj">Video understanding</h1><p id="79ab" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">We design VA to assist in <a class="af ny" href="https://arxiv.org/abs/1904.11451" rel="noopener ugc nofollow" target="_blank">granular video understanding</a> which requires the identification of visuals, concepts, and events within video segments. Video understanding is fundamental for numerous applications such as <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ava-discovery-view-surfacing-authentic-moments-b8cd145491cc">search and discovery</a>, <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">personalization</a>, and the <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/discovering-creative-insights-in-promotional-artwork-295e4d788db5">creation of promotional assets</a>. Our framework allows users to efficiently train machine learning models for video understanding by developing an extensible set of binary video classifiers, which power scalable scoring and retrieval of a vast catalog of content.</p><h2 id="b26b" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Video classification</h2><p id="ccea" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">Video classification is the task of assigning a label to an arbitrary-length video clip, often accompanied by a probability or prediction score, as illustrated in Fig 1.</p><figure class="nz oa ob oc od oe qa qb paragraph-image"><div role="button" tabindex="0" class="qd qe fi qf bg qg"><div class="qa qb qc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*jXxtYzxoYFThJRs7TgEmqw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*jXxtYzxoYFThJRs7TgEmqw.png" /></picture></div></div><figcaption class="qi fe qj qa qb qk ql be b bf z dt">Fig 1- Functional view of a binary video classifier. A few-second clip from <a class="af ny" href="https://www.netflix.com/title/81130691" rel="noopener ugc nofollow" target="_blank">”Operation Varsity Blues: The College Admissions Scandal”</a> is passed to a binary classifier for detecting the ”establishing shots” label. The classifier outputs a very high score (score is between 0 and 1), indicating that the video clip is very likely an establishing shot. In filmmaking, an establishing shot is a wide shot (i.e. video clip between two consecutive cuts) of a building or a landscape that is intended for establishing the time and location of the scene.</figcaption></figure><h2 id="35af" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Video understanding via an extensible set of video classifiers</h2><p id="57e5" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">Binary classification allows for independence and flexibility, allowing us to add or improve one model independent of the others. It also has the additional benefit of being easier to understand and build for our users. Combining the predictions of multiple models allows us a deeper understanding of the video content at various levels of granularity, illustrated in Fig 2.</p><figure class="nz oa ob oc od oe qa qb paragraph-image"><div class="qa qb qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*hqNZjvWBlREV5GIv4uEmHw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*hqNZjvWBlREV5GIv4uEmHw.png" /></picture></div><figcaption class="qi fe qj qa qb qk ql be b bf z dt">Fig 2- Three video clips and the corresponding binary classifier scores for three video understanding labels. Note that these labels are not mutually exclusive. Video clips are from <a class="af ny" href="https://www.netflix.com/title/81130691" rel="noopener ugc nofollow" target="_blank">Operation Varsity Blues: The College Admissions Scandal</a>, <a class="af ny" href="https://www.netflix.com/title/81001887" rel="noopener ugc nofollow" target="_blank">6 Underground</a>, and <a class="af ny" href="https://www.netflix.com/title/81314956" rel="noopener ugc nofollow" target="_blank">Leave The World Behind</a>, respectively.</figcaption></figure><h1 id="9f00" class="oi oj gt be ok ol om on oo op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf bj">Video Annotator (VA)</h1><p id="bd14" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">In this section, we describe VA’s three-step process for building video classifiers.</p><h2 id="d9a6" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Step 1 — search</h2><p id="a5a4" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">Users begin by finding an initial set of examples within a large, diverse corpus to bootstrap the annotation process. We leverage text-to-video search to enable this, powered by video and text encoders from a Vision-Language Model to extract embeddings. For example, an annotator working on the <a class="af ny" href="https://en.wikipedia.org/wiki/Establishing_shot" rel="noopener ugc nofollow" target="_blank">establishing shots</a> model may start the process by searching for “wide shots of buildings”, illustrated in Fig 3.</p><figure class="nz oa ob oc od oe qa qb paragraph-image"><div class="qa qb qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*NxGRRz_J5AlNDi6RZOCdMQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*NxGRRz_J5AlNDi6RZOCdMQ.png" /></picture></div><figcaption class="qi fe qj qa qb qk ql be b bf z dt">Fig 3- Step 1 — Text-to-video search to bootstrap the annotation process.</figcaption></figure><h2 id="48cc" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Step 2 — active learning</h2><p id="e8dd" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">The next stage involves a classic Active Learning loop. VA then builds a lightweight binary classifier over the video embeddings, which is subsequently used to score all clips in the corpus, and presents some examples within feeds for further annotation and refinement, as illustrated in Fig 4.</p><figure class="nz oa ob oc od oe qa qb paragraph-image"><div class="qa qb qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*pdm_-ClSOrNOnUUviKZlQA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*pdm_-ClSOrNOnUUviKZlQA.png" /></picture></div><figcaption class="qi fe qj qa qb qk ql be b bf z dt">Fig 4- Step 2 — Active Learning loop. The annotator clicks on build, which initiates classifier training and scoring of all clips in a video corpus. Scored clips are organized in four feeds.</figcaption></figure><p id="1a04" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">The top-scoring positive and negative feeds display examples with the highest and lowest scores respectively. Our users reported that this provided a valuable indication as to whether the classifier has picked up the correct concepts in the early stages of training and spot cases of bias in the training data that they were able to subsequently fix. We also include a feed of “borderline” examples that the model is not confident about. This feed helps with discovering interesting edge cases and inspires the need for labeling additional concepts. Finally, the random feed consists of randomly selected clips and helps to annotate diverse examples which is important for generalization.</p><p id="ff09" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">The annotator can label additional clips in any of the feeds and build a new classifier and repeat as many times as desired.</p><h2 id="20b1" class="pg oj gt be ok ph pi dx oo pj pk dz os nl pl pm pn np po pp pq nt pr ps pt pu bj">Step 3 — review</h2><p id="c13f" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">The last step simply presents the user with all annotated clips. It’s a good opportunity to spot annotation mistakes and to identify ideas and concepts for further annotation via search in step 1. From this step, users often go back to step 1 or step 2 to refine their annotations.</p><h1 id="2b38" class="oi oj gt be ok ol om on oo op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf bj">Experiments</h1><p id="ece8" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">To evaluate VA, we asked three video experts to annotate a diverse set of 56 labels across a video corpus of 500k shots. We compared VA to the performance of a few baseline methods, and observed that VA leads to the creation of higher quality video classifiers. Fig 5 compares VA’s performance to baselines as a function of the number of annotated clips.</p><figure class="nz oa ob oc od oe qa qb paragraph-image"><div class="qa qb qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*QoExXpMrIQgkHOTs%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*QoExXpMrIQgkHOTs%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*QoExXpMrIQgkHOTs%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*QoExXpMrIQgkHOTs%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*QoExXpMrIQgkHOTs%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*QoExXpMrIQgkHOTs%201100w,%20https://miro.medium.com/v2/resize:fit:1298/format:webp/0*QoExXpMrIQgkHOTs%201298w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 649px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*QoExXpMrIQgkHOTs 640w, https://miro.medium.com/v2/resize:fit:720/0*QoExXpMrIQgkHOTs 720w, https://miro.medium.com/v2/resize:fit:750/0*QoExXpMrIQgkHOTs 750w, https://miro.medium.com/v2/resize:fit:786/0*QoExXpMrIQgkHOTs 786w, https://miro.medium.com/v2/resize:fit:828/0*QoExXpMrIQgkHOTs 828w, https://miro.medium.com/v2/resize:fit:1100/0*QoExXpMrIQgkHOTs 1100w, https://miro.medium.com/v2/resize:fit:1298/0*QoExXpMrIQgkHOTs 1298w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 649px" /></picture></div><figcaption class="qi fe qj qa qb qk ql be b bf z dt">Fig 5- Model quality (i.e. Average Precision) as a function of the number of annotated clips for the “establishing shots” label. We observe that all methods outperform the baseline, and that all methods benefit from additional annotated data, albeit to varying degrees.</figcaption></figure><p id="45e9" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">You can find more details about VA and our experiments in <a class="af ny" href="https://arxiv.org/pdf/2402.06560" rel="noopener ugc nofollow" target="_blank">this paper</a>.</p><h1 id="b996" class="oi oj gt be ok ol om on oo op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf bj">Conclusion</h1><p id="c7e3" class="pw-post-body-paragraph na nb gt nc b nd pv nf ng nh pw nj nk nl px nn no np py nr ns nt pz nv nw nx gm bj">We presented Video Annotator (VA), an interactive framework that addresses many challenges associated with conventional techniques for training machine learning classifiers. VA leverages the zero-shot capabilities of large vision-language models and active learning techniques to enhance sample efficiency and reduce costs. It offers a unique approach to annotating, managing, and iterating on video classification datasets, emphasizing the direct involvement of domain experts in a human-in-the-loop system. By enabling these users to rapidly make informed decisions on hard samples during the annotation process, VA increases the system’s overall efficiency. Moreover, it allows for a continuous annotation process, allowing users to swiftly deploy models, monitor their quality in production, and rapidly fix any edge cases.</p><p id="f292" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">This self-service architecture empowers domain experts to make improvements without the active involvement of data scientists or third-party annotators, and fosters a sense of ownership, thereby building trust in the system.</p><p id="45a5" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We conducted experiments to study the performance of VA, and found that it yields a median 8.3 point improvement in Average Precision relative to the most competitive baseline across a wide-ranging assortment of video understanding tasks. We <a class="af ny" href="https://github.com/netflix/videoannotator" rel="noopener ugc nofollow" target="_blank">release a dataset</a> with 153k labels across 56 video understanding tasks annotated by three professional video editors using VA, and also release <a class="af ny" href="https://github.com/netflix/videoannotator" rel="noopener ugc nofollow" target="_blank">code</a> to replicate our experiments.</p></div>]]></description>
      <link>https://netflixtechblog.com/video-annotator-building-video-classifiers-using-vision-language-models-and-active-learning-8ebdda0b2db4</link>
      <guid>https://netflixtechblog.com/video-annotator-building-video-classifiers-using-vision-language-models-and-active-learning-8ebdda0b2db4</guid>
      <pubDate>Wed, 19 Jun 2024 17:29:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Round 2: A Survey of Causal Inference Applications at Netflix]]></title>
      <description><![CDATA[<div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><div><div class="hu hv hw hx hy"></div><p id="46f5" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">At Netflix, we want to ensure that every current and future member finds content that thrills them today and excites them to come back for more. Causal inference is an essential part of the value that Data Science and Engineering adds towards this mission. We rely heavily on both <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/decision-making-at-netflix-33065fa06481">experimentation</a> and <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/quasi-experimentation-at-netflix-566b57d2e362">quasi-experimentation</a> to help our teams make the best decisions for growing member joy.</p><p id="b9ea" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Building off of our last successful <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f?gi=f599816a7a8b">Causal Inference and Experimentation Summit</a>, we held another week-long internal conference this year to learn from our stunning colleagues. We brought together speakers from across the business to learn about methodological developments and innovative applications.</p><p id="e1f2" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We covered a wide range of topics and are excited to share five talks from that conference with you in this post. This will give you a behind the scenes look at some of the causal inference research happening at Netflix!</p><h1 id="7cf2" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Metrics Projection for Growth A/B Tests</h1><p id="b991" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/tendulkar" rel="noopener ugc nofollow" target="_blank">Mihir Tendulkar</a>, <a class="af ny" href="https://www.linkedin.com/in/simon-ejdemyr-22b920123" rel="noopener ugc nofollow" target="_blank">Simon Ejdemyr</a>, <a class="af ny" href="https://www.linkedin.com/in/dhevi-rajendran-7b736b29" rel="noopener ugc nofollow" target="_blank">Dhevi Rajendran</a>, <a class="af ny" href="https://www.linkedin.com/in/david-hubbard-557a852" rel="noopener ugc nofollow" target="_blank">David Hubbard</a>, <a class="af ny" href="https://www.linkedin.com/in/arushi-tomar" rel="noopener ugc nofollow" target="_blank">Arushi Tomar</a>, <a class="af ny" href="https://www.linkedin.com/in/steve-beckett-cfa-4384a382" rel="noopener ugc nofollow" target="_blank">Steve Beckett</a>, <a class="af ny" href="https://www.linkedin.com/in/jlantos?original_referer=https%3A%2F%2Fwww.google.com%2F" rel="noopener ugc nofollow" target="_blank">Judit Lantos</a>, <a class="af ny" href="https://www.linkedin.com/in/codychapmanucsd" rel="noopener ugc nofollow" target="_blank">Cody Chapman</a>, <a class="af ny" href="https://www.linkedin.com/in/achenzion" rel="noopener ugc nofollow" target="_blank">Ayal Chen-Zion</a>, <a class="af ny" href="https://www.linkedin.com/in/apoorvalal" rel="noopener ugc nofollow" target="_blank">Apoorva Lal</a>, <a class="af ny" href="https://www.linkedin.com/in/kocaguneli" rel="noopener ugc nofollow" target="_blank">Ekrem Kocaguneli</a>, <a class="af ny" href="https://www.linkedin.com/in/kshimada" rel="noopener ugc nofollow" target="_blank">Kyoko Shimada</a></p><p id="a186" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Experimentation is in Netflix’s DNA. When we launch a new product feature, we use — where possible — A/B test results to estimate the annualized incremental impact on the business.</p><p id="c120" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Historically, that estimate has come from our Finance, Strategy, &amp; Analytics (FS&amp;A) partners. For each test cell in an experiment, they manually forecast signups, retention probabilities, and cumulative revenue on a one year horizon, using monthly cohorts. The process can be repetitive and time consuming.</p><p id="e0e9" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We decided to build out a faster, automated approach that boils down to estimating two pieces of missing data. When we run an A/B test, we might allocate users for one month, and monitor results for only two billing periods. In this simplified example, we have one member cohort, and we have two billing period treatment effects (𝜏.cohort1,period1 and 𝜏.cohort1,period2, which we will shorten to 𝜏.1,1 and 𝜏.1,2, respectively).</p><p id="59bb" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">To measure annualized impact, we need to estimate:</p><ol class=""><li id="38e1" class="na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx pc pd pe bj"><strong class="nc gu">Unobserved billing periods</strong>. For the first cohort, we don’t have treatment effects (TEs) for their third through twelfth billing periods (𝜏.1,j , where j = 3…12).</li><li id="48e1" class="na nb gt nc b nd pf nf ng nh pg nj nk nl ph nn no np pi nr ns nt pj nv nw nx pc pd pe bj"><strong class="nc gu">Unobserved sign up cohorts</strong>. We only observed one monthly signup cohort, and there are eleven more cohorts in a year. We need to know both the size of these cohorts, and their TEs (𝜏.i,j, where i = 2…12 and j = 1…12).</li></ol><p id="e4c0" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">For the first piece of missing data, we used a<a class="af ny" href="https://research.netflix.com/publication/evaluating-the-surrogate-index-as-a-decision-making-tool-using-200-a-b-tests" rel="noopener ugc nofollow" target="_blank"> surrogate index approach</a>. We make a standard assumption that the causal path from the treatment to the outcome (in this case, Revenue) goes through the surrogate of retention. We leverage our proprietary<a class="af ny" href="https://arxiv.org/pdf/1905.03818" rel="noopener ugc nofollow" target="_blank"> Retention Model</a> and short-term observations — in the above example, 𝜏.1,2 — to estimate 𝜏.1,j , where j = 3…12.</p><p id="69fa" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">For the second piece of missing data, we assume transportability: that each subsequent cohort’s billing-period TE is the same as the first cohort’s TE. Note that if you have long-running A/B tests, this is a testable assumption!</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl pm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*e8IMSJf7p60mk-WG%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*e8IMSJf7p60mk-WG%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*e8IMSJf7p60mk-WG%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*e8IMSJf7p60mk-WG%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*e8IMSJf7p60mk-WG%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*e8IMSJf7p60mk-WG%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*e8IMSJf7p60mk-WG%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*e8IMSJf7p60mk-WG 640w, https://miro.medium.com/v2/resize:fit:720/0*e8IMSJf7p60mk-WG 720w, https://miro.medium.com/v2/resize:fit:750/0*e8IMSJf7p60mk-WG 750w, https://miro.medium.com/v2/resize:fit:786/0*e8IMSJf7p60mk-WG 786w, https://miro.medium.com/v2/resize:fit:828/0*e8IMSJf7p60mk-WG 828w, https://miro.medium.com/v2/resize:fit:1100/0*e8IMSJf7p60mk-WG 1100w, https://miro.medium.com/v2/resize:fit:1400/0*e8IMSJf7p60mk-WG 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="py fe pz pk pl qa qb be b bf z dt"><em class="qc">Fig. 1: Monthly cohort-based activity as measured in an A/B test. In green, we show the allocation window throughout January, while blue represents the January cohort’s observation window. From this, we can directly observe 𝜏.1 and 𝜏.2, and we can project later 𝜏.j forward using the surrogate-based approach. We can transport values from observed cohorts to unobserved cohorts.</em></figcaption></figure><p id="0abd" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Now, we can put the pieces together. For the first cohort, we project TEs forward. For unobserved cohorts, we transport the TEs from the first cohort and collapse our notation to remove the cohort index: 𝜏.1,1 is now written as just 𝜏.1. We estimate the annualized impact by summing the values from each cohort.</p><p id="882c" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We empirically validated our results from this method by comparing to long-running AB tests and prior results from our FS&amp;A partners. Now we can provide quicker and more accurate estimates of the longer term value our product features are delivering to members.</p><h1 id="c047" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">A Systematic Framework for Evaluating Game Events</h1><p id="fc61" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/clairewilleck?original_referer=https%3A%2F%2Fwww.google.com%2F" rel="noopener ugc nofollow" target="_blank">Claire Willeck</a>, <a class="af ny" href="https://www.linkedin.com/in/yimeng-tang-49566b207" rel="noopener ugc nofollow" target="_blank">Yimeng Tang</a></p><p id="1360" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In Netflix Games DSE, we are asked many causal inference questions after an intervention has been implemented. For example, how did a product change impact a game’s performance? Or how did a player acquisition campaign impact a key metric?</p><p id="a49a" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">While we would ideally conduct AB tests to measure the impact of an intervention, it is not always practical to do so. In the first scenario above, A/B tests were not planned before the intervention’s launch, so we needed to use observational causal inference to assess its effectiveness. In the second scenario, the campaign is at the country level, meaning everyone in the country is in the treatment group, which makes traditional A/B tests inviable.</p><p id="c8c6" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">To evaluate the impacts of various game events and updates and to help our team scale, we designed a framework and package around variations of synthetic control.</p><p id="ea25" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">For most questions in Games, we have game-level or country-level interventions and relatively little data. This means most pre-existing packages that rely on time-series forecasting, unit-level data, or instrumental variables are not useful.</p><p id="f2c8" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Our framework utilizes a variety of synthetic control (SC) models, including Augmented SC, Robust SC, Penalized SC, and synthetic difference-in-differences, since different approaches can work best in different cases. We utilize a scale-free metric to evaluate the performance of each model and select the one that minimizes pre-treatment bias. Additionally, we conduct robustness tests like backdating and apply inference measures based on the number of control units.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*5yOWKQ4AUPmwWOfP%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*5yOWKQ4AUPmwWOfP%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*5yOWKQ4AUPmwWOfP%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*5yOWKQ4AUPmwWOfP%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*5yOWKQ4AUPmwWOfP%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*5yOWKQ4AUPmwWOfP%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*5yOWKQ4AUPmwWOfP%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*5yOWKQ4AUPmwWOfP 640w, https://miro.medium.com/v2/resize:fit:720/0*5yOWKQ4AUPmwWOfP 720w, https://miro.medium.com/v2/resize:fit:750/0*5yOWKQ4AUPmwWOfP 750w, https://miro.medium.com/v2/resize:fit:786/0*5yOWKQ4AUPmwWOfP 786w, https://miro.medium.com/v2/resize:fit:828/0*5yOWKQ4AUPmwWOfP 828w, https://miro.medium.com/v2/resize:fit:1100/0*5yOWKQ4AUPmwWOfP 1100w, https://miro.medium.com/v2/resize:fit:1400/0*5yOWKQ4AUPmwWOfP 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="py fe pz pk pl qa qb be b bf z dt">Fig. 2: Example of Augmented Synthetic Control model used to reduce pre-treatment bias by fitting the model in the training period and evaluating performance in the validation period. In this example, the Augmented Synthetic Control model reduced the pre-treatment bias in the validation period more than the other synthetic control variations.</figcaption></figure><p id="4695" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">This framework and package allows our team, and other teams, to tackle a broad set of causal inference questions using a consistent approach.</p><h1 id="0d01" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Double Machine Learning for Weighing Metrics Tradeoffs</h1><p id="f5dd" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/apoorvalal" rel="noopener ugc nofollow" target="_blank">Apoorva Lal</a>, <a class="af ny" href="https://www.linkedin.com/in/winston-chou-6491b0168" rel="noopener ugc nofollow" target="_blank">Winston Chou</a>, <a class="af ny" href="https://www.linkedin.com/in/jjschafer" rel="noopener ugc nofollow" target="_blank">Jordan Schafer</a></p><p id="463e" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">As Netflix expands into new business verticals, we’re increasingly seeing examples of metric tradeoffs in A/B tests — for example, an increase in games metrics may occur alongside a decrease in streaming metrics. To help decision-makers navigate scenarios where metrics disagree, we developed a method to compare the relative importance of different metrics (viewed as “treatments”) in terms of their causal effect on the north-star metric (Retention) using Double Machine Learning (DML).</p><p id="1522" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In our first pass at this problem, we found that ranking treatments according to their Average Treatment Effects using DML with a Partially Linear Model (PLM) could yield an incorrect ranking when treatments have different marginal distributions. The PLM ranking <em class="qe">would</em> be correct if treatment effects were constant and additive. However, when treatment effects are heterogeneous, PLM upweights the effects for members whose treatment values are most unpredictable. This is problematic for comparing treatments with different baselines.</p><p id="a131" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Instead, we discretized each treatment into bins and fit a multiclass propensity score model. This lets us estimate multiple Average Treatment Effects (ATEs) using Augmented Inverse-Propensity-Weighting (AIPW) to reflect different treatment contrasts, for example the effect of low versus high exposure.</p><p id="a972" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We then weight these treatment effects by the baseline distribution. This yields an “apples-to-apples” ranking of treatments based on their ATE on the same overall population.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*M2cujmiqIfahtPH1%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*M2cujmiqIfahtPH1%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*M2cujmiqIfahtPH1%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*M2cujmiqIfahtPH1%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*M2cujmiqIfahtPH1%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*M2cujmiqIfahtPH1%201100w,%20https://miro.medium.com/v2/resize:fit:1310/format:webp/0*M2cujmiqIfahtPH1%201310w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 655px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*M2cujmiqIfahtPH1 640w, https://miro.medium.com/v2/resize:fit:720/0*M2cujmiqIfahtPH1 720w, https://miro.medium.com/v2/resize:fit:750/0*M2cujmiqIfahtPH1 750w, https://miro.medium.com/v2/resize:fit:786/0*M2cujmiqIfahtPH1 786w, https://miro.medium.com/v2/resize:fit:828/0*M2cujmiqIfahtPH1 828w, https://miro.medium.com/v2/resize:fit:1100/0*M2cujmiqIfahtPH1 1100w, https://miro.medium.com/v2/resize:fit:1310/0*M2cujmiqIfahtPH1 1310w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 655px" /></picture></div></div><figcaption class="py fe pz pk pl qa qb be b bf z dt">Fig. 3: Comparison of PLMs vs. AIPW in estimating treatment effects. Because PLMs do not estimate average treatment effects when effects are heterogeneous, they do not rank metrics by their Average Treatment Effects, whereas AIPW does.</figcaption></figure><p id="c76a" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In the example above, we see that PLM ranks Treatment 1 above Treatment 2, while AIPW correctly ranks the treatments in order of their ATEs. This is because PLM upweights the Conditional Average Treatment Effect for units that have more unpredictable treatment assignment (in this example, the group defined by x = 1), whereas AIPW targets the ATE.</p><h1 id="1df6" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Survey AB Tests with Heterogeneous Non-Response Bias</h1><p id="3795" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/andreasaristidou" rel="noopener ugc nofollow" target="_blank">Andreas Aristidou</a>, <a class="af ny" href="https://www.linkedin.com/in/carolyn-chu-263147a9" rel="noopener ugc nofollow" target="_blank">Carolyn Chu</a></p><p id="050c" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">To improve the quality and reach of Netflix’s survey research, we leverage a research-on-research program that utilizes tools such as survey AB tests. Such experiments allow us to directly test and validate new ideas like providing incentives for survey completion, varying the invitation’s subject-line, message design, time-of-day to send, and many other things.</p><p id="1e27" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In our experimentation program we investigate treatment effects on not only primary success metrics, but also on guardrail metrics. A challenge we face is that, in many of our tests, the intervention (e.g. providing higher incentives) and success metrics (e.g. percent of invited members who begin the survey) are upstream of guardrail metrics such as answers to specific questions designed to measure data quality (e.g. survey straightlining).</p><p id="7992" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In such a case, the intervention may (and, in fact, we expect it to) distort upstream metrics (especially sample mix), the balance of which is a necessary component for the identification of our downstream guardrail metrics. This is a consequence of non-response bias, a common external validity concern with surveys that impacts how generalizable the results can be.</p><p id="4f7a" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">For example, if one group of members — group X — responds to our survey invitations at a significantly lower rate than another group — group Y — , then average treatment effects will be skewed towards the behavior of group Y. Further, in a survey AB test, the type of non-response bias can differ between control and treatment groups (e.g. different groups of members may be over/under represented in different cells of the test), thus threatening the internal validity of our test by introducing a covariate imbalance. We call this combination heterogeneous non-response bias.</p><p id="dbd9" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">To overcome this identification problem and investigate treatment effects on downstream metrics, we leverage a combination of several techniques. First, we look at conditional average treatment effects (CATE) for particular sub-populations of interest where confounding covariates are balanced in each strata.</p><p id="2eec" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In order to examine the average treatment effects, we leverage a combination of propensity scores to correct for internal validity issues and iterative proportional fitting to correct for external validity issues. With these techniques, we can ensure that our surveys are of the highest quality and that they accurately represent our members’ opinions, thus helping us build products that they want to see.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*JD_5cvqcKV0yZYMz%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*JD_5cvqcKV0yZYMz%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*JD_5cvqcKV0yZYMz%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*JD_5cvqcKV0yZYMz%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*JD_5cvqcKV0yZYMz%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*JD_5cvqcKV0yZYMz%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*JD_5cvqcKV0yZYMz%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*JD_5cvqcKV0yZYMz 640w, https://miro.medium.com/v2/resize:fit:720/0*JD_5cvqcKV0yZYMz 720w, https://miro.medium.com/v2/resize:fit:750/0*JD_5cvqcKV0yZYMz 750w, https://miro.medium.com/v2/resize:fit:786/0*JD_5cvqcKV0yZYMz 786w, https://miro.medium.com/v2/resize:fit:828/0*JD_5cvqcKV0yZYMz 828w, https://miro.medium.com/v2/resize:fit:1100/0*JD_5cvqcKV0yZYMz 1100w, https://miro.medium.com/v2/resize:fit:1400/0*JD_5cvqcKV0yZYMz 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><h1 id="595d" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Design: The Intersection of Humans and Technology</h1><p id="e8e0" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj"><a class="af ny" href="https://www.linkedin.com/in/rinachang" rel="noopener ugc nofollow" target="_blank">Rina Chang</a></p><p id="d243" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">A design talk at a causal inference conference? Why, yes! Because <strong class="nc gu">design is about how a product works</strong>, it is fundamentally interwoven into the experimentation platform at Netflix. Our product serves the huge variety of internal users at Netflix who run — and consume the results of — A/B tests. Thus, choosing how to enable our users to take action and how we present data in the product is critical to decision-making via experimentation.</p><p id="0f45" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">If you were to display some numbers and text, you might opt to show it in a tabular format.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*QfNTb6td1ROVJ1PV%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*QfNTb6td1ROVJ1PV%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*QfNTb6td1ROVJ1PV%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*QfNTb6td1ROVJ1PV%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*QfNTb6td1ROVJ1PV%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*QfNTb6td1ROVJ1PV%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*QfNTb6td1ROVJ1PV%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*QfNTb6td1ROVJ1PV 640w, https://miro.medium.com/v2/resize:fit:720/0*QfNTb6td1ROVJ1PV 720w, https://miro.medium.com/v2/resize:fit:750/0*QfNTb6td1ROVJ1PV 750w, https://miro.medium.com/v2/resize:fit:786/0*QfNTb6td1ROVJ1PV 786w, https://miro.medium.com/v2/resize:fit:828/0*QfNTb6td1ROVJ1PV 828w, https://miro.medium.com/v2/resize:fit:1100/0*QfNTb6td1ROVJ1PV 1100w, https://miro.medium.com/v2/resize:fit:1400/0*QfNTb6td1ROVJ1PV 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="cf68" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">While there is nothing inherently <strong class="nc gu"><em class="qe">wrong</em></strong> with this presentation, it is not as easily digested as something more visual.</p><p id="6432" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">If your goal is to illustrate that those three numbers add up to 100%, and thus are parts of a whole, then you might choose a pie chart.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*1NYJPLPl5PnhKV6H%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*1NYJPLPl5PnhKV6H%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*1NYJPLPl5PnhKV6H%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*1NYJPLPl5PnhKV6H%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*1NYJPLPl5PnhKV6H%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*1NYJPLPl5PnhKV6H%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*1NYJPLPl5PnhKV6H%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*1NYJPLPl5PnhKV6H 640w, https://miro.medium.com/v2/resize:fit:720/0*1NYJPLPl5PnhKV6H 720w, https://miro.medium.com/v2/resize:fit:750/0*1NYJPLPl5PnhKV6H 750w, https://miro.medium.com/v2/resize:fit:786/0*1NYJPLPl5PnhKV6H 786w, https://miro.medium.com/v2/resize:fit:828/0*1NYJPLPl5PnhKV6H 828w, https://miro.medium.com/v2/resize:fit:1100/0*1NYJPLPl5PnhKV6H 1100w, https://miro.medium.com/v2/resize:fit:1400/0*1NYJPLPl5PnhKV6H 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="2366" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">If you wanted to show how these three numbers combine to illustrate progress toward a goal, then you might choose a stacked bar chart.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*bD3cJNsWjDpUatXX%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*bD3cJNsWjDpUatXX%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*bD3cJNsWjDpUatXX%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*bD3cJNsWjDpUatXX%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*bD3cJNsWjDpUatXX%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*bD3cJNsWjDpUatXX%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*bD3cJNsWjDpUatXX%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*bD3cJNsWjDpUatXX 640w, https://miro.medium.com/v2/resize:fit:720/0*bD3cJNsWjDpUatXX 720w, https://miro.medium.com/v2/resize:fit:750/0*bD3cJNsWjDpUatXX 750w, https://miro.medium.com/v2/resize:fit:786/0*bD3cJNsWjDpUatXX 786w, https://miro.medium.com/v2/resize:fit:828/0*bD3cJNsWjDpUatXX 828w, https://miro.medium.com/v2/resize:fit:1100/0*bD3cJNsWjDpUatXX 1100w, https://miro.medium.com/v2/resize:fit:1400/0*bD3cJNsWjDpUatXX 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="e9ed" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Alternatively, if your goal was to compare these three numbers against each other, then you might choose a bar chart instead.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div role="button" tabindex="0" class="pt pu fi pv bg pw"><div class="pk pl qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*6wmCdjWrfbq65PuS%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*6wmCdjWrfbq65PuS%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*6wmCdjWrfbq65PuS%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*6wmCdjWrfbq65PuS%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*6wmCdjWrfbq65PuS%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*6wmCdjWrfbq65PuS%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*6wmCdjWrfbq65PuS%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6wmCdjWrfbq65PuS 640w, https://miro.medium.com/v2/resize:fit:720/0*6wmCdjWrfbq65PuS 720w, https://miro.medium.com/v2/resize:fit:750/0*6wmCdjWrfbq65PuS 750w, https://miro.medium.com/v2/resize:fit:786/0*6wmCdjWrfbq65PuS 786w, https://miro.medium.com/v2/resize:fit:828/0*6wmCdjWrfbq65PuS 828w, https://miro.medium.com/v2/resize:fit:1100/0*6wmCdjWrfbq65PuS 1100w, https://miro.medium.com/v2/resize:fit:1400/0*6wmCdjWrfbq65PuS 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="681e" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">All of these show the same information, but the choice of presentation changes how <strong class="nc gu"><em class="qe">easily</em></strong> a consumer of an infographic understands the “so what?” of the point you’re trying to convey. Note that there is no “right” solution here; rather, it depends on the desired takeaway.</p><p id="8556" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Thoughtful design applies not only to static representations of data, but also to interactive experiences. In this example, a single item within a long form could be represented by having a pre-filled value.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div class="pk pl qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*j0Urc5pk_zDbqYhs%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*j0Urc5pk_zDbqYhs%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*j0Urc5pk_zDbqYhs%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*j0Urc5pk_zDbqYhs%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*j0Urc5pk_zDbqYhs%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*j0Urc5pk_zDbqYhs%201100w,%20https://miro.medium.com/v2/resize:fit:1224/format:webp/0*j0Urc5pk_zDbqYhs%201224w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 612px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*j0Urc5pk_zDbqYhs 640w, https://miro.medium.com/v2/resize:fit:720/0*j0Urc5pk_zDbqYhs 720w, https://miro.medium.com/v2/resize:fit:750/0*j0Urc5pk_zDbqYhs 750w, https://miro.medium.com/v2/resize:fit:786/0*j0Urc5pk_zDbqYhs 786w, https://miro.medium.com/v2/resize:fit:828/0*j0Urc5pk_zDbqYhs 828w, https://miro.medium.com/v2/resize:fit:1100/0*j0Urc5pk_zDbqYhs 1100w, https://miro.medium.com/v2/resize:fit:1224/0*j0Urc5pk_zDbqYhs 1224w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 612px" /></picture></div></figure><p id="7281" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Alternatively, the same functionality could be achieved by displaying a default value in text, with the ability to edit it.</p><figure class="pn po pp pq pr ps pk pl paragraph-image"><div class="pk pl qi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*pbXqoiaarsdchZZn%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*pbXqoiaarsdchZZn%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*pbXqoiaarsdchZZn%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*pbXqoiaarsdchZZn%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*pbXqoiaarsdchZZn%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*pbXqoiaarsdchZZn%201100w,%20https://miro.medium.com/v2/resize:fit:1226/format:webp/0*pbXqoiaarsdchZZn%201226w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 613px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*pbXqoiaarsdchZZn 640w, https://miro.medium.com/v2/resize:fit:720/0*pbXqoiaarsdchZZn 720w, https://miro.medium.com/v2/resize:fit:750/0*pbXqoiaarsdchZZn 750w, https://miro.medium.com/v2/resize:fit:786/0*pbXqoiaarsdchZZn 786w, https://miro.medium.com/v2/resize:fit:828/0*pbXqoiaarsdchZZn 828w, https://miro.medium.com/v2/resize:fit:1100/0*pbXqoiaarsdchZZn 1100w, https://miro.medium.com/v2/resize:fit:1226/0*pbXqoiaarsdchZZn 1226w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 613px" /></picture></div></figure><p id="601b" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">While functionally equivalent, this UI change shifts the user’s narrative from “Is this value correct?” to “Do I need to do something that is not ‘normal’?” — which is a much easier question to answer. Zooming out even more, thoughtful design addresses product-level choices like if a person knows where to go to accomplish a task. In general, thoughtful design influences product strategy.</p><p id="5290" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Design permeates all aspects of our experimentation product at Netflix, from small choices like color to strategic choices like our roadmap. By thoughtfully approaching design, we can ensure that tools help the team learn the most from our experiments.</p><h1 id="9ce5" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">External Speaker: Kosuke Imai</h1><p id="a05b" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">In addition to the amazing talks by Netflix employees, we also had the privilege of hearing from <a class="af ny" href="https://imai.fas.harvard.edu/" rel="noopener ugc nofollow" target="_blank">Kosuke Imai</a>, Professor of Government and Statistics at Harvard, who delivered our keynote talk. He introduced the “<a class="af ny" href="https://arxiv.org/abs/2403.07031" rel="noopener ugc nofollow" target="_blank">cram method</a>,” a powerful and efficient approach to learning and evaluating treatment policies using generic machine learning algorithms.</p></div></div></div><div class="ab ca qj qk ql qm" role="separator"><div class="gm gn go gp gq"><div class="ab ca"><div class="ch bg fy fz ga gb"><p id="675c" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Measuring causality is a large part of the data science culture at Netflix, and we are proud to have many stunning colleagues who leverage both experimentation and quasi-experimentation to drive member impact. The conference was a great way to celebrate each other’s work and highlight the ways in which causal methodology can create value for the business.</p><p id="da5c" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">To stay up to date on our work, follow the <a class="af ny" href="https://netflixtechblog.com/" rel="noopener ugc nofollow" target="_blank">Netflix Tech Blog</a>, and if you are interested in joining us, we are currently looking for <a class="af ny" href="https://jobs.netflix.com/search?q=data+science&amp;team=Data+Science+and+Engineering" rel="noopener ugc nofollow" target="_blank">new stunning colleagues</a> to help us entertain the world!</p></div></div></div></div></div>]]></description>
      <link>https://netflixtechblog.com/round-2-a-survey-of-causal-inference-applications-at-netflix-fd78328ee0bb</link>
      <guid>https://netflixtechblog.com/round-2-a-survey-of-causal-inference-applications-at-netflix-fd78328ee0bb</guid>
      <pubDate>Thu, 06 Jun 2024 22:10:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[The Making of VES: the Cosmos Microservice for Netflix Video Encoding]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/liwei-guo/">Liwei Guo</a>, <a href="https://www.linkedin.com/in/carvalhovinicius/">Vinicius Carvalho</a>, <a href="https://www.linkedin.com/in/anush-moorthy-b8451142/">Anush Moorthy</a>, <a href="https://www.linkedin.com/in/aditya-mavlankar-7139791/">Aditya Mavlankar</a>, <a href="https://www.linkedin.com/in/lishan-z-51302abb/">Lishan Zhu</a></p><p><em>This is the second post in a multi-part series from Netflix. See </em><a href="https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359"><em>here</em></a><em> for Part 1 which provides an overview of our efforts in rebuilding the Netflix video processing pipeline with microservices. This blog dives into the details of building our Video Encoding Service (VES), and shares our learnings.</em></p><p><a href="https://netflixtechblog.com/the-netflix-cosmos-platform-35c14d9351ad">Cosmos</a> is the next generation media computing platform at Netflix. Combining microservice architecture with asynchronous workflows and serverless functions, Cosmos aims to modernize Netflix’s media processing pipelines with improved flexibility, efficiency, and developer productivity. In the past few years, the video team within Encoding Technologies (ET) has been working on rebuilding the entire video pipeline on Cosmos.</p><p>This new pipeline is composed of a number of microservices, each dedicated to a single functionality. One such microservice is Video Encoding Service (VES). Encoding is an essential component of the video pipeline. At a high level, it takes an ingested mezzanine and encodes it into a video stream that is suitable for Netflix streaming or serves some studio/production use case. In the case of Netflix, there are a number of requirements for this service:</p><ul><li>Given the wide range of devices from mobile phones to browsers to Smart TVs, multiple codec formats, resolutions, and quality levels need to be supported.</li><li>Chunked encoding is a must to meet the latency requirements of our business needs, and use cases with different levels of latency sensitivity need to be accommodated.</li><li>The capability of continuous release is crucial for enabling fast product innovation in both streaming and studio spaces.</li><li>There is a huge volume of encoding jobs every day. The service needs to be cost-efficient and make the most use of available resources.</li></ul><p>In this tech blog, we will walk through how we built VES to achieve the above goals and will share a number of lessons we learned from building microservices. Please note that for simplicity, we have chosen to omit certain Netflix-specific details that are not integral to the primary message of this blog post.</p><h3>Building Video Encoding Service on Cosmos</h3><p>A Cosmos microservice consists of three layers: an API layer (Optimus) that takes in requests, a workflow layer (Plato) that orchestrates the media processing flows, and a serverless computing layer (Stratum) that processes the media. These three layers communicate asynchronously through a home-grown, priority-based messaging system called <a href="https://netflixtechblog.com/timestone-netflixs-high-throughput-low-latency-priority-queueing-system-with-built-in-support-1abf249ba95f">Timestone</a>. We chose Protobuf as the payload format for its high efficiency and mature cross-platform support.</p><p>To help service developers get a head start, the Cosmos platform provides a powerful service generator. This generator features an intuitive UI. With a few clicks, it creates a basic yet complete Cosmos service: code repositories for all 3 layers are created; all platform capabilities, including discovery, logging, tracing, etc., are enabled; release pipelines are set up and dashboards are readily accessible. We can immediately start adding video encoding logic and deploy the service to the cloud for experimentation.</p><h4>Optimus</h4><p>As the API layer, Optimus serves as the gateway into VES, meaning service users can only interact with VES through Optimus. The defined API interface is a strong contract between VES and the external world. As long as the API is stable, users are shielded from internal changes in VES. This decoupling is instrumental in enabling faster iterations of VES internals.</p><p>As a single-purpose service, the API of VES is quite clean. We defined an endpoint<em> encodeVideo</em> that takes an <em>EncodeRequest</em> and returns an <em>EncodeResponse</em> (in an async way through Timestone messages). The <em>EncodeRequest</em> object contains information about the source video as well as the encoding recipe. All the requirements of the encoded video (codec, resolution, etc.) as well as the controls for latency (chunking directives) are exposed through the data model of the encoding recipe.</p><pre>//protobuf definition <br><br>message EncodeRequest {<br>    VideoSource video_source = 1;//source to be encoded<br>    Recipe recipe = 2; //including encoding format, resolution, etc.<br>}<br><br>message EncodeResponse {<br>    OutputVideo output_video = 1; //encoded video<br>    Error error = 2; //error message (optional)<br>}<br><br>message Recipe {<br>    Codec codec = 1; //including codec format, profile, level, etc.<br>    Resolution resolution = 2;<br>    ChunkingDirectives chunking_directives = 3;<br>    ...<br>}</pre><p>Like any other Cosmos service, the platform automatically generates an RPC client based on the VES API data model, which users can use to build the request and invoke VES. Once an incoming request is received, Optimus performs validations, and (when applicable) converts the incoming data into an internal data model before passing it to the next layer, Plato.</p><p>Like any other Cosmos service, the platform automatically generates an RPC client based on the VES API data model, which users can use to build the request and invoke VES. Once an incoming request is received, Optimus performs validations, and (when applicable) converts the incoming data into an internal data model before passing it to the next layer, Plato.</p><h3>Plato</h3><p>The workflow layer, Plato, governs the media processing steps. The Cosmos platform supports two programming paradigms for Plato: forward chaining rule engine and Directed Acyclic Graph (DAG). VES has a linear workflow, so we chose DAG for its simplicity.</p><p>In a DAG, the workflow is represented by nodes and edges. Nodes represent stages in the workflow, while edges signify dependencies — a stage is only ready to execute when all its dependencies have been completed. VES requires parallel encoding of video chunks to meet its latency and resilience goals. This workflow-level parallelism is facilitated by the DAG through a MapReduce mode. Nodes can be annotated to indicate this relationship, and a Reduce node will only be triggered when all its associated Map nodes are ready.</p><p>For the VES workflow, we defined five Nodes and their associated edges, which are visualized in the following graph:</p><ul><li>Splitter Node: This node divides the video into chunks based on the chunking directives in the recipe.</li><li>Encoder Node: This node encodes a video chunk. It is a Map node.</li><li>Assembler Node: This node stitches the encoded chunks together. It is a Reduce node.</li><li>Validator Node: This node performs the validation of the encoded video.</li><li>Notifier Node: This node notifies the API layer once the entire workflow is completed.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7n3hN3lYhe89Ezk6"></figure><p>In this workflow, nodes such as the Notifier perform very lightweight operations and can be directly executed in the Plato runtime. However, resource-intensive operations need to be delegated to the computing layer (Stratum), or another service. Plato invokes Stratum functions for tasks such as encoding and assembling, where the nodes (Encoder and Assembler) post messages to the corresponding message queues. The Validator node calls another Cosmos service, the Video Validation Service, to validate the assembled encoded video.</p><h4>Stratum</h4><p>The computing layer, Stratum, is where media samples can be accessed. Developers of Cosmos services create Stratum Functions to process the media. They can bring their own media processing tools, which are packaged into Docker images of the Functions. These Docker images are then published to our internal Docker registry, part of <a href="https://medium.com/p/f868c9fb5436">Titus</a>. In production, Titus automatically scales instances based on the depths of job queues.</p><p>VES needs to support encoding source videos into a variety of codec formats, including AVC, AV1, and VP9, to name a few. We use different encoder binaries (referred to simply as “encoders”) for different codec formats. For AVC, a format that is now 20 years old, the encoder is quite stable. On the other hand, <a href="https://netflixtechblog.com/bringing-av1-streaming-to-netflix-members-tvs-b7fc88e42320">the newest addition to Netflix streaming</a>, AV1, is continuously going through active improvements and experimentations, necessitating more frequent encoder upgrades. ​​To effectively manage this variability, we decided to create multiple Stratum Functions, each dedicated to a specific codec format and can be released independently. This approach ensures that upgrading one encoder will not impact the VES service for other codec formats, maintaining stability and performance across the board.</p><p>Within the Stratum Function, the Cosmos platform provides abstractions for common media access patterns. Regardless of file formats, sources are uniformly presented as locally mounted frames. Similarly, for output that needs to be persisted in the cloud, the platform presents the process as writing to a local file. All details, such as streaming of bytes and retrying on errors, are abstracted away. With the platform taking care of the complexity of the infrastructure, the essential code for video encoding in the Stratum Function could be as simple as follows.</p><pre>ffmpeg -i input/source%08d.j2k -vf ... -c:v libx264 ... output/encoding.264</pre><p>Encoding is a resource-intensive process, and the resources required are closely related to the codec format and the encoding recipe. We conducted benchmarking to understand the resource usage pattern, particularly CPU and RAM, for different encoding recipes. Based on the results, we leveraged the “container shaping” feature from the Cosmos platform.</p><p>We defined a number of different “container shapes”, specifing the allocations of resources like CPU and RAM.</p><pre># an example definition of container shape<br>group: containerShapeExample1<br>resources:<br>  numCpus: 2<br>  memoryInMB: 4000<br>  networkInMbp: 750<br>  diskSizeInMB: 12000</pre><p>Routing rules are created to assign encoding jobs to different shapes based on the combination of codec format and encoding resolution. This helps the platform perform “bin packing”, thereby maximizing resource utilization.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/425/1*UZ7G9SlWnqLXd4umutzhcQ.png"><figcaption>An example of “bin-packing”. The circles represent CPU cores and the area represents the RAM. This 16-core EC2 instance is packed with 5 encoding containers (rectangles) of 3 different shapes (indicated by different colors).</figcaption></figure><h3>Continuous Release</h3><p>After we completed the development and testing of all three layers, VES was launched in production. However, this did not mark the end of our work. Quite the contrary, we believed and still do that a significant part of a service’s value is realized through iterations: supporting new business needs, enhancing performance, and improving resilience. An important piece of our vision was for Cosmos services to have the ability to continuously release code changes to production in a safe manner.</p><p>Focusing on a single functionality, code changes pertaining to a single feature addition in VES are generally small and cohesive, making them easy to review. Since callers can only interact with VES through its API, internal code is truly “implementation details” that are safe to change. The explicit API contract limits the test surface of VES. Additionally, the Cosmos platform provides a <a href="https://martinfowler.com/articles/practical-test-pyramid.html">pyramid</a>-based testing framework to guide developers in creating tests at different levels.</p><p>After testing and code review, changes are merged and are ready for release. The release pipeline is fully automated: after the merge, the pipeline checks out code, compiles, builds, runs unit/integration/end-to-end tests as prescribed, and proceeds to full deployment if no issues are encountered. Typically, it takes around 30 minutes from code merge to feature landing (a process that took 2–4 weeks in our previous generation platform!). The short release cycle provides faster feedback to developers and helps them make necessary updates while the context is still fresh.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1020/1*fgKHQg4IZVMkWQlHGaCT2Q.png"><figcaption>Screenshot of a release pipeline run in our production enviorment</figcaption></figure><p>When running in production, the service constantly emits metrics and logs. They are collected by the platform to visualize dashboards and to drive monitoring/alerting systems. Metrics deviating too much from the baseline will trigger alerts and can lead to automatic service rollback (when the “canary” feature is enabled).</p><h3>The Learnings:</h3><p>VES was the very first microservice that our team built. We started with basic knowledge of microservices and learned a multitude of lessons along the way. These learnings deepened our understanding of microservices and have helped us improve our design choices and decisions.</p><h4>Define a Proper Service Scope</h4><p>A principle of microservice architecture is that a service should be built for a single functionality. This sounds straightforward, but what exactly qualifies a “single functionality”? “Encoding video” sounds good but wouldn’t “encode video into the AVC format” be an even more specific single-functionality?</p><p>When we started building the VES, we took the approach of creating a separate encoding service for each codec format. While this has advantages such as decoupled workflows, quickly we were overwhelmed by the development overhead. Imagine that a user requested us to add the watermarking capability to the encoding. We needed to make changes to multiple microservices. What is worse, changes in all these services are very similar and essentially we are adding the same code (and tests) again and again. Such kind of repetitive work can easily wear out developers.</p><p>The service presented in this blog is our second iteration of VES (yes, we already went through one iteration). In this version, we consolidated encodings for different codec formats into a single service. They share the same API and workflow, while each codec format has its own Stratum Functions. So far this seems to strike a good balance: the common API and workflow reduces code repetition, while separate Stratum Functions guarantee independent evolution of each codec format.</p><p>The changes we made are not irreversible. If someday in the future, the encoding of one particular codec format evolves into a totally different workflow, we have the option to spin it off into its own microservice.</p><h4>Be Pragmatic about Data Modeling</h4><p>In the beginning, we were very strict about data model separation — we had a strong belief that sharing equates to coupling, and coupling could lead to potential disasters in the future. To avoid this, for each service as well as the three layers within a service, we defined its own data model and built converters to translate between different data models.</p><p>We ended up creating multiple data models for aspects such as bit-depth and resolution across our system. To be fair, this does have some merits. For example, our encoding pipeline supports different bit-depths for AVC encoding (8-bit) and AV1 encoding (10-bit). By defining both <em>AVC.BitDepth</em> and <em>AV1.BitDepth</em>, constraints on the bit-depth can be built into the data models. However, it is debatable whether the benefits of this differentiation power outweigh the downsides, namely multiple data model translations.</p><p>Eventually, we created a library to host data models for common concepts in the video domain. Examples of such concepts include frame rate, scan type, color space, etc. As you can see, they are extremely common and stable. This “common” data model library is shared across all services owned by the video team, avoiding unnecessary duplications and data conversions. Within each service, additional data models are defined for service-specific objects.</p><h4>Embrace Service API Changes</h4><p>This may sound contradictory. We have been saying that an API is a strong contract between the service and its users, and keeping an API stable shields users from internal changes. This is absolutely true. However, none of us had a crystal ball when we were designing the very first version of the service API. It is inevitable that at a certain point, this API becomes inadequate. If we hold the belief that “the API cannot change” too dearly, developers would be forced to find workarounds, which are almost certainly sub-optimal.</p><p>There are many great tech articles about gracefully evolving API. We believe we also have a unique advantage: VES is a service internal to Netflix Encoding Technologies (ET). Our two users, the Streaming Workflow Orchestrator and the Studio Workflow Orchestrator, are owned by the workflow team within ET. Our teams share the same contexts and work towards common goals. If we believe updating API is in the best interest of Netflix, we meet with them to seek alignment. Once a consensus to update the API is reached, teams collaborate to ensure a smooth transition.</p><h3>Stay Tuned…</h3><p>This is the second part of our tech blog series Rebuilding Netflix Video Pipeline with Microservices. In this post, we described the building process of the Video Encoding Service (VES) in detail as well as our learnings. Our pipeline includes a few other services that we plan to share about as well. Stay tuned for our future blogs on this topic of microservices!</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=946b9b3cd300" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/the-making-of-ves-the-cosmos-microservice-for-netflix-video-encoding-946b9b3cd300">The Making of VES: the Cosmos Microservice for Netflix Video Encoding</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/the-making-of-ves-the-cosmos-microservice-for-netflix-video-encoding-946b9b3cd300</link>
      <guid>https://netflixtechblog.com/the-making-of-ves-the-cosmos-microservice-for-netflix-video-encoding-946b9b3cd300</guid>
      <pubDate>Wed, 10 Apr 2024 00:12:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Reverse Searching Netflix’s Federated Graph]]></title>
      <description><![CDATA[<div><div class="hu hv hw hx hy"></div><p id="89dd" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">By <a class="af ny" href="https://www.linkedin.com/in/rickygardiner/" rel="noopener ugc nofollow" target="_blank">Ricky Gardiner</a>, <a class="af ny" href="https://www.linkedin.com/in/ahutter/" rel="noopener ugc nofollow" target="_blank">Alex Hutter</a>, and <a class="af ny" href="https://www.linkedin.com/in/katielefevre/" rel="noopener ugc nofollow" target="_blank">Katie Lefevre</a></p><p id="cd91" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Since our previous posts regarding Content Engineering’s role in enabling search functionality within Netflix’s federated graph (<a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf">the first post</a>, where we identify the issue and elaborate on the indexing architecture, and <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-part-2-49348511c06c">the second post</a>, where we detail how we facilitate querying) there have been significant developments. We’ve opened up Studio Search beyond Content Engineering to the entirety of the Engineering organization at Netflix and renamed it Graph Search. There are over 100 applications integrated with Graph Search and nearly 50 indices we support. We continue to add functionality to the service. As promised in the previous post, we’ll share how we partnered with one of our Studio Engineering teams to build reverse search. Reverse search inverts the standard querying pattern: rather than finding documents that match a query, it finds queries that match a document.</p><h1 id="7f88" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Intro</h1><p id="b5d6" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">Tiffany is a Netflix Post Production Coordinator who oversees a slate of nearly a dozen movies in various states of pre-production, production, and post-production. Tiffany and her team work with various cross-functional partners, including Legal, Creative, and Title Launch Management, tracking the progression and health of her movies.</p><p id="0c81" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">So Tiffany subscribes to notifications and calendar updates specific to certain areas of concern, like “movies shooting in Mexico City which don’t have a key role assigned”, or “movies that are at risk of not being ready by their launch date”.</p><p id="4dfd" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu"><em class="pc">Tiffany is not subscribing to updates of particular movies, but subscribing to queries that return a dynamic subset of movies. </em>This poses an issue for those of us responsible for sending her those notifications. When a movie changes, we don’t know who to notify, since there’s no association between employees and the movies they’re interested in.</strong></p><p id="ef81" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We could save these searches, and then repeatedly query for the results of every search, but because we’re part of a large federated graph, this would have heavy traffic implications for every service we’re connected to. We’d have to decide if we wanted timely notifications or less load on our graph.</p><p id="9147" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">If we could answer the question “would this movie be returned by this query”, we could re-query based on change events with laser precision and not impact the broader ecosystem.</p><h1 id="6ed1" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">The Solution</h1><p id="426d" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">Graph Search is built on top of Elasticsearch, which has the exact capabilities we require:</p><ul class=""><li id="f67a" class="na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx pd pe pf bj"><a class="af ny" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/percolator.html#:~:text=The%20percolator%20field%20type%20parses,to%20be%20a%20percolator%20field." rel="noopener ugc nofollow" target="_blank">percolator fields</a> that can be used to index Elasticsearch queries</li><li id="3bdd" class="na nb gt nc b nd pg nf ng nh ph nj nk nl pi nn no np pj nr ns nt pk nv nw nx pd pe pf bj"><a class="af ny" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html" rel="noopener ugc nofollow" target="_blank">percolate queries</a> that can be used to determine which indexed queries match an input document.</li></ul><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv fi pw bg px"><div class="pl pm pn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*GCZRoNqT8seObcUFzYthXg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*GCZRoNqT8seObcUFzYthXg.png" /></picture></div></div></figure><p id="55ac" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Instead of taking a search (like “spanish-language movies shot in Mexico City”) and returning the documents that match (One for Roma, one for Familia), a percolate query takes a document (one for Roma) and returns the searches that match that document, like “spanish-language movies” and “scripted dramas”.</p><p id="8054" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We’ve communicated this functionality as the ability to save a search, called <code class="cw pz qa qb qc b">SavedSearches</code>, which is a persisted filter on an existing index.</p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">type SavedSearch {<br />  id: ID!<br />  filter: String<br />  index: SearchIndex!<br />}</pre><p id="a0e9" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">That filter, written in Graph Search DSL, is converted to an Elasticsearch query and indexed in a percolator field. To learn more about Graph Search DSL and why we created it rather than using Elasticsearch query language directly, <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-part-2-49348511c06c">see the Query Language section of “How Netflix Content Engineering makes a federated graph searchable (Part 2)”</a>.</p><p id="05d9" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">We’ve called the process of finding matching saved searches <code class="cw pz qa qb qc b">ReverseSearch</code>. This is the most straightforward part of this offering. We added a new resolver to the <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2">Domain Graph Service</a> (DGS) for Graph Search. It takes the index of interest and a document, and returns all the saved searches that match the document by issuing a percolate query.</p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">"""<br />Query for retrieving all the registered saved searches, in a given index,<br />based on a provided document. The document in this case is an ElasticSearch<br />document that is generated based on the configuration of the index.<br />"""<br />reverseSearch(<br />  after: String,<br />  document: JSON!,<br />  first: Int!,<br />  index: SearchIndex!): SavedSearchConnection</pre><p id="6b23" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Persisting a <code class="cw pz qa qb qc b">SavedSearch</code> is implemented as a new mutation on the Graph Search DGS. This ultimately triggers the indexing of an Elasticsearch query in a percolator field.</p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">"""<br />Mutation for registering and updating a saved search. They need to be updated<br />any time a user adjusts their search criteria.<br />"""<br />upsertSavedSearch(input: UpsertSavedSearchInput!): UpsertSavedSearchPayload</pre><p id="714b" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Supporting percolator fields fundamentally changed how we provision the indexing pipelines for Graph Search (<a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf">see Architecture section of How Netflix Content Engineering makes a federated graph searchable)</a>. Rather than having a single indexing pipeline per Graph Search index we now have two: one to index documents and one to index saved searches to a percolate index. We chose to add percolator fields to a separate index in order to tune performance for the two types of queries separately.</p><p id="3c95" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Elasticsearch requires the percolate index to have a mapping that matches the structure of the queries it stores and therefore must match the mapping of the document index. Index templates define mappings that are applied when creating new indices. By using the <a class="af ny" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates-v1.html#put-index-template-v1-api-request-body" rel="noopener ugc nofollow" target="_blank">index_patterns</a> functionality of index templates, we’re able to share the mapping for the document index between the two. index_patterns also gives us an easy way to add a percolator field to every percolate index we create.</p><p id="20fc" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Example of document index mapping</strong></p><p id="d14a" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Index pattern — application_*</strong></p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">{<br />  "order": 1,<br />  "index_patterns": ["application_*"],<br />  "mappings": {<br />  "properties": {<br />    "movieTitle": {<br />      "type": "keyword"<br />    },<br />    "isArchived": {<br />      "type": "boolean"<br />    }<br />  }<br />}</pre><p id="79f1" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Example of percolate index mappings</strong></p><p id="a312" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Index pattern — *_percolate</strong></p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">{<br />  "order": 2,<br />  "index_patterns": ["*_percolate*"],<br />  "mappings": {<br />    "properties": {<br />      "percolate_query": {<br />        "type": "percolator"<br />      }<br />    }<br />  }<br />}</pre><p id="e87a" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Example of generated mapping</strong></p><p id="4cef" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj"><strong class="nc gu">Percolate index name is application_v1_percolate</strong></p><pre class="po pp pq pr ps qd qc qe bo qf ba bj">{<br />  "application_v1_percolate": {<br />    "mappings": {<br />      "_doc": {<br />        "properties": {<br />          "movieTitle": {<br />            "type": "keyword"<br />          },<br />          "isArchived": {<br />            "type": "boolean"<br />          },<br />          "percolate_query": {<br />            "type": "percolator"<br />          }<br />        }<br />      }<br />    }<br />  }<br />}</pre><h1 id="d6ba" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Percolate Indexing Pipeline</h1><p id="aa7f" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">The percolate index isn’t as simple as taking the input from the GraphQL mutation, translating it to an Elasticsearch query, and indexing it. Versioning, which we’ll talk more about shortly, reared its ugly head and made things a bit more complicated. Here is the way the percolate indexing pipeline is set up.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv fi pw bg px"><div class="pl pm ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*KSZuvPeOxDOKNrPiNnNvFg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*KSZuvPeOxDOKNrPiNnNvFg.png" /></picture></div></div><figcaption class="qm fe qn pl pm qo qp be b bf z dt"><em class="qq">See</em> <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">Data Mesh — A Data Movement and Processing Platform @ Netflix</a> to learn more about Data Mesh.</figcaption></figure><ol class=""><li id="b0e6" class="na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx qr pe pf bj">When <code class="cw pz qa qb qc b">SavedSearches</code> are modified, we store them in our CockroachDB, and the source connector for the Cockroach database emits CDC events.</li><li id="630e" class="na nb gt nc b nd pg nf ng nh ph nj nk nl pi nn no np pj nr ns nt pk nv nw nx qr pe pf bj">A single table is shared for the storage of all <code class="cw pz qa qb qc b">SavedSearches</code>, so the next step is filtering down to just those that are for *this* index using a filter processor.</li><li id="cf24" class="na nb gt nc b nd pg nf ng nh ph nj nk nl pi nn no np pj nr ns nt pk nv nw nx qr pe pf bj">As previously mentioned, what is stored in the database is our custom Graph Search filter DSL, which is not the same as the Elasticsearch DSL, so we cannot directly index the event to the percolate index. Instead, we issue a mutation to the Graph Search DGS. The Graph Search DGS translates the DSL to an Elasticsearch query.</li><li id="2a9e" class="na nb gt nc b nd pg nf ng nh ph nj nk nl pi nn no np pj nr ns nt pk nv nw nx qr pe pf bj">Then we index the Elasticsearch query as a percolate field in the appropriate percolate index.</li><li id="1318" class="na nb gt nc b nd pg nf ng nh ph nj nk nl pi nn no np pj nr ns nt pk nv nw nx qr pe pf bj">The success or failure of the indexing of the <code class="cw pz qa qb qc b">SavedSearch</code> is returned. On failure, the <code class="cw pz qa qb qc b">SavedSearch</code> events are sent to a Dead Letter Queue (DLQ) that can be used to address any failures, such as fields referenced in the search query being removed from the index.</li></ol><p id="bf37" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Now a bit on versioning to explain why the above is necessary. Imagine we’ve started tagging movies that have animals. If we want users to be able to create views of “movies with animals”, we need to add this new field to the existing search index to flag movies as such. However, the mapping in the current index doesn’t include it, so we can’t filter on it. To solve for this we have index versions.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv fi pw bg px"><div class="pl pm qs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*AnB07zkL_g4a30TUgAHGSA.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*AnB07zkL_g4a30TUgAHGSA.jpeg" /></picture></div></div><figcaption class="qm fe qn pl pm qo qp be b bf z dt">Dalia &amp; Forrest from the series <a class="af ny" href="https://www.netflix.com/title/81730862" rel="noopener ugc nofollow" target="_blank">Baby Animal Cam</a></figcaption></figure><p id="9530" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">When a change is made to an index definition that necessitates a new mapping, like when we add the animal tag, Graph Search creates a new version of the Elasticsearch index and a new pipeline to populate it. This new pipeline reads from a log-compacted Kafka topic in Data Mesh — this is how we can reindex the entire corpus without asking the data sources to resend all the old events. The new pipeline and the old pipeline run side by side, until the new pipeline has processed the backlog, at which point Graph Search cuts over to the version using Elasticsearch index aliases.</p><p id="4b0c" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">Creating a new index for our documents means we also need to create a new percolate index for our queries so they can have consistent index mappings. This new percolate index also needs to be backfilled when we change versions. This is why the pipeline works the way it does — we can again utilize the log compacted topics in Data Mesh to reindex the corpus of <code class="cw pz qa qb qc b">SavedSearches</code> when we spin up a new percolate indexing pipeline.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv fi pw bg px"><div class="pl pm qt"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*bAxTCeTOeZ_g4ueiN0cLiQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*bAxTCeTOeZ_g4ueiN0cLiQ.png" /></picture></div></div><figcaption class="qm fe qn pl pm qo qp be b bf z dt"><em class="qq">We persist the user provided filter DSL to the database rather than immediately translating it to Elasticsearch query language. This enables us to make changes or fixes when we translate the saved search DSL to an Elasticsearch query . We can deploy those changes by creating a new version of the index as the bootstrapping process will re-translate every saved search.</em></figcaption></figure><h1 id="c32c" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Another Use Case</h1><p id="d07d" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">We hoped reverse search functionality would eventually be useful for other engineering teams. We were approached almost immediately with a problem that reverse searching could solve.</p><p id="6cd0" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">The way you make a movie can be very different based on the type of movie it is. One movie might go through a set of phases that are not applicable to another, or might need to schedule certain events that another movie doesn’t require. Instead of manually configuring the workflow for a movie based on its classifications, we should be able to define the means of classifying movies and use that to automatically assign them to workflows. But determining the classification of a movie is challenging: you could define these movie classifications based on genre alone, like “Action” or “Comedy”, but you likely require more complex definitions. Maybe it’s defined by the genre, region, format, language, or some nuanced combination thereof. The Movie Matching service provides a way to classify a movie based on any combination of matching criteria. Under the hood, the matching criteria are stored as reverse searches, and to determine which criteria a movie matches against, the movie’s document is submitted to the reverse search endpoint.</p><p id="f1aa" class="pw-post-body-paragraph na nb gt nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gm bj">In short, reverse search is powering an externalized criteria matcher. It’s being used for movie criteria now, but since every Graph Search index is now reverse-search capable, any index could use this pattern.</p><h1 id="9c4d" class="nz oa gt be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">A Possible Future: Subscriptions</h1><p id="91de" class="pw-post-body-paragraph na nb gt nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gm bj">Reverse searches also look like a promising foundation for creating more responsive UIs. Rather than fetching results once as a query, the search results could be provided via a GraphQL subscription. These subscriptions could be associated with a <code class="cw pz qa qb qc b">SavedSearch</code> and, as index changes come in, reverse search can be used to determine when to update the set of keys returned by the subscription.</p></div>]]></description>
      <link>https://netflixtechblog.com/reverse-searching-netflixs-federated-graph-222ac5d23576</link>
      <guid>https://netflixtechblog.com/reverse-searching-netflixs-federated-graph-222ac5d23576</guid>
      <pubDate>Thu, 04 Apr 2024 23:26:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Sequential Testing Keeps the World Streaming Netflix Part 2: Counting Processes]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/michaelslindon/">Michael Lindon</a>, <a href="https://www.linkedin.com/in/csanden/">Chris Sanden</a>, <a href="https://www.linkedin.com/in/vshirikian/">Vache Shirikian</a>, <a href="https://www.linkedin.com/in/liuyanjun/">Yanjun Liu</a>, <a href="https://www.linkedin.com/in/minalmishra/">Minal Mishra</a>, <a href="https://www.linkedin.com/in/martintingley/">Martin Tingley</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7DNyGA0x7r7msS7w1Zpvpw.jpeg"></figure><p>Have you ever encountered a bug while streaming Netflix? Did your title stop unexpectedly, or not start at all? In the first installment of this blog series on sequential testing, we described our <a href="https://medium.com/p/cba6c7ed49df">canary testing methodology for continuous metrics such as <em>play-delay</em></a>. One of our readers commented</p><blockquote>What if the new release is not related to a new play/streaming feature? For example, what if the new release includes modified login functionality? Will you still monitor the “play-delay” metric?</blockquote><p>Netflix monitors a large suite of metrics, many of which can be classified as counts. These include metrics such as the number of logins, errors, successful play starts, and even the number of customer call center contacts. In this second installment, we describe our sequential methodology for testing count metrics, outlined in the NeurIPS paper <a href="https://openreview.net/forum?id=a4zg0jiuVi"><em>Anytime Valid Inference for Multinomial Count Data</em></a>.</p><h4><strong>Spot the Difference</strong></h4><p>Suppose we are about to deploy new code that changes the login behavior. To de-risk the software rollout we A/B test the new code, known also as a canary test. Whenever an event such as a login occurs, a log flows through our real-time backend and the corresponding timestamp is recorded. Figure 1 illustrates the sequences of timestamps generated by devices assigned to the new (treatment) and existing (control) software versions. A question that naturally concerns us is whether there are fewer login events in the treatment. Can you tell?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AGdASLgCaQCNo72VV7rOFg.gif"><figcaption>Figure 1: Timestamps of events occurring in control and treatment</figcaption></figure><p>It is not immediately obvious by simple inspection of the point processes in Figure 1. The difference becomes immediately obvious when we visualize the observed <a href="https://en.wikipedia.org/wiki/Counting_process#:~:text=Counting%20processes%20deal%20with%20the,be%20a%20Markov%20counting%20process.">counting processes</a>, shown in Figure 2.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hAQupmPzmKi7squd4iEwig.png"><figcaption>Figure 2: Visualizing the counting processes — the number of events observed by time t</figcaption></figure><p>The counting processes are functions that increment by 1 whenever a new event arrives. Clearly, there are fewer events occurring in the treatment than in the control. If these were login events, this would suggest that the new code contains a bug that prevents some users from being able to log in successfully.</p><p>This is a common situation when dealing with event timestamps. To give another example, if events corresponded to errors or crashes, we would like to know if these are accruing faster in the treatment than in the control. Moreover, we want to answer that question <em>as quickly as possible </em>to prevent any further disruption to the service<em>.</em> This necessitates sequential testing techniques which were introduced in <a href="https://medium.com/p/cba6c7ed49df">part 1</a>.</p><h4><strong>Time-Inhomogeneous Poisson Process</strong></h4><p>Our data for each treatment group is a realization of a one-dimensional point process, that is, a sequence of timestamps. As the rate at which the events arrive is time-varying (in both treatment and control), we model the point process as a time-inhomogeneous <a href="https://en.wikipedia.org/wiki/Poisson_point_process#Inhomogeneous_Poisson_point_process">Poisson point process</a>. This point process is defined by an intensity function λ: ℝ → [0, ∞). The number of events in the interval [0,t), denoted N(t), has the following Poisson distribution</p><p>N(t) ~ Poisson(Λ(t)), where Λ(t) = ∫₀ᵗ λ(s) ds.</p><p>We seek to test the null hypothesis H₀: λᴬ(t) = λᴮ(t) for all t i.e. the intensity functions for control (A) and treatment (B) are the same. This can be done semiparametrically without making any assumptions about the intensity functions λᴬ and λᴮ. Moreover, the novelty of the research is that this can be done sequentially, as described in <a href="https://openreview.net/pdf?id=a4zg0jiuVi">section 4</a> of our paper. Conveniently, the only data required to test this hypothesis at time t is Nᴬ(t) and Nᴮ(t), the total number of events observed so far in control and treatment. In other words, all you need to test the null hypothesis is two integers, which can easily be updated as new events arrive. Here is an example from a simulated A/A test, in which we know by design that the intensity function is the same for the control (A) and the treatment (B), albeit nonstationary.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RDdOPECxhDmLt9FkAOuWog.png"><figcaption>Figure 3: (Left) An A/A simulation of two inhomogeneous Poisson point processes. (Right) Confidence sequence on the log-difference of intensity functions, and sequential p-value.</figcaption></figure><p>Figure 3 provides an illustration of an A/A setting. The left figure presents the raw data and the intensity functions, and the right figure presents the sequential statistical analysis. The blue and red rug plots indicate the observed arrival timestamps of events from the treatment and control streams respectively. The dashed lines are the observed counting processes. As this data is simulated under the null, the intensity functions are identical and overlay each other. The left axis of the right figure visualizes the evolution of the confidence sequence on the log-difference of intensity functions. The right axis of the right figure visualizes the evolution of the sequential p-value. We can make the two following observations</p><ul><li>Under the null, the difference of log intensities is zero, which is correctly covered by the 0.95 confidence sequence at all times.</li><li>The sequential p-value is greater than 0.05 at all times</li></ul><p>Now let’s consider an illustration of an A/B setting. Figure 4 shows observed arrival times for treatment and control when the intensity functions differ. As this is a simulation, the true difference between log intensities is known.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KChDT2L5gSw3BEDvH2Pu0g.png"><figcaption>Figure 4: (Left) An A/B simulation of two inhomogeneous Poisson point processes. (Right) Confidence sequence on the difference of log of intensity functions, and sequential p-value.</figcaption></figure><p>We can make the following observations</p><ul><li>The 0.95 confidence sequence covers the true log-difference at all times</li><li>The sequential p-value falls below 0.05 at the same time the 0.95 confidence sequence excludes the null value of zero</li></ul><p>Now we present a number of case studies where this methodology has rapidly detected serious problems in a number of count metrics</p><h4>Case Study 1: Drop in Successful Title Starts</h4><p>Figure 2 actually presents counts of title start events from a real canary test. Whenever a title starts successfully, an <a href="https://netflixtechblog.com/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a">event</a> is sent from the device to Netflix. We have a stream of title start events from treatment devices and a stream of title start events from control devices. Whenever fewer title starts are observed among treatment devices, there is usually a bug in the new client preventing playback.</p><p>In this case, the canary test detected a bug that was later determined to have prevented approximately 60% of treatment devices from being able to start their streams. The confidence sequence is shown in Figure 5, in addition to the (sequential) p-value. While the exact units of time have been omitted, this bug was detected at the <em>sub-second</em> level.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mfkXe5aNK3Y8X7aAXbxhJw.png"><figcaption>Figure 5: 0.99 Confidence sequence on the difference of log-intensities with sequential p-value.</figcaption></figure><h4>Case Study 2: Increase in Abnormal Shutdowns</h4><p>In addition to title start events, we also monitor whenever the Netflix client shuts down unexpectedly. As before, we have two streams of abnormal shutdown events, one from treatment devices, and one from control devices. The following screenshots are taken directly from our <a href="https://netflixtechblog.com/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c">Lumen</a> dashboards.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*48VrtyraTis6BFnls0hECA.png"><figcaption>Figure 6: Counts of Abnormal Shutdowns over time, cumulative and non-cumulative. Treatment (Black) and Control (Blue)</figcaption></figure><p>Figure 6 illustrates two important points. There is clearly nonstationarity in the arrival of abnormal shutdown events. It is also not easy to visibly see any difference between treatment and control from the non-cumulative view. The difference is, however, much easier to see from the cumulative view by observing the counting process. There is a small but visible increase in the number of abnormal shutdowns in the treatment. Figure 7 shows how our sequential statistical methodology is even able to identify such small differences.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*reK3Rxz4nybHPEHYd6LZrQ.png"><figcaption>Figure 7: Abnormal Shutdowns. (Top Panel) Confidence sequences on λᴮ(t)/λᴬ(t) (shaded blue) with observed counting processes for treatment (black dashed) and control (blue dashed). (Bottom Panel) sequential p-values.</figcaption></figure><h4>Case Study 3: Increase in Errors</h4><p>Netflix also monitors the number of errors produced by treatment and control. This is a high cardinality metric as every error is annotated with a code indicating the type of error. Monitoring errors segmented by code helps developers diagnose issues quickly. Figure 8 shows the sequential p-values, on the log scale, for a set of error codes that Netflix monitors during client rollouts. In this example, we have detected a higher volume of <a href="https://help.netflix.com/en/node/100573?q=3.1.18">3.1.18</a> errors being produced by treatment devices. Devices experiencing this error are presented with the following message:</p><blockquote>“We’re having trouble playing this title right now”</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pef_CT_X6avRIALRT8Q7MA.png"><figcaption>Figure 8: Sequential p-values for start play errors by error code</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*byxD-v65SdLe8HRaAPGB4g.png"><figcaption>Figure 9: Observed error-3.1.18 timestamps and counting processes for treatment (blue) and control (red)</figcaption></figure><p>Knowing <em>which</em> errors increased can streamline the process of identifying the bug for our developers. We immediately send developers alerts through Slack integrations, such as the following</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/565/1*moE4xE6_A-o0l_6Z4cAK1A.png"><figcaption>Figure 10: Notifications via Slack Integrations</figcaption></figure><p>The next time you are watching Netflix and encounter an error, know that we’re on it!</p><h4>Try it Out!</h4><p>The statistical approach outlined in our<a href="https://openreview.net/pdf?id=a4zg0jiuVi"> paper</a> is remarkably easy to implement in practice. All you need are two integers, the number of events observed so far in the treatment and control. The code is available in this short<a href="https://gist.github.com/michaellindon/5ce04c744d20755c3f653fbb58c2f4dd"> GitHub gist</a>. Here are two usage examples:</p><pre>&gt; counts = [100, 101]<br>&gt; assignment_probabilities = [0.5, 0.5]<br>&gt; sequential_p_value(counts, assignment_probabilities)<br>  1<br><br>&gt; counts = [100, 201]<br>&gt; assignment_probabilities = [0.5, 0.5]<br>&gt; sequential_p_value(counts, assignment_probabilities)<br>  5.06061172163498e-06</pre><p>The<a href="https://gist.github.com/michaellindon/5ce04c744d20755c3f653fbb58c2f4dd"> code</a> generalizes to more than just two treatment groups. For full details, including hyperparameter tuning, see<a href="https://openreview.net/pdf?id=a4zg0jiuVi"> section 4</a> of the paper.</p><h4>Further Reading</h4><ul><li><a href="https://openreview.net/forum?id=a4zg0jiuVi">Anytime Valid Inference for Multinomial Count Data</a></li><li><a href="https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df">Sequential A/B Testing Keeps the World Streaming Netflix Part 1: Continuous Data</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=da6805341642" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/sequential-testing-keeps-the-world-streaming-netflix-part-2-counting-processes-da6805341642">Sequential Testing Keeps the World Streaming Netflix Part 2: Counting Processes</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/sequential-testing-keeps-the-world-streaming-netflix-part-2-counting-processes-da6805341642</link>
      <guid>https://netflixtechblog.com/sequential-testing-keeps-the-world-streaming-netflix-part-2-counting-processes-da6805341642</guid>
      <pubDate>Mon, 18 Mar 2024 13:46:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Supporting Diverse ML Systems at Netflix]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/david-j-berg/"><em>David J. Berg</em></a><em>, </em><a href="https://www.linkedin.com/in/romain-cledat-4a211a5/"><em>Romain Cledat</em></a><em>, </em><a href="https://www.linkedin.com/in/seeleykayla/"><em>Kayla Seeley</em></a><em>, </em><a href="https://www.linkedin.com/in/shashanksrikanth/"><em>Shashank Srikanth</em></a><em>, </em><a href="https://www.linkedin.com/in/chaoying-wang/"><em>Chaoying Wang</em></a><em>, </em><a href="https://www.linkedin.com/in/zitingyu/"><em>Darin Yu</em></a></p><p>Netflix uses data science and machine learning across all facets of the company, powering a wide range of business applications from <a href="https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b">our internal infrastructure</a> and <a href="https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f">content demand modeling</a> to <a href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243">media understanding</a>. The Machine Learning Platform (MLP) team at Netflix provides an entire ecosystem of tools around <a href="https://metaflow.org/">Metaflow</a>, an open source machine learning infrastructure framework we started, to empower data scientists and machine learning practitioners to build and manage a variety of ML systems.</p><p><a href="https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9">Since its inception</a>, Metaflow has been designed to provide a human-friendly API for building data and ML (and today AI) applications and deploying them in our production infrastructure frictionlessly. While human-friendly APIs are delightful, it is really the integrations to our production systems that give Metaflow its superpowers. Without these integrations, projects would be stuck at the prototyping stage, or they would have to be maintained as outliers outside the systems maintained by our engineering teams, incurring unsustainable operational overhead.</p><p>Given the very diverse set of ML and AI use cases we support — today we have hundreds of Metaflow projects deployed internally — we don’t expect all projects to follow the same path from prototype to production. Instead, we provide a robust foundational layer with integrations to our company-wide data, compute, and orchestration platform, as well as various paths to deploy applications to production smoothly. On top of this, teams have built their own domain-specific libraries to support their specific use cases and needs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4hoAg4FX6oeua708alTMlA.png"></figure><p>In this article, we cover a few key integrations that we provide for various layers of the Metaflow stack at Netflix, as illustrated above. We will also showcase real-life ML projects that rely on them, to give an idea of the breadth of projects we support. Note that all projects leverage multiple integrations, but we highlight them in the context of the integration that they use most prominently. Importantly, all the use cases were engineered by practitioners themselves.</p><p>These integrations are implemented through <a href="https://github.com/Netflix/metaflow-extensions-template">Metaflow’s extension mechanism</a> which is publicly available but subject to change, and hence not a part of Metaflow’s stable API yet. If you are curious about implementing your own extensions, get in touch with us on <a href="http://chat.metaflow.org/">the Metaflow community Slack</a>.</p><p>Let’s go over the stack layer by layer, starting with the most foundational integrations.</p><h3>Data: Fast Data</h3><p>Our main data lake is <a href="https://www.youtube.com/watch?v=jMFMEk8jFu8">hosted on S3, organized as Apache Iceberg tables</a>. For ETL and other heavy lifting of data, we mainly rely on Apache Spark. In addition to Spark, we want to support last-mile data processing in Python, addressing use cases such as feature transformations, batch inference, and training. Occasionally, these use cases involve terabytes of data, so we have to pay attention to performance.</p><p>To enable fast, scalable, and robust access to the Netflix data warehouse, we have developed a <em>Fast Data</em> library for Metaflow, which leverages high-performance components from the Python data ecosystem:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OGn9AcNNdMXAhxLq8WugkQ.png"></figure><p>As depicted in the diagram, the Fast Data library consists of two main interfaces:</p><ul><li>The Table object is responsible for interacting with the Netflix data warehouse which includes parsing Iceberg (or legacy Hive) table metadata, resolving partitions and Parquet files for reading. Recently, we added support for the write path, so tables can be updated as well using the library.</li><li>Once we have discovered the Parquet files to be processed, MetaflowDataFrame takes over: it downloads data using Metaflow’s high-throughput S3 client directly to the process’ memory, which <a href="https://outerbounds.com/blog/metaflow-fast-data/">often outperforms reading of local files</a>.</li></ul><p>We use <a href="https://arrow.apache.org/">Apache Arrow</a> to decode Parquet and to host an in-memory representation of data. The user can choose the most suitable tool for manipulating data, such as <a href="https://pandas.pydata.org/">Pandas</a> or <a href="https://pola.rs/">Polars</a> to use a dataframe API, or one of our internal C++ libraries for various high-performance operations. Thanks to Arrow, data can be accessed through these libraries in a zero-copy fashion.</p><p>We also pay attention to dependency issues: (Py)Arrow is a dependency of many ML and data libraries, so we don’t want our custom C++ extensions to depend on a specific version of Arrow, which could easily lead to unresolvable dependency graphs. Instead, in the style of <a href="https://github.com/apache/arrow-nanoarrow">nanoarrow</a>, our Fast Data library only relies on <a href="https://arrow.apache.org/docs/format/CDataInterface.html">the stable Arrow C data interface</a>, producing a hermetically sealed library with no external dependencies.</p><h4>Example use case: Content Knowledge Graph</h4><p>Our knowledge graph of the entertainment world encodes relationships between titles, actors and other attributes of a film or series, supporting all aspects of business at Netflix.</p><p>A key challenge in creating a knowledge graph is entity resolution. There may be many different representations of slightly different or conflicting information about a title which must be resolved. This is typically done through a pairwise matching procedure for each entity which becomes non-trivial to do at scale.</p><p>This project leverages Fast Data and horizontal scaling with <a href="https://docs.metaflow.org/v/r/metaflow/basics#foreach">Metaflow’s foreach construct</a> to load large amounts of title information — approximately a billion pairs — stored in the Netflix Data Warehouse, so the pairs can be matched in parallel across many Metaflow tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KG7TUCqTF3uRU6lUncCHtg.png"></figure><p>We use metaflow.Table to resolve all input shards which are distributed to Metaflow tasks which are responsible for processing terabytes of data collectively. Each task loads the data using metaflow.MetaflowDataFrame, performs matching using Pandas, and populates a corresponding shard in an output Table. Finally, when all matching is done and data is written the new table is committed so it can be read by other jobs.</p><h3>Compute: Titus</h3><p>Whereas open-source users of Metaflow rely on <a href="https://docs.metaflow.org/scaling/remote-tasks/introduction">AWS Batch or Kubernetes as the compute backend</a>, we rely on <a href="https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436">our centralized compute-platform, Titus</a>. Under the hood, Titus is <a href="https://www.slideshare.net/aspyker/herding-kats-netflixs-journey-to-kubernetes-public">powered by Kubernetes</a>, but it provides a thick layer of enhancements over off-the-shelf Kubernetes, to <a href="https://netflixtechblog.com/kubernetes-and-kernel-panics-ed620b9c6225">make it more observable</a>, <a href="https://netflixtechblog.com/evolving-container-security-with-linux-user-namespaces-afbe3308c082">secure</a>, <a href="https://netflixtechblog.com/auto-scaling-production-services-on-titus-1f3cd49f5cd7">scalable</a>, and <a href="https://netflixtechblog.com/predictive-cpu-isolation-of-containers-at-netflix-91f014d856c7">cost-efficient</a>.</p><p>By targeting @titus, Metaflow tasks benefit from these battle-hardened features out of the box, with no in-depth technical knowledge or engineering required from the ML engineers or data scientist end. However, in order to benefit from scalable compute, we need to help the developer to package and rehydrate the whole execution environment of a project in a remote pod in a reproducible manner (preferably quickly). Specifically, we don’t want to ask developers to manage Docker images of their own manually, which quickly results in more problems than it solves.</p><p>This is why <a href="https://docs.metaflow.org/scaling/dependencies">Metaflow provides support for dependency management</a> out of the box. Originally, we supported only @conda, but based on our work on <a href="https://github.com/Netflix/metaflow-nflx-extensions">Portable Execution Environments</a>, open-source <a href="https://outerbounds.com/blog/pypi-announcement/">Metaflow gained support for </a><a href="https://outerbounds.com/blog/pypi-announcement/">@pypi</a> a few months ago as well.</p><h4>Example use case: Building model explainers</h4><p>Here’s a fascinating example of the usefulness of portable execution environments. For many of our applications, model explainability matters. Stakeholders like to understand why models produce a certain output and why their behavior changes over time.</p><p>There are several ways to provide explainability to models but one way is to train an explainer model based on each trained model. Without going into the details of how this is done exactly, suffice to say that Netflix trains a lot of models, so we need to train a lot of explainers too.</p><p>Thanks to Metaflow, we can allow each application to choose the best modeling approach for their use cases. Correspondingly, each application brings its own bespoke set of dependencies. Training an explainer model therefore requires:</p><ol><li>Access to the original model and its training environment, and</li><li>Dependencies specific to building the explainer model.</li></ol><p>This poses an interesting challenge in dependency management: we need a higher-order training system, “Explainer flow” in the figure below, which is able to take a full execution environment of another training system as an input and produce a model based on it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WoHEm2yvuo22NRp4qf0W9g.png"></figure><p>Explainer flow is event-triggered by an upstream flow, such Model A, B, C flows in the illustration. The build_environment step uses the metaflow environment command provided by <a href="https://github.com/Netflix/metaflow-nflx-extensions">our portable environments</a>, to build an environment that includes both the requirements of the input model as well as those needed to build the explainer model itself.</p><p>The built environment is given a unique name that depends on the run identifier (to provide uniqueness) as well as the model type. Given this environment, the train_explainer step is then able to refer to this uniquely named environment and operate in an environment that can both access the input model as well as train the explainer model. Note that, unlike in typical flows using vanilla @conda or @pypi, the portable environments extension allows users to also fetch those environments directly at execution time as opposed to at deploy time which therefore allows users to, as in this case, resolve the environment right before using it in the next step.</p><h3>Orchestration: Maestro</h3><p>If data is the fuel of ML and the compute layer is the muscle, then the nerves must be the orchestration layer. We have talked about the importance of a production-grade workflow orchestrator in the context of Metaflow when <a href="https://netflixtechblog.com/unbundling-data-science-workflows-with-metaflow-and-aws-step-functions-d454780c6280">we released support for AWS Step Functions</a> years ago. Since then, open-source Metaflow has gained support for <a href="https://outerbounds.com/blog/human-centric-data-science-on-kubernetes-with-metaflow/">Argo Workflows</a>, a Kubernetes-native orchestrator, as well as <a href="https://outerbounds.com/blog/better-airflow-with-metaflow/">support for Airflow</a> which is still widely used by data engineering teams.</p><p>Internally, we use <a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">a production workflow orchestrator called Maestro</a>. The Maestro post shares details about how the system supports scalability, high-availability, and usability, which provide the backbone for all of our Metaflow projects in production.</p><p>A hugely important detail that often goes overlooked is <a href="https://docs.metaflow.org/production/event-triggering">event-triggering</a>: it allows a team to integrate their Metaflow flows to surrounding systems upstream (e.g. ETL workflows), as well as downstream (e.g. flows managed by other teams), using a protocol shared by the whole organization, as exemplified by the example use case below.</p><h4>Example use case: Content decision making</h4><p>One of the most business-critical systems running on Metaflow <a href="https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f">supports our content decision making</a>, that is, the question of what content Netflix should bring to the service. We support a massive scale of over 260M subscribers spanning over 190 countries representing hugely diverse cultures and tastes, all of whom we want to delight with our content slate. Reflecting the breadth and depth of the challenge, the systems and models focusing on the question have grown to be very sophisticated.</p><p>We approach the question from multiple angles but we have a core set of data pipelines and models that provide a foundation for decision making. To illustrate the complexity of just the core components, consider this high-level diagram:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rp4sF-nIWgTt8kdt"></figure><p>In this diagram, gray boxes represent integrations to partner teams downstream and upstream, green boxes are various ETL pipelines, and blue boxes are Metaflow flows. These boxes encapsulate hundreds of advanced models and intricate business logic, handling massive amounts of data daily.</p><p>Despite its complexity, the system is managed by a relatively small team of engineers and data scientists autonomously. This is made possible by a few key features of Metaflow:</p><ul><li>All the boxes are event-triggered, orchestrated by Maestro. Dependencies between Metaflow flows are triggered via <a href="https://docs.metaflow.org/production/event-triggering/flow-events">@trigger_on_finish</a>, dependencies to external systems with <a href="https://docs.metaflow.org/production/event-triggering/external-events">@trigger</a>.</li><li>Rapid development is enabled via <a href="https://docs.metaflow.org/scaling/tagging">Metaflow namespaces</a>, so individual developers can develop without interfering with production deployments.</li><li><a href="https://docs.metaflow.org/production/coordinating-larger-metaflow-projects">Branched development and deployment is managed via </a><a href="https://docs.metaflow.org/production/coordinating-larger-metaflow-projects">@project</a>, which also <a href="https://docs.metaflow.org/production/event-triggering/project-events">isolates events between different branches</a>.</li></ul><p>The team has also developed their own domain-specific libraries and configuration management tools, which help them improve and operate the system.</p><h3>Deployment: Cache</h3><p>To produce business value, all our Metaflow projects are deployed to work with other production systems. In many cases, the integration might be via shared tables in our data warehouse. In other cases, it is more convenient to share the results via a low-latency API.</p><p>Notably, not all API-based deployments require real-time evaluation, which we cover in the section below. We have a number of business-critical applications where some or all predictions can be precomputed, guaranteeing the lowest possible latency and operationally simple high availability at the global scale.</p><p>We have developed an officially supported pattern to cover such use cases. While the system relies on our internal caching infrastructure, you could follow the same pattern using services like <a href="https://aws.amazon.com/elasticache/">Amazon ElasticCache</a> or <a href="https://aws.amazon.com/dynamodb/">DynamoDB</a>.</p><h4>Example use case: Content performance visualization</h4><p>The historical performance of titles is used by decision makers to understand and improve the film and series catalog. Performance metrics can be complex and are often best understood by humans with visualizations that break down the metrics across parameters of interest interactively. Content decision makers are equipped with self-serve visualizations through a real-time web application built with metaflow.Cache, which is accessed through an API provided with metaflow.Hosting.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i3YtLUvobXxEYE6crwQy_w.png"></figure><p>A daily scheduled Metaflow job computes aggregate quantities of interest in parallel. The job writes a large volume of results to an online key-value store using metaflow.Cache. A <a href="https://streamlit.io/">Streamlit</a> app houses the visualization software and data aggregation logic. Users can dynamically change parameters of the visualization application and in real-time a message is sent to a simple <a href="https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d#a890">Metaflow hosting service</a> which looks up values in the cache, performs computation, and returns the results as a JSON blob to the Streamlit application.</p><h3>Deployment: Metaflow Hosting</h3><p>For deployments that require an API and real-time evaluation, we provide an integrated model hosting service, Metaflow Hosting. Although details have evolved a lot, <a href="https://www.youtube.com/watch?v=sBM5cSBGZS4">this old talk still gives a good overview of the service</a>.</p><p>Metaflow Hosting is specifically geared towards hosting artifacts or models produced in Metaflow. This provides an easy to use interface on top of Netflix’s existing microservice infrastructure, allowing data scientists to quickly move their work from experimentation to a production grade web service that can be consumed over a HTTP REST API with minimal overhead.</p><p>Its key benefits include:</p><ul><li>Simple decorator syntax to create RESTFull endpoints.</li><li>The back-end auto-scales the number of instances used to back your service based on traffic.</li><li>The back-end will scale-to-zero if no requests are made to it after a specified amount of time thereby saving cost particularly if your service requires GPUs to effectively produce a response.</li><li>Request logging, alerts, monitoring and tracing hooks to Netflix infrastructure</li></ul><p>Consider the service similar to managed model hosting services like <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html">AWS Sagemaker Model Hosting</a>, but tightly integrated with our microservice infrastructure.</p><h4>Example use case: Media</h4><p>We have a long history of using machine learning to process media assets, for instance, to <a href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">personalize artwork</a> and to help our <a href="https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd">creatives create promotional content</a> efficiently. Processing large amounts of media assets is technically non-trivial and computationally expensive, so over the years, we have developed plenty of <a href="https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359">specialized infrastructure</a> dedicated for this purpose in general, and <a href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243">infrastructure supporting media ML use cases</a> in particular.</p><p>To demonstrate the benefits of Metaflow Hosting that provides a general-purpose API layer supporting both synchronous and asynchronous queries, consider this use case involving<a href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243"> Amber, our feature store for media</a>.</p><p>While Amber is a feature <em>store</em>, precomputing and storing all media features in advance would be infeasible. Instead, we compute and cache features in an on-demand basis, as depicted below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SP7GASNee-YB_dDu35ldJA.png"></figure><p>When a service requests a feature from Amber, it computes the feature dependency graph and then sends one or more asynchronous requests to Metaflow Hosting, which places the requests in a queue, eventually triggering feature computations when compute resources become available. Metaflow Hosting caches the response, so Amber can fetch it after a while. We could have built a dedicated microservice just for this use case, but thanks to the flexibility of Metaflow Hosting, we were able to ship the feature faster with no additional operational burden.</p><h3>Future Work</h3><p>Our appetite to apply ML in diverse use cases is only increasing, so our Metaflow platform will keep expanding its footprint correspondingly and continue to provide delightful integrations to systems built by other teams at Netlfix. For instance, we have plans to work on improvements in the versioning layer, which wasn’t covered by this article, by giving more options for artifact and model management.</p><p>We also plan on building more integrations with other systems that are being developed by sister teams at Netflix. As an example, Metaflow Hosting models are currently not well integrated into model logging facilities — we plan on working on improving this to make models developed with Metaflow more integrated with the feedback loop critical in training new models. We hope to do this in a pluggable manner that would allow other users to integrate with their own logging systems.</p><p>Additionally we want to supply more ways Metaflow artifacts and models can be integrated into non-Metaflow environments and applications, e.g. JVM based edge service, so that Python-based data scientists can contribute to non-Python engineering systems easily. This would allow us to better bridge the gap between the quick iteration that Metaflow provides (in Python) with the requirements and constraints imposed by the infrastructure serving Netflix member facing requests.</p><p>If you are building business-critical ML or AI systems in your organization, <a href="http://chat.metaflow.org/">join the Metaflow Slack community</a>! We are happy to share experiences, answer any questions, and welcome you to contribute to Metaflow.</p><h4>Acknowledgements:</h4><p>Thanks to Wenbing Bai, Jan Florjanczyk, Michael Li, Aliki Mavromoustaki, and Sejal Rai for help with use cases and figures. Thanks to our OSS contributors for making Metaflow a better product.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=2d2e6b6d205d" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d">Supporting Diverse ML Systems at Netflix</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d</link>
      <guid>https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d</guid>
      <pubDate>Thu, 07 Mar 2024 19:33:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Bending pause times to your will with Generational ZGC]]></title>
      <description><![CDATA[<p><em>The surprising and not so surprising benefits of generations in the Z Garbage Collector.</em></p><p>By Danny Thomas, JVM Ecosystem Team</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GuEZ-RMhzNnYgLQd"></figure><p>The latest long term support release of the JDK delivers generational support for the <a href="https://docs.oracle.com/en/java/javase/21/gctuning/z-garbage-collector.html">Z Garbage Collector</a>.</p><p>More than half of our critical streaming video services are now running on JDK 21 with Generational ZGC, so it’s a good time to talk about our experience and the benefits we’ve seen. If you’re interested in how we use Java at Netflix, Paul Bakker’s talk <a href="https://www.infoq.com/presentations/netflix-java/">How Netflix Really Uses Java</a>, is a great place to start.</p><h3>Reduced tail latencies</h3><p>In both our GRPC and <a href="https://netflix.github.io/dgs/">DGS Framework</a> services, GC pauses are a significant source of tail latencies. That’s particularly true of our GRPC clients and servers, where request cancellations due to timeouts interact with reliability features such as retries, hedging and fallbacks. Each of these errors is a canceled request resulting in a retry so this reduction further reduces overall service traffic by this rate:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SCVt4VGlA517hZDi"><figcaption>Errors rates per second. Previous week in white vs current cancellation rate in purple, as ZGC was enabled on a service cluster on November 15</figcaption></figure><p>Removing the noise of pauses also allows us to identify actual sources of latency end-to-end, which would otherwise be hidden in the noise, as maximum pause time outliers can be significant:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rW029WscxSKDQRQ6"><figcaption>Maximum GC pause times by cause, for the same service cluster as above. Yes, those ZGC pauses really are usually under one millisecond</figcaption></figure><h3>Efficiency</h3><p>Even after we saw very promising results in our evaluation, we expected the adoption of ZGC to be a trade off: a little less application throughput, due to store and load barriers, work performed in thread local handshakes, and the GC competing with the application for resources. We considered that an acceptable trade off, as avoiding pauses provided benefits that would outweigh that overhead.</p><p>In fact, we’ve found for our services and architecture that there is no such trade off. For a given CPU utilization target, ZGC improves both average and P99 latencies with equal or better CPU utilization when compared to G1.</p><p>The consistency in request rates, request patterns, response time and allocation rates we see in many of our services certainly help ZGC, but we’ve found it’s equally capable of handling less consistent workloads (with exceptions of course; more on that below).</p><h3>Operational simplicity</h3><p>Service owners often reach out to us with questions about excessive pause times and for help with tuning. We have several frameworks that periodically refresh large amounts of on-heap data to avoid external service calls for efficiency. These periodic refreshes of on-heap data are great at taking G1 by surprise, resulting in pause time outliers well beyond the default pause time goal.</p><p>This long lived on-heap data was the major contributor to us not adopting non-generational ZGC previously. In the worst case we evaluated, non-generational ZGC caused 36% more CPU utilization than G1 for the same workload. That became a nearly 10% improvement with generational ZGC.</p><p>Half of all services required for streaming video use our <a href="https://hollow.how/)">Hollow</a> library for on-heap metadata. Removing pauses as a concern allowed us to <a href="https://github.com/Netflix/hollow/commit/4f21ab593543bb622d9ccea2f8e6295eae5e8080">remove array pooling mitigations</a>, freeing hundreds of megabytes of memory for allocations.</p><p>Operational simplicity also stems from ZGC’s heuristics and defaults. No explicit tuning has been required to achieve these results. Allocation stalls are rare, typically coinciding with abnormal spikes in allocation rates, and are shorter than the average pause times we saw with G1.</p><h3>Memory overhead</h3><p>We expected that losing <a href="https://shipilev.net/jvm/anatomy-quarks/23-compressed-references/">compressed references</a> on heaps &lt; 32G, due to <a href="https://youtu.be/YyXjC68l8mw?t=816">colored pointers requiring 64-bits object pointers</a>, would be a major factor in the choice of a garbage collector.</p><p>We’ve found that while that’s an important consideration for stop-the-world GCs, that’s not the case for ZGC where even on small heaps, the increase in allocation rate is amortized by the efficiency and operational improvements. Our thanks to Erik Österlund at Oracle for explaining the less intuitive benefits of colored pointers when it comes to concurrent garbage collectors, which lead us to evaluating ZGC more broadly than initially planned.</p><p>In the majority of cases ZGC is also able to consistently make more memory available to the application:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3eTNEdI2mHfL1Yvk"><figcaption>Used vs available heap capacity following each GC cycle, for the same service cluster as above</figcaption></figure><p>ZGC has a fixed overhead 3% of the heap size, requiring more native memory than G1. Except in a couple of cases, there’s been no need to lower the maximum heap size to allow for more headroom, and those were services with greater than average native memory needs.</p><p>Reference processing is also only performed in major collections with ZGC. We paid particular attention to deallocation of direct byte buffers, but we haven’t seen any impact thus far. This difference in reference processing did cause a <a href="https://bugs.openjdk.org/browse/JDK-8321178">performance problem with JSON thread dump support</a>, but that’s a unusual situation caused by a framework accidentally creating an unused ExecutorService instance for every request.</p><h3>Transparent huge pages</h3><p>Even if you’re not using ZGC, you probably should be using huge pages, and <a href="https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/">transparent huge pages</a> is the most convenient way to use them.</p><p>ZGC uses shared memory for the heap and many Linux distributions configure shmem_enabled to <em>never</em>, which silently prevents ZGC from using huge pages with -XX:+UseTransparentHugePages.</p><p>Here we have a service deployed with no other change but shmem_enabled going from never to advise, reducing CPU utilization significantly:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bGoc3W9P_E2kjghe"><figcaption>Deployment moving from 4k to 2m pages. Ignore the gap, that’s our immutable deployment process temporarily doubling the cluster capacity</figcaption></figure><p>Our default configuration:</p><ul><li>Sets heap minimum and maximums to equal size</li><li>Configures -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch</li><li>Uses the following transparent_hugepage configuration:</li></ul><pre>echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled<br>echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled<br>echo defer | sudo tee /sys/kernel/mm/transparent_hugepage/defrag<br>echo 1 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/defrag</pre><h3>What workloads weren’t a good fit?</h3><p>There is no best garbage collector. Each trades off collection throughput, application latency and resource utilization depending on the goal of the garbage collector.</p><p>For the workloads that have performed better with G1 vs ZGC, we’ve found that they tend to be more throughput oriented, with very spiky allocation rates and long running tasks holding objects for unpredictable periods.</p><p>A notable example was a service where very spiky allocation rates and large numbers of long lived objects, which happened to be a particularly good fit for G1’s pause time goal and old region collection heuristics. It allowed G1 to avoid unproductive work in GC cycles that ZGC couldn’t.</p><p>The switch to ZGC by default has provided the perfect opportunity for application owners to think about their choice of garbage collector. Several batch/precompute cases had been using G1 by default, where they would have seen better throughput from the parallel collector. In one large precompute workload we saw a 6–8% improvement in application throughput, shaving an hour off the batch time, versus G1.</p><h3>Try it for yourself!</h3><p>Left unquestioned, assumptions and expectations could have caused us to miss one of the most impactful changes we’ve made to our operational defaults in a decade. We’d encourage you to try generational ZGC for yourself. It might surprise you as much as it surprised us.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=256629c9386b" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b">Bending pause times to your will with Generational ZGC</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b</link>
      <guid>https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b</guid>
      <pubDate>Wed, 06 Mar 2024 02:35:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…]]></title>
      <description><![CDATA[<h3>Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data Platform</h3><p>by <a href="https://www.linkedin.com/in/binbing-hou/overlay/about-this-profile/">Binbing Hou</a>, <a href="https://www.linkedin.com/in/stephanievezich/overlay/about-this-profile/">Stephanie Vezich Tamayo</a>, <a href="https://www.linkedin.com/in/chenxiao000/overlay/about-this-profile/">Xiao Chen</a>, <a href="https://www.linkedin.com/in/liangtian/overlay/about-this-profile/">Liang Tian</a>, <a href="https://www.linkedin.com/in/troy-ristow-4899b49/overlay/about-this-profile/">Troy Ristow</a>, <a href="https://www.linkedin.com/in/haoyuanwang/overlay/about-this-profile/">Haoyuan Wang</a>, <a href="https://www.linkedin.com/in/snehalchennuru/overlay/about-this-profile/">Snehal Chennuru</a>, <a href="https://www.linkedin.com/in/pawan-dixit-b4307b2/overlay/about-this-profile/">Pawan Dixit</a></p><p><em>This is the first of the series of our work at Netflix on leveraging data insights and Machine Learning (ML) to improve the operational automation around the performance and cost efficiency of big data jobs. Operational automation–including but not limited to, auto diagnosis, auto remediation, auto configuration, auto tuning, auto scaling, auto debugging, and auto testing–is key to the success of modern data platforms. In this blog post, we present our project on Auto Remediation, which integrates the currently used rule-based classifier with an ML service and aims to automatically remediate failed jobs without human intervention. We have deployed Auto Remediation in production for handling memory configuration errors and unclassified errors of Spark jobs and observed its efficiency and effectiveness (e.g., automatically remediating 56% of memory configuration errors and saving 50% of the monetary costs caused by all errors) and great potential for further improvements.</em></p><h3>Introduction</h3><p>At Netflix, hundreds of thousands of workflows and millions of jobs are running per day across multiple layers of the big data platform. Given the extensive scope and intricate complexity inherent to such a distributed, large-scale system, even if the failed jobs account for a tiny portion of the total workload, diagnosing and remediating job failures can cause considerable operational burdens.</p><p>For efficient error handling, Netflix developed an error classification service, called Pensive, which leverages a rule-based classifier for error classification. The rule-based classifier classifies job errors based on a set of predefined rules and provides insights for schedulers to decide whether to retry the job and for engineers to diagnose and remediate the job failure.</p><p>However, as the system has increased in scale and complexity, the rule-based classifier has been facing challenges due to its limited support for operational automation, especially for handling memory configuration errors and unclassified errors. Therefore, the operational cost increases linearly with the number of failed jobs. In some cases–for example, diagnosing and remediating job failures caused by Out-Of-Memory (OOM) errors–joint effort across teams is required, involving not only the users themselves, but also the support engineers and domain experts.</p><p>To address these challenges, we have developed a new feature, called <em>Auto Remediation</em>, which integrates the rule-based classifier with an ML service. Based on the classification from the rule-based classifier, it uses an ML service to predict retry success probability and retry cost and selects the best candidate configuration as recommendations; and a configuration service to automatically apply the recommendations. Its major advantages are below:</p><ul><li><strong>Integrated intelligence. </strong>Instead of completely deprecating the current rule-based classifier, Auto Remediation integrates the classifier with an ML service so that it can leverage the merits of both: the rule-based classifier provides static, deterministic classification results per error class, which is based on the context of domain experts; the ML service provides performance- and cost-aware recommendations per job, which leverages the power of ML. With the integrated intelligence, we can properly meet the requirements of remediating different errors.</li><li><strong>Fully automated.</strong> The pipeline of classifying errors, getting recommendations, and applying recommendations is fully automated. It provides the recommendations together with the retry decision to the scheduler, and particularly uses an online configuration service to store and apply recommended configurations. In this way, no human intervention is required in the remediation process.</li><li><strong>Multi-objective optimizations. </strong>Auto Remediation generates recommendations by considering both performance (i.e., the retry success probability) and compute cost efficiency (i.e., the monetary costs of running the job) to avoid blindly recommending configurations with excessive resource consumption. For example, for memory configuration errors, it searches multiple parameters related to the memory usage of job execution and recommends the combination that minimizes a linear combination of failure probability and compute cost.</li></ul><p>These advantages have been verified by the production deployment for remediating Spark jobs’ failures. Our observations indicate that Auto Remediation<em> </em>can successfully remediate about 56% of all memory configuration errors by applying the recommended memory configurations online without human intervention; and meanwhile reduce the cost of about 50% due to its ability to recommend new configurations to make memory configurations successful and disable unnecessary retries for unclassified errors. We have also noted a great potential for further improvement by model tuning (see the section of Rollout in Production).</p><h3>Rule-based Classifier: Basics and Challenges</h3><h4>Basics</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pnViNRB4q-LX7rcdn6MgHA.png"></figure><p>Figure 1 illustrates the error classification service, i.e., Pensive, in the data platform. It leverages the rule-based classifier and is composed of three components:</p><ul><li><strong>Log Collector</strong> is responsible for pulling logs from different platform layers for error classification (e.g., the scheduler, job orchestrator, and compute clusters).</li><li><strong>Rule Execution Engine</strong> is responsible for matching the collected logs against a set of predefined rules. A rule includes (1) the name, source, log, and summary, of the error and whether the error is restartable; and (2) the regex to identify the error from the log. For example, the rule with the name SparkDriverOOM includes the information indicating that if the stdout log of a Spark job can match the regex <em>SparkOutOfMemoryError:</em>, then this error is classified to be a user error, not restartable.</li><li><strong>Result Finalizer</strong> is responsible for finalizing the error classification result based on the matched rules. If one or multiple rules are matched, then the classification of the first matched rule determines the final classification result (the rule priority is determined by the rule ordering, and the first rule has the highest priority). On the other hand, if no rules are matched, then this error will be considered unclassified.</li></ul><h4>Challenges</h4><p>While the rule-based classifier is simple and has been effective, it is facing challenges due to its limited ability to handle the errors caused by misconfigurations and classify new errors:</p><ul><li><strong>Memory configuration errors. </strong>The rules-based classifier provides error classification results indicating whether to restart the job; however, for non-transient errors, it still relies on engineers to manually remediate the job. The most notable example is memory configuration errors. Such errors are generally caused by the misconfiguration of job memory. Setting an excessively small memory can result in Out-Of-Memory (OOM) errors while setting an excessively large memory can waste cluster memory resources. What’s more challenging is that some memory configuration errors require changing the configurations of multiple parameters. Thus, setting a proper memory configuration requires not only the manual operation but also the expertise of Spark job execution. In addition, even if a job’s memory configuration is initially well tuned, changes such as data size and job definition can cause performance to degrade. Given that about 600 memory configuration errors per month are observed in the data platform, timely remediation of memory configuration errors alone requires non-trivial engineering efforts.</li><li><strong>Unclassified errors. </strong>The rule-based classifier relies on data platform engineers to manually add rules for recognizing errors based on the known context; otherwise, the errors will be unclassified. Due to the migrations of different layers of the data platform and the diversity of applications, existing rules can be invalid, and adding new rules requires engineering efforts and also depends on the deployment cycle. More than 300 rules have been added to the classifier, yet about 50% of all failures remain unclassified. For unclassified errors, the job may be retried multiple times with the default retry policy. If the error is non-transient, these failed retries incur unnecessary job running costs.</li></ul><h3>Evolving to Auto Remediation: Service Architecture</h3><h4>Methodology</h4><p>To address the above-mentioned challenges, our basic methodology is to integrate the rule-based classifier with an ML service to generate recommendations, and use a configuration service to apply the recommendations automatically:</p><ul><li><strong>Generating recommendations. </strong>We use the rule-based classifier as the first pass to classify all errors based on predefined rules, and the ML service as the second pass to provide recommendations for memory configuration errors and unclassified errors.</li><li><strong>Applying recommendations. </strong>We use an online configuration service to store and apply the recommended configurations. The pipeline is fully automated, and the services used to generate and apply recommendations are decoupled.</li></ul><h4>Service Integrations</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2eENd1mhwyGpMWNccEwqlQ.png"></figure><p>Figure 2 illustrates the integration of the services generating and applying the recommendations in the data platform. The major services are as follows:</p><ul><li><strong>Nightingale</strong> is a service running the ML model trained using <a href="https://metaflow.org/">Metaflow</a> and is responsible for generating a retry recommendation. The recommendation includes (1) whether the error is restartable; and (2) if so, the recommended configurations to restart the job.</li><li><strong>ConfigService</strong> is an online configuration service. The recommended configurations are saved in <strong>ConfigService</strong> as a JSON patch with a scope defined to specify the jobs that can use the recommended configurations. When <strong>Scheduler</strong> calls <strong>ConfigService</strong> to get recommended configurations, <strong>Scheduler</strong> passes the original configurations to <strong>ConfigService</strong> and <strong>ConfigService</strong> returns the mutated configurations by applying the JSON patch to the original configurations. <strong>Scheduler</strong> can then restart the job with the mutated configurations (including the recommended configurations).</li><li><strong>Pensive</strong> is an error classification service that leverages the rule-based classifier. It calls <strong>Nightingale</strong> to get recommendations and stores the recommendations to <strong>ConfigService</strong> so that it can be picked up by <strong>Scheduler</strong> to restart the job.</li><li><strong>Scheduler </strong>is the service scheduling jobs (our current implementation is with <a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">Netflix Maestro</a>). Each time when a job fails, it calls <strong>Pensive</strong> to get the error classification to decide whether to restart a job and calls <strong>ConfigServices</strong> to get the recommended configurations for restarting the job.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gyXv3JyvhUODQWecQqy1zg.png"></figure><p>Figure 3 illustrates the sequence of service calls with Auto Remediation:</p><ol><li>Upon a job failure, <strong>Scheduler</strong> calls <strong>Pensive</strong> to get the error classification.</li><li><strong>Pensive</strong> classifies the error based on the rule-based classifier. If the error is identified to be a memory configuration error or an unclassified error, it calls <strong>Nightingale</strong> to get recommendations.</li><li>With the obtained recommendations, <strong>Pensive</strong> updates the error classification result and saves the recommended configurations to <strong>ConfigService</strong>; and then returns the error classification result to <strong>Scheduler</strong>.</li><li>Based on the error classification result received from <strong>Pensive, Scheduler</strong> determines whether to restart the job.</li><li>Before restarting the job, <strong>Scheduler</strong> calls <strong>ConfigService</strong> to get the recommended configuration and retries the job with the new configuration.</li></ol><h3>Evolving to Auto Remediation: ML Service</h3><h4>Overview</h4><p>The ML service, i.e., Nightingale, aims to generate a retry policy for a failed job that trades off between retry success probability and job running costs. It consists of two major components:</p><ul><li><strong>A prediction model</strong> that jointly estimates a) probability of retry success, and b) retry cost in dollars, conditional on properties of the retry.</li><li><strong>An optimizer</strong> which explores the Spark configuration parameter space to recommend a configuration which minimizes a linear combination of retry failure probability and cost.</li></ul><p>The prediction model is retrained offline daily, and is called by the optimizer to evaluate each candidate set of configuration parameter values. The optimizer runs in a RESTful service which is called upon job failure. If there is a feasible configuration solution from the optimization, the response includes this recommendation, which ConfigService uses to mutate the configuration for the retry. If there is no feasible solution–in other words, it is unlikely the retry will succeed by changing Spark configuration parameters alone–the response includes a flag to disable retries and thus eliminate wasted compute cost.</p><h4>Prediction Model</h4><p>Given that we want to explore how retry success and retry cost might change under different configuration scenarios, we need some way to predict these two values using the information we have about the job. Data Platform logs both retry success outcome and execution cost, giving us reliable labels to work with. Since we use a shared feature set to predict both targets, have good labels, and need to run inference quickly online to meet SLOs, we decided to formulate the problem as a multi-output supervised learning task. In particular, we use a simple Feedforward Multilayer Perceptron (MLP) with two heads, one to predict each outcome.</p><p><strong>Training: </strong>Each record in the training set represents a potential retry which previously failed due to memory configuration errors or unclassified errors. The labels are: a) did retry fail, b) retry cost. The raw feature inputs are largely unstructured metadata about the job such as the Spark execution plan, the user who ran it, and the Spark configuration parameters and other job properties. We split these features into those that can be parsed into numeric values (e.g., Spark executor memory parameter) and those that cannot (e.g., user name). We used feature hashing to process the non-numeric values because they come from a high cardinality and dynamic set of values. We then create a lower dimensionality embedding which is concatenated with the normalized numeric values and passed through several more layers.</p><p><strong>Inference: </strong>Upon passing validation audits, each new model version is stored in <a href="https://metaflow.org/">Metaflow</a> Hosting, a service provided by our internal ML Platform. The optimizer makes several calls to the model prediction function for each incoming configuration recommendation request, described in more detail below.</p><h4>Optimizer</h4><p>When a job attempt fails, it sends a request to Nightingale with a job identifier. From this identifier, the service constructs the feature vector to be used in inference calls. As described previously, some of these features are Spark configuration parameters which are candidates to be mutated (e.g., spark.executor.memory, spark.executor.cores). The set of Spark configuration parameters was based on distilled knowledge of domain experts who work on Spark performance tuning extensively. We use Bayesian Optimization (implemented via Meta’s <a href="https://ax.dev/">Ax library</a>) to explore the configuration space and generate a recommendation. At each iteration, the optimizer generates a candidate parameter value combination (e.g., spark.executor.memory=7192 mb, spark.executor.cores=8), then evaluates that candidate by calling the prediction model to estimate retry failure probability and cost using the candidate configuration (i.e., mutating their values in the feature vector). After a fixed number of iterations is exhausted, the optimizer returns the “best” configuration solution (i.e., that which minimized the combined retry failure and cost objective) for ConfigService to use if it is feasible. If no feasible solution is found, we disable retries.</p><p>One downside of the iterative design of the optimizer is that any bottleneck can block completion and cause a timeout, which we initially observed in a non-trivial number of cases. Upon further profiling, we found that most of the latency came from the candidate generated step (i.e., figuring out which directions to step in the configuration space after the previous iteration’s evaluation results). We found that this issue had been raised to Ax library owners, who <a href="https://github.com/facebook/Ax/issues/810">added GPU acceleration options in their API</a>. Leveraging this option decreased our timeout rate substantially.</p><h3>Rollout in Production</h3><p>We have deployed Auto Remediation in production to handle memory configuration errors and unclassified errors for Spark jobs. Besides the retry success probability and cost efficiency, the impact on user experience is the major concern:</p><ul><li><strong>For memory configuration errors: </strong>Auto remediation improves user experience because the job retry is rarely successful without a new configuration for memory configuration errors. This means that a successful retry with the recommended configurations can reduce the operational loads and save job running costs, while a failed retry does not make the user experience worse.</li><li><strong>For unclassified errors: </strong>Auto remediation recommends whether to restart the job if the error cannot be classified by existing rules in the rule-based classifier. In particular, if the ML model predicts that the retry is very likely to fail, it will recommend disabling the retry, which can save the job running costs for unnecessary retries. For cases in which the job is business-critical and the user prefers always retrying the job even if the retry success probability is low, we can add a new rule to the rule-based classifier so that the same error will be classified by the rule-based classifier next time, skipping the recommendations of the ML service. This presents the advantages of the integrated intelligence of the rule-based classifier and the ML service.</li></ul><p>The deployment in production has demonstrated that Auto Remediation<em> </em>can provide effective configurations for memory configuration errors, successfully remediating about 56% of all memory configuration without human intervention. It also decreases compute cost of these jobs by about 50% because it can either recommend new configurations to make the retry successful or disable unnecessary retries. As tradeoffs between performance and cost efficiency are tunable, we can decide to achieve a higher success rate or more cost savings by tuning the ML service.</p><p>It is worth noting that the ML service is currently adopting a conservative policy to disable retries. As discussed above, this is to avoid the impact on the cases that users prefer always retrying the job upon job failures. Although these cases are expected and can be addressed by adding new rules to the rule-based classifier, we consider tuning the objective function in an incremental manner to gradually disable more retries is helpful to provide desirable user experience. Given the current policy to disable retries is conservative, Auto Remediation presents a great potential to eventually bring much more cost savings without affecting the user experience.</p><h3>Beyond Error Handling: Towards Right Sizing</h3><p>Auto Remediation is our first step in leveraging data insights and Machine Learning (ML) for improving user experience, reducing the operational burden, and improving cost efficiency of the data platform. It focuses on automating the remediation of failed jobs, but also paves the path to automate operations other than error handling.</p><p>One of the initiatives we are taking, called <em>Right Sizing</em>, is to reconfigure scheduled big data jobs to request the proper resources for job execution. For example, we have noted that the average requested executor memory of Spark jobs is about four times their max used memory, indicating a significant overprovision. In addition to the configurations of the job itself, the resource overprovision of the container that is requested to execute the job can also be reduced for cost savings. With heuristic- and ML-based methods, we can infer the proper configurations of job execution to minimize resource overprovisions and save millions of dollars per year without affecting the performance. Similar to Auto Remediation, these configurations can be automatically applied via ConfigService without human intervention. Right Sizing is in progress and will be covered with more details in a dedicated technical blog post later. Stay tuned.</p><h3>Acknowledgements</h3><p>Auto Remediation is a joint work of the engineers from different teams and organizations. This work would have not been possible without the solid, in-depth collaborations. We would like to appreciate all folks, including Spark experts, data scientists, ML engineers, the scheduler and job orchestrator engineers, data engineers, and support engineers, for sharing the context and providing constructive suggestions and valuable feedback (e.g., <a href="https://www.linkedin.com/in/jzhuge/">John Zhuge</a>, <a href="https://www.linkedin.com/in/jheua/">Jun He</a>, <a href="https://www.linkedin.com/in/holdenkarau/">Holden Karau</a>, <a href="https://www.linkedin.com/in/samarthjain11/">Samarth Jain</a>, <a href="https://www.linkedin.com/in/julianjaffe/">Julian Jaffe</a>, <a href="https://www.linkedin.com/in/batul-shajapurwala-3274b863/">Batul Shajapurwala</a>, <a href="https://www.linkedin.com/in/michael-sachs-b2453b/overlay/about-this-profile/">Michael Sachs</a>, <a href="https://www.linkedin.com/in/fzsiddiqi/overlay/about-this-profile/">Faisal Siddiqi</a>).</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=039d5efd115b" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b">Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data…</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b</link>
      <guid>https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b</guid>
      <pubDate>Mon, 04 Mar 2024 19:01:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Announcing bpftop: Streamlining eBPF performance optimization]]></title>
      <description><![CDATA[<p><em>By </em><a href="https://www.linkedin.com/in/josefernandezmn/"><em>Jose Fernandez</em></a></p><p>Today, we are thrilled to announce the release of <a href="https://github.com/Netflix/bpftop">bpftop</a>, a command-line tool designed to streamline the performance optimization and monitoring of eBPF applications. As Netflix increasingly adopts eBPF [<a href="https://netflixtechblog.com/extending-vector-with-ebpf-to-inspect-host-and-container-performance-5da3af4c584b">1</a>, <a href="https://netflixtechblog.com/how-netflix-uses-ebpf-flow-logs-at-scale-for-network-insight-e3ea997dca96">2</a>], applying the same rigor to these applications as we do to other managed services is imperative. Striking a balance between eBPF’s benefits and system load is crucial, ensuring it enhances rather than hinders our operational efficiency. This tool enables Netflix to embrace eBPF’s potential.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/135/1*pum8HfOu5XkB9XEwsKl_hw.png"></figure><h3>Introducing bpftop</h3><p>bpftop provides a dynamic real-time view of running eBPF programs. It displays the average execution runtime, events per second, and estimated total CPU % for each program. This tool minimizes overhead by enabling performance statistics only while it is active.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hLvLcNDb6RljixhN8la-lg.gif"></figure><p>bpftop simplifies the performance optimization process for eBPF programs by enabling an efficient cycle of benchmarking, code refinement, and immediate feedback. Without bpftop, optimization efforts would require manual calculations, adding unnecessary complexity to the process. With bpftop, users can quickly establish a baseline, implement improvements, and verify enhancements, streamlining the process.</p><p>A standout feature of this tool is its ability to display the statistics in time series graphs. This approach can uncover patterns and trends that could be missed otherwise.</p><h3>How it works</h3><p>bpftop uses the <a href="https://elixir.bootlin.com/linux/v6.6.16/source/include/uapi/linux/bpf.h#L792">BPF_ENABLE_STATS</a> syscall command to enable global eBPF runtime statistics gathering, which is disabled by default to reduce performance overhead. It collects these statistics every second, calculating the average runtime, events per second, and estimated CPU utilization for each eBPF program within that sample period. This information is displayed in a top-like tabular format or a time series graph over a 10s moving window. Once bpftop terminates, it turns off the statistics-gathering function. The tool is written in Rust, leveraging the <a href="https://github.com/libbpf/libbpf-rs">libbpf-rs</a> and <a href="https://github.com/ratatui-org/ratatui">ratatui</a> crates.</p><h3>Getting started</h3><p>Visit the project’s <a href="https://github.com/Netflix/bpftop">GitHub page</a> to learn more about using the tool. We’ve open-sourced bpftop under the Apache 2 license and look forward to contributions from the community.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=6a727c1ae2e5" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5">Announcing bpftop: Streamlining eBPF performance optimization</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5</link>
      <guid>https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5</guid>
      <pubDate>Mon, 26 Feb 2024 17:43:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Sequential A/B Testing Keeps the World Streaming Netflix
Part 1: Continuous Data]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/michaelslindon/">Michael Lindon</a>, <a href="https://www.linkedin.com/in/csanden/">Chris Sanden</a>, <a href="https://www.linkedin.com/in/vshirikian/">Vache Shirikian</a>, <a href="https://www.linkedin.com/in/liuyanjun/">Yanjun Liu</a>, <a href="https://www.linkedin.com/in/minalmishra/">Minal Mishra</a>, <a href="https://www.linkedin.com/in/martintingley/">Martin Tingley</a></p><figure><img alt="Using sequential anytime-valid hypothesis testing procedures to safely release software" src="https://cdn-images-1.medium.com/max/1024/0*mK01JWbQB9QlCEsL"></figure><p><strong>1. Spot the Difference</strong></p><p>Can you spot any difference between the two data streams below? Each observation is the time interval between a Netflix member hitting the play button and playback commencing, i.e., <em>play-delay</em>. These observations are from a particular type of A/B test that Netflix runs called a software canary or regression-driven experiment. More on that below — for now, what’s important is that we want to <strong>quickly</strong> and <strong>confidently</strong> identify any difference in the distribution of play-delay — or conclude that, within some tolerance, there is no difference.</p><p>In this blog post, we will develop a statistical procedure to do just that, and describe the impact of these developments at Netflix. The key idea is to switch from a “fixed time horizon” to an “any-time valid” framing of the problem.</p><figure><img alt="Sequentially comparing two streams of measurements from treatment and control" src="https://cdn-images-1.medium.com/max/1024/1*yDCF303-R9uqqH_zo7F4ug.gif"><figcaption>Figure 1. An example data stream for an A/B test where each observation represents play-delay for the control (left) and treatment (right). Can you spot any differences in the statistical distributions between the two data streams?</figcaption></figure><p><strong>2. Safe software deployment, canary testing, and play-delay</strong></p><p>Software engineering readers of this blog are likely familiar with unit, integration and load testing, as well as other testing practices that aim to prevent bugs from reaching production systems. Netflix also performs canary tests — software A/B tests between current and newer software versions. To learn more, see our previous blog post on <a href="https://netflixtechblog.com/safe-updates-of-client-applications-at-netflix-1d01c71a930c">Safe Updates of Client Applications</a>.</p><p>The purpose of a canary test is twofold: to act as a quality-control gate that catches bugs prior to full release, and to measure performance of the new software in the wild. This is carried out by performing a randomized controlled experiment on a small subset of users, where the treatment group receives the new software update and the control group continues to run the existing software. If any bugs or performance regressions are observed in the treatment group, then the full-scale release can be prevented, limiting the “impact radius” among the user base.</p><p>One of the metrics Netflix monitors in canary tests is how long it takes for the video stream to start when a title is requested by a user. Monitoring this “play-delay” metric throughout releases ensures that the streaming performance of Netflix only ever improves as we release newer versions of the Netflix client. In Figure 1, the left side shows a real-time stream of play-delay measurements from users running the existing version of the Netflix client, while the right side shows play-delay measurements from users running the updated version. We ask ourselves: Are users of the updated client experiencing longer play-delays?</p><p>We consider any increase in play-delay to be a serious performance regression and would prevent the release if we detect an increase. Critically, testing for differences in means or medians is not sufficient and does not provide a complete picture. For example, one situation we might face is that the median or mean play-delay is the same in treatment and control, but the treatment group experiences an increase in the upper quantiles of play-delay. This corresponds to the Netflix experience being degraded for those who already experience high play delays — likely our members on slow or unstable internet connections. Such changes should not be ignored by our testing procedure.</p><p>For a complete picture, we need to be able to reliably and quickly detect an upward shift <em>in any part of the play-delay distribution</em>. That is, we must do inference on and test for any differences between the distributions of play-delay in treatment and control.</p><p>To summarize, here are the design requirements of our canary testing system:</p><ol><li>Identify bugs and performance regressions, as measured by play-delay, as quickly as possible. <strong><em>Rationale</em></strong>: To minimize member harm, if there is any problem with the streaming quality experienced by users in the treatment group we need to abort the canary and roll back the software change as quickly as possible.</li><li>Strictly control false positive (false alarm) probabilities. <strong><em>Rationale</em></strong>: This system is part of a semi-automated process for all client deployments. A false positive test unnecessarily interrupts the software release process, reducing the velocity of software delivery and sending developers looking for bugs that do not exist.</li><li>This system should be able to detect any change in the distribution. <strong><em>Rationale</em></strong><em>: </em>We care not only about changes in the mean or median, but also about changes in tail behaviour and other quantiles.</li></ol><p>We now build out a sequential testing procedure that meets these design requirements.</p><p><strong>3. Sequential Testing: The Basics</strong></p><p>Standard statistical tests are fixed-n or fixed-time horizon: the analyst waits until some pre-set amount of data is collected, and then performs the analysis a single time. The classic t-test, the Kolmogorov-Smirnov test, and the Mann-Whitney test are all examples of fixed-n tests. A limitation of fixed-n tests is that they can only be performed once — yet in situations like the above, we want to be testing frequently to detect differences as soon as possible. If you apply a fixed-n test more than once, then you forfeit the Type-I error or false positive guarantee.</p><p>Here’s a quick illustration of how fixed-n tests fail under repeated analysis. In the following figure, each red line traces out the p-value when the Mann-Whitney test is repeatedly applied to a data set as 10,000 observations accrue in both treatment and control. Each red line shows an independent simulation, and in each case, there is no difference between treatment and control: these are simulated A/A tests.</p><p>The black dots mark where the p-value falls below the standard 0.05 rejection threshold. An alarming <strong>70% of simulations </strong>declare a significant difference at some point in time, even though, by construction, there is no difference: the actual false positive rate is much higher than the nominal 0.05. Exactly the same behaviour would be observed for the Kolmogorov-Smirnov test.</p><figure><img alt="increased false positives when peeking at mann-whitney test" src="https://cdn-images-1.medium.com/max/1024/1*fhHzjEOV5Iak564vOSLq7g.png"><figcaption>Figure 2. 100 Sample paths of the p-value process simulated under the null hypothesis shown in red. The dotted black line indicates the nominal alpha=0.05 level. Black dots indicate where the p-value process dips below the alpha=0.05 threshold, indicating a false rejection of the null hypothesis. A total of 66 out of 100 A/A simulations falsely rejected the null hypothesis.</figcaption></figure><p>This is a manifestation of “peeking”, and much has been written about the downside risks of this practice (see, for example, <a href="https://dl.acm.org/doi/abs/10.1145/3097983.3097992">Johari <em>et al. </em>2017</a>). If we restrict ourselves to correctly applied fixed-n statistical tests, where we analyze the data exactly once, we face a difficult tradeoff:</p><ul><li>Perform the test early on, after a small amount of data has been collected. In this case, we will only be powered to detect larger regressions. Smaller performance regressions will not be detected, and we run the risk of steadily eroding the member experience as small regressions accrue.</li><li>Perform the test later, after a large amount of data has been collected. In this case, we are powered to detect small regressions — but in the case of large regressions, we expose members to a bad experience for an unnecessarily long period of time.</li></ul><p>Sequential, or “any-time valid”, statistical tests overcome these limitations. They permit for peeking –in fact, they can be applied after every new data point arrives– while providing false positive, or Type-I error, guarantees that hold throughout time. As a result, we can continuously monitor data streams like in the image above, using <em>confidence sequences</em> or <em>sequential p-values</em>, and rapidly detect large regressions while eventually detecting small regressions.</p><p>Despite relatively recent adoption in the context of digital experimentation, these methods have a long academic history, with initial ideas dating back to Abraham Wald’s <a href="https://www.jstor.org/stable/2235829"><em>Sequential Tests of Statistical Hypotheses</em></a><em> </em>from 1945. Research in this area remains active, and Netflix has made a number of contributions in the last few years (see the references in these papers for a more complete literature review):</p><ul><li><a href="https://dl.acm.org/doi/abs/10.1145/3534678.3539099">Rapid Regression Detection in Software Deployments</a></li><li><a href="https://openreview.net/forum?id=a4zg0jiuVi">Anytime-Valid Inference For Multinomial Count Data</a></li><li><a href="https://arxiv.org/abs/2210.08589">Anytime-Valid Linear Models and Regression Adjusted Causal Inference in Randomized Experiments</a></li><li><a href="https://arxiv.org/abs/2210.08639">Design-Based Confidence Sequences: A General Approach to Risk Mitigation in Online Experimentation</a></li><li><a href="https://arxiv.org/abs/2212.14411">Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations</a></li></ul><p>In this and following blogs, we will describe both the methods we’ve developed and their applications at Netflix. The remainder of this post discusses the first paper above, which was published at KDD ’22 (and available on <a href="https://arxiv.org/abs/2205.14762">ArXiV</a>). We will keep it high level — readers interested in the technical details can consult the paper.</p><p><strong>4. A sequential testing solution</strong></p><p><strong>Differences in Distributions</strong></p><p>At any point in time, we can estimate the empirical quantile functions for both treatment and control, based on the data observed so far.</p><figure><img alt="empirical quantile functions for treatment and control data" src="https://cdn-images-1.medium.com/max/1024/0*tKe66EiIrN9R8SST"><figcaption>Figure 3: Empirical quantile function for control (left) and treatment (right) at a snapshot in time after starting the canary experiment. This is from actual Netflix data, so we’ve suppressed numerical values on the y-axis.</figcaption></figure><p>These two plots look pretty close, but we can do better than an eyeball comparison — and we want the computer to be able to continuously evaluate if there is any significant difference between the distributions. Per the design requirements, we also wish to detect large effects early, while preserving the ability to detect small effects eventually — and we want to maintain the false positive probability at a nominal level while permitting continuous analysis (aka peeking).</p><p><strong>That is, we need a sequential test on the difference in distributions</strong>.</p><p>Obtaining “fixed-horizon” confidence bands for the quantile function can be achieved using the <a href="https://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality">DKWM inequality</a>. To obtain time-uniform confidence bands, however, we use the anytime-valid confidence sequences from <a href="https://projecteuclid.org/journals/bernoulli/volume-28/issue-3/Sequential-estimation-of-quantiles-with-applications-to-A-B-testing/10.3150/21-BEJ1388.short">Howard and Ramdas (2022)</a> [<a href="https://arxiv.org/abs/1906.09712">arxiv version</a>]. As the coverage guarantee from these confidence bands holds uniformly across time, we can watch them become tighter without being concerned about <a href="https://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-why-it-matters-and-what-to-do-about-it">peeking</a>. As more data points stream in, these sequential confidence bands continue to shrink in width, which means any difference in the distribution functions — if it exists — will eventually become apparent.</p><figure><img alt="Anytime-valid confidence bands on treatment and control quantile functions" src="https://cdn-images-1.medium.com/max/1024/1*kUcLygkzrpSiHcQI9iA-qw.gif"><figcaption>Figure 4: 97.5% Time-Uniform Confidence bands on the quantile function for control (left) and treatment (right)</figcaption></figure><p>Note each frame corresponds to a point in time after the experiment began, not sample size. In fact, there is no requirement that each treatment group has the same sample size.</p><p>Differences are easier to see by visualizing the difference between the treatment and control quantile functions.</p><figure><img alt="Confidence sequences on quantile differences and sequential p-value" src="https://cdn-images-1.medium.com/max/1024/1*FBi_sDHmfhXFp3p1ZOcodw.gif"><figcaption>Figure 5: 95% Time-Uniform confidence band on the quantile difference function Q_b(p) — Q_a(p) (left). The sequential p-value (right).</figcaption></figure><p>As the sequential confidence band on the treatment effect quantile function is anytime-valid, the inference procedure becomes rather intuitive. We can continue to watch these confidence bands tighten, and if at any point the band no longer covers zero at any quantile, we can conclude that the distributions are different and stop the test. In addition to the sequential confidence bands, we can also construct a sequential p-value for testing that the distributions differ. Note from the animation that the moment the 95% confidence band over quantile treatment effects excludes zero is the same moment that the sequential p-value falls below 0.05: as with fixed-n tests, there is consistency between confidence intervals and p-values.</p><p>There are many multiple testing concerns in this application. Our solution controls Type-I error across all quantiles, all treatment groups, and all joint sample sizes simultaneously (see <a href="https://arxiv.org/pdf/2205.14762.pdf">our paper</a>, or<a href="https://projecteuclid.org/journals/bernoulli/volume-28/issue-3/Sequential-estimation-of-quantiles-with-applications-to-A-B-testing/10.3150/21-BEJ1388.short"> Howard and Ramdas</a> for details). Results hold for all quantiles, and for all times.</p><p><strong>5. Impact at Netflix</strong></p><p>Releasing new software always carries risk, and we always want to reduce the risk of service interruptions or degradation to the member experience. Our canary testing approach is another layer of protection for preventing bugs and performance regressions from slipping into production. It’s fully automated and has become an integral part of the software delivery process at Netflix. Developers can push to production with peace of mind, knowing that bugs and performance regressions will be rapidly caught. The additional confidence empowers developers to push to production more frequently, reducing the time to market for upgrades to the Netflix client and increasing our rate of software delivery.</p><p>So far this system has successfully prevented a number of serious bugs from reaching our end users. We detail one example.</p><p><strong>Case study: Safe Rollout of Netflix Client Application</strong></p><p>Figures 3–5 are taken from a canary test in which the behaviour of the client application was modified application (actual numerical values of play-delay have been suppressed). As we can see, the canary test revealed that the new version of the client increases a number of quantiles of play-delay, with the median and 75% percentile of play experiencing relative increases of at least 0.5% and 1% respectively. The timeseries of the sequential p-value shows that, in this case, we were able to reject the null of no change in distribution at the 0.05 level after about 60 seconds. This provides rapid feedback in the software delivery process, allowing developers to test the performance of new software and quickly iterate.</p><p><strong>6. What’s next?</strong></p><p>If you are curious about the technical details of the sequential tests for quantiles developed here, you can learn all about the math in our <a href="https://dl.acm.org/doi/abs/10.1145/3534678.3539099">KDD paper</a> (<a href="https://arxiv.org/pdf/2205.14762.pdf">also available on arxiv</a>).</p><p>You might also be wondering what happens if the data are not continuous measurements. Errors and exceptions are critical metrics to log when deploying software, as are many other metrics which are best defined in terms of counts. Stay tuned — our next post will develop sequential testing procedures for count data.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=cba6c7ed49df" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df">Sequential A/B Testing Keeps the World Streaming Netflix
Part 1: Continuous Data</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df</link>
      <guid>https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df</guid>
      <pubDate>Tue, 13 Feb 2024 20:10:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing SafeTest: A Novel Approach to Front End Testing]]></title>
      <description><![CDATA[<p>by <a href="https://medium.com/u/a155da075195">Moshe Kolodny</a></p><p>In this post, we’re excited to introduce SafeTest, a revolutionary library that offers a fresh perspective on End-To-End (E2E) tests for web-based User Interface (UI) applications.</p><h3>The Challenges of Traditional UI Testing</h3><p>Traditionally, UI tests have been conducted through either unit testing or integration testing (also referred to as End-To-End (E2E) testing). However, each of these methods presents a unique trade-off: you have to choose between controlling the test fixture and setup, or controlling the test driver.</p><p>For instance, when using <a href="https://testing-library.com/docs/react-testing-library/intro/">react-testing-library</a>, a unit testing solution, you maintain complete control over what to render and how the underlying services and imports should behave. However, you lose the ability to interact with an actual page, which can lead to a myriad of pain points:</p><ul><li>Difficulty in interacting with complex UI elements like &lt;Dropdown /&gt; components.</li><li>Inability to test CORS setup or GraphQL calls.</li><li>Lack of visibility into z-index issues affecting click-ability of buttons.</li><li>Complex and unintuitive authoring and debugging of tests.</li></ul><p>Conversely, using integration testing tools like Cypress or Playwright provides control over the page, but sacrifices the ability to instrument the bootstrapping code for the app. These tools operate by remotely controlling a browser to visit a URL and interact with the page. This approach has its own set of challenges:</p><ul><li>Difficulty in making calls to an alternative API endpoint without implementing custom network layer API rewrite rules.</li><li>Inability to make assertions on spies/mocks or execute code within the app.</li><li>Testing something like dark mode entails clicking the theme switcher or knowing the localStorage mechanism to override.</li><li>Inability to test segments of the app, for example if a component is only visible after clicking a button and waiting for a 60 second timer to countdown, the test will need to run those actions and will be at least a minute long.</li></ul><p>Recognizing these challenges, solutions like E2E Component Testing have emerged, with offerings from <a href="https://docs.cypress.io/guides/component-testing/overview">Cypress</a> and <a href="https://playwright.dev/docs/test-components">Playwright</a>. While these tools attempt to rectify the shortcomings of traditional integration testing methods, they have other limitations due to their architecture. They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline. Moreover, updating TypeScript usage could break your tests until the Cypress/Playwright team updates their runner.</p><h3>Welcome to SafeTest</h3><p>SafeTest aims to address these issues with a novel approach to UI testing. The main idea is to have a <a href="https://www.npmjs.com/package/safetest#bootstrapping-your-application">snippet of code in our application bootstrapping stage that injects hooks to run our tests</a> (see the <a href="https://www.npmjs.com/package/safetest#how-safetest-works">How Safetest Works</a> sections for more info on what this is doing). <strong>Note that how this works has no measurable impact on the regular usage of your app since SafeTest leverages lazy loading to dynamically load the tests only when running the tests (in the README example, the tests aren’t in the production bundle at all).</strong> Once that’s in place, we can use Playwright to run regular tests, thereby achieving the ideal browser control we want for our tests.</p><p>This approach also unlocks some exciting features:</p><ul><li>Deep linking to a specific test without needing to run a node test server.</li><li>Two-way communication between the browser and test (node) context.</li><li>Access to all the DX features that come with Playwright (excluding the ones that come with @playwright/test).</li><li>Video recording of tests, trace viewing, and pause page functionality for trying out different page selectors/actions.</li><li>Ability to make assertions on spies in the browser in node, matching snapshot of the call within the browser.</li></ul><h3>Test Examples with SafeTest</h3><p>SafeTest is designed to feel familiar to anyone who has conducted UI tests before, as it leverages the best parts of existing solutions. Here’s an example of how to test an entire application:</p><pre>import { describe, it, expect } from 'safetest/jest';<br>import { render } from 'safetest/react';<br><br>describe('my app', () =&gt; {<br>  it('loads the main page', async () =&gt; {<br>    const { page } = await render();<br><br>    await expect(page.getByText('Welcome to the app')).toBeVisible();<br>    expect(await page.screenshot()).toMatchImageSnapshot();<br>  });<br>});</pre><p>We can just as easily test a specific component</p><pre>import { describe, it, expect, browserMock } from 'safetest/jest';<br>import { render } from 'safetest/react';<br><br>describe('Header component', () =&gt; {<br>  it('has a normal mode', async () =&gt; {<br>    const { page } = await render(&lt;Header /&gt;);<br><br>    await expect(page.getByText('Admin')).not.toBeVisible();<br>   });<br><br>  it('has an admin mode', async () =&gt; {<br>    const { page } = await render(&lt;Header admin={true} /&gt;);<br><br>    await expect(page.getByText('Admin')).toBeVisible();<br>  });<br><br>  it('calls the logout handler when signing out', async () =&gt; {<br>    const spy = browserMock.fn();<br>    const { page } = await render(&lt;Header handleLogout={fn} /&gt;);<br><br>    await page.getByText('logout').click();<br>    expect(await spy).toHaveBeenCalledWith();<br>  });<br>});</pre><h3>Leveraging Overrides</h3><p>SafeTest utilizes React Context to allow for value overrides during tests. For an example of how this works, let’s assume we have a fetchPeople function used in a component:</p><pre>import { useAsync } from 'react-use';<br>import { fetchPerson } from './api/person';<br><br>export const People: React.FC = () =&gt; {<br>  const { data: people, loading, error } = useAsync(fetchPeople);<br>  <br>  if (loading) return &lt;Loader /&gt;;<br>  if (error) return &lt;ErrorPage error={error} /&gt;;<br>  return &lt;Table data={data} rows=[...] /&gt;;<br>}</pre><p>We can modify the People component to use an Override:</p><pre> import { fetchPerson } from './api/person';<br>+import { createOverride } from 'safetest/react';<br><br>+const FetchPerson = createOverride(fetchPerson);<br><br> export const People: React.FC = () =&gt; {<br>+  const fetchPeople = FetchPerson.useValue();<br>   const { data: people, loading, error } = useAsync(fetchPeople);<br>  <br>   if (loading) return &lt;Loader /&gt;;<br>   if (error) return &lt;ErrorPage error={error} /&gt;;<br>   return &lt;Table data={data} rows=[...] /&gt;;<br> }</pre><p>Now, in our test, we can override the response for this call:</p><pre>const pending = new Promise(r =&gt; { /* Do nothing */ });<br>const resolved = [{name: 'Foo', age: 23], {name: 'Bar', age: 32]}];<br>const error = new Error('Whoops');<br><br>describe('People', () =&gt; {<br>  it('has a loading state', async () =&gt; {<br>    const { page } = await render(<br>      &lt;FetchPerson.Override with={() =&gt; () =&gt; pending}&gt;<br>        &lt;People /&gt;<br>      &lt;/FetchPerson.Override&gt;<br>    );<br><br>    await expect(page.getByText('Loading')).toBeVisible();<br>  });<br><br>  it('has a loaded state', async () =&gt; {<br>    const { page } = await render(<br>      &lt;FetchPerson.Override with={() =&gt; async () =&gt; resolved}&gt;<br>        &lt;People /&gt;<br>      &lt;/FetchPerson.Override&gt;<br>    );<br><br>    await expect(page.getByText('User: Foo, name: 23')).toBeVisible();<br>  });<br><br>  it('has an error state', async () =&gt; {<br>    const { page } = await render(<br>      &lt;FetchPerson.Override with={() =&gt; async () =&gt; { throw error }}&gt;<br>        &lt;People /&gt;<br>      &lt;/FetchPerson.Override&gt;<br>    );<br><br>    await expect(page.getByText('Error getting users: "Whoops"')).toBeVisible();<br>  });<br>});</pre><p>The render function also accepts a function that will be passed the initial app component, allowing for the injection of any desired elements anywhere in the app:</p><pre>it('has a people loaded state', async () =&gt; {<br>  const { page } = await render(app =&gt;<br>    &lt;FetchPerson.Override with={() =&gt; async () =&gt; resolved}&gt;<br>      {app}<br>    &lt;/FetchPerson.Override&gt;<br>  );<br>   await expect(page.getByText('User: Foo, name: 23')).toBeVisible();<br>});</pre><p>With overrides, we can write complex test cases such as ensuring a service method which combines API requests from /foo, /bar, and /baz, has the correct retry mechanism for just the failed API requests and still maps the return value correctly. So if /bar takes 3 attempts to resolve the method will make a total of 5 API calls.</p><p>Overrides aren’t limited to just API calls (since we can use also use <a href="https://playwright.dev/docs/api/class-page#page-route">page.route</a>), we can also override specific app level values like feature flags or changing some static value:</p><pre>+const UseFlags = createOverride(useFlags);<br> export const Admin = () =&gt; {<br>+  const useFlags = UseFlags.useValue();<br>   const { isAdmin } = useFlags();<br>   if (!isAdmin) return &lt;div&gt;Permission error&lt;/div&gt;;<br>   // ...<br> }<br><br>+const Language = createOverride(navigator.language);<br> export const LanguageChanger = () =&gt; {<br>-  const language = navigator.language;<br>+  const language = Language.useValue();<br>   return &lt;div&gt;Current language is { language } &lt;/div&gt;;<br> }<br><br> describe('Admin', () =&gt; {<br>   it('works with admin flag', async () =&gt; {<br>     const { page } = await render(<br>       &lt;UseIsAdmin.Override with={oldHook =&gt; {<br>         const oldFlags = oldHook();<br>         return { ...oldFlags, isAdmin: true };<br>       }}&gt;<br>         &lt;MyComponent /&gt;<br>       &lt;/UseIsAdmin.Override&gt;<br>     );<br><br>     await expect(page.getByText('Permission error')).not.toBeVisible();<br>   });<br> });<br><br> describe('Language', () =&gt; {<br>   it('displays', async () =&gt; {<br>     const { page } = await render(<br>       &lt;Language.Override with={old =&gt; 'abc'}&gt;<br>         &lt;MyComponent /&gt;<br>       &lt;/Language.Override&gt;<br>     );<br><br>     await expect(page.getByText('Current language is abc')).toBeVisible();<br>   });<br> });</pre><p>Overrides are a powerful feature of SafeTest and the examples here only scratch the surface. For more information and examples, refer to the <a href="https://www.npmjs.com/package/safetest#overrides">Overrides section</a> on the <a href="https://github.com/kolodny/safetest/blob/main/README.md">README</a>.</p><h3>Reporting</h3><p>SafeTest comes out of the box with powerful reporting capabilities, such as automatic linking of video replays, Playwright trace viewer, and even <a href="https://safetest-two.vercel.app/vite-react-ts/?test_path=./Another.safetest&amp;test_name=Main2+can+do+many+interactions+fast">deep link directly to the mounted tested component</a>. The SafeTest repo <a href="https://github.com/kolodny/safetest/blob/main/README.md">README</a> links to all the <a href="https://safetest-two.vercel.app/">example apps</a> as well as the <a href="https://safetest-two.vercel.app/report.html#results=vite-react-ts/artifacts/results.json&amp;url=vite-react-ts/">reports</a></p><figure><img alt="Image of SafeTest report showing a video of a test run" src="https://cdn-images-1.medium.com/max/995/1*OFmV3PX7Is8X48-V9ryeig.png"></figure><h3>SafeTest in Corporate Environments</h3><p>Many large corporations need a form of authentication to use the app. Typically, navigating to localhost:3000 just results in a perpetually loading page. You need to go to a different port, like localhost:8000, which has a proxy server to check and/or inject auth credentials into underlying service calls. This limitation is one of the main reasons that Cypress/Playwright Component Tests aren’t suitable for use at Netflix.</p><p>However, there’s usually a service that can generate test users whose credentials we can use to log in and interact with the application. This facilitates creating a light wrapper around SafeTest to automatically generate and assume that test user. For instance, here’s basically how we do it at Netflix:</p><pre>import { setup } from 'safetest/setup';<br>import { createTestUser, addCookies } from 'netflix-test-helper';<br><br>type Setup = Parameters&lt;typeof setup&gt;[0] &amp; {<br>  extraUserOptions?: UserOptions;<br>};<br><br><br>export const setupNetflix = (options: Setup) =&gt; {<br>  setup({<br>    ...options,<br>    hooks: { beforeNavigate: [async page =&gt; addCookies(page)] },<br>  });<br><br>  beforeAll(async () =&gt; {<br>    createTestUser(options.extraUserOptions)<br>  });<br>};</pre><p>After setting this up, we simply import the above package in place of where we would have used safetest/setup.</p><h3>Beyond React</h3><p>While this post focused on how SafeTest works with React, it’s not limited to just React. SafeTest also works with Vue, Svelte, Angular, and even can run on NextJS or Gatsby. It also runs using either Jest or Vitest based on which test runner your scaffolding started you off with. The <a href="https://github.com/kolodny/safetest/tree/main/examples">examples folder</a> demonstrates how to use SafeTest with different tooling combinations, and we encourage contributions to add more cases.</p><p>At its core, SafeTest is an intelligent glue for a test runner, a UI library, and a browser runner. Though the most common usage at Netflix employs Jest/React/Playwright, it’s easy to add more adapters for other options.</p><h3>Conclusion</h3><p>SafeTest is a powerful testing framework that’s being adopted within Netflix. It allows for easy authoring of tests and provides comprehensive reports when and how any failures occurred, complete with links to view a playback video or manually run the test steps to see what broke. We’re excited to see how it will revolutionize UI testing and look forward to your feedback and contributions.</p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=37f9f88c152d" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/introducing-safetest-a-novel-approach-to-front-end-testing-37f9f88c152d">Introducing SafeTest: A Novel Approach to Front End Testing</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/introducing-safetest-a-novel-approach-to-front-end-testing-37f9f88c152d</link>
      <guid>https://netflixtechblog.com/introducing-safetest-a-novel-approach-to-front-end-testing-37f9f88c152d</guid>
      <pubDate>Tue, 13 Feb 2024 17:07:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Rebuilding Netflix Video Processing Pipeline with Microservices]]></title>
      <description><![CDATA[<p><a href="https://www.linkedin.com/in/liwei-guo-a5aa6311/">Liwei Guo</a>, <a href="https://www.linkedin.com/in/anush-moorthy-b8451142/">Anush Moorthy</a>, <a href="https://www.linkedin.com/in/li-heng-chen-a75458a2/">Li-Heng Chen</a>, <a href="https://www.linkedin.com/in/carvalhovinicius/">Vinicius Carvalho</a>, <a href="https://www.linkedin.com/in/aditya-mavlankar-7139791/">Aditya Mavlankar</a>, <a href="https://www.linkedin.com/in/agataopalach/">Agata Opalach</a>, <a href="https://www.linkedin.com/in/adithyaprakash/">Adithya Prakash</a>, Kyle Swanson, <a href="https://www.linkedin.com/in/jessicatweneboah/">Jessica Tweneboah</a>, <a href="https://www.linkedin.com/in/subbu-venkatrav-126172a/">Subbu Venkatrav</a>, <a href="https://www.linkedin.com/in/lishan-z-51302abb/">Lishan Zhu</a></p><p><em>This is the first blog in a multi-part series on how Netflix rebuilt its video processing pipeline with microservices, so we can maintain our rapid pace of innovation and continuously improve the system for member streaming and studio operations. This introductory blog focuses on an overview of our journey. Future blogs will provide deeper dives into each service, sharing insights and lessons learned from this process.</em></p><p>The Netflix video processing pipeline went live with the launch of our streaming service in 2007. Since then, the video pipeline has undergone substantial improvements and broad expansions:</p><ul><li>Starting with Standard Dynamic Range (SDR) at <a href="https://en.wikipedia.org/wiki/Display_resolution">Standard-Definitions</a>, we expanded the encoding pipeline to 4K and High Dynamic Range (HDR) which enabled support for our premium offering.</li><li>We moved from centralized linear encoding to <a href="https://netflixtechblog.com/high-quality-video-encoding-at-scale-d159db052746">distributed chunk-based encoding</a>. This architecture shift greatly reduced the processing latency and increased system resiliency.</li><li>Moving away from the use of dedicated instances that were constrained in quantity, we tapped into Netflix’s <a href="https://netflixtechblog.com/creating-your-own-ec2-spot-market-6dd001875f5">internal trough</a> created due to autoscaling microservices, leading to significant improvements in computation elasticity as well as resource utilization efficiency.</li><li>We rolled out encoding innovations such as <a href="https://medium.com/netflix-techblog/per-title-encode-optimization-7e99442b62a2">per-title</a> and <a href="https://netflixtechblog.com/optimized-shot-based-encodes-now-streaming-4b9464204830">per-shot</a> optimizations, which provided significant quality-of-experience (QoE) improvement to Netflix members.</li><li>By integrating with studio content systems, we enabled the pipeline to leverage rich metadata from the creative side and create more engaging member experiences like <a href="https://en.wikipedia.org/wiki/Black_Mirror:_Bandersnatch">interactive storytelling</a>.</li><li>We expanded pipeline support to serve our studio/content-development use cases, which had different latency and resiliency requirements as compared to the traditional streaming use case.</li></ul><p>Our experience of the last decade-and-a-half has reinforced our conviction that an efficient, flexible video processing pipeline that allows us to innovate and support our streaming service, as well as our studio partners, is critical to the continued success of Netflix. To that end, the Video and Image Encoding team in Encoding Technologies (ET) has spent the last few years rebuilding the video processing pipeline on our next-generation microservice-based computing platform <a href="https://netflixtechblog.com/the-netflix-cosmos-platform-35c14d9351ad">Cosmos</a>.</p><h3>From Reloaded to Cosmos</h3><h4>Reloaded</h4><p>Starting in 2014, we developed and operated the video processing pipeline on our third-generation platform <a href="https://www.youtube.com/watch?v=JouA10QJiNc">Reloaded</a>. Reloaded was well-architected, providing good stability, scalability, and a reasonable level of flexibility. It served as the foundation for numerous encoding innovations developed by our team.</p><p>When Reloaded was designed, we focused on a single use case: converting high-quality media files (also known as mezzanines) received from studios into compressed assets for Netflix streaming. Reloaded was created as a single monolithic system, where developers from various media teams in ET and our platform partner team Content Infrastructure and Solutions (CIS)¹ worked on the same codebase, building a single system that handled all media assets. Over the years, the system expanded to support various new use cases. This led to a significant increase in system complexity, and the limitations of Reloaded began to show:</p><ul><li><em>Coupled functionality:</em> Reloaded was composed of a number of worker modules and an orchestration module. The setup of a new Reloaded module and its integration with the orchestration required a non-trivial amount of effort, which led to a bias towards augmentation rather than creation when developing new functionalities. For example, in Reloaded <a href="https://netflixtechblog.com/netflix-video-quality-at-scale-with-cosmos-microservices-552be631c113">the video quality calculation was implemented inside the video encoder module</a>. With this implementation, it was extremely difficult to recalculate video quality without re-encoding.</li><li><em>Monolithic structure</em>: Since Reloaded modules were often co-located in the same repository, it was easy to overlook code-isolation rules and there was quite a bit of unintended reuse of code across what should have been strong boundaries. Such reuse created tight coupling and reduced development velocity. The tight coupling among modules further forced us to deploy all modules together.</li><li><em>Long release cycles</em>: The joint deployment meant that there was increased fear of unintended production outages as debugging and rollback can be difficult for a deployment of this size. This drove the approach of the “release train”. Every two weeks, a “snapshot” of all modules was taken, and promoted to be a “release candidate”. This release candidate then went through exhaustive testing which attempted to cover as large a surface area as possible. This testing stage took about two weeks. Thus, depending on when the code change was merged, it could take anywhere between two and four weeks to reach production.</li></ul><p>As time progressed and functionalities grew, the rate of new feature contributions in Reloaded dropped. Several promising ideas were abandoned owing to the outsized work needed to overcome architectural limitations. The platform that had once served us well was now becoming a drag on development.</p><h4>Cosmos</h4><p>As a response, in 2018 the CIS and ET teams started developing the next-generation platform, Cosmos. In addition to the scalability and the stability that the developers already enjoyed in Reloaded, Cosmos aimed to significantly increase system flexibility and feature development velocity. To achieve this, Cosmos was developed as a computing platform for workflow-driven, media-centric microservices.</p><p>The microservice architecture provides strong decoupling between services. Per-microservice workflow support eases the burden of implementing complex media workflow logic. Finally, relevant abstractions allow media algorithm developers to focus on the manipulation of video and audio signals rather than on infrastructural concerns. A comprehensive list of benefits offered by Cosmos can be found in the linked <a href="https://netflixtechblog.com/the-netflix-cosmos-platform-35c14d9351ad">blog</a>.</p><h3>Building the Video Processing Pipeline in Cosmos</h3><h4>Service Boundaries</h4><p>In the microservice architecture, a system is composed of a number of fine-grained services, with each service focusing on a single functionality. So the first (and arguably the most important) thing is to identify boundaries and define services.</p><p>In our pipeline, as media assets travel through creation to ingest to delivery, they go through a number of processing steps such as analyses and transformations. We analyzed these processing steps to identify “boundaries” and grouped them into different domains, which in turn became the building blocks of the microservices we engineered.</p><p>As an example, in Reloaded, the video encoding module bundles 5 steps:</p><p>1. divide the input video into small chunks</p><p>2. encode each chunk independently</p><p>3. calculate the quality score (<a href="https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12">VMAF</a>) of each chunk</p><p>4. assemble all the encoded chunks into a single encoded video</p><p>5. aggregate quality scores from all chunks</p><p>From a system perspective, the assembled encoded video is of primary concern while the internal chunking and separate chunk encodings exist in order to fulfill certain latency and resiliency requirements. Further, as alluded to above, the video quality calculation provides a totally separate functionality as compared to the encoding service.</p><p>Thus, in Cosmos, we created two independent microservices: Video Encoding Service (VES) and <a href="https://netflixtechblog.com/netflix-video-quality-at-scale-with-cosmos-microservices-552be631c113">Video Quality Service (VQS)</a>, each of which serves a clear, decoupled function. As implementation details, the chunked encoding and the assembling were abstracted away into the VES.</p><h4>Video Services</h4><p>The approach outlined above was applied to the rest of the video processing pipeline to identify functionalities and hence service boundaries, leading to the creation of the following video services².</p><ol><li>Video Inspection Service (VIS): This service takes a mezzanine as the input and performs various inspections. It extracts metadata from different layers of the mezzanine for downstream services. In addition, the inspection service flags issues if invalid or unexpected metadata is observed and provides actionable feedback to the upstream team.</li><li>Complexity Analysis Service (CAS): The optimal encoding recipe is highly content-dependent. This service takes a mezzanine as the input and performs analysis to understand the content complexity. It calls Video Encoding Service for <a href="https://netflixtechblog.com/dynamic-optimizer-a-perceptual-video-encoding-optimization-framework-e19f1e3a277f">pre-encoding</a> and Video Quality Service for quality evaluation. The results are saved to a database so they can be reused.</li><li>Ladder Generation Service (LGS): This service creates an entire bitrate ladder for a given encoding family (H.264, AV1, etc.). It fetches the complexity data from CAS and runs the optimization algorithm to create encoding recipes. The CAS and LGS cover much of the innovations that we have previously presented in our tech blogs (<a href="http://techblog.netflix.com/2015/12/per-title-encode-optimization.html">per-title</a>, <a href="http://techblog.netflix.com/2016/12/more-efficient-mobile-encodes-for.html">mobile encodes</a>, <a href="https://netflixtechblog.com/optimized-shot-based-encodes-now-streaming-4b9464204830">per-shot</a>, <a href="https://netflixtechblog.com/optimized-shot-based-encodes-for-4k-now-streaming-47b516b10bbb">optimized 4K encoding</a>, etc.). By wrapping ladder generation into a separate microservice (LGS), we decouple the ladder optimization algorithms from the creation and management of complexity analysis data (which resides in CAS). We expect this to give us greater freedom for experimentation and a faster rate of innovation.</li><li>Video Encoding Service (VES): This service takes a mezzanine and an encoding recipe and creates an encoded video. The recipe includes the desired encoding format and properties of the output, such as resolution, bitrate, etc. The service also provides options that allow fine-tuning latency, throughput, etc., depending on the use case.</li><li>Video Validation Service (VVS): This service takes an encoded video and a list of expectations about the encode. These expectations include attributes specified in the encoding recipe as well as conformance requirements from the codec specification. VVS analyzes the encoded video and compares the results against the indicated expectations. Any discrepancy is flagged in the response to alert the caller.</li><li><a href="https://netflixtechblog.com/netflix-video-quality-at-scale-with-cosmos-microservices-552be631c113">Video Quality Service (VQS)</a>: This service takes the mezzanine and the encoded video as input, and calculates the quality score (VMAF) of the encoded video.</li></ol><h4>Service Orchestration</h4><p>Each video service provides a dedicated functionality and they work together to generate the needed video assets. Currently, the two main use cases of the Netflix video pipeline are producing assets for member streaming and for studio operations. For each use case, we created a dedicated workflow orchestrator so the service orchestration can be customized to best meet the corresponding business needs.</p><p>For the streaming use case, the generated videos are deployed to our content delivery network (CDN) for Netflix members to consume. These videos can easily be watched millions of times. The Streaming Workflow Orchestrator utilizes almost all video services to create streams for an impeccable member experience. It leverages VIS to detect and reject non-conformant or low-quality mezzanines, invokes LGS for encoding recipe optimization, encodes video using VES, and calls VQS for quality measurement where the quality data is further fed to Netflix’s data pipeline for analytics and monitoring purposes. In addition to video services, the Streaming Workflow Orchestrator uses audio and timed text services to generate audio and text assets, and packaging services to “containerize” assets for streaming.</p><p>For the studio use case, some example video assets are marketing clips and daily production editorial proxies. The requests from the studio side are generally latency-sensitive. For example, someone from the production team may be waiting for the video to review so they can decide the shooting plan for the next day. Because of this, the Studio Workflow Orchestrator optimizes for fast turnaround and focuses on core media processing services. At this time, the Studio Workflow Orchestrator calls VIS to extract metadata of the ingested assets and calls VES with predefined recipes. Compared to member streaming, studio operations have different and unique requirements for video processing. Therefore, the Studio Workflow Orchestrator is the exclusive user of some encoding features like forensic watermarking and timecode/text burn-in.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Wen2M8BgqhC7Hacc1O2Nqg.png"></figure><h3>Where we are now</h3><p>We have had the new video pipeline running alongside Reloaded in production for a few years now. During this time, we completed the migration of all necessary functionalities from Reloaded, began gradually shifting over traffic one use case at a time, and completed the switchover in September of 2023.</p><p>While it is still early days, we have already seen the benefits of the new platform, specifically the ease of feature delivery. Notably, Netflix launched the <a href="https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba">Advertising-supported plan</a> in November 2022. Processing Ad creatives posed some new challenges: media formats of Ads are quite different from movie and TV mezzanines that the team was familiar with, and there was a new set of media processing requirements related to the business needs of Ads. With the modularity and developer productivity benefits of Cosmos, we were able to quickly iterate the pipeline to keep up with the changing requirements and support a successful product launch.</p><h3>Summary</h3><p>Rebuilding the video pipeline was a huge undertaking for the team. We are very proud of what we have achieved, and also eager to share our journey with the technical community. This blog has focused on providing an overview: a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses. In the next blog, we are going to delve into the details of the Video Encoding Service (VES), explaining step-by-step the service creation, and sharing lessons learned (we have A LOT!). We also plan to cover other video services in future tech blogs. Follow the<a href="https://netflixtechblog.com/"> Netflix Tech Blog</a> to stay up to date.</p><h3><strong>Acknowledgments</strong></h3><p>A big shout out to the CIS team for their outstanding work in building the Cosmos platform and their receptiveness to feedback from service developers.</p><p>We want to express our appreciation to our users, the Streaming Encoding Pipeline team, and the Video Engineering team. Just like our feedback helps iron out the platform, the feedback from our users has been instrumental in building high-quality services.</p><p>We also want to thank Christos Bampis and Zhi Li for their significant contributions to video services, and our two former team members, Chao Chen and Megha Manohara for contributing to the early development of this project.</p><h4>Footnotes</h4><ol><li>Formerly known as Media Cloud Engineering/MCE team.</li><li>The actual number of video services is more than listed here. Some of them are Netflix-specific and thus omitted from this blog.</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=4e5e6310e359" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359">Rebuilding Netflix Video Processing Pipeline with Microservices</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
      <link>https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359</link>
      <guid>https://netflixtechblog.com/rebuilding-netflix-video-processing-pipeline-with-microservices-4e5e6310e359</guid>
      <pubDate>Thu, 11 Jan 2024 00:20:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Our First Netflix Data Engineering Summit]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="8319" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.linkedin.com/in/holdenkarau" rel="noopener ugc nofollow" target="_blank">Holden Karau</a> <a class="af ny" href="https://www.linkedin.com/in/elizabeth-stone-608a754" rel="noopener ugc nofollow" target="_blank">Elizabeth Stone</a> <a class="af ny" href="https://www.linkedin.com/in/pmd323/" rel="noopener ugc nofollow" target="_blank">Pedro Duarte</a> <a class="af ny" href="https://www.linkedin.com/in/cjstep" rel="noopener ugc nofollow" target="_blank">Chris Stephens</a> <a class="af ny" href="https://www.linkedin.com/in/pallavi-phadnis-75280b20/" rel="noopener ugc nofollow" target="_blank">Pallavi Phadnis</a> <a class="af ny" href="https://www.linkedin.com/in/leewoodridge/" rel="noopener ugc nofollow" target="_blank">Lee Woodridge</a> <a class="af ny" href="https://www.linkedin.com/in/markcho" rel="noopener ugc nofollow" target="_blank">Mark Cho </a><a class="af ny" href="https://www.linkedin.com/in/guilhermesmi" rel="noopener ugc nofollow" target="_blank">Guil Pires</a> <a class="af ny" href="https://www.linkedin.com/in/sujayjain" rel="noopener ugc nofollow" target="_blank">Sujay Jain</a> <a class="af ny" href="https://www.linkedin.com/in/tristanreid" rel="noopener ugc nofollow" target="_blank">Tristan Reid</a> <a class="af ny" href="https://www.linkedin.com/in/senthilnathan-rajagopalan-athinarayanan-26a97138" rel="noopener ugc nofollow" target="_blank">Senthilnathan Athinarayanan</a> <a class="af ny" href="https://www.linkedin.com/in/bharath-chandra-mummadisetty-27591a88" rel="noopener ugc nofollow" target="_blank">Bharath Mummadisetty</a> <a class="af ny" href="https://www.linkedin.com/in/abhinaya-shetty-ab871418" rel="noopener ugc nofollow" target="_blank">Abhinaya Shetty</a> <a class="af ny" href="https://www.linkedin.com/in/jlantos/" rel="noopener ugc nofollow" target="_blank">Judit Lantos</a> <a class="af ny" href="https://www.linkedin.com/in/amanuel-kahsay-81ab29153/" rel="noopener ugc nofollow" target="_blank">Amanuel Kahsay</a> <a class="af ny" href="https://www.linkedin.com/in/daomi" rel="noopener ugc nofollow" target="_blank">Dao Mi</a> <a class="af ny" href="https://www.linkedin.com/in/mdreeling" rel="noopener ugc nofollow" target="_blank">Mick Dreeling</a> <a class="af ny" href="https://www.linkedin.com/in/chris-colburn" rel="noopener ugc nofollow" target="_blank">Chris Colburn</a> and <a class="af ny" href="https://www.linkedin.com/in/agatagrzybek" rel="noopener ugc nofollow" target="_blank">Agata Gryzbek</a></p><figure class="oc od oe of og oh nz oa paragraph-image"><div role="button" tabindex="0" class="oi oj fg ok bg ol"><div class="nz oa ob"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*0EF4tpjprbMxn8rF_pKyFQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*0EF4tpjprbMxn8rF_pKyFQ.png" /></picture></div></div></figure><h1 id="6d9e" class="on oo gr be op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk bj">Introduction</h1><p id="555e" class="pw-post-body-paragraph na nb gr nc b nd pl nf ng nh pm nj nk nl pn nn no np po nr ns nt pp nv nw nx gk bj">Earlier this summer Netflix held our first-ever Data Engineering Forum. Engineers from across the company came together to share best practices on everything from Data Processing Patterns to Building Reliable Data Pipelines. The result was a series of talks which we are now sharing with the rest of the Data Engineering community!</p><p id="fdbd" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">You can find each of the talks below with a short description of each, or you can go straight to the playlist on YouTube <a class="af ny" href="https://www.youtube.com/watch?v=SJxRd1uHAkA&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0" rel="noopener ugc nofollow" target="_blank">here</a>.</p><h1 id="e22d" class="on oo gr be op oq or os ot ou ov ow ox oy oz pa pb pc pd pe pf pg ph pi pj pk bj">The Talks</h1><p id="234f" class="pw-post-body-paragraph na nb gr nc b nd pl nf ng nh pm nj nk nl pn nn no np po nr ns nt pp nv nw nx gk bj"><a class="af ny" href="https://youtu.be/QxaOlmv79ls" rel="noopener ugc nofollow" target="_blank">The Netflix Data Engineering Stack</a></p><p id="8780" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Chris Stephens, Data Engineer, Content &amp; Studio and Pedro Duarte, Software Engineer, Consolidated Logging walk engineers new to Netflix through the building blocks of the Netflix Data Engineering stack. Learn more about how batch and streaming data pipelines are built at Netflix.</p><p id="d91c" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=vuyjK2TFZNk&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=3" rel="noopener ugc nofollow" target="_blank">Data Processing Patterns</a></p><p id="a4d0" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Lee Woodridge and Pallavi Phadnis, Data Engineers at Netflix, talk about how you can apply different processing strategies for your batch pipelines by implementing generic abstractions to help scale, be more efficient, handle late-arriving data, and be more fault tolerant.</p><p id="8737" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=TwcWvwU7B64&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=4" rel="noopener ugc nofollow" target="_blank">Streaming SQL on Data Mesh using Apache Flink</a></p><p id="f912" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Mark Cho, Guil Pires and Sujay Jain, Engineers from the Netflix Data Platform talk about how a managed Streaming SQL using Apache Flink can help unlock new Stream Processing use cases at Netflix. You can read more about Data Mesh, Netflix’s next-generation stream processing platform,<a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873"> here</a></p><p id="c727" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=uWmJxbhI304&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=5" rel="noopener ugc nofollow" target="_blank">Building Reliable Data Pipelines</a></p><p id="970f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Holden Karau, OSS Engineer, Data Platform Engineering, talks about the importance of reliable data pipelines and how to build them covering tools from testing to validation and auditing. The talk uses Apache Spark as an example, but the concepts generalize regardless of your specific tools.</p><p id="44a6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=F4N8AmScZ-w&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=6" rel="noopener ugc nofollow" target="_blank">Knowledge Management — Leveraging Institutional Data</a></p><p id="bf04" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Tristan Reid, software engineer, shares experiences about the Knowledge Management project at Netflix, which seeks to leverage language modeling techniques and metadata from internal systems to improve the impact of the &gt;100K memos that circulate within the company.</p><p id="6490" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=jRckeOedtx0&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=8" rel="noopener ugc nofollow" target="_blank">Psyberg, An Incremental ETL Framework Using Iceberg</a></p><p id="40bb" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Abhinaya Shetty and Bharath Mummadisetty, Data Engineers from Netflix’s Membership Data Engineering team, introduce Psyberg, an incremental ETL framework. Learn about how Psyberg leverages Iceberg metadata to handle late-arriving data, and improves data pipelines while simplifying on-call life!</p><p id="e542" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.youtube.com/watch?v=Dr8LMn-nJGc&amp;list=PLSECvWLlUYeF06QK5FOOELvgKdap3cQf0&amp;index=9" rel="noopener ugc nofollow" target="_blank">Start/Stop/Continue for optimizing complex ETL jobs</a></p><p id="5033" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Judit Lantos, Data Engineer, Member Experience Data Engineering, shares a case study to demonstrate an effective approach for optimizing complex ETL jobs.</p><p id="22b0" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://youtu.be/1gGi3NBZk7M" rel="noopener ugc nofollow" target="_blank">Media Data for ML Studio Creative Production</a></p><p id="0c09" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In the last 2 decades, Netflix has revolutionized the way video content is consumed, however, there is significant work to be done in revolutionizing how movies and tv shows are made. In this video, Sr. Data Engineers Amanual Kahsay and Dao Mi showcase how data and insights are being utilized to accomplish such a vision.</p><p id="e317" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We hope that our fellow members of the Data Engineering Community find these videos useful and engaging. Please follow our Netflix Data<a class="af ny" href="https://twitter.com/netflixdata" rel="noopener ugc nofollow" target="_blank"> Twitter account</a> for updates and notifications of future Data Engineering Summits!</p><p id="804c" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><a class="af ny" href="https://www.linkedin.com/in/mdreeling/" rel="noopener ugc nofollow" target="_blank">Mick Dreeling</a>, <a class="af ny" href="https://www.linkedin.com/in/chris-colburn/" rel="noopener ugc nofollow" target="_blank">Chris Colburn</a></p></div>]]></description>
      <link>https://netflixtechblog.com/our-first-netflix-data-engineering-summit-f326b0589102</link>
      <guid>https://netflixtechblog.com/our-first-netflix-data-engineering-summit-f326b0589102</guid>
      <pubDate>Thu, 14 Dec 2023 17:54:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[All of Netflix’s HDR video streaming is now dynamically optimized]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fw fx fy fz"><div><div class="hs ht hu hv hw"></div><p id="338d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">by <a class="af ny" href="https://www.linkedin.com/in/aditya-mavlankar-7139791/" rel="noopener ugc nofollow" target="_blank">Aditya Mavlankar</a>, <a class="af ny" href="https://www.linkedin.com/in/henryzhili/" rel="noopener ugc nofollow" target="_blank">Zhi Li</a>, <a class="af ny" href="https://www.linkedin.com/in/luk%C3%A1%C5%A1-krasula-a0171b6a/" rel="noopener ugc nofollow" target="_blank">Lukáš Krasula</a> and <a class="af ny" href="https://www.linkedin.com/in/christosbampis/" rel="noopener ugc nofollow" target="_blank">Christos Bampis</a></p><p id="7b58" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj nz">High dynamic range (<a class="af ny" href="https://developer.apple.com/videos/play/tech-talks/502/" rel="noopener ugc nofollow" target="_blank">HDR</a>) video brings a wider range of luminance and a wider gamut of colors, paving the way for a stunning viewing experience. Separately, our invention of Dynamically Optimized (<a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/dynamic-optimizer-a-perceptual-video-encoding-optimization-framework-e19f1e3a277f">DO</a>) encoding helps achieve optimized bitrate-quality tradeoffs depending on the complexity of the content.</p><p id="64cc" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">HDR was launched at Netflix in 2016 and the number of titles available in HDR has been growing ever since. We were, however, missing the systematic ability to measure perceptual quality (<a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12">VMAF</a>) of HDR streams since VMAF was limited to standard dynamic range (SDR) video signals.</p><p id="269f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">As noted in <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/optimized-shot-based-encodes-for-4k-now-streaming-47b516b10bbb">an earlier blog post</a>, we began developing an HDR variant of VMAF; let’s call it HDR-VMAF. A vital aspect of such development is subjective testing with HDR encodes in order to generate training data. The pandemic, however, posed unique challenges in conducting a conventional in-lab subjective test with HDR encodes. We improvised as part of a collaborative effort with Dolby Laboratories and conducted subjective tests with 4K-HDR content using high-end OLED panels in calibrated conditions created in participants’ homes [1],[2]. Details pertaining to HDR-VMAF exceed the scope of this article and will be covered in a future blog post; for now, suffice it to say that the first version of HDR-VMAF landed internally in 2021 and we have been improving the metric ever since.</p><p id="16ba" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The arrival of HDR-VMAF allowed us to create HDR streams with DO applied, i.e., HDR-DO encodes. Prior to that, we were using a fixed ladder with predetermined bitrates — regardless of content characteristics — for HDR video streaming. We A/B tested HDR-DO encodes in production in Q3-Q4 2021, followed by improving the ladder generation algorithm further in early 2022. We started backfilling HDR-DO encodes for existing titles from Q2 2022. By June 2023 the entire HDR catalog was optimized. The graphic below (Fig. 1) depicts the migration of traffic from fixed bitrates to DO encodes.</p></div></div><div class="oi"><div class="ab ca"><div class="md oj me ok mf ol ce om cf on ch bg"><figure class="or os ot ou ov oi ow ox paragraph-image"><div role="button" tabindex="0" class="oy oz fg pa bg pb"><div class="oo op oq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*B5i80MShTsFKuNJNGXjYZQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*B5i80MShTsFKuNJNGXjYZQ.png" /></picture></div></div><figcaption class="pd fc pe oo op pf pg be b bf z dt">Fig. 1: Migration of traffic from fixed-ladder encodes to DO encodes.</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><h1 id="6b4a" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">Bitrate versus quality comparison</h1><p id="2230" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj">HDR-VMAF is designed to be format-agnostic — it measures the perceptual quality of HDR video signal regardless of its container format, for example, Dolby Vision or HDR10. HDR-VMAF focuses on the signal characteristics (as a result of lossy encoding) instead of display characteristics, and thus it does not include display mapping in its pipeline. Display mapping is the specific tone mapping applied by the display based on its own characteristics — peak luminance, black level, color gamut, etc. — and based on content characteristics and/or metadata signaled in the bitstream.</p><p id="8e22" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Two ways that HDR10 and Dolby Vision differ are: <strong class="nc gs">1)</strong> the preprocessing applied to the signal before encoding <strong class="nc gs">2)</strong> the metadata informing the display mapping on different displays. So, HDR-VMAF will capture the effect of <strong class="nc gs">1)</strong> but ignore the effect of <strong class="nc gs">2)</strong>. Display capabilities vary a lot among the heterogeneous population of devices that stream HDR content — this aspect is similar to other factors that vary session to session such as ambient lighting, viewing distance, upscaling algorithm on the device, etc. “VMAF not incorporating display mapping” implies the scores are computed for an “ideal display” that’s capable of representing the entire luminance range and the entire color gamut spanned by the video signal — thus not requiring display mapping. This background is useful to have before looking at rate vs quality curves pertaining to these two formats.</p><p id="9d45" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Shown below are rate versus quality examples for a couple of titles from our HDR catalog. We present two sets. Within each set we show curves for both Dolby Vision and HDR10. The first set (Fig. 2) corresponds to an episode from a gourmet cooking show incorporating fast-paced scenes from around the world. The second set (Fig. 3) corresponds to an episode from a relatively slower drama series; slower in terms of camera action. The optimized encodes are chosen from the convex hull formed by various rate-quality points corresponding to different bitrates, spatial resolutions and encoding recipes.</p><p id="16c3" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">For brevity we skipped annotating ladder points with their spatial resolutions but the overall observations from our <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/optimized-shot-based-encodes-for-4k-now-streaming-47b516b10bbb">previous article on SDR-4K encode optimization</a> apply here as well. The fixed ladder is slow in ramping up spatial resolution, so the quality stays almost flat among two successive 1080p points or two successive 4K points. On the other hand, the optimized ladder presents a sharper increase in quality with increasing bitrate.</p><p id="9be5" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The fixed ladder has predetermined 4K bitrates — 8, 10, 12 and 16 Mbps — it deterministically maxes out at 16 Mbps. On the other hand, the optimized ladder targets very high levels of quality on the top rung of the bitrate ladder, even at the cost of higher bitrates if the content is complex, thereby satisfying the most discerning viewers. In spite of reaching higher qualities than the fixed ladder, the HDR-DO ladder, on average, occupies only 58% of the storage space compared to fixed-bitrate ladder. This is achieved by more efficiently spacing the ladder points, especially in the high-bitrate region. After all, there is little to no benefit in packing multiple high-bitrate points so close to each other — for example, 3 QHD (2560x1440) points placed in the 6 to 7.5 Mbps range followed by the four 4K points at 8, 10, 12 and 16 Mbps, as was done on the fixed ladder.</p></div></div><div class="oi"><div class="ab ca"><div class="md oj me ok mf ol ce om cf on ch bg"><figure class="or os ot ou ov oi ow ox paragraph-image"><div role="button" tabindex="0" class="oy oz fg pa bg pb"><div class="oo op qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*fLwwhH92O-BRL4wHRQVsTA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*fLwwhH92O-BRL4wHRQVsTA.png" /></picture></div></div></figure><figure class="lv oi ow ox paragraph-image"><div role="button" tabindex="0" class="oy oz fg pa bg pb"><div class="oo op qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*M9UlDHYGgigS5dBRzrPlzw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*M9UlDHYGgigS5dBRzrPlzw.png" /></picture></div></div><figcaption class="pd fc pe oo op pf pg be b bf z dt">Fig. 2: Rate-quality curves comparing fixed and optimized ladders corresponding to an episode from a gourmet cooking show incorporating fast-paced scenes from around the world.</figcaption></figure><figure class="lv oi ow ox paragraph-image"><div role="button" tabindex="0" class="oy oz fg pa bg pb"><div class="oo op qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*IYwzs6A898-cyecPNSwn8A.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*IYwzs6A898-cyecPNSwn8A.png" /></picture></div></div></figure><figure class="lv oi ow ox paragraph-image"><div role="button" tabindex="0" class="oy oz fg pa bg pb"><div class="oo op qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*uEAQep8dh_iIglpAwpWpyw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*uEAQep8dh_iIglpAwpWpyw.png" /></picture></div></div><figcaption class="pd fc pe oo op pf pg be b bf z dt">Fig. 3: Rate-quality curves comparing fixed and optimized ladders corresponding to an episode from a drama series, which is slower in terms of camera action.</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="2dd7" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">It is important to note that the fixed-ladder encodes had constant duration group-of-pictures (GoPs) and suffered from some inefficiency due to shot boundaries not aligning with Instantaneous Decoder Refresh (IDR) frames. The DO encodes are shot-based and so the IDR frames align with shot boundaries. For a given rate-quality operating point, the DO process helps allocate bits among the various shots while maximizing an overall objective function. Also thanks to the DO framework, within a given rate-quality operating point, <em class="ql">challenging shots</em> can and do burst in bitrate up to the <em class="ql">codec level limit</em> associated with that point.</p><h1 id="3fe0" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">Member benefits</h1><p id="4c66" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj">We A/B tested the fixed and optimized ladders; first and foremost to make sure that devices in the field can handle the new streams and serving new streams doesn’t cause unintended playback issues. A/B testing also allows us to get a read on the improvement in quality of experience (QoE). Overall, the improvements can be summarized as:</p><ul class=""><li id="c8c0" class="na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx qm qn qo bj">40% fewer rebuffers</li><li id="d123" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Higher video quality for both bandwidth-constrained as well as unconstrained sessions</li><li id="c414" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Lower initial bitrate</li><li id="865a" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Higher initial quality</li><li id="1edf" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Lower play delay</li><li id="826f" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Less variation in delivered video quality</li><li id="2f8c" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">Lower Internet data usage, especially on mobiles and tablets</li></ul><h1 id="f4e9" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">Will HDR-VMAF be open-source?</h1><p id="5f13" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj">Yes, we are committed to supporting the open-source community. The current implementation, however, is largely tailored to our internal pipelines. We are working to ensure it is versatile, stable, and easy-to-use for the community. Additionally, the current version has some algorithmic limitations that we are in the process of improving before the official release. When we do release it, HDR-VMAF will have higher accuracy in perceptual quality prediction, and be easier to use “out of the box”.</p><h1 id="6553" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">Summary</h1><p id="8182" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj">Thanks to the arrival of HDR-VMAF, we were able to optimize our HDR encodes. Fixed-ladder HDR encodes have been fully replaced by optimized ones, reducing storage footprint and Internet data usage — and most importantly, improving the video quality for our members. Improvements have been seen across all device categories ranging from TVs to mobiles and tablets.</p><h1 id="5624" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">Acknowledgments</h1><p id="22f0" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj">We thank all the volunteers who participated in the subjective experiments. We also want to acknowledge the contributions of our colleagues from Dolby, namely Anustup Kumar Choudhury, Scott Daly, Robin Atkins, Ludovic Malfait, and Suzanne Farrell, who helped with preparations and conducting of the subjective tests.</p><p id="3c12" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We thank Matthew Donato, Adithya Prakash, Rich Gerber, Joe Drago, Benbuck Nason and Joseph McCormick for all the interesting discussions on HDR video.</p><p id="3dd4" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We thank various internal teams at Netflix for the crucial roles they play:</p><ul class=""><li id="46b0" class="na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx qm qn qo bj">The various <a class="af ny" href="https://jobs.netflix.com/team?slug=client-and-ui-engineering" rel="noopener ugc nofollow" target="_blank">client device and UI engineering</a> teams at Netflix that manage the Netflix experience on various device platforms</li><li id="3178" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">The <a class="af ny" href="https://jobs.netflix.com/team?slug=data-science-and-engineering" rel="noopener ugc nofollow" target="_blank">data science and engineering</a> teams at Netflix that help us run and analyze A/B tests; we thank Chris Pham in particular for generating various data insights for the encoding team</li><li id="fb7b" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">The <a class="af ny" href="https://www.youtube.com/watch?v=5ju4W9KAzcY" rel="noopener ugc nofollow" target="_blank">Playback Systems</a> team that steers the Netflix experience for every client device including the experience served in various encoding A/B tests</li><li id="c85d" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">The <a class="af ny" href="https://openconnect.netflix.com/en/" rel="noopener ugc nofollow" target="_blank">Open Connect</a> team that manages Netflix’s own content delivery network</li><li id="07f0" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">The Content Infrastructure and Solutions team that manages the <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/the-netflix-cosmos-platform-35c14d9351ad">compute platform</a> that enables us to execute video encoding at scale</li><li id="70c3" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj">The Streaming Encoding Pipeline team that helps us orchestrate the generation of various streaming assets</li></ul><p id="59f7" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><em class="ql">Find our work interesting? Join us and be a part of the amazing team that brought you this tech-blog; open positions:</em></p><ul class=""><li id="f7d0" class="na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx qm qn qo bj"><a class="af ny" href="https://jobs.netflix.com/jobs/305482718" rel="noopener ugc nofollow" target="_blank"><em class="ql">Software Engineer, Cloud Gaming</em></a></li><li id="c46a" class="na nb gr nc b nd qp nf ng nh qq nj nk nl qr nn no np qs nr ns nt qt nv nw nx qm qn qo bj"><a class="af ny" href="https://jobs.netflix.com/jobs/296600425" rel="noopener ugc nofollow" target="_blank"><em class="ql">Software Engineer, Live Streaming</em></a></li></ul><h1 id="9a09" class="ph pi gr be pj pk pl pm pn po pp pq pr ps pt pu pv pw px py pz qa qb qc qd qe bj">References</h1><p id="c61d" class="pw-post-body-paragraph na nb gr nc b nd qf nf ng nh qg nj nk nl qh nn no np qi nr ns nt qj nv nw nx gk bj"><strong class="nc gs">[1]</strong> L. Krasula, A. Choudhury, S. Daly, Z. Li, R. Atkins, L. Malfait, A. Mavlankar, “Subjective video quality for 4K HDR-WCG content using a browser-based approach for “at-home” testing,” Electronic Imaging, vol. 35, pp. 263–1–8 (2023) [<a class="af ny" href="https://library.imaging.org/admin/apis/public/api/ist/website/downloadArticle/ei/35/8/IQSP-263" rel="noopener ugc nofollow" target="_blank">online</a>]<br /><strong class="nc gs">[2]</strong> A. Choudhury, L. Krasula, S. Daly, Z. Li, R. Atkins, L. Malfait, “Testing 4K HDR-WCG professional video content for subjective quality using a remote testing approach,” SMPTE Media Technology Summit 2023</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/all-of-netflixs-hdr-video-streaming-is-now-dynamically-optimized-e9e0cb15f2ba</link>
      <guid>https://netflixtechblog.com/all-of-netflixs-hdr-video-streaming-is-now-dynamically-optimized-e9e0cb15f2ba</guid>
      <pubDate>Wed, 29 Nov 2023 22:26:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Netflix Original Research: MIT CODE 2023]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="7889" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Netflix was thrilled to be the premier sponsor for the 2nd year in a row at the <a class="af ny" href="https://ide.mit.edu/events/2023-conference-on-digital-experimentation-mit-codemit/" rel="noopener ugc nofollow" target="_blank">2023 Conference on Digital Experimentation</a> (CODE@MIT) in Cambridge, MA. The conference features a balanced blend of academic and industry research from some<em class="nz"> wicked smart</em> folks, and we’re proud to have contributed a number of talks and posters along with a plenary session.</p><p id="5823" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Our contributions kicked off with a concept that is crucial to our understanding of A/B tests: surrogates!</p><p id="f3e3" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Our first talk was given by <a class="af ny" href="https://www.linkedin.com/in/aurelien-bibaut/" rel="noopener ugc nofollow" target="_blank">Aurelien Bibaut</a> (with co-authors <a class="af ny" href="https://www.linkedin.com/in/kallus/" rel="noopener ugc nofollow" target="_blank">Nathan Kallus</a>, <a class="af ny" href="https://www.linkedin.com/in/simon-ejdemyr-22b920123/" rel="noopener ugc nofollow" target="_blank">Simon Ejdemyr</a> and <a class="af ny" href="https://www.linkedin.com/in/mfzhao/" rel="noopener ugc nofollow" target="_blank">Michael Zhao</a>) in which we discussed how to confidently measure long-term outcomes using short term surrogates in the presence of bias. For example, how do we estimate the effects of innovations on retention a year later without running all our experiments for a year? We proposed an estimation method using cross-fold procedures, and construct valid confidence intervals for long term effects before that effect is fully observed.</p><figure class="od oe of og oh oi oa ob paragraph-image"><div role="button" tabindex="0" class="oj ok fg ol bg om"><div class="oa ob oc"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*fM3Bn8JL-cpOzTM2aXKBHg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*fM3Bn8JL-cpOzTM2aXKBHg.png" /></picture></div></div></figure><p id="ffac" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Later on, <a class="af ny" href="https://www.linkedin.com/in/mfzhao/" rel="noopener ugc nofollow" target="_blank">Michael Zhao</a> (with <a class="af ny" href="https://www.linkedin.com/in/zhangvickie/" rel="noopener ugc nofollow" target="_blank">Vickie Zhang</a>, <a class="af ny" href="https://www.linkedin.com/in/anhqle/" rel="noopener ugc nofollow" target="_blank">Anh Le</a> and <a class="af ny" href="https://www.linkedin.com/in/kallus/" rel="noopener ugc nofollow" target="_blank">Nathan Kallus</a>) spoke about the evaluation of surrogate index models for product decision making. Using 200 real A/B tests performed at Netflix, we showed that surrogate-index models, constructed using only 2 weeks of data, lead to the same product ship decisions ~95% of the time when compared to making a call based on 2 months of data. This means we can reliably run shorter tests with confidence without needing to wait months for results!</p><p id="55dd" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Our next topic focused on how to understand and balance competing engagement metrics; for example, should 1 hour of gaming equal 1 hour of streaming? <a class="af ny" href="https://www.linkedin.com/in/mfzhao/" rel="noopener ugc nofollow" target="_blank">Michael Zhao</a> and <a class="af ny" href="https://www.linkedin.com/in/jjschafer/" rel="noopener ugc nofollow" target="_blank">Jordan Schafer</a> shared a poster on how they built an Overall Evaluation Criterion (OEC) metric that provides holistic evaluation for A/B tests, appropriately weighting different engagement metrics to serve a single overall objective. This new framework has enabled fast and confident decision making in tests, and is being actively adapted as our business continues to expand into new areas.</p><figure class="od oe of og oh oi oa ob paragraph-image"><div role="button" tabindex="0" class="oj ok fg ol bg om"><div class="oa ob oo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*3i3cVtVNRvX4xNd3%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*3i3cVtVNRvX4xNd3%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*3i3cVtVNRvX4xNd3%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*3i3cVtVNRvX4xNd3%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*3i3cVtVNRvX4xNd3%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*3i3cVtVNRvX4xNd3%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*3i3cVtVNRvX4xNd3%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*3i3cVtVNRvX4xNd3 640w, https://miro.medium.com/v2/resize:fit:720/0*3i3cVtVNRvX4xNd3 720w, https://miro.medium.com/v2/resize:fit:750/0*3i3cVtVNRvX4xNd3 750w, https://miro.medium.com/v2/resize:fit:786/0*3i3cVtVNRvX4xNd3 786w, https://miro.medium.com/v2/resize:fit:828/0*3i3cVtVNRvX4xNd3 828w, https://miro.medium.com/v2/resize:fit:1100/0*3i3cVtVNRvX4xNd3 1100w, https://miro.medium.com/v2/resize:fit:1400/0*3i3cVtVNRvX4xNd3 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="3854" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In the second plenary session of the day, <a class="af ny" href="https://www.linkedin.com/in/martintingley/" rel="noopener ugc nofollow" target="_blank">Martin Tingley</a> took us on a compelling and fun journey of complexity, exploring key challenges in digital experimentation and how they differ from the challenges faced by agricultural researchers a century ago. He highlighted different areas of complexity and provided perspectives on how to tackle the right challenges based on business objectives.</p><figure class="od oe of og oh oi oa ob paragraph-image"><div role="button" tabindex="0" class="oj ok fg ol bg om"><div class="oa ob op"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*ImGNPn14QO4HtOc3%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*ImGNPn14QO4HtOc3%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*ImGNPn14QO4HtOc3%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*ImGNPn14QO4HtOc3%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*ImGNPn14QO4HtOc3%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*ImGNPn14QO4HtOc3%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*ImGNPn14QO4HtOc3%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*ImGNPn14QO4HtOc3 640w, https://miro.medium.com/v2/resize:fit:720/0*ImGNPn14QO4HtOc3 720w, https://miro.medium.com/v2/resize:fit:750/0*ImGNPn14QO4HtOc3 750w, https://miro.medium.com/v2/resize:fit:786/0*ImGNPn14QO4HtOc3 786w, https://miro.medium.com/v2/resize:fit:828/0*ImGNPn14QO4HtOc3 828w, https://miro.medium.com/v2/resize:fit:1100/0*ImGNPn14QO4HtOc3 1100w, https://miro.medium.com/v2/resize:fit:1400/0*ImGNPn14QO4HtOc3 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="c255" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Our final talk was given by <a class="af ny" href="https://www.linkedin.com/in/apoorvalal/" rel="noopener ugc nofollow" target="_blank">Apoorva Lal</a> (with co-authors <a class="af ny" href="https://www.linkedin.com/in/samir-khan-9536a9175/" rel="noopener ugc nofollow" target="_blank">Samir Khan</a> and <a class="af ny" href="https://www.linkedin.com/in/jugander/" rel="noopener ugc nofollow" target="_blank">Johan Ugander</a>) in which we show how partial identification of the dose-response function (DRF) under non-parametric assumptions can be used to provide more insightful analyses of experimental data than the standard ATE analysis does. We revisited a study that reduced like-minded content algorithmically, and showed how we could extend the binary ATE learning to answer how the amount of like-minded content a user sees affects their political attitudes.</p><p id="663f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We had a blast connecting with the CODE@MIT community and bonding over our shared enthusiasm for not only rigorous measurement in experimentation, but also stats-themed stickers and swag!</p><figure class="od oe of og oh oi oa ob paragraph-image"><div class="oa ob oq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*87ZdDoYnh5R3XGhS%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*87ZdDoYnh5R3XGhS%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*87ZdDoYnh5R3XGhS%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*87ZdDoYnh5R3XGhS%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*87ZdDoYnh5R3XGhS%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*87ZdDoYnh5R3XGhS%201100w,%20https://miro.medium.com/v2/resize:fit:1280/format:webp/0*87ZdDoYnh5R3XGhS%201280w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*87ZdDoYnh5R3XGhS 640w, https://miro.medium.com/v2/resize:fit:720/0*87ZdDoYnh5R3XGhS 720w, https://miro.medium.com/v2/resize:fit:750/0*87ZdDoYnh5R3XGhS 750w, https://miro.medium.com/v2/resize:fit:786/0*87ZdDoYnh5R3XGhS 786w, https://miro.medium.com/v2/resize:fit:828/0*87ZdDoYnh5R3XGhS 828w, https://miro.medium.com/v2/resize:fit:1100/0*87ZdDoYnh5R3XGhS 1100w, https://miro.medium.com/v2/resize:fit:1280/0*87ZdDoYnh5R3XGhS 1280w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 640px" /></picture></div><figcaption class="or fc os oa ob ot ou be b bf z dt"><em class="ov">One of our stickers this year, can you guess what this is showing?!</em></figcaption></figure><p id="bdd4" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We look forward to next year’s iteration of the conference and hope to see you there!</p><p id="6b6a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><em class="nz">Psst! We’re hiring Data Scientists across a variety of domains at Netflix — check out our </em><a class="af ny" href="https://jobs.netflix.com/search?q=data+scientist" rel="noopener ugc nofollow" target="_blank"><em class="nz">open roles.</em></a></p></div>]]></description>
      <link>https://netflixtechblog.com/netflix-original-research-mit-code-2023-9340b879176a</link>
      <guid>https://netflixtechblog.com/netflix-original-research-mit-code-2023-9340b879176a</guid>
      <pubDate>Mon, 27 Nov 2023 21:59:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Causal Machine Learning for Creative Insights]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="a99e" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">A framework to identify the causal impact of successful visual components.</strong></p><p id="ab02" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">By <a class="af nz" href="https://www.linkedin.com/in/billurengin/" rel="noopener ugc nofollow" target="_blank">Billur Engin</a>, <a class="af nz" href="https://www.linkedin.com/in/yinghong-lan-2368656b/" rel="noopener ugc nofollow" target="_blank">Yinghong Lan</a>, <a class="af nz" href="https://www.linkedin.com/in/tsmgrace/" rel="noopener ugc nofollow" target="_blank">Grace Tang</a>, <a class="af nz" href="https://www.linkedin.com/in/cristinasegalin/" rel="noopener ugc nofollow" target="_blank">Cristina Segalin</a>, <a class="af nz" href="https://www.linkedin.com/in/kelli-griggs-32990125/" rel="noopener ugc nofollow" target="_blank">Kelli Griggs</a>, <a class="af nz" href="https://www.linkedin.com/in/vi-pallavika-iyengar-144abb1b/" rel="noopener ugc nofollow" target="_blank">Vi Iyengar</a></p><p id="fb24" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Introduction</strong></p><p id="d5c7" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">At Netflix, we want our viewers to easily find TV shows and movies that resonate and engage. Our creative team helps make this happen by designing promotional artwork that best represents each title featured on our platform. What if we could use machine learning and computer vision to support our creative team in this process? Through identifying the components that contribute to a successful artwork — one that leads a member to choose and watch it — we can give our creative team data-driven insights to incorporate into their creative strategy, and help in their selection of which artwork to feature.</p><p id="5bfe" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We are going to make an assumption that the presence of a specific component will lead to an artwork’s success. We will discuss a causal framework that will help us find and summarize the successful components as creative insights, and hypothesize and estimate their impact.</p><figure class="oa ob oc od oe of"><div class="og jh l fg"></div></figure><p id="ac1e" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">The Challenge</strong></p><p id="71fe" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Given Netflix’s vast and increasingly diverse catalog, it is a challenge to design experiments that both work within an A/B test framework and are representative of all genres, plots, artists, and more. In the past, we have attempted to design A/B tests where we investigate one aspect of artwork at a time, often within one particular genre. However, this approach has a major drawback: it is not scalable because we either have to label images manually or create new asset variants differing only in the feature under investigation. The manual nature of these tasks means that we cannot test many titles at a time. Furthermore, given the multidimensional nature of artwork, we might be missing many other possible factors that might explain an artwork’s success, such as figure orientation, the color of the background, facial expressions, etc. Since we want to ensure that our testing framework allows for maximum creative freedom, and avoid any interruption to the design process, we decided to try an alternative approach.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div role="button" tabindex="0" class="om on fg oo bg op"><div class="oj ok ol"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*sVVL2kJmGUVxZQnO%20640w,%20https://miro.medium.com/v2/resize:fit:720/format:webp/0*sVVL2kJmGUVxZQnO%20720w,%20https://miro.medium.com/v2/resize:fit:750/format:webp/0*sVVL2kJmGUVxZQnO%20750w,%20https://miro.medium.com/v2/resize:fit:786/format:webp/0*sVVL2kJmGUVxZQnO%20786w,%20https://miro.medium.com/v2/resize:fit:828/format:webp/0*sVVL2kJmGUVxZQnO%20828w,%20https://miro.medium.com/v2/resize:fit:1100/format:webp/0*sVVL2kJmGUVxZQnO%201100w,%20https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sVVL2kJmGUVxZQnO%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*sVVL2kJmGUVxZQnO 640w, https://miro.medium.com/v2/resize:fit:720/0*sVVL2kJmGUVxZQnO 720w, https://miro.medium.com/v2/resize:fit:750/0*sVVL2kJmGUVxZQnO 750w, https://miro.medium.com/v2/resize:fit:786/0*sVVL2kJmGUVxZQnO 786w, https://miro.medium.com/v2/resize:fit:828/0*sVVL2kJmGUVxZQnO 828w, https://miro.medium.com/v2/resize:fit:1100/0*sVVL2kJmGUVxZQnO 1100w, https://miro.medium.com/v2/resize:fit:1400/0*sVVL2kJmGUVxZQnO 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="d6c0" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Figure. </strong>Given the multidimensional nature of artwork, it is challenging to design an A/B test to investigate one aspect of artwork at a given time. We could be missing many other possible factors that might explain an artwork’s success, such as figure orientation, the color of the background, facial expressions, etc.</p><p id="317a" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">The Causal Framework</strong></p><p id="e488" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Thanks to our <a class="af nz" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">Artwork Personalization System</a> and vision algorithms (some of which are <a class="af nz" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ava-the-art-and-science-of-image-discovery-at-netflix-a442f163af6">exemplified here</a>), we have a rich dataset of promotional artwork components and user engagement data to build a causal framework. Utilizing this dataset, we have developed the framework to test creative insights and estimate their causal impact on an artwork’s performance via the dataset generated through our recommendation system. In other words, we can learn which attributes led to a title’s successful selection based on its artwork.</p><p id="e4ba" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Let’s first explore the workflow of the causal framework, as well as the data and success metrics that power it.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div role="button" tabindex="0" class="om on fg oo bg op"><div class="oj ok or"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*5-vn-ifo4UUicfRgLjg0FQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*5-vn-ifo4UUicfRgLjg0FQ.png" /></picture></div></div></figure><p id="4152" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We represent the success of an artwork with the take rate: the probability of an average user to watch the promoted title after seeing its promotional artwork, adjusted for the popularity of the title. Every show on our platform has multiple promotional artwork assets. Using Netflix’s <a class="af nz" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">Artwork Personalization</a>, we serve these assets to hundreds of millions of members everyday. To power this recommendation system, we look at user engagement patterns and see whether or not these engagements with artworks resulted in a successful title selection.</p><p id="9daa" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">With the capability to annotate a given image (some of which are mentioned in <a class="af nz" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ava-the-art-and-science-of-image-discovery-at-netflix-a442f163af6#:~:text=editorial%20image%20candidates-,frame%20annotation,-As%20part%20of">an earlier post</a>), an artwork asset in this case, we use a series of computer vision algorithms to gather objective image metadata, latent representation of the image, as well as some of the contextual metadata that a given image contains. This process allows our dataset to consist of both the image features and user data, all in an effort to understand which image components lead to successful user engagement. We also utilize machine learning algorithms, consumer insights¹, and correlational analysis for discovering high-level associations between image features and an artwork’s success. These statistically significant associations become our hypotheses for the next phase.</p><p id="e164" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Once we have a specific hypothesis, we can test it by deploying causal machine learning algorithms. This framework reduces our experimental effort to uncover causal relationships, while taking into account confounding among the high-level variables (i.e. the variables that may influence both the treatment / intervention and outcome).</p><p id="d911" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">The Hypothesis and Assumptions</strong></p><p id="0268" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We will use the following hypothesis in the rest of the script: <em class="os">presence of a face in an artwork causally improves the asset performance</em>. (We know that <a class="af nz" href="https://about.netflix.com/en/news/the-power-of-a-picture#:~:text=emotions%20are%20an%20efficient%20way%20of%20conveying%20complex%20nuances" rel="noopener ugc nofollow" target="_blank">faces work well in artwork</a>, especially <a class="af nz" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/selecting-the-best-artwork-for-videos-through-a-b-testing-f6155c4595f6#:~:text=Unbreakable%20Kimmy%20Schmidt-,conclusion,-Over%20the%20course">images with an expressive facial emotion that’s in line with the tone of the title.</a>)</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok ot"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*S6h7dLUsWKqjRJ6HKbD6vw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*S6h7dLUsWKqjRJ6HKbD6vw.png" /></picture></div></figure><p id="579d" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Here are two promotional artwork assets from <em class="os">Unbreakable Kimmy Schmidt</em>. We know that the image on the left performed better than the image on the right. However, the difference between them is not only the presence of a face. There are many other variances, like the difference in background, text placement, font size, face size, etc. Causal Machine Learning makes it possible for us to understand an artwork’s performance based on the causal impact of its treatment.</p><p id="ac3f" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">To make sure our hypothesis is fit for the causal framework, it’s important we go over the <em class="os">identification assumptions</em>.</p><ul class=""><li id="b805" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny ou ov ow bj"><strong class="nd gs">Consistency:</strong> The treatment component is sufficiently well-defined.</li></ul><p id="5f91" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We use machine learning algorithms to predict whether or not the artwork contains a face. That’s why the first assumption we make is that our face detection algorithm is mostly accurate (~92% average precision).</p><ul class=""><li id="add7" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny ou ov ow bj"><strong class="nd gs">Positivity / Probabilistic Assignment:</strong> Every unit (an artwork) has some chance of getting treated.</li></ul><p id="dcd8" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We calculate the propensity score (the probability of receiving the treatment based on certain baseline characteristics) of having a face for samples with different covariates. If a certain subset of artwork (such as artwork from a certain genre) has close to a 0 or 1 propensity score for having a face, then we discard these samples from our analysis.</p><ul class=""><li id="3adb" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny ou ov ow bj"><strong class="nd gs">Individualistic Assignment / SUTVA (stable unit treatment value assumption):</strong> The potential outcomes of a unit do not depend on the treatments assigned to others.</li></ul><p id="3fde" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Creatives make the decision to create artwork with or without faces based on considerations limited to the title of interest itself. This decision is not dependent on whether other assets have a face in them or not.</p><ul class=""><li id="2051" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny ou ov ow bj"><strong class="nd gs">Conditional exchangeability (Unconfoundedness):</strong> There are no unmeasured confounders.</li></ul><p id="d428" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">This assumption is by definition not testable. Given a dataset, we can’t know if there has been an unobserved confounder. However, we can test the sensitivity of our conclusions toward the violation of this assumption in various different ways.</p><p id="5843" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">The Models</strong></p><p id="a1d9" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Now that we have established our hypothesis to be a causal inference problem, we can focus on the Causal Machine Learning Application. Predictive Machine Learning (ML) models are great at finding patterns and associations in order to predict outcomes, however they are not great at explaining cause-effect relationships, as their model structure does not reflect causality (the relationship between cause and effect). As an example, let’s say we looked at the price of Broadway theater tickets and the number of tickets sold. An ML algorithm may find a correlation between price increases and ticket sales. If we have used this algorithm for decision making, we could falsely conclude that increasing the ticket price leads to higher ticket sales if we do not consider the confounder of show popularity, which clearly impacts both ticket prices and sales. It is understandable that a Broadway musical ticket may be more expensive if the show is a hit, however simply increasing ticket prices to gain more customers is counter-intuitive.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok ox"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*xSGAfYbgHJSvzEKtEWz6tA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*xSGAfYbgHJSvzEKtEWz6tA.png" /></picture></div></figure><p id="d4b3" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Causal ML helps us estimate treatment effects from observational data, where it is challenging to conduct clean randomizations. Back-to-back publications on Causal ML, such as <a class="af nz" href="https://arxiv.org/abs/1608.00060" rel="noopener ugc nofollow" target="_blank">Double ML</a>, <a class="af nz" href="https://arxiv.org/abs/1510.04342" rel="noopener ugc nofollow" target="_blank">Causal Forests</a>, <a class="af nz" href="https://arxiv.org/pdf/1906.02120.pdf" rel="noopener ugc nofollow" target="_blank">Causal Neural Networks</a>, and many more, showcased a toolset for investigating treatment effects, via combining domain knowledge with ML in the learning system. Unlike predictive ML models, Causal ML explicitly controls for confounders, by modeling both treatment of interest as a function of confounders (i.e., propensity scores) as well as the impact of confounders on the outcome of interest. In doing so, Causal ML isolates out the <em class="os">causal </em>impact of treatment on outcome. Moreover, the estimation steps of Causal ML are carefully set up to achieve better error bounds for the estimated treatment effects, another consideration often overlooked in predictive ML. Compared to more traditional Causal Inference methods anchored on linear models, Causal ML leverages the latest ML techniques to not only better control for confounders (when propensity or outcome models are hard to capture by linear models) but also more flexibly estimate treatment effects (when treatment effect heterogeneity is nonlinear). In short, by utilizing machine learning algorithms, Causal ML provides researchers with a framework for understanding causal relationships with flexible ML methods.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*uMVbtKsR3Wn8rHU1dYcA4g.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*uMVbtKsR3Wn8rHU1dYcA4g.png" /></picture></div></figure><p id="f3ee" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Y : outcome variable (take rate)<br />T : binary treatment variable (presence of a face or not)<br />W: a vector of covariates (features of the title and artwork)<br />X ⊆ W: a vector of covariates (a subset of W) along which treatment effect heterogeneity is evaluated</p><p id="7c41" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Let’s dive more into the causal ML (Double ML to be specific) application steps for creative insights.</p><ol class=""><li id="06cb" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny oz ov ow bj">Build a propensity model to predict treatment probability (T) given the W covariates.</li></ol><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok pa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*BnL2Hs4-uFhPKc9mWQfSmw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*BnL2Hs4-uFhPKc9mWQfSmw.png" /></picture></div></figure><p id="95b6" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">2. Build a potential outcome model to predict Y given the W covariates.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok pb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*hwtidaJAN_JsJz7JNCDbyw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*hwtidaJAN_JsJz7JNCDbyw.png" /></picture></div></figure><p id="cb15" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">3. Residualization of</p><ul class=""><li id="fc1a" class="nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny ou ov ow bj">The treatment (observed T — predicted T via propensity model)</li><li id="aba4" class="nb nc gr nd b ne pc ng nh ni pd nk nl nm pe no np nq pf ns nt nu pg nw nx ny ou ov ow bj">The outcome (observed Y — predicted Y via potential outcome model)</li></ul><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok ph"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*85tC7OaDneD3qswh0K6Whw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*85tC7OaDneD3qswh0K6Whw.png" /></picture></div></figure><p id="0da9" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">4. Fit a third model on the residuals to predict the average treatment effect (ATE) or conditional average treatment effect (CATE).</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok pi"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*IyLMy_m47tTIakcqwPlLCQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*IyLMy_m47tTIakcqwPlLCQ.png" /></picture></div></figure><p id="e8de" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Where ? and η are stochastic errors and we assume that<strong class="nd gs"> E[ ?|T,W] = 0</strong> , <strong class="nd gs">E[ η|W] = 0</strong>.</p><p id="7946" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">For the estimation of the nuisance functions (i.e., the propensity score model and the outcome model), we have implemented the propensity model as a classifier (as we have a binary treatment variable — the presence of face) and the potential outcome model as a regressor (as we have a continuous outcome variable — adjusted take rate). We have used grid search for tuning the XGBoosting classifier &amp; regressor hyperparameters. We have also used k-fold cross-validation to avoid overfitting. Finally, we have used a causal forest on the residuals of treatment and the outcome variables to capture the ATE, as well as CATE on different genres and countries.</p><p id="e414" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Mediation and Moderation</strong></p><p id="59df" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">ATE will reveal the impact of the treatment — in this case, having a face in the artwork — across the board. The result will answer the question of whether it is worth applying this approach for all of our titles across our catalog, regardless of potential conditioning variables e.g. genre, country, etc. Another advantage of our multi-feature dataset is that we get to deep dive into the relationships between attributes. To do this, we can employ two methods: mediation and moderation.</p><p id="b7b6" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">In their classic paper, <a class="af nz" href="https://www2.psych.ubc.ca/~schaller/528Readings/BaronKenny1986.pdf" rel="noopener ugc nofollow" target="_blank">Baron &amp; Kenny</a> define a moderator as “a qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between an independent or predictor variable and a dependent or criterion variable.”. We can investigate suspected moderators to uncover Conditional Average Treatment Effects (CATE). For example, we might suspect that the effect of the presence of a face in artwork varies across genres (e.g. certain genres, like nature documentaries, probably benefit less from the presence of a human face since titles in those genres tend to focus more on non-human subject matter). We can investigate these relationships by including an interaction term between the suspected moderator and the independent variable. If the interaction term is significant, we can conclude that the third variable is a moderator of the relationship between the independent and dependent variables.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok pj"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*a2SdHwnAF9DxAPJ7spIxoQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*a2SdHwnAF9DxAPJ7spIxoQ.png" /></picture></div></figure><p id="c0b9" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Mediation, on the other hand, occurs when a third variable explains the relationship between an independent and dependent variable. To quote Baron &amp; Kenny once more, “whereas moderator variables specify when certain effects will hold, mediators speak to how or why such effects occur.”</p><p id="77ca" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">For example, we observed that the <a class="af nz" href="https://about.netflix.com/en/news/the-power-of-a-picture#:~:text=dropped%20when%20it%20contained%20more%20than%203%20people" rel="noopener ugc nofollow" target="_blank">presence of more than 3 people tends to negatively impact performance</a>. It could be that higher numbers of faces make it harder for a user to focus on any one face in the asset. However, since face count and face size tend to be negatively correlated (since we fit more information in an image of fixed size, each individual piece of information tends to be smaller), one could also hypothesize that the negative correlation with face count is not driven so much from the number of people featured in the artwork, but rather the size of each individual person’s face, which may affect how visible each person is. To test this, we can run a mediation analysis to see if face size is mediating the effect of face count on the asset’s performance.</p><figure class="oa ob oc od oe of oj ok paragraph-image"><div class="oj ok pk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Co-lTEL5F9tg7UnGYF1szg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Co-lTEL5F9tg7UnGYF1szg.png" /></picture></div></figure><p id="26bd" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">The steps of the mediation analysis are as follows: We have already detected a correlation between the independent variable (number of faces) and the outcome variable (user engagement) — in other words, we observed that a higher number of faces is associated with lower user engagement. But, we also observe that the number of faces is negatively correlated with average face size — faces tend to be smaller when more faces are fit into the same fixed-size canvas. To find out the degree to which face size mediates the effect of face count, we regress user engagement on both average face size and the number of faces. If 1) face size is a significant predictor of engagement, and 2) the significance of the predictive contribution of the number of people drops, we can conclude that face size mediates the effect of the number of people in artwork user engagement. If the coefficient for the number of people is no longer significant, it shows that face size <em class="os">fully</em> mediates the effect of the number of faces on engagement.</p><p id="c2b4" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">In this dataset, we found that face size only partially mediates the effect of face count on asset effectiveness. This implies that both factors have an impact on asset effectiveness — fewer faces tend to be more effective even if we control for the effect of face size.</p><p id="9c1f" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Sensitivity Analysis</strong></p><p id="e116" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">As alluded to above, the conditional exchangeability assumption (unconfoundedness) is not testable by definition. It is thus crucial to evaluate how sensitive our findings and insights are to the violation of this assumption. Inspired by prior <a class="af nz" href="https://medium.com/data-science-at-microsoft/causal-inference-part-3-of-3-model-validation-and-applications-c84764156a29" rel="noopener">work</a>, we conducted a suite of sensitivity analyses that stress-tested this assumption from multiple different angles. In addition, we leveraged ideas from academic research (most notably the <a class="af nz" href="https://www.acpjournals.org/doi/abs/10.7326/m16-2607" rel="noopener ugc nofollow" target="_blank">E-value</a>) and concluded that our estimates are robust even when the unconfoundedness assumption is violated. We are actively working on designing and implementing a standardized framework for sensitivity analysis and will share the various applications in an upcoming blog post — stay tuned for a more detailed discussion!</p><p id="d382" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Finally, we also compared our estimated treatment effects with known effects for specific genres that were derived with other different methods, validating our estimates with consistency across different methods</p><p id="91df" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Conclusion</strong></p><p id="61c8" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Using the causal machine learning framework, we can potentially test and identify the various components of promotional artwork and gain invaluable creative insights. With this post, we just started to scratch the surface of this interesting challenge. In the upcoming posts in this series, we will share alternative machine learning and computer vision approaches that can provide insights from a causal perspective. These insights will guide and assist our team of talented strategists and creatives to select and generate the most attractive artwork, leveraging the attributes that these models selected, down to a specific genre. Ultimately this will give Netflix members a better and more personalized experience.</p><p id="cbc8" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">If these types of challenges interest you, please let us know! We are always looking for great people who are inspired by causal inference, <a class="af nz" href="https://jobs.netflix.com/search?q=%22machine+learning%22" rel="noopener ugc nofollow" target="_blank">machine learning</a>, and <a class="af nz" href="https://jobs.netflix.com/search?q=%22computer+vision%22" rel="noopener ugc nofollow" target="_blank">computer vision</a> to join our team.</p><p id="45f7" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Contributions</strong></p><p id="c11a" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">The authors contributed to the post as follows.</p><p id="1f07" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">Billur Engin was the main driver of this blog post, she worked on the causal machine learning theory and its application in the artwork space. Yinghong Lan contributed equally to the causal machine learning theory. Grace Tang worked on the mediation analysis. Cristina Segalin engineered and extracted the visual features at scale from artworks used in the analysis. Grace Tang and Cristina Segalin initiated and conceptualized the problem space that is being used as the illustrative example in this post (studying factors affecting user engagement with a broad multivariate analysis of artwork features), curated the data, and performed initial statistical analysis and construction of predictive models supporting this work.</p><p id="33c8" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Acknowledgments</strong></p><p id="ddfb" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">We would like to thank <a class="af nz" href="https://www.linkedin.com/in/shiva-chaitanya-05a93b5/" rel="noopener ugc nofollow" target="_blank">Shiva Chaitanya</a> for reviewing this work, and a special thanks to <a class="af nz" href="https://www.linkedin.com/in/shaun-wright-28b74248/" rel="noopener ugc nofollow" target="_blank">Shaun Wright</a> , <a class="af nz" href="https://www.linkedin.com/in/luca-aldag/" rel="noopener ugc nofollow" target="_blank">Luca Aldag</a>, <a class="af nz" href="https://www.linkedin.com/in/sarah-soquel-morhaim-3875831a3/" rel="noopener ugc nofollow" target="_blank">Sarah Soquel Morhaim</a>, and <a class="af nz" href="https://www.linkedin.com/in/anna-pulido-61025063/" rel="noopener ugc nofollow" target="_blank">Anna Pulid</a>o who helped make this possible.</p><p id="20b7" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj"><strong class="nd gs">Footnotes</strong></p><p id="e347" class="pw-post-body-paragraph nb nc gr nd b ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny gk bj">¹The Consumer Insights team at Netflix seeks to understand members and non-members through a wide range of quantitative and qualitative research methods.</p></div>]]></description>
      <link>https://netflixtechblog.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96</link>
      <guid>https://netflixtechblog.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96</guid>
      <pubDate>Sat, 25 Nov 2023 02:27:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Incremental Processing using Netflix Maestro and Apache Iceberg]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="bd3d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">by <a class="af ny" href="https://www.linkedin.com/in/jheua/" rel="noopener ugc nofollow" target="_blank">Jun He</a>, <a class="af ny" href="https://www.linkedin.com/in/yingyi-zhang-a0a164111/" rel="noopener ugc nofollow" target="_blank">Yingyi Zhang</a>, and <a class="af ny" href="https://www.linkedin.com/in/pawan-dixit-b4307b2/" rel="noopener ugc nofollow" target="_blank">Pawan Dixit</a></p><p id="5fe6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Incremental processing is an approach to process new or changed data in workflows. The key advantage is that it only incrementally processes data that are newly added or updated to a dataset, instead of re-processing the complete dataset. This not only reduces the cost of compute resources but also reduces the execution time in a significant manner. When workflow execution has a shorter duration, chances of failure and manual intervention reduce. It also improves the engineering productivity by simplifying the existing pipelines and unlocking the new patterns.</p><p id="3fd1" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In this blog post, we talk about the landscape and the challenges in workflows at Netflix. We will show how we are building a clean and efficient incremental processing solution (IPS) by using Netflix Maestro and Apache Iceberg. IPS provides the incremental processing support with data accuracy, data freshness, and backfill for users and addresses many of the challenges in workflows. IPS enables users to continue to use the data processing patterns with minimal changes.</p><h1 id="3e17" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Introduction</h1><p id="79ba" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">Netflix relies on data to power its business in all phases. Whether in analyzing A/B tests, optimizing studio production, training algorithms, investing in content acquisition, detecting security breaches, or optimizing payments, well structured and accurate data is foundational. As our business scales globally, the demand for data is growing and the needs for scalable low latency incremental processing begin to emerge. There are three common issues that the dataset owners usually face.</p><ul class=""><li id="f3d6" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">Data Freshness: </strong>Large datasets from Iceberg tables needed to be processed quickly and accurately to generate insights to enable faster product decisions. The hourly processing semantics along with valid–through-timestamp watermark or data signals provided by the Data Platform toolset today satisfies many use cases, but is not the best for low-latency batch processing. Before IPS, the Data Platform did not have a solution for tracking the state and progression of data sets as a single easy to use offering. This has led to a few internal solutions such as <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/2-diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-1d273b3aaefb">Psyberg</a>. These internal libraries process data by capturing the changed partitions, which works only on specific use cases. Additionally, the libraries have tight coupling to the user business logic, which often incurs higher migration costs, maintenance costs, and requires heavy coordination with the Data Platform team.</li><li id="d1d6" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Data Accuracy:</strong> Late arriving data causes datasets processed in the past to become incomplete and as a result inaccurate. To compensate for that, ETL workflows often use a lookback window, based on which they reprocess the data in that certain time window. For example, a job would reprocess aggregates for the past 3 days because it assumes that there would be late arriving data, but data prior to 3 days isn’t worth the cost of reprocessing.</li><li id="4fd1" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Backfill:</strong> Backfilling datasets is a common operation in big data processing. This requires repopulating data for a historical time period which is before the scheduled processing. The need for backfilling could be due to a variety of factors, e.g. (1) upstream data sets got repopulated due to changes in business logic of its data pipeline, (2) business logic was changed in a data pipeline, (3) anew metric was created that needs to be populated for historical time ranges, (4) historical data was found missing, etc.</li></ul><p id="ed1e" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">These challenges are currently addressed in suboptimal and less cost efficient ways by individual local teams to fulfill the needs, such as</p><ul class=""><li id="b3c8" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">Lookback: </strong>This is a generic and simple approach that data engineers use to solve the data accuracy problem. Users configure the workflow to read the data in a window (e.g. past 3 hours or 10 days). The window is set based on users’ domain knowledge so that users have a high confidence that the late arriving data will be included or will not matter (i.e. data arrives too late to be useful). It ensures the correctness with a high cost in terms of time and compute resources.</li><li id="b44c" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Foreach pattern: </strong>Users build backfill workflows using <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c#7d0f">Maestro foreach support</a>. It works well to backfill data produced by a single workflow. If the pipeline has multiple stages or many downstream workflows, users have to manually create backfill workflows for each of them and that requires significant manual work.</li></ul><p id="ccd1" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The incremental processing solution (IPS) described here has been designed to address the above problems. The design goal is to provide a clean and easy to adopt solution for the Incremental processing to ensure data freshness, data accuracy, and to provide easy backfill support.</p><ul class=""><li id="a5d8" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">Data Freshness:</strong> provide the support for scheduling workflows in a <strong class="nc gs">micro batch </strong>fashion (e.g. 15 min interval) with state tracking functionality</li><li id="1e72" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Data Accuracy:</strong> provide the support to process all late arriving data to achieve data accuracy needed by the business with significantly improved performance in terms of multifold <strong class="nc gs">time and cost efficiency</strong></li><li id="5ce3" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Backfill: </strong>provide managed backfill support to build, monitor, and validate the backfill, including automatically propagating changes from upstream to downstream workflows, to greatly improve <strong class="nc gs">engineering productivity</strong> (i.e. a few days or weeks of engineering work to build backfill workflows vs one click for managed backfill)</li></ul><h1 id="75e1" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Approach Overview</h1><h2 id="5199" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">General Concept</h2><p id="789a" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj"><strong class="nc gs">Incremental processing</strong> is an approach to process data in batch — but only on new or changed data. To support incremental processing, we need an approach for not only capturing incremental data changes but also tracking their states (i.e. whether a change is processed by a workflow or not). It must be aware of the change and can capture the changes from the source table(s) and then keep tracking those changes. Here, changes mean more than just new data itself. For example, a row in an aggregation target table needs all the rows from the source table associated with the aggregation row. Also, if there are multiple source tables, usually the union of the changed data ranges from all input tables gives the full change data set. Thus, change information captured must include all related data including those unchanged rows in the source table as well. Due to previously mentioned complexities, change tracking cannot be simply achieved by using a single watermark. IPS has to track those captured changes in finer granularity.</p><p id="b2fa" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The changes from the source tables might affect the transformed result in the target table in various ways.</p><ul class=""><li id="79e3" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj">If one row in the target table is derived from one row in the source table, newly captured data change will be the complete input dataset for the workflow pipeline.</li><li id="6303" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">If one row in the target table is derived from multiple rows in the source table, capturing new data will only tell us the rows have to be re-processed. But the dataset needed for ETL is beyond the change data itself. For example, an aggregation based on account id requires all rows from the source table about an account id. The change dataset will tell us which account ids are changed and then the user business logic needs to load all data associated with those account ids found in the change data.</li><li id="e088" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">If one row in the target table is derived based on the data beyond the changed data set, e.g. joining source table with other tables, newly captured data is still useful and can indicate a range of data to be affected. Then the workflow will re-process the data based on the range. For example, assuming we have a table that keeps the accumulated view time for a given account partitioned by the day. If the view time 3-days ago is updated right now due to late arriving data, then the view time for the following two days has to be re-calculated for this account. In this case, the captured late arriving data will tell us the start of the re-calculation, which is much more accurate than recomputing everything for the past X days by guesstimate, where X is a cutoff lookback window decided by business domain knowledge.</li></ul><p id="0e7a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Once the change information (data or range) is captured, a workflow has to write the data to the target table in a slightly more complicated way because the simple <strong class="nc gs">INSERT OVERWRITE</strong> mechanism won’t work well. There are two alternatives:</p><ul class=""><li id="54bf" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">Merge pattern:</strong> In some compute frameworks, e.g. Spark 3, it supports MERGE INTO to allow new data to be merged into the existing data set. That solves the write problem for incremental processing. Note that the workflow/step can be safely restarted without worrying about duplicate data being inserted when using MERGE INTO.</li><li id="4919" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">Append pattern:</strong> Users can also use append only write (e.g. INSERT INTO) to add the new data to the existing data set. Once the processing is completed, the append data is committed to the table. If users want to re-run or re-build the data set, they will run a backfill workflow to completely overwrite the target data set (e.g. INSERT OVERWRITE).</li></ul><p id="3a41" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Additionally, the IPS will naturally support the backfill in many cases. Downstream workflows (if there is no business logic change) will be triggered by the data change due to backfill. This enables auto propagation of backfill data in multi-stage pipelines. Note that the backfill support is skipped in this blog. We will talk about IPS backfill support in another following blog post.</p><h2 id="b410" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Netflix Maestro</h2><p id="34b8" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj"><a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">Maestro</a> is the Netflix data workflow orchestration platform built to meet the current and future needs of Netflix. It is a general-purpose workflow orchestrator that provides a fully managed workflow-as-a-service (WAAS) to the data platform users at Netflix. It serves thousands of users, including data scientists, data engineers, machine learning engineers, software engineers, content producers, and business analysts, in various use cases. Maestro is highly scalable and extensible to support existing and new use cases and offers enhanced usability to end users.</p><p id="e7ad" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Since the last blog on <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">Maestro</a>, we have migrated all the workflows to it on behalf of users with minimal interruption. Maestro has been fully deployed in production with 100% workload running on it.</p><p id="f55f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">IPS is built upon Maestro as an extension by adding two building blocks, i.e. a new trigger mechanism and step job type, to enable incremental processing for all workflows. It is seamlessly integrated into the whole Maestro ecosystem with minimal onboarding cost.</p><h2 id="9937" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Apache Iceberg</h2><p id="4ce5" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj"><a class="af ny" href="https://iceberg.apache.org/" rel="noopener ugc nofollow" target="_blank">Iceberg</a> is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. It supports expressive SQL, full schema evolution, hidden partitioning, data compaction, and time travel &amp; rollback. In the IPS, we leverage the rich features provided by Apache Iceberg to develop a lightweight approach to capture the table changes.</p><h2 id="bb79" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Incremental Change Capture Design</h2><p id="c94f" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">Using Netflix Maestro and Apache Iceberg, we created a novel solution for incremental processing, which provides the incremental change (data and range) capture in a super lightweight way without copying any data. During our exploration, we see a huge opportunity to improve cost efficiency and engineering productivity using incremental processing.</p><p id="eb7f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Here is our solution to achieve incremental change capture built upon Apache Iceberg features. As we know, an iceberg table contains a list of snapshots with a set of metadata data. Snapshots include references to the actual immutable data files. A snapshot can contain data files from different partitions.</p><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*iIxEBzgYTiqc6J2lkA2lSQ.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*iIxEBzgYTiqc6J2lkA2lSQ.gif" /></picture></div></div></figure><p id="92b0" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The graph above shows that s0 contains data for Partition P0 and P1 at T1. Then at T2, a new snapshot s1 is committed to the table with a list of new data files, which includes late arriving data for partition P0 and P1 and data for P2.</p><p id="4b32" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We implemented a lightweight approach to create an iceberg table (called ICDC table), which has its own snapshot but only includes the new data file references from the original table without copying the data files. It is highly efficient with a low cost. Then workflow pipelines can just load the ICDC table to process only the change data from partition P0, P1, P2 without reprocessing the unchanged data in P0 and P1. Meanwhile, the change range is also captured for the specified data field as the Iceberg table metadata contains the upper and lower bound information of each data field for each data file. Moreover, IPS will track the changes in data file granularity for each workflow.</p><p id="5b3f" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This lightweight approach is seamlessly integrated with Maestro to allow all (thousands) scheduler users to use this new building block (i.e. incremental processing) in their tens of thousands of workflows. Each workflow using IPS will be injected with a table parameter, which is the table name of the lightweight ICDC table. The ICDC table contains only the change data. Additionally, if the workflow needs the change range, a list of parameters will be injected to the user workflow to include the change range information. The incremental processing can be enabled by a new step job type (ICDC) and/or a new incremental trigger mechanism. Users can use them together with all existing Maestro features, e.g. foreach patterns, step dependencies based on valid–through-timestamp watermark, write-audit-publish templatized pattern, etc.</p><h2 id="1aab" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Main Advantages</h2><p id="2d08" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">With this design, user workflows can adopt incremental processing with very low efforts. The user business logic is also decoupled from the IPS implementation. Multi-stage pipelines can also mix the incremental processing workflows with existing normal workflows. We also found that user workflows can be simplified after using IPS by removing additional steps to handle the complexity of the lookback window or calling some internal libraries.</p><p id="a688" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Adding incremental processing features into Netflix Maestro as new features/building blocks for users will enable users to build their workflows in a much more efficient way and bridge the gaps to solve many challenging problems (e.g. dealing with late arriving data) in a much simpler way.</p><h1 id="402d" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Emerging Incremental Processing Patterns</h1><p id="bb3a" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">While onboarding user pipelines to IPS, we have discovered a few incremental processing patterns:</p><h2 id="280e" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Incrementally process the captured incremental change data and directly append them to the target table</h2><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*n9LhzHUTIQPVtLywo3vF7g.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*n9LhzHUTIQPVtLywo3vF7g.png" /></picture></div></div></figure><p id="02b3" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This is the straightforward incremental processing use case, where the change data carries all the information needed for the data processing. Upstream changes (usually from a single source table) are propagated to the downstream (usually another target table) and the workflow pipeline only needs to process the change data (might join with other dimension tables) and then merge into (usually append) to the target table. This pattern will replace lookback window patterns to take care of late arriving data. Instead of overwriting past X days of data completely by using a lookback window pattern, user workflows just need to MERGE the change data (including late arriving data) into the target table by processing the ICDC table.</p><h2 id="1cfb" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Use captured incremental change data as the row level filter list to remove unnecessary transformation</h2><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*R36FNiC50RY4VPvlzI02xw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*R36FNiC50RY4VPvlzI02xw.png" /></picture></div></div></figure><p id="732d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">ETL jobs usually need to aggregate data based on certain group-by keys. Change data will disclose all the group-by keys that require a re-aggregation due to the new landing data from the source table(s). Then ETL jobs can join the original source table with the ICDC table on those group-by keys by using ICDC as a filter to speed up the processing to enable calculations of a much smaller set of data. There is no change to business transform logic and no re-design of ETL workflow. ETL pipelines keep all the benefits of batch workflows.</p><h2 id="9c36" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Use the captured range parameters in the business logic</h2><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qr"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*Tra9tGljtXOGhJ9l4NvaOQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Tra9tGljtXOGhJ9l4NvaOQ.png" /></picture></div></div></figure><p id="d2df" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This pattern is usually used in complicated use cases, such as joining multiple tables and doing complex processings. In this case, the change data do not give the full picture of the input needed by the ETL workflow. Instead, the change data indicates a range of changed data sets for a specific set of fields (might be partition keys) in a given input table or usually multiple input tables. Then, the union of the change ranges from all input tables gives the full change data set needed by the workflow. Additionally, the whole range of data usually has to be overwritten because the transformation is not stateless and depends on the outcome result from the previous ranges. Another example is that the aggregated record in the target table or window function in the query has to be updated based on the whole data set in the partition (e.g. calculating a medium across the whole partition). Basically, the range derived from the change data indicates the dataset to be re-processed.</p><h1 id="608b" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Use cases</h1><p id="3f0e" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">Data workflows at Netflix usually have to deal with late arriving data which is commonly solved by using lookback window pattern due to its simplicity and ease of implementation. In the lookback pattern, the ETL pipeline will always consume the past X number of partition data from the source table and then overwrite the target table in every run. Here, X is a number decided by the pipeline owners based on their domain expertise. The drawback is the cost of computation and execution time. It usually costs almost X times more than the pipeline without considering late arriving data. Given the fact that the late arriving data is sparse, the majority of the processing is done on the data that have been already processed, which is unnecessary. Also, note that this approach is based on domain knowledge and sometimes is subject to changes of the business environment or the domain expertise of data engineers. In certain cases, it is challenging to come up with a good constant number.</p><p id="7537" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Below, we will use a two-stage data pipeline to illustrate how to rebuild it using IPS to improve the cost efficiency. We will observe a significant cost reduction (&gt; 80%) with little changes in the business logic. In this use case, we will set the lookback window size X to be 14 days, which varies in different real pipelines.</p><h2 id="a22a" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">Original Data Pipeline with Lookback Window</h2><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*LmdM1TjBaHKBkbhctsP28w.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*LmdM1TjBaHKBkbhctsP28w.png" /></picture></div></div></figure><ul class=""><li id="e42d" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">playback_table</strong>: an iceberg table holding playback events from user devices ingested by streaming pipelines with late arriving data, which is sparse, only about few percents of the data is late arriving.</li><li id="f152" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">playback_daily_workflow</strong>: a daily scheduled workflow to process the past X days playback_table data and write the transformed data to the target table for the past X days</li><li id="b92a" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">playback_daily_table</strong>: the target table of the playback_daily_workflow and get overwritten every day for the past X days</li><li id="244e" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">playback_daily_agg_workflow</strong>: a daily scheduled workflow to process the past X days’ playback_daily_table data and write the aggregated data to the target table for the past X days</li><li id="cf8f" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj"><strong class="nc gs">playback_daily_agg_table</strong>: the target table of the playback_daily_agg_workflow and get overwritten every day for the past 14 days.</li></ul><p id="bb22" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We ran this pipeline in a sample dataset using the real business logic and here is the average execution result of sample runs</p><ul class=""><li id="cf8e" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj">The first stage workflow takes about 7 hours to process playback_table data</li><li id="8eca" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">The second stage workflow takes about 3.5 hours to process playback_daily_table data</li></ul><h2 id="bf9f" class="pn oa gr be ob po pp dx of pq pr dz oj nl ps pt pu np pv pw px nt py pz qa qb bj">New Data Pipeline with Incremental Processing</h2><p id="e582" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">Using IPS, we rewrite the pipeline to avoid re-processing data as much as possible. The new pipeline is shown below.</p><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qt"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*vBPuSEs0XmgNyvOyuQrgQw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*vBPuSEs0XmgNyvOyuQrgQw.png" /></picture></div></div></figure><p id="b82b" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Stage 1:</strong></p><ul class=""><li id="6ce8" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj"><strong class="nc gs">ips_playback_daily_workflow</strong>: it is the updated version of playback_daily_workflow.</li><li id="395f" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">The workflow spark sql job then reads an incremental change data capture (ICDC) iceberg table (i.e. <strong class="nc gs">playback_icdc_table</strong>), which only includes the new data added into the playback_table. It includes the late arriving data but does not include any unchanged data from playback_table.</li><li id="c2c9" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">The business logic will replace <strong class="nc gs">INSERT OVERWRITE</strong> by <strong class="nc gs">MERGE INTO</strong> SQL query and then the new data will be merged into the playback_daily_table.</li></ul><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qu"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*1MzFisgq-Nk0eKCAEpI17w.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*1MzFisgq-Nk0eKCAEpI17w.png" /></picture></div></div></figure><p id="31d8" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Stage 2:</strong></p><ul class=""><li id="82b6" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj">IPS captures the changed data of playback_daily_table and also keeps the change data in an ICDC source table (<strong class="nc gs">playback_daily_icdc_table</strong>). So we don’t need to hard code the lookback window in the business logic. If there are only Y days having changed data in playback_daily_table, then it only needs to load data for Y days.</li><li id="1aaa" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">In <strong class="nc gs">ips_playback_daily_agg_workflow</strong>, the business logic will be the same for the current day’s partition. We then need to update business logic to take care of late arriving data by</li><li id="6015" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">JOIN the playback_daily table with playback_daily_icdc_table on the aggregation group-by keys for the past 2 to X days, excluding the current day (i.e. day 1)</li><li id="559c" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">Because late arriving data is sparse, JOIN will narrow down the playback_daily_table data set so as to only process a very small portion of it.</li><li id="2b70" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">The business logic will use <strong class="nc gs">MERGE INTO</strong> SQL query then the change will be propagated to the downstream target table</li><li id="5054" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">For the current day, the business logic will be the same and consume the data from playback_daily_table and then write the outcome to the target table playback_daily_agg_table using <strong class="nc gs">INSERT OVERWRITE</strong> because there is no need to join with the ICDC table.</li></ul><figure class="qf qg qh qi qj qk qc qd paragraph-image"><div role="button" tabindex="0" class="ql qm fg qn bg qo"><div class="qc qd qv"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*kuwRjmjojQ8k5-SlCsHg8Q.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*kuwRjmjojQ8k5-SlCsHg8Q.png" /></picture></div></div></figure><p id="527d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">With these small changes, the data pipeline efficiency is greatly improved. In our sample run,</p><ul class=""><li id="1ba3" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pc nn no np pd nr ns nt pe nv nw nx pf pg ph bj">The first stage workflow takes just about 30 minutes to process X day change data from playback_table.</li><li id="dfc1" class="na nb gr nc b nd pi nf ng nh pj nj nk nl pk nn no np pl nr ns nt pm nv nw nx pf pg ph bj">The second stage workflow takes about 15 minutes to process change data between day 2 to day X from playback_daily_table by joining with playback_daily_cdc_table data and takes another 15 minutes to process the current day (i.e. day 1) playback_daily_table change data.</li></ul><p id="9b53" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Here the spark job settings are the same in original and new pipelines. So in total, the new IPS based pipeline overall needs around <strong class="nc gs">10%</strong> of resources (measured by the execution time) to finish.</p><h1 id="2bcb" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Looking Forward</h1><p id="e6cb" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">We will improve IPS to support more complicated cases beyond append-only cases. IPS will be able to keep track of the progress of the table changes and support multiple Iceberg table change types (e.g. append, overwrite, etc.). We will also add managed backfill support into IPS to help users to build, monitor, and validate the backfill.</p><p id="45fd" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">We are taking Big Data Orchestration to the next level and constantly solving new problems and challenges, please stay tuned. If you are motivated to solve large scale orchestration problems, please <a class="af ny" href="https://jobs.netflix.com/search?team=Data+Platform" rel="noopener ugc nofollow" target="_blank">join us</a>.</p><h1 id="1d0e" class="nz oa gr be ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bj">Acknowledgements</h1><p id="9269" class="pw-post-body-paragraph na nb gr nc b nd ox nf ng nh oy nj nk nl oz nn no np pa nr ns nt pb nv nw nx gk bj">Thanks to our Product Manager <a class="af ny" href="https://www.linkedin.com/in/ashpokh/" rel="noopener ugc nofollow" target="_blank">Ashim Pokharel</a> for driving the strategy and requirements. We’d also like to thank Andy Chu, Kyoko Shimada, Abhinaya Shetty, Bharath Mummadisetty, John Zhuge, Rakesh Veeramacheneni, and other stunning colleagues at Netflix for their suggestions and feedback while developing IPS. We’d also like to thank Prashanth Ramdas, Eva Tse, Charles Smith, and other leaders of Netflix engineering organizations for their constructive feedback and suggestions on the IPS architecture and design.</p></div>]]></description>
      <link>https://netflixtechblog.com/incremental-processing-using-netflix-maestro-and-apache-iceberg-b8ba072ddeeb</link>
      <guid>https://netflixtechblog.com/incremental-processing-using-netflix-maestro-and-apache-iceberg-b8ba072ddeeb</guid>
      <pubDate>Tue, 21 Nov 2023 06:49:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[3. Psyberg: Automated end to end catch up]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fw fx fy fz"><div><div class="hs ht hu hv hw"></div><p id="5471" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">By <a class="af ny" href="https://www.linkedin.com/in/abhinaya-shetty-ab871418/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Abhinaya Shetty</em></a>, <a class="af ny" href="https://www.linkedin.com/in/bharath-chandra-mummadisetty-27591a88/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Bharath Mummadisetty</em></a></p><p id="8101" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables.</p><p id="4baf" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In the previous installments of this series, we <a class="af ny" href="https://netflixtechblog.medium.com/f68830617dd1" rel="noopener">introduced Psyberg</a> and delved into its core operational modes: <a class="af ny" href="https://netflixtechblog.medium.com/1d273b3aaefb" rel="noopener">Stateless and Stateful Data Processing</a>. Now, let’s explore the state of our pipelines after incorporating Psyberg.</p><h1 id="d3bf" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Pipelines After Psyberg</h1><p id="cfce" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Let’s explore how different modes of Psyberg could help with a multistep data pipeline. We’ll return to the sample customer lifecycle:</p><figure class="pg ph pi pj pk pl pd pe paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe pf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*C1VqEWTnxQ4O-M8ScWoLvA.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*C1VqEWTnxQ4O-M8ScWoLvA.png" /></picture></div></div></figure><p id="8d82" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Processing Requirement</strong>: <br />Keep track of the end-of-hour state of accounts, e.g., <strong class="nc gs">Active/Upgraded/Downgraded/Canceled.</strong></p><p id="fd59" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Solution</strong>: <br />One potential approach here would be as follows</p><ol class=""><li id="2618" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx pu pv pw bj">Create <strong class="nc gs">two stateless</strong> <strong class="nc gs">fact</strong> tables :<br />a. Signups<br />b. Account Plans</li><li id="ec4d" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx pu pv pw bj">Create <strong class="nc gs">one stateful</strong> <strong class="nc gs">fact</strong> table:<br />a. Cancels</li><li id="81d2" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx pu pv pw bj">Create a <strong class="nc gs">stateful dimension</strong> that reads the above fact tables every hour and derives the latest account state.</li></ol><p id="d1ab" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Let’s look at how this can be integrated with Psyberg to auto-handle late-arriving data and corresponding end-to-end data catchup.</p><h1 id="902f" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj"><strong class="al">Navigating the Workflow: How Psyberg Handles Late-Arriving Data</strong></h1><p id="9b8c" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">We follow a generic workflow structure for both stateful and stateless processing with Psyberg; this helps maintain consistency and makes debugging and understanding these pipelines easier. The following is a concise overview of the various stages involved; for a more detailed exploration of the workflow specifics, please turn to the <a class="af ny" href="https://netflixtechblog.medium.com/1d273b3aaefb" rel="noopener">second installment</a> of this series.</p><h1 id="71de" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">1. Psyberg Initialization</h1><p id="af8c" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">The workflow starts with the Psyberg initialization (init) step.</p><ul class=""><li id="f2f0" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx qc pv pw bj"><strong class="nc gs">Input</strong>: List of source tables and required processing mode</li><li id="8223" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Output</strong>: Psyberg identifies new events that have occurred since the last high watermark (HWM) and records them in the session metadata table.</li></ul><p id="abef" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The session metadata table can then be read to determine the pipeline input.</p><figure class="pg ph pi pj pk pl pd pe paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*NhFne1sCHTTW8ZzahSZqAQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*NhFne1sCHTTW8ZzahSZqAQ.png" /></picture></div></div></figure><h1 id="e190" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">2. Write-Audit-Publish (WAP) Process</h1><p id="77dc" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">This is the general pattern we use in our ETL pipelines.</p><p id="c675" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">a. Write<br /></strong>Apply the ETL business logic to the input data identified in Step 1 and write to an unpublished iceberg snapshot based on the Psyberg mode</p><figure class="pg ph pi pj pk pl pd pe paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*HbO2Zm9IEeoRwEYOnrBRuQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*HbO2Zm9IEeoRwEYOnrBRuQ.png" /></picture></div></div></figure><p id="d7f6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">b</strong>. <strong class="nc gs">Audit<br /></strong>Run various quality checks on the staged data. Psyberg’s metadata session table is used to identify the partitions included in a batch run. Several audits, such as verifying source and target counts, are performed on this batch of data.</p><p id="6283" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">c. Publish<br /></strong>If the audits are successful, cherry-pick the staging snapshot to publish the data to production.</p><h1 id="b65f" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">3. Psyberg Commit</h1><p id="d6f4" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Now that the data pipeline has been executed successfully, the new high watermark identified in the initialization step is committed to Psyberg’s high watermark metadata table. This ensures that the next instance of the workflow will pick up newer updates.</p><h1 id="441c" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Callouts</h1><ul class=""><li id="f3d2" class="na nb gr nc b nd oy nf ng nh oz nj nk nl qe nn no np qf nr ns nt qg nv nw nx qc pv pw bj">Having the Psyberg step isolated from the core data pipeline allows us to maintain a consistent pattern that can be applied across stateless and stateful processing pipelines with varying requirements.</li><li id="2109" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj">This also enables us to update the Psyberg layer without touching the workflows.</li><li id="858b" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj">This is compatible with both Python and Scala Spark.</li><li id="1de2" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj">Debugging/figuring out what was loaded in every run is made easy with the help of workflow parameters and Psyberg Metadata.</li></ul><h1 id="1700" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">The Setup: Automated end-to-end catchup</h1><p id="a0c1" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Let’s go back to our customer lifecycle example. Once we integrate all four components with Psyberg, here’s how we would set it up for automated catchup.</p></div></div><div class="pl"><div class="ab ca"><div class="md qh me qi mf qj ce qk cf ql ch bg"><figure class="pg ph pi pj pk pl qn qo paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*10mLxGw1SkgT1z3W%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*10mLxGw1SkgT1z3W%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*10mLxGw1SkgT1z3W%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*10mLxGw1SkgT1z3W%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*10mLxGw1SkgT1z3W%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*10mLxGw1SkgT1z3W%201100w,%20https://miro.medium.com/v2/resize:fit:2000/0*10mLxGw1SkgT1z3W%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*10mLxGw1SkgT1z3W 640w, https://miro.medium.com/v2/resize:fit:720/0*10mLxGw1SkgT1z3W 720w, https://miro.medium.com/v2/resize:fit:750/0*10mLxGw1SkgT1z3W 750w, https://miro.medium.com/v2/resize:fit:786/0*10mLxGw1SkgT1z3W 786w, https://miro.medium.com/v2/resize:fit:828/0*10mLxGw1SkgT1z3W 828w, https://miro.medium.com/v2/resize:fit:1100/0*10mLxGw1SkgT1z3W 1100w, https://miro.medium.com/v2/resize:fit:2000/0*10mLxGw1SkgT1z3W 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="ecb6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The three fact tables, comprising the signup and plan facts encapsulated in Psyberg’s stateless mode, along with the cancel fact in stateful mode, serve as inputs for the stateful sequential load ETL pipeline. This data pipeline monitors the various stages in the customer lifecycle.</p><p id="86c6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In the sequential load ETL, we have the following features:</p><ul class=""><li id="a400" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx qc pv pw bj"><strong class="nc gs">Catchup Threshold</strong>: This defines the lookback period for the data being read. For instance, only consider the last 12 hours of data.</li><li id="eb05" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Data Load Type</strong>: The ETL can either load the missed/new data specifically or reload the entire specified range.</li><li id="0814" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Metadata Recording</strong>: Metadata is persisted for traceability.</li></ul><p id="4c9e" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Here is a <strong class="nc gs">walkthrough</strong> on how this system would automatically catch up in the event of late-arriving data:</p><p id="6f5b" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Premise:</strong> All the tables were last loaded up to hour 5, meaning that any data from hour 6 onwards is considered new, and anything before that is classified as late data (as indicated in red above)</p><p id="28de" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Fact level catchup</strong>:</p><ol class=""><li id="1ed6" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx pu pv pw bj">During the Psyberg initialization phase, the signup and plan facts identify the late data from hours 2 and 3, as well as the most recent data from hour 6. The ETL then appends this data to the corresponding partitions within the fact tables.</li><li id="a203" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx pu pv pw bj">The Psyberg initialization for the cancel fact identifies late data from hour 5 and additional data from hours 6 and 7. Since this ETL operates in stateful mode, the data in the target table from hours 5 to 7 will be overwritten with the new data.</li><li id="1c14" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx pu pv pw bj">By focusing solely on updates and avoiding reprocessing of data based on a fixed lookback window, both Stateless and Stateful Data Processing maintain a minimal change footprint. This approach ensures data processing is both efficient and accurate.</li></ol><p id="057d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Dimension level catchup</strong>:</p><ol class=""><li id="bde3" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx pu pv pw bj">The Psyberg wrapper for this stateful ETL looks at the updates to the upstream Psyberg powered fact tables to determine the date-hour range to reprocess. Here’s how it would calculate the above range:<br /><strong class="nc gs">MinHr = least(min processing hour from each source table)</strong><br />This ensures that we don’t miss out on any data, including late-arriving data. In this case, the minimum hour to process the data is hour 2.<br /><strong class="nc gs">MaxHr = least(max processing hour from each source table)<br /></strong>This ensures we do not process partial data, i.e., hours for which data has not been loaded into all source tables. In this case, the maximum hour to process the data is hour 6.</li><li id="0585" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx pu pv pw bj">The ETL process uses this time range to compute the state in the changed partitions and overwrite them in the target table. This helps overwrite data only when required and minimizes unnecessary reprocessing.</li></ol><p id="8304" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">As seen above, by chaining these Psyberg workflows, we could automate the catchup for late-arriving data from hours 2 and 6. The Data Engineer does not need to perform any manual intervention in this case and can thus focus on more important things!</p><h1 id="a12f" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">The Impact: How Psyberg Transformed Our Workflows</h1><p id="385f" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">The introduction of Psyberg into our workflows has served as a valuable tool in enhancing accuracy and performance. The following are key areas that have seen improvements from using Psyberg:</p><ul class=""><li id="eb1a" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pr nn no np ps nr ns nt pt nv nw nx qc pv pw bj"><strong class="nc gs">Computational Resources Used: <br /></strong>In certain instances, we’ve noticed a significant reduction in resource utilization, with the number of Spark cores used dropping by 90% following the implementation of Psyberg, compared to using fixed lookback windows</li><li id="b515" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Workflow and Table Onboarding: <br /></strong>We have onboarded 30 tables and 13 workflows into incremental processing since implementing Psyberg</li><li id="a583" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Reliability and Accuracy: <br /></strong>Since onboarding workflows to Psyberg, we have experienced zero manual catchups or missing data incidents</li><li id="a4b1" class="na nb gr nc b nd px nf ng nh py nj nk nl pz nn no np qa nr ns nt qb nv nw nx qc pv pw bj"><strong class="nc gs">Bootstrap template: <br /></strong>The process of integrating new tables into incremental processing has been made more accessible and now requires minimal effort using Psyberg</li></ul><p id="dd9b" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">These performance metrics suggest that adopting Psyberg has been beneficial to the efficiency of our data processing workflows.</p><h1 id="bb51" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Next Steps and Conclusion</h1><p id="cc4d" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Integrating Psyberg into our operations has improved our data workflows and opened up exciting possibilities for the future. As we continue to innovate, Netflix’s data platform team is focused on creating a comprehensive solution for incremental processing use cases. This platform-level solution is intended to enhance our data processing capabilities across the organization. Stay tuned for a new post on this!</p><p id="ec4e" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In conclusion, Psyberg has proven to be a reliable and effective solution for our data processing needs. As we look to the future, we’re excited about the potential for further advancements in our data platform capabilities.</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/3-psyberg-automated-end-to-end-catch-up-260fbe366fe2</link>
      <guid>https://netflixtechblog.com/3-psyberg-automated-end-to-end-catch-up-260fbe366fe2</guid>
      <pubDate>Wed, 15 Nov 2023 04:25:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[2. Diving Deeper into Psyberg: Stateless vs Stateful Data Processing]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fw fx fy fz"><div><div class="hs ht hu hv hw"></div><p id="9782" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">By <a class="af ny" href="https://www.linkedin.com/in/abhinaya-shetty-ab871418/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Abhinaya Shetty</em></a>, <a class="af ny" href="https://www.linkedin.com/in/bharath-chandra-mummadisetty-27591a88/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Bharath Mummadisetty</em></a></p><p id="8a19" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In the <a class="af ny" href="https://netflixtechblog.medium.com/f68830617dd1" rel="noopener">inaugural blog post</a> of this series, we introduced you to the state of our pipelines before Psyberg and the challenges with incremental processing that led us to create the Psyberg framework within Netflix’s Membership and Finance data engineering team. In this post, we will delve into a more detailed exploration of Psyberg’s two primary operational modes: stateless and stateful.</p><h1 id="6401" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Modes of Operation of Psyberg</h1><p id="59f1" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Psyberg has two main modes of operation or patterns, as we call them. Understanding the nature of the late-arriving data and processing requirements will help decide which pattern is most appropriate for a use case.</p><ol class=""><li id="79de" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pd nn no np pe nr ns nt pf nv nw nx pg ph pi bj"><strong class="nc gs">Stateless Data Processing</strong>: As the name suggests, one should use this pattern in scenarios where the columns in the target table solely depend on the content of the incoming events, irrespective of their order of occurrence. For instance, consider a scenario where we need to keep track of all the customer signups over time. In this case, the order of signups wouldn’t matter, and individual signup records are independent of each other. This information has only one source, and we can append new/late records to the fact table as and when the events are received.</li><li id="b4ad" class="na nb gr nc b nd pj nf ng nh pk nj nk nl pl nn no np pm nr ns nt pn nv nw nx pg ph pi bj"><strong class="nc gs">Stateful Data Processing</strong>: This pattern is useful when the output depends on a sequence of events across one or more input streams. For example, the customer account lifecycle in a business might involve multiple stages, such as account creation, plan upgrades, downgrades, and cancellation. To derive attributes like the lifetime of an account or the latest plan the account is on, we need to track the sequence of these events across different input streams. A missed event in such a scenario would result in incorrect analysis due to a wrong derived state. Late-arriving data in such cases requires overwriting data that was previously processed to ensure all events are accounted for.</li></ol><p id="f0bb" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Let’s visualize how these two modes work within our data processing pipeline using a general workflow for loading a fact table. If you would like to learn more about how the workflows are orchestrated in Netflix Maestro scheduler, please check out this <a class="af ny" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">blog post</a> from our data platform team.</p><figure class="pr ps pt pu pv pw po pp paragraph-image"><div class="po pp pq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*RnFDv0pCKpSxEiWBF_e2kw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*RnFDv0pCKpSxEiWBF_e2kw.png" /></picture></div></figure><p id="8ab6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">With this illustration as our guide, let’s explore each mode in more detail.</p><h1 id="8075" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">The Psyberg Initialization Phase</h1><p id="5c60" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">This step invokes Psyberg with the required parameters. Based on these parameters, Psyberg then computes the correct data range for the pipeline processing needs.</p><p id="4237" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Input parameters in this step include the following:</p></div></div><div class="pw"><div class="ab ca"><div class="md py me pz mf qa ce qb cf qc ch bg"><figure class="pr ps pt pu pv pw qe qf paragraph-image"><div role="button" tabindex="0" class="qg qh fg qi bg qj"><div class="po pp qd"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*euArkhJ5Fx-lVgoZszxFEg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*euArkhJ5Fx-lVgoZszxFEg.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><h1 id="2f59" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Initialization for Stateless Data Processing</h1><p id="8fe2" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Let’s use the signup fact table as an example here. This table’s workflow runs hourly, with the main input source being an Iceberg table storing all raw signup events partitioned by landing date, hour, and batch id.</p><p id="87aa" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Here’s a YAML snippet outlining the configuration for this during the Psyberg initialization step:</p><pre class="pr ps pt pu pv qk ql qm bo qn ba bj">- job:<br />   id: psyberg_session_init<br />   type: Spark<br />   spark:<br />     app_args:<br />       - --process_name=signup_fact_load<br />       - --src_tables=raw_signups<br />       - --psyberg_session_id=20230914061001<br />       - --psyberg_hwm_table=high_water_mark_table<br />       - --psyberg_session_table=psyberg_session_metadata<br />       - --etl_pattern_id=1</pre><p id="a6b0" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Behind the scenes, Psyberg identifies that this pipeline is configured for a stateless pattern since <strong class="nc gs">etl_pattern_id=1</strong>.</p><p id="2420" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Psyberg also uses the provided inputs to detect the Iceberg snapshots that persisted after the latest high watermark available in the watermark table. Using the <strong class="nc gs">summary column in snapshot metadata</strong> [see the <a class="af ny" href="https://netflixtechblog.medium.com/f68830617dd1" rel="noopener">Iceberg Metadata section in post 1</a> for more details], we parse out the partition information for each Iceberg snapshot of the source table.</p><p id="c24b" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Psyberg then retains these processing URIs (an array of JSON strings containing combinations of landing date, hour, and batch IDs) as determined by the snapshot changes. This information and other calculated metadata are stored in the <strong class="nc gs">psyberg_session_f</strong> table. This stored data is then available for the subsequent<strong class="nc gs"> LOAD.FACT_TABLE</strong> job in the workflow to utilize and for analysis and debugging purposes.</p><h1 id="7ba9" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Initialization for Stateful Data Processing</h1><p id="4bf1" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Stateful Data Processing is used when the output depends on a sequence of events across one or more input streams.</p><p id="c06e" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Let’s consider the example of creating a cancel fact table, which takes the following as input:</p><ol class=""><li id="2910" class="na nb gr nc b nd ne nf ng nh ni nj nk nl pd nn no np pe nr ns nt pf nv nw nx pg ph pi bj"><strong class="nc gs">Raw cancellation events</strong> indicating when the customer account was canceled</li><li id="7c9a" class="na nb gr nc b nd pj nf ng nh pk nj nk nl pl nn no np pm nr ns nt pn nv nw nx pg ph pi bj">A fact table that stores incoming <strong class="nc gs">customer requests </strong>to cancel their subscription at the end of the billing period</li></ol><p id="0ab9" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">These inputs help derive additional stateful analytical attributes like the type of churn i.e. voluntary or involuntary, etc.</p><p id="9e44" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The initialization step for Stateful Data Processing differs slightly from Stateless. Psyberg offers additional configurations according to the pipeline needs. Here’s a YAML snippet outlining the configuration for the cancel fact table during the Psyberg initialization step:</p><pre class="pr ps pt pu pv qk ql qm bo qn ba bj">- job:<br />   id: psyberg_session_init<br />   type: Spark<br />   spark:<br />     app_args:<br />       - --process_name=cancel_fact_load<br />       - --src_tables=raw_cancels|processing_ts,cancel_request_fact<br />       - --psyberg_session_id=20230914061501<br />       - --psyberg_hwm_table=high_water_mark_table<br />       - --psyberg_session_table=psyberg_session_metadata<br />       - --etl_pattern_id=2</pre><p id="c150" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Behind the scenes, Psyberg identifies that this pipeline is configured for a stateful pattern since <strong class="nc gs">etl_pattern_id</strong> is 2.</p><p id="34b4" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Notice the additional detail in the src_tables list corresponding to raw_cancels above. The <strong class="nc gs">processing_ts</strong> here represents the event processing timestamp which is different from the regular Iceberg snapshot commit timestamp i.e. <strong class="nc gs">event_landing_ts</strong> as described in <a class="af ny" href="https://netflixtechblog.medium.com/f68830617dd1" rel="noopener">part 1</a> of this series.</p><p id="f22c" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">It is important to capture the range of a consolidated batch of events from all the sources i.e. both raw_cancels and cancel_request_fact, while factoring in late-arriving events. Changes to the source table snapshots can be tracked using different timestamp fields. Knowing which timestamp field to use i.e. <strong class="nc gs">event_landing_ts</strong> or something like <strong class="nc gs">processing_ts</strong> helps avoid missing events.</p><p id="e35d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Similar to the approach in stateless data processing, Psyberg uses the provided inputs to parse out the partition information for each Iceberg snapshot of the source table.</p></div></div><div class="pw"><div class="ab ca"><div class="md py me pz mf qa ce qb cf qc ch bg"><figure class="pr ps pt pu pv pw qe qf paragraph-image"><div role="button" tabindex="0" class="qg qh fg qi bg qj"><div class="po pp qt"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*m7WRiJ012nCinKoZr1Yglg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*m7WRiJ012nCinKoZr1Yglg.png" /></picture></div></div><figcaption class="qu fc qv po pp qw qx be b bf z dt">Sample parsed input for target snapshot_date 20230914 and snapshot_hour 9</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="298e" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This is then used to query the partitions metadata table which has the min and max range for each column in the source table. In this case, we look at the min and max range of the <strong class="nc gs">processing_ts</strong> column to determine actual partitions for any late-arriving events. The minimum value here helps determine the lower limit of the data to be processed i.e. the derived minimum date and hour based on the input epoch timestamp.</p></div></div><div class="pw"><div class="ab ca"><div class="md py me pz mf qa ce qb cf qc ch bg"><figure class="pr ps pt pu pv pw qe qf paragraph-image"><div role="button" tabindex="0" class="qg qh fg qi bg qj"><div class="po pp qy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*EwHHXNgpEvPVDyMmW0rEvg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*EwHHXNgpEvPVDyMmW0rEvg.png" /></picture></div></div><figcaption class="qu fc qv po pp qw qx be b bf z dt">Lower Limit to be processed = least ( “min” <strong class="be oc">event_processing_ts)</strong></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="9379" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">It also tracks the VTTS (Valid To TimeStamp) of all the input streams and determines the minimum VTTS of all the streams together. This helps determine the upper limit of data to be processed, thus restricting the data load based on data completeness of all the streams combined.</p></div></div><div class="pw"><div class="ab ca"><div class="md py me pz mf qa ce qb cf qc ch bg"><figure class="pr ps pt pu pv pw qe qf paragraph-image"><div role="button" tabindex="0" class="qg qh fg qi bg qj"><div class="po pp qz"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*QRQuvqnfQ6BbzMceQfC5IQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*QRQuvqnfQ6BbzMceQfC5IQ.png" /></picture></div></div><figcaption class="qu fc qv po pp qw qx be b bf z dt">Upper Limit to be processed = least (<strong class="be oc">vtts date-hour)</strong></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="b5f6" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Using this metadata from different streams, Psyberg calculates several parameters like minimum/maximum processing date and hour and event landing date hour. These parameters, along with other metadata, discussed in the previous post, are persisted in the <strong class="nc gs">psyberg_session_f</strong> table for analysis and debugging purposes.</p><h1 id="d413" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Write Audit Publish (WAP) process</h1><p id="05e4" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">The <a class="af ny" href="https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/" rel="noopener ugc nofollow" target="_blank">Write Audit Publish (WAP) process</a> is a general pattern we use in our ETLs to validate writes to the uncommitted Iceberg snapshot before publishing to the target table. The <strong class="nc gs">LOAD.FACT_TABLE</strong> step takes <strong class="nc gs">psyberg_session_id</strong> and <strong class="nc gs">process_name</strong> as input arguments.</p><p id="b75a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">For stateless pattern, the processing URIs to be processed as part of the load step are identified by reading the <strong class="nc gs">psyberg_session_f</strong> table. This information is then used to filter the source table and apply the business logic to create the signup fact table. Any late-arriving signup events data is appended to the target table partitions as part of this. All these writes go into the uncommitted Iceberg snapshot managed by the WAP pattern.</p><p id="8f62" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Similarly, in the stateful pattern, the ETL step reads the <strong class="nc gs">psyberg_session_f</strong> table to identify the derived minimum and maximum date hour range to be processed, which acts as a filter for different input tables involved in the ETL. After applying the corresponding business logic for cancellation events, we create the cancel fact table along with columns like cancellation type (i.e., voluntary vs involuntary churn) representing the state of the canceled account. If there are any late-arriving events, Psyberg handles them automatically by providing the correct range to the data process to derive the state changes correctly.</p><h1 id="1460" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Audits</h1><p id="ccd4" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">We run different audits on the uncommitted Iceberg snapshot created as part of the job run. Leveraging Psyberg metadata, we can identify the cohort of data involved as part of the job run. This helps in pinpointing changes and applying blocking audits efficiently. Audits like source-to-target count comparison and checking for no missing events in the target Iceberg snapshot ensure data integrity and completeness. Once the audits pass successfully, the data is published to the target table.</p><h1 id="a91f" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">HWM Commit</h1><p id="0a5f" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Leveraging Psyberg metadata tables, we determine the latest timestamp associated with the Iceberg snapshot seen as part of the job run. This timestamp is used to update the high watermark table with the new high watermark so that the subsequent pipeline instance can pick up the next set of changes.</p><h1 id="3109" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Conclusion</h1><p id="f5e2" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">This exploration shows how Psyberg brings efficiency, accuracy, and timeliness to Stateless and Stateful Data Processing within the Membership and Finance data engineering team. Join us in the <a class="af ny" href="https://netflixtechblog.medium.com/260fbe366fe2" rel="noopener">next part</a> of our blog series, where we’ll discuss how it also helps automate the end-to-end catchup of different pipelines.</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/2-diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-1d273b3aaefb</link>
      <guid>https://netflixtechblog.com/2-diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-1d273b3aaefb</guid>
      <pubDate>Wed, 15 Nov 2023 04:25:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[1. Streamlining Membership Data Engineering at Netflix with Psyberg]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fw fx fy fz"><div><div class="hs ht hu hv hw"></div><p id="3928" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">By <a class="af ny" href="https://www.linkedin.com/in/abhinaya-shetty-ab871418/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Abhinaya Shetty</em></a>, <a class="af ny" href="https://www.linkedin.com/in/bharath-chandra-mummadisetty-27591a88/" rel="noopener ugc nofollow" target="_blank"><em class="nz">Bharath Mummadisetty</em></a></p><p id="4666" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">At Netflix, our <strong class="nc gs">Membership and Finance Data Engineering team</strong> harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. Many metrics in<a class="af ny" href="https://s2.bl-1.com/h/i/dtZJ85P6/tWbBNBk" rel="noopener ugc nofollow" target="_blank"><strong class="nc gs">Netflix’s financial reports</strong></a> are powered and reconciled with efforts from our team! Given our role on this critical path, <strong class="nc gs">accuracy</strong> is paramount. In this context, managing the data, especially when it arrives late, can present a substantial challenge!</p><p id="c124" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">In this three-part blog post series, we introduce you to <strong class="nc gs"><em class="nz">Psyberg</em>, our incremental data processing framework</strong> designed to tackle such challenges! We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. By the end of this series, we hope you will gain an understanding of how Psyberg transformed our data processing, making our pipelines more efficient, accurate, and timely. Let’s dive in!</p><h1 id="bb43" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">The Challenge: Incremental Data Processing with Late Arriving Data</h1><p id="0a9a" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Our teams’ data processing model mainly comprises <strong class="nc gs">batch pipelines</strong>, which run at different intervals ranging from hourly to multiple times a day (also known as intraday) and even daily. We expect <strong class="nc gs">complete and accurate data </strong>at the end of each run. To meet such expectations, we generally run our pipelines with a lag of a few hours to leave room for late-arriving data.</p><h1 id="0810" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">What is late-arriving data?</h1><p id="b8e8" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Late-arriving data is essentially delayed data due to system retries, network delays, batch processing schedules, system outages, delayed upstream workflows, or reconciliation in source systems.</p><h1 id="61a3" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">How does late-arriving data impact us?</h1><p id="a8e7" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">You could think of our data as a puzzle. With each new piece of data, we must fit it into the larger picture and ensure it’s accurate and complete. Thus, we must reprocess the missed data to ensure data completeness and accuracy.</p><h1 id="2550" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Types of late-arriving data</h1><figure class="pg ph pi pj pk pl pd pe paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe pf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*Xw5VD9l6P04jy6vM%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*Xw5VD9l6P04jy6vM%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*Xw5VD9l6P04jy6vM%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*Xw5VD9l6P04jy6vM%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*Xw5VD9l6P04jy6vM%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*Xw5VD9l6P04jy6vM%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*Xw5VD9l6P04jy6vM%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*Xw5VD9l6P04jy6vM 640w, https://miro.medium.com/v2/resize:fit:720/0*Xw5VD9l6P04jy6vM 720w, https://miro.medium.com/v2/resize:fit:750/0*Xw5VD9l6P04jy6vM 750w, https://miro.medium.com/v2/resize:fit:786/0*Xw5VD9l6P04jy6vM 786w, https://miro.medium.com/v2/resize:fit:828/0*Xw5VD9l6P04jy6vM 828w, https://miro.medium.com/v2/resize:fit:1100/0*Xw5VD9l6P04jy6vM 1100w, https://miro.medium.com/v2/resize:fit:1400/0*Xw5VD9l6P04jy6vM 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="70c5" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Based on the structure of our upstream systems, we’ve classified late-arriving data into two categories, each named after the timestamps of the updated partition:</p></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe pw"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*rJmOFmCoW47SCjvHeX0CXQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*rJmOFmCoW47SCjvHeX0CXQ.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><h1 id="0445" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Ways to process such data</h1><p id="3847" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Our team previously employed some strategies to manage these scenarios, which often led to unnecessarily reprocessing unchanged data. Some techniques we used were:</p><p id="aa05" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">1. Using<strong class="nc gs"> fixed lookback </strong>windows to always reprocess data, assuming that most late-arriving events will occur within that window. However, this approach usually leads to redundant data reprocessing, thereby increasing <a class="af ny" href="https://en.wikipedia.org/wiki/Extract,_transform,_load" rel="noopener ugc nofollow" target="_blank">ETL</a> processing time and compute costs. It also becomes inefficient as the data scale increases. Imagine reprocessing the past 6 hours of data every hour!</p><p id="4e34" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">2. <strong class="nc gs">Add alerts</strong> to flag when late arriving data appears, block the pipelines, and perform a manual intervention where we triggered backfill pipelines to handle the missed events. This approach was a simple solution with minimal extra processing for the most part and, hence, was our preferred solution. However, when the late events occurred, the pain of reprocessing data and catching up on all the dependent pipelines was not worth it! We will talk about this shortly.</p><p id="1d29" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">At a high level, both these approaches were inefficient for intraday pipelines and impacted cost, performance, accuracy, and time. We developed <strong class="nc gs">Psyberg</strong>, an incremental processing framework using <a class="af ny" href="https://iceberg.apache.org/" rel="noopener ugc nofollow" target="_blank">Iceberg</a> to handle these challenges more effectively.</p><h1 id="9bf7" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">The state of our pipelines before Psyberg</h1><p id="5397" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Before diving into the world of Psyberg, it’s crucial to take a step back and reflect on the state of the data pipelines in our team before its implementation. The complexities involved in these processes and the difficulties they posed led to the development of Psyberg.</p><p id="b95d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">At Netflix, our backend microservices continuously generate real-time event data that gets streamed into Kafka. These raw events are the source of various data processing workflows within our team. We ingest this diverse event data and transform it into standardized fact tables. The fact tables then feed downstream intraday pipelines that process the data hourly. The sequential load ETL shown in the diagram below depicts one such pipeline that calculates an account's state every hour.</p></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe pf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*6-ZIOA5PMNz_DPW-%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*6-ZIOA5PMNz_DPW-%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*6-ZIOA5PMNz_DPW-%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*6-ZIOA5PMNz_DPW-%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*6-ZIOA5PMNz_DPW-%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*6-ZIOA5PMNz_DPW-%201100w,%20https://miro.medium.com/v2/resize:fit:2000/0*6-ZIOA5PMNz_DPW-%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*6-ZIOA5PMNz_DPW- 640w, https://miro.medium.com/v2/resize:fit:720/0*6-ZIOA5PMNz_DPW- 720w, https://miro.medium.com/v2/resize:fit:750/0*6-ZIOA5PMNz_DPW- 750w, https://miro.medium.com/v2/resize:fit:786/0*6-ZIOA5PMNz_DPW- 786w, https://miro.medium.com/v2/resize:fit:828/0*6-ZIOA5PMNz_DPW- 828w, https://miro.medium.com/v2/resize:fit:1100/0*6-ZIOA5PMNz_DPW- 1100w, https://miro.medium.com/v2/resize:fit:2000/0*6-ZIOA5PMNz_DPW- 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div><figcaption class="pz fc qa pd pe qb qc be b bf z dt">Raw data for hours 3 and 6 arrive. Hour 6 data flows through the various workflows, while hour 3 triggers a late data audit alert.</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="a3de" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Let’s walk through an example to understand the complexity of this pre-Psyberg world.</p><p id="798d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">Consider a simplified version of our pipelines where we process three events: signups, plan changes, and cancels. Now imagine that some signup events from hour 3 were delayed and sent in at hour 6 instead. Our audits would detect this and alert the on-call data engineer (DE). The on-call DE would then face the daunting task of making things right!</p><p id="3991" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Step 1</strong>: Dive into the audit logs to identify the late-arriving data and the impacted workflows. In this case, they would discover that the late-arriving data for hour 3 must be included in the signup facts.</p><p id="c38a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Step 2</strong>: Stop all impacted workflows and downstream jobs (such as the sequential load ETL) and patch the missed data in the fact tables. Now, the data in the signup fact is patched.</p><p id="6f5d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Step 3</strong>: Identify the number of partitions to be rerun for the sequential stateful load jobs to account for the delayed data and rerun them from the impacted date-hour. The DE would note that the data for hours 3–6 needs to be reprocessed and will retrigger four instances to be run sequentially. This step is crucial because missing signup events from hour 3 would result in us missing subsequent events for those affected accounts (e.g., a cancel event for a missed signup would have had no effect). As we capture the state of an account based on the sequence of different types of events, rerunning the sequential load ETL from hours 3 to 6 ensures the accurate representation of account states.</p><p id="fa57" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj"><strong class="nc gs">Step 4</strong>: Now that we’ve spent significant time triaging and resolving the alert, the sequential ETL workflow likely experienced a delay. As a result, we need to catch up to schedule. To compensate for the lost time, the DE must trigger a few additional instances until the latest hour that would have run if the data hadn’t arrived late.</p><p id="c34d" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This entire process was challenging and required significant manual intervention from the on-call DE perspective. Note that these are hourly jobs, so the alert could be triggered at any time of the day (or night!). Yes, they were infrequent, but a big pain point when they occurred! Also, the on-call DE was usually not the SME for these pipelines, as the late data could have arrived in any of our upstream pipelines. To solve these problems, we came up with Psyberg!</p><h1 id="5b06" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Psyberg: The Game Changer!</h1><p id="cd82" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Psyberg automates our data loads, making it suitable for various data processing needs, including intraday pipeline use cases. It leverages Iceberg metadata to facilitate processing incremental and batch-based data pipelines.</p></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe pf"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*i3Q9OtyFGyxh0Zon%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*i3Q9OtyFGyxh0Zon%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*i3Q9OtyFGyxh0Zon%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*i3Q9OtyFGyxh0Zon%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*i3Q9OtyFGyxh0Zon%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*i3Q9OtyFGyxh0Zon%201100w,%20https://miro.medium.com/v2/resize:fit:2000/0*i3Q9OtyFGyxh0Zon%202000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*i3Q9OtyFGyxh0Zon 640w, https://miro.medium.com/v2/resize:fit:720/0*i3Q9OtyFGyxh0Zon 720w, https://miro.medium.com/v2/resize:fit:750/0*i3Q9OtyFGyxh0Zon 750w, https://miro.medium.com/v2/resize:fit:786/0*i3Q9OtyFGyxh0Zon 786w, https://miro.medium.com/v2/resize:fit:828/0*i3Q9OtyFGyxh0Zon 828w, https://miro.medium.com/v2/resize:fit:1100/0*i3Q9OtyFGyxh0Zon 1100w, https://miro.medium.com/v2/resize:fit:2000/0*i3Q9OtyFGyxh0Zon 2000w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="d07a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">One of the critical features of Psyberg is its ability to detect and manage late-arriving data, no matter the partition it lands in. This feature allows data pipelines to handle late-arriving data effectively without manual intervention, ensuring higher data accuracy in our systems. <a class="af ny" href="https://iceberg.apache.org/spec/" rel="noopener ugc nofollow" target="_blank">Iceberg metadata</a> and Psyberg’s own metadata form the backbone of its efficient data processing capabilities.</p><h1 id="c7e4" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">ETL Process High Watermark</h1><p id="064c" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">This is the last recorded update timestamp for any data pipeline process. This is mainly used to identify new changes since the last update.</p><h1 id="7116" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Iceberg Metadata</h1><p id="f9ff" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">Psyberg primarily harnesses two key iceberg metadata tables — <em class="nz">snapshots and partitions</em> — to manage the workload. All Iceberg tables have associated metadata that provide insight into changes or updates within the data tables.</p><p id="0e9a" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The snapshots metadata table records essential metadata such as:</p><ul class=""><li id="0bac" class="na nb gr nc b nd ne nf ng nh ni nj nk nl qd nn no np qe nr ns nt qf nv nw nx qg qh qi bj">The creation time of a snapshot</li><li id="2e13" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">The type of operation performed (append, overwrite, etc.)</li><li id="e981" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">A summary of partitions created/updated during the generation of the Iceberg snapshot</li></ul><p id="8cd1" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">These details enable Psyberg to track different operations and identify changes made to a source table since the previous high watermark. For example:</p></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qo"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*MgEYVLbQnpEnm4aDt3OFMQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*MgEYVLbQnpEnm4aDt3OFMQ.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="ce36" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The partitions metadata table is particularly interesting as it stores:</p><ul class=""><li id="d5de" class="na nb gr nc b nd ne nf ng nh ni nj nk nl qd nn no np qe nr ns nt qf nv nw nx qg qh qi bj">Information about partition keys used in the data table</li><li id="e559" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">Column names and the range of values for each column within a specific partition</li></ul></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qp"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*n5c4kslLyEDLKA1aL2bIgw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*n5c4kslLyEDLKA1aL2bIgw.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="5fe9" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">One unique aspect of Netflix’s internal implementation is that it provides the range of values for each column within a partition in a deserialized format. This information helps Psyberg comprehend the timestamp ranges for both types of late-arriving data (event and processing time) without querying the actual data.</p><h1 id="be62" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Psyberg Metadata</h1><p id="8fb5" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">In addition to Iceberg metadata, Psyberg maintains its own metadata tables — the session table and the high watermark table. Both these tables are partitioned by the pipeline process name to maintain information related to each data pipeline independently.</p><p id="76a0" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The session table captures metadata specific to each pipeline run, including:</p><ul class=""><li id="68f6" class="na nb gr nc b nd ne nf ng nh ni nj nk nl qd nn no np qe nr ns nt qf nv nw nx qg qh qi bj">Process name partition to track all the runs associated with the data pipeline process</li><li id="2f97" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">Session ID to track unique runs within the process</li><li id="e9c6" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">Processing URIs to identify the input partitions involved in the load</li><li id="053f" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">“from date”, “from hour”, “to date” and “to hour” for both event and processing times</li></ul><p id="96ce" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">The high watermark table stores relevant values from the session table at the end of each pipeline run:</p><ul class=""><li id="638e" class="na nb gr nc b nd ne nf ng nh ni nj nk nl qd nn no np qe nr ns nt qf nv nw nx qg qh qi bj">Latest and previous high water mark timestamp</li><li id="8953" class="na nb gr nc b nd qj nf ng nh qk nj nk nl ql nn no np qm nr ns nt qn nv nw nx qg qh qi bj">Metadata related to the latest run</li></ul></div></div><div class="pl"><div class="ab ca"><div class="md pr me ps mf pt ce pu cf pv ch bg"><figure class="pg ph pi pj pk pl px py paragraph-image"><div role="button" tabindex="0" class="pm pn fg po bg pp"><div class="pd pe qq"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*odWvTipThkfPS-r9B_MoKQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*odWvTipThkfPS-r9B_MoKQ.png" /></picture></div></div></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="4ebc" class="pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj">This information is vital for each pipeline run instance as it helps determine the data to be loaded, updates the high water mark after processing, and finally generates output signals to inform downstream workflows about the date-hour up to which data is complete and available. It also serves as an essential resource for debugging and creating audits on the pipeline jobs.</p><h1 id="008d" class="oa ob gr be oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow ox bj">Conclusion</h1><p id="39d2" class="pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj">In this post, we described our data architecture at a high level, along with the pain points that led to the development of Psyberg. We also went into details related to the metadata that powers Psyberg. If you understand the challenges faced by the on-call DE and would like to learn more about our solution, please check out the <a class="af ny" href="https://netflixtechblog.medium.com/1d273b3aaefb" rel="noopener">next iteration</a> of this three-part series, where we delve deeper into different modes of Psyberg.</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/1-streamlining-membership-data-engineering-at-netflix-with-psyberg-f68830617dd1</link>
      <guid>https://netflixtechblog.com/1-streamlining-membership-data-engineering-at-netflix-with-psyberg-f68830617dd1</guid>
      <pubDate>Wed, 15 Nov 2023 04:24:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Detecting Speech and Music in Audio Content]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="0f44" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj"><a class="af nr" href="https://www.linkedin.com/in/iroroorife/" rel="noopener ugc nofollow" target="_blank">Iroro Orife</a>, <a class="af nr" href="https://www.linkedin.com/in/chih-wei-wu-73081689/" rel="noopener ugc nofollow" target="_blank">Chih-Wei Wu</a> and <a class="af nr" href="https://www.linkedin.com/in/yun-ning-hung/" rel="noopener ugc nofollow" target="_blank">Yun-Ning (Amy) Hung</a></p><h1 id="039b" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Introduction</h1><p id="6296" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">When you enjoy the latest season of <em class="ov">Stranger Things</em> or <em class="ov">Casa de Papel (Money Heist)</em>, have you ever wondered about the secrets to fantastic story-telling, besides the stunning visual presentation? From the violin melody accompanying a pivotal scene to the soaring orchestral arrangement and thunderous sound-effects propelling an edge-of-your-seat action sequence, the various components of the audio soundtrack combine to evoke the very essence of story-telling. To uncover the magic of audio soundtracks and further improve the sonic experience, we need a way to systematically examine the interaction of these components, typically categorized as <a class="af nr" href="https://www.jstor.org/stable/j.ctt16t8zf9" rel="noopener ugc nofollow" target="_blank">dialogue, music and effects</a>.</p><p id="fce4" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In this blog post, we will introduce speech and music detection as an enabling technology for a variety of audio applications in Film &amp; TV, as well as introduce our speech and music activity detection (SMAD) system which we recently published as a <a class="af nr" href="https://dl.acm.org/doi/abs/10.1186/s13636-022-00253-8" rel="noopener ugc nofollow" target="_blank">journal article</a> in EURASIP Journal on Audio, Speech, and Music Processing.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*W-4QTtWN_NDQt4HWr3EBBQ.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*W-4QTtWN_NDQt4HWr3EBBQ.gif" /></picture></div></div></figure><p id="a899" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Like semantic segmentation for audio, SMAD separately tracks the amount of speech and music in each frame in an audio file and is useful in <em class="ov">content understanding</em> tasks during the audio production and delivery lifecycle. The detailed temporal metadata SMAD provides about speech and music regions in a polyphonic audio mixture are a first step for structural audio segmentation, indexing and pre-processing audio for the following downstream tasks. Let’s have a look at a few applications.</p><h1 id="1411" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Practical use cases for speech &amp; music activity</h1><h2 id="ef54" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Audio dataset preparation</h2><p id="3df2" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Speech &amp; music activity is an important preprocessing step to prepare corpora for training. SMAD classifies &amp; segments long-form audio for use in large corpora, such as</p><ul class=""><li id="e06b" class="mt mu gr mv b mw mx my mz na nb nc nd ne qa ng nh ni qb nk nl nm qc no np nq qd qe qf bj">musical segments for <a class="af nr" href="https://www.ismir.net/resources/datasets/" rel="noopener ugc nofollow" target="_blank">music information retrieval tasks</a> (MIR).</li><li id="db00" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">utterances for <a class="af nr" href="https://github.com/s3prl/s3prl#downstream" rel="noopener ugc nofollow" target="_blank">speech tasks</a> like speaker diarization, emotion classification, semantic and phonetic transcription and translation.</li></ul><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*-_gP351GvZPmF9IbZVmEFg.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*-_gP351GvZPmF9IbZVmEFg.gif" /></picture></div></div><figcaption class="ql fc qm ow ox qn qo be b bf z dt">From “Audio Signal Classification” by David Gerhard</figcaption></figure><h2 id="0d2c" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Dialogue analysis &amp; processing</h2><ul class=""><li id="c97a" class="mt mu gr mv b mw oq my mz na or nc nd ne qp ng nh ni qq nk nl nm qr no np nq qd qe qf bj">During encoding at Netflix, speech-gated loudness is computed for every audio master track and used for loudness normalization. Speech-activity metadata is thus a central part of accurate catalog-wide loudness management and improved audio volume experience for Netflix members.</li><li id="1d26" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Similarly, algorithms for dialogue intelligibility, spoken-language-identification and speech-transcription are only applied to audio regions where there is measured speech.</li></ul><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*C5x--pVe2lu8AMWT43Je4Q.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*C5x--pVe2lu8AMWT43Je4Q.gif" /></picture></div></div></figure><h2 id="921a" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Music information retrieval</h2><ul class=""><li id="824e" class="mt mu gr mv b mw oq my mz na or nc nd ne qp ng nh ni qq nk nl nm qr no np nq qd qe qf bj">There are a few studio use cases where music activity metadata is important, including quality-control (QC) and at-scale multimedia content analysis and tagging.</li><li id="790f" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">There are also inter-domain tasks like singer-identification and song lyrics transcription, which do not fit neatly into either speech or classical MIR tasks, but are useful for annotating musical passages with lyrics in closed captions and subtitles.</li><li id="4ec5" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Conversely, where neither speech nor music activity is present, such audio regions are estimated to have content classified as noisy, environmental or sound-effects.</li></ul><h2 id="ca9a" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Localization &amp; Dubbing</h2><p id="afbe" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Finally, there are <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/introducing-netflix-timed-text-authoring-lineage-6fb57b72ad41">post-production tasks</a>, which take advantage of accurate speech segmentation at the the spoken utterance or sentence level, ahead of translation and dub-script generation. Likewise, authoring accessibility-features like <a class="af nr" href="https://en.wikipedia.org/wiki/Audio_description" rel="noopener ugc nofollow" target="_blank">Audio Description</a> (AD) involves music and speech segmentation. The AD narration is typically mixed-in to not overlap with the primary dialogue, while music lyrics strongly tied to the plot of the story, are sometimes referenced by AD creators, especially for translated AD.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div class="ow ox qs"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*3tWa-iN8pX_ZhOTskzd78Q.jpeg" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*3tWa-iN8pX_ZhOTskzd78Q.jpeg" /></picture></div><figcaption class="ql fc qm ow ox qn qo be b bf z dt">A voice actor in the studio</figcaption></figure><h1 id="9597" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Our Approach to Speech and Music Activity Detection</h1><p id="56ec" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Although the application of deep learning methods has improved audio classification systems in recent years, this data driven approach for SMAD requires large amounts of audio source material with audio-frame level speech and music activity labels. The collection of such fine-resolution labels is costly and labor intensive and audio content often cannot be publicly shared due to the copyright limitations. We address the challenge from a different angle.</p><h2 id="3e7d" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Content, genre and languages</h2><p id="736f" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Instead of augmenting or synthesizing training data, we sample the large scale data available in the Netflix catalog with noisy labels. In contrast to clean labels, which indicate precise start and end times for each speech/music region, noisy labels only provide approximate timing, which may impact SMAD classification performance. Nevertheless, noisy labels allow us to increase the scale of the dataset with minimal manual efforts and potentially generalize better across different types of content.</p><p id="a57f" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Our dataset, which we introduced as TVSM (TV Speech and Music) in <a class="af nr" href="https://www.springeropen.com/epdf/10.1186/s13636-022-00253-8?sharing_token=qUE9lQ50qcQxbhy4q7WuAm_BpE1tBhCbnbw3BuzI2RPYHxmYyj04FfJD9WVAT3xVEfjU0YvWAKHjSrjS3Pk16I2vFtdRuQgSdmgaSKkf5JiXbOSb0AglyInIbQCpnL8z0kJbzIzN5s368ENFJJSbKW1C3I7fzTQEHjPKYPBd2xM%3D" rel="noopener ugc nofollow" target="_blank">our publication</a>, has a total number of 1608 hours of professionally recorded and produced audio. TVSM is significantly larger than other SMAD datasets and contains both speech and music labels at the frame level. TVSM also contains overlapping music and speech labels, and both classes have a similar total duration.</p><p id="b75b" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Training examples were produced between 2016 and 2019, in 13 countries, with 60% of the titles originating in the USA. Content duration ranged from 10 minutes to over 1 hour, across the various genres listed below.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*gIrMnzD0LkTZl00Q4taGcA.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*gIrMnzD0LkTZl00Q4taGcA.gif" /></picture></div></div></figure><p id="28bc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The dataset contains audio tracks in three different languages, namely English, Spanish, and Japanese. The <strong class="mv gs">language distribution</strong> is shown in the figure below. The name of the episode/TV show for each sample remains unpublished. However, each sample has both a show-ID and a season-ID to help identify the connection between the samples. For instance, two samples from different seasons of the same show would share the same show ID and have different season IDs.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*Yx_cwy9oHGuQcYhgvNTj-g.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Yx_cwy9oHGuQcYhgvNTj-g.gif" /></picture></div></div></figure><h2 id="c70b" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">What constitutes music or speech?</h2><p id="33df" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">To evaluate and benchmark our dataset, we manually labeled 20 audio tracks from various TV shows which do not overlap with our training data. One of the fundamental issues encountered during the annotation of our manually-labeled TVSM-test set, was the definition of music and speech. The heavy usage of ambient sounds and sound effects blurs the boundaries between active music regions and non-music. Similarly, switches between conversational speech and singing voices in certain TV genres obscure where speech starts and music stops. Furthermore, must these two classes be mutually exclusive? To ensure label quality, consistency, and to avoid ambiguity, we converged on the following guidelines for differentiating music and speech:</p><ul class=""><li id="cd32" class="mt mu gr mv b mw mx my mz na nb nc nd ne qa ng nh ni qb nk nl nm qc no np nq qd qe qf bj">Any music that is perceivable by the annotator at a comfortable playback volume should be annotated.</li><li id="1a52" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Since sung lyrics are often included in closed-captions or subtitles, human singing voices should all be annotated as both speech and music.</li><li id="254e" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Ambient sound or sound effects without <strong class="mv gs"><em class="ov">apparent melodic contours</em></strong> should not be annotated as music. Traditional phone bell, ringing, or buzzing without apparent melodic contours should not be annotated as music.</li><li id="3f34" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Filled pauses (uh, um, ah, er), backchannels (mhm, uh-huh), sighing, and screaming should not be annotated as speech.</li></ul><h2 id="52ee" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Audio format and preprocessing</h2><p id="19a0" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">All audio files were originally delivered from the post-production studios in the standard 5.1 surround format at 48 kHz sampling rate. We first normalize all files to an average loudness of −27 LKFS ± 2 LU dialog-gated, then downsample to 16 kHz before creating an <a class="af nr" href="https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.775-1-199407-S!!PDF-E.pdf" rel="noopener ugc nofollow" target="_blank">ITU downmix</a>.</p><h2 id="dfe7" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Model Architecture</h2><p id="3041" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Our modeling choices take advantage of both convolutional and recurrent architectures, which are known to work well on audio sequence classification tasks, and are well supported by previous investigations. We adapted the SOTA convolutional recurrent neural network (<strong class="mv gs">CRNN</strong>) architecture to accommodate our requirements for input/output dimensionality and model complexity. The best model was a CRNN with three convolutional layers, followed by two bi-directional recurrent layers and one fully connected layer. The model has 832k trainable parameters and emits frame-level predictions for both speech and music with a temporal resolution of 5 frames per second.</p><p id="b0cd" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">For training, we leveraged our large and diverse catalog dataset with noisy labels, introduced above. Applying a random sampling strategy, each training sample is a 20 second segment obtained by randomly selecting an audio file and corresponding starting timecode offset on the fly. All models in our experiments were trained by minimizing <strong class="mv gs">binary cross-entropy (BCE) loss</strong>.</p><h2 id="e004" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Evaluation</h2><p id="1133" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">In order to understand the influence of different variables in our experimental setup, e.g. model architecture, training data or input representation variants like log-Mel Spectrogram versus per-channel energy normalization (PCEN), we setup<strong class="mv gs"> a detailed ablation study</strong>, which we encourage the reader to explore fully in our <a class="af nr" href="https://dl.acm.org/doi/abs/10.1186/s13636-022-00253-8" rel="noopener ugc nofollow" target="_blank">EURASIP journal article</a>.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*Qc8AnFKtL8XpmO1doiifQA.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*Qc8AnFKtL8XpmO1doiifQA.gif" /></picture></div></div></figure><p id="3d98" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">For each experiment, we reported the class-wise F-score and error rate with a segment size of 10ms. The error rate is the summation of deletion rate (false negative) and insertion rate (false positive). Since a binary decision must be attained for music and speech to calculate the F-score, a threshold of 0.5 was used to quantize the continuous output of speech and music activity functions.</p><h2 id="0a93" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Results</h2><p id="05bd" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">We evaluated our models on<strong class="mv gs"> four open datasets</strong> comprising audio data from TV programs, YouTube clips and various content such as concert, radio broadcasts, and low-fidelity folk music. The excellent performance of our models demonstrates the importance of building a robust system that detects <strong class="mv gs">overlapping speech and music</strong> and supports our assumption that a large but noisy-labeled real-world dataset can serve as a viable solution for SMAD.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*E1Jlm4oPe5VxOd-2Kf7-1A.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*E1Jlm4oPe5VxOd-2Kf7-1A.gif" /></picture></div></div></figure><h1 id="9e54" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Conclusion</h1><p id="2f00" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">At Netflix, tasks throughout the content production and delivery lifecycle work are most often interested in one part of the soundtrack. Tasks that operate on just dialogue, music or effects are performed hundreds of times a day, by teams around the globe, in dozens of different audio languages. So investments in algorithmically-assisted tools for automatic audio content understanding like SMAD, can yield substantial productivity returns at scale while minimizing tedium.</p><h1 id="6253" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Additional Resources</h1><p id="098b" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">We have made audio features and labels available via <a class="af nr" href="https://zenodo.org/record/7025971" rel="noopener ugc nofollow" target="_blank">Zenodo</a>. There is also <a class="af nr" href="https://github.com/biboamy/TVSM-dataset" rel="noopener ugc nofollow" target="_blank">GitHub repository</a> with the following audio tools:</p><ul class=""><li id="8f15" class="mt mu gr mv b mw mx my mz na nb nc nd ne qa ng nh ni qb nk nl nm qc no np nq qd qe qf bj">Python code for data pre-processing, including scripts for 5.1 downmixing, Mel spectrogram generation, MFCCs generation, VGGish features generation, and the PCEN implementation.</li><li id="2c37" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Python code for reproducing all experiments, including scripts of data loaders, model implementations, training and evaluation pipelines.</li><li id="4e94" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Pre-trained models for each conducted experiment.</li><li id="9630" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">Prediction outputs for all audio in the evaluation datasets.</li></ul><p id="baae" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj"><em class="ov">Special thanks to the entire Audio Algorithms team, as well as </em><a class="af nr" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank"><em class="ov">Amir Ziai</em></a><em class="ov">, </em><a class="af nr" href="https://www.linkedin.com/in/anna-pulido-61025063/" rel="noopener ugc nofollow" target="_blank"><em class="ov">Anna Pulido</em></a><em class="ov">, and </em><a class="af nr" href="https://www.linkedin.com/in/angiepollema1/" rel="noopener ugc nofollow" target="_blank"><em class="ov">Angie Pollema</em></a><em class="ov">.</em></p></div>]]></description>
      <link>https://netflixtechblog.com/detecting-speech-and-music-in-audio-content-afd64e6a5bf8</link>
      <guid>https://netflixtechblog.com/detecting-speech-and-music-in-audio-content-afd64e6a5bf8</guid>
      <pubDate>Tue, 14 Nov 2023 02:55:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[The Next Step in Personalization: Dynamic Sizzles]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="f5bc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Authors:<a class="af nr" href="https://www.linkedin.com/in/bruce-wobbe-197395/" rel="noopener ugc nofollow" target="_blank">Bruce Wobbe</a>, <a class="af nr" href="https://www.linkedin.com/in/leticiak/" rel="noopener ugc nofollow" target="_blank">Leticia Kwok</a></p><p id="8e66" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Additional Credits:<a class="af nr" href="https://www.linkedin.com/in/sanford-holsapple-782a3a158/" rel="noopener ugc nofollow" target="_blank">Sanford Holsapple</a>, <a class="af nr" href="https://www.linkedin.com/in/eugene-lok-6465045b/" rel="noopener ugc nofollow" target="_blank">Eugene Lok</a>, <a class="af nr" href="https://www.linkedin.com/in/jeremy-kelly-526a30180/" rel="noopener ugc nofollow" target="_blank">Jeremy Kelly</a></p><h1 id="8f18" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Introduction</h1><p id="36d3" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">At Netflix, we strive to give our members an excellent personalized experience, helping them make the most successful and satisfying selections from our thousands of titles. We already personalize <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">artwork</a> and trailers, but we hadn’t yet personalized sizzle reels — until now.</p><p id="b650" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">A sizzle reel is a montage of video clips from different titles strung together into a seamless A/V asset that gets members excited about upcoming launches (for example, our Emmys nominations or holiday collections). Now Netflix can create a personalized sizzle reel dynamically in real time and on demand. The order of the clips and included titles are personalized per member, giving each a unique and effective experience. These new personalized reels are called <em class="ov">Dynamic Sizzles</em>.</p><p id="f344" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In this post, we will dive into the exciting details of how we create Dynamic Sizzles with minimal human intervention, including the challenges we faced and the solutions we developed.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*OxP3aIMkQW6d-BjK%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*OxP3aIMkQW6d-BjK%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*OxP3aIMkQW6d-BjK%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*OxP3aIMkQW6d-BjK%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*OxP3aIMkQW6d-BjK%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*OxP3aIMkQW6d-BjK%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*OxP3aIMkQW6d-BjK%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*OxP3aIMkQW6d-BjK 640w, https://miro.medium.com/v2/resize:fit:720/0*OxP3aIMkQW6d-BjK 720w, https://miro.medium.com/v2/resize:fit:750/0*OxP3aIMkQW6d-BjK 750w, https://miro.medium.com/v2/resize:fit:786/0*OxP3aIMkQW6d-BjK 786w, https://miro.medium.com/v2/resize:fit:828/0*OxP3aIMkQW6d-BjK 828w, https://miro.medium.com/v2/resize:fit:1100/0*OxP3aIMkQW6d-BjK 1100w, https://miro.medium.com/v2/resize:fit:1400/0*OxP3aIMkQW6d-BjK 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="220c" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">An example of a Dynamic Sizzle created for Chuseok, the Korean mid-autumn harvest festival collection.</p><h1 id="5341" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Overview</h1><p id="7517" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">In the past, each sizzle reel was created manually. The time and cost of doing this prevents scaling and misses the invaluable benefit of personalization, which is a bedrock principle at Netflix. We wanted to figure out how to efficiently scale sizzle reel production, while also incorporating personalization — all in an effort to yield greater engagement and enjoyment for our members.</p><p id="6817" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Enter the creation of Dynamic Sizzles. We developed a systems-based approach that uses our interactive and creative technology to programmatically stitch together multiple video clips alongside a synced audio track. The process involves compiling personalized multi-title/multi-talent promotional A/V assets on the fly into a <em class="ov">Mega Asset</em>. A Mega Asset is a large A/V asset made up of video clips from various titles, acting as a library from which the Dynamic Sizzle pulls media. These clips are then used to construct a personalized Dynamic Sizzle according to a predefined cadence.</p><p id="cc27" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">With Dynamic Sizzles, we can utilize more focused creative work from editors and generate a multitude of personalized sizzle reels efficiently and effectively — up to 70% in terms of time and cost savings than a manually created one. This gives us the ability to create thousands, if not millions, of combinations of video clips and assets that result in optimized and personalized sizzle reel experiences for Netflix members.</p><h1 id="c2c8" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Creating the Mega Asset</h1><h2 id="1911" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Where To Begin</h2><p id="b08a" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Our first challenge was figuring out how to create the Mega Asset, as each video clip needs to be precise in its selection and positioning. A Mega Asset can contain any number of clips, and millions of unique Dynamic Sizzles can be produced from a single Mega Asset.</p><p id="2f31" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">We accomplished this by using human editors to select the clips — ensuring that they are well-defined from both a creative and technical standpoint — then laying them out in a specific known order in a timeline. We also need each clip marked with an index to its location — an extremely tedious and time consuming process for an editor. To solve this, we created an Adobe Premiere plug-in to automate the process. Further verifications can also be done programmatically via ingestion of the timecode data, as we can validate the structure of the Mega Asset by looking at the timecodes.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*BGuPECM8yOcl1VyF%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*BGuPECM8yOcl1VyF%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*BGuPECM8yOcl1VyF%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*BGuPECM8yOcl1VyF%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*BGuPECM8yOcl1VyF%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*BGuPECM8yOcl1VyF%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*BGuPECM8yOcl1VyF%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*BGuPECM8yOcl1VyF 640w, https://miro.medium.com/v2/resize:fit:720/0*BGuPECM8yOcl1VyF 720w, https://miro.medium.com/v2/resize:fit:750/0*BGuPECM8yOcl1VyF 750w, https://miro.medium.com/v2/resize:fit:786/0*BGuPECM8yOcl1VyF 786w, https://miro.medium.com/v2/resize:fit:828/0*BGuPECM8yOcl1VyF 828w, https://miro.medium.com/v2/resize:fit:1100/0*BGuPECM8yOcl1VyF 1100w, https://miro.medium.com/v2/resize:fit:1400/0*BGuPECM8yOcl1VyF 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="acfd" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">An example of a title’s video clips layout.</p><p id="da6d" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The above layout shows how a single title’s clips are ordered in a Mega Asset and in 3 different lengths: 160, 80 and 40 frame rates. Each clip should be unique per title; however, when using multiple titles, they may share the same frame rate. This gives us more variety to choose from while maintaining a structured order in the layout.</p><h2 id="2b7c" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Cadence</h2><p id="8aff" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">The cadence is a predetermined collection of clip lengths that indicates when, where, and for how long a title shows within a Dynamic Sizzle. The cadence ensures that when a Dynamic Sizzle is played, it will show a balanced view of any titles chosen, while still giving more time to a member’s higher ranked titles. Cadence is something we can personalize or randomize, and will continue to evolve as needed.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div class="ow ox qb"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*u_Y4NqIlrgVCBPpDt2T6VQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*u_Y4NqIlrgVCBPpDt2T6VQ.png" /></picture></div><figcaption class="qc fc qd ow ox qe qf be b bf z dt">Sample Cadence</figcaption></figure><p id="7dcc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In the above sample cadence, Title A refers to the highest ranked title in a member’s personalized sort, Title B the second highest, and so on. The cadence is made up of 3 distinct segments with 5 chosen titles (A-E) played in sequence using various clip lengths. Each clip in the cadence refers to a different clip in the Mega Asset. For example, the 80 frame clip for title A in the first (red) segment is different from the 80 frame clip for title A in the third (purple) segment.</p><h1 id="7ac0" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Composing the Dynamic Sizzle</h1><h2 id="0ba6" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Personalization</h2><p id="b3d0" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">When a request comes in for a sizzle reel, our system determines what titles are in the Mega Asset and based on the request, a personalized list of titles is created and sorted. The top titles for a member are then used to construct the Dynamic Sizzle by leveraging the clips in the Mega Asset. Higher ranked titles get more weight in placement and allotted time.</p><h2 id="68fe" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Finding Timecodes</h2><p id="cf00" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">For the Dynamic Sizzle process, we have to quickly and dynamically determine the timecodes for each clip in the Mega Asset and make sure they are easily accessed at runtime. We accomplish this by utilizing Netflix’s <a class="af nr" href="https://hollow.how/" rel="noopener ugc nofollow" target="_blank">Hollow technology</a>. Hollow allows us to store timecodes for quick searches and use timecodes as a map — a key can be used to find the timecodes needed as defined by the cadence. The key can be as simple as <em class="ov">titleId-clip-1.</em></p><h2 id="22e9" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Building The Reel</h2><p id="4e75" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">The ordering of the clips are set by the predefined cadence, which dictates the final layout and helps easily build the Dynamic Sizzle. For example, if the system knows to use title 17 within the Mega Asset, we can easily calculate the time offset for all the clips because of the known ordering of the titles and clips within the Mega Asset. This all comes together in the following way:</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox qa"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*LP9mC6GLEnKsw-Ih%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*LP9mC6GLEnKsw-Ih%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*LP9mC6GLEnKsw-Ih%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*LP9mC6GLEnKsw-Ih%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*LP9mC6GLEnKsw-Ih%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*LP9mC6GLEnKsw-Ih%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*LP9mC6GLEnKsw-Ih%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*LP9mC6GLEnKsw-Ih 640w, https://miro.medium.com/v2/resize:fit:720/0*LP9mC6GLEnKsw-Ih 720w, https://miro.medium.com/v2/resize:fit:750/0*LP9mC6GLEnKsw-Ih 750w, https://miro.medium.com/v2/resize:fit:786/0*LP9mC6GLEnKsw-Ih 786w, https://miro.medium.com/v2/resize:fit:828/0*LP9mC6GLEnKsw-Ih 828w, https://miro.medium.com/v2/resize:fit:1100/0*LP9mC6GLEnKsw-Ih 1100w, https://miro.medium.com/v2/resize:fit:1400/0*LP9mC6GLEnKsw-Ih 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div></figure><p id="75f9" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The result is a series of timecodes indicating the start and stop times for each clip. These codes appear in the order they should be played and the player uses them to construct a seamless video experience as seen in the examples below:</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div class="ow ox qg"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*ksESuKPELSPkUCwQSDZXYw.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*ksESuKPELSPkUCwQSDZXYw.gif" /></picture></div><figcaption class="qc fc qd ow ox qe qf be b bf z dt">The Beautiful Game Dynamic Sizzle</figcaption></figure><p id="28df" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">With Dynamic Sizzles, each member experiences a personalized sizzle reel.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div class="ow ox qh"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*mCPorSK246DfHuRmKaxCgQ.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*mCPorSK246DfHuRmKaxCgQ.gif" /></picture></div><figcaption class="qc fc qd ow ox qe qf be b bf z dt">Example of what 2 different profiles might see for the same sizzle</figcaption></figure><h1 id="e99c" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Playing the Dynamic Sizzle</h1><h2 id="d596" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Delivering To The Player</h2><p id="afe4" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">The player leverages the Mega Asset by using timecodes to know where to start and stop each clip, and then seamlessly plays each one right after the other. This required a change in the API that devices normally use to get trailers. The API change was twofold. First, on the request we need the device to indicate that it can support Dynamic Sizzles. Second, on the response the timecode list needs to be sent. (Changing the API and rolling it out took time, so this all had to be implemented before Dynamic Sizzles could actually be used, tested, and productized.)</p><h2 id="0957" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Challenges With The Player</h2><p id="8428" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">There were two main challenges with the player. First, in order to support features like background music across multiple unique video segments, we needed to support asymmetrical segment streaming from discontiguous locations in the Mega Asset. This involved modifying existing schemas and adding corresponding support to the player to allow for the stitching of the video and audio together separately while still keeping the timecodes in sync. Second, we needed to optimize our streaming algorithms to account for these much shorter segments, as some of our previous assumptions were incorrect when dealing with dozens of discontiguous tiny segments in the asset.</p><h2 id="db3f" class="pl nt gr be nu pm pn dx ny po pp dz oc ne pq pr ps ni pt pu pv nm pw px py pz bj">Building Great Things Together</h2><p id="e1d3" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">We are just getting started on this journey to build truly great experiences. While the challenges may seem endless, the work is incredibly fulfilling. The core to bringing these great engineering solutions to life is the direct collaboration we have with our colleagues and innovating together to solve these challenges.</p><p id="cac4" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">If you are interested in working on great technology like Dynamic Sizzles, we’d love to talk to you! We are hiring: <a class="af nr" href="https://jobs.netflix.com/" rel="noopener ugc nofollow" target="_blank">jobs.netflix.com</a></p></div>]]></description>
      <link>https://netflixtechblog.com/the-next-step-in-personalization-dynamic-sizzles-4dc4ce2011ef</link>
      <guid>https://netflixtechblog.com/the-next-step-in-personalization-dynamic-sizzles-4dc4ce2011ef</guid>
      <pubDate>Wed, 08 Nov 2023 21:56:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Building In-Video Search]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="19bc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj"><a class="af nr" href="https://www.linkedin.com/in/boris-chen-b921a214/" rel="noopener ugc nofollow" target="_blank">Boris Chen</a>, <a class="af nr" href="https://www.linkedin.com/in/benjamin-klein-usa/" rel="noopener ugc nofollow" target="_blank">Ben Klein</a>, <a class="af nr" href="https://www.linkedin.com/in/jasonge27/" rel="noopener ugc nofollow" target="_blank">Jason Ge</a>, <a class="af nr" href="https://www.linkedin.com/in/avneesh/" rel="noopener ugc nofollow" target="_blank">Avneesh Saluja</a>, <a class="af nr" href="https://www.linkedin.com/in/gurutahasildar/" rel="noopener ugc nofollow" target="_blank">Guru Tahasildar</a>, <a class="af nr" href="https://www.linkedin.com/in/abhisheks0ni/" rel="noopener ugc nofollow" target="_blank">Abhishek Soni</a>, <a class="af nr" href="https://www.linkedin.com/in/jivimberg/" rel="noopener ugc nofollow" target="_blank">Juan Vimberg</a>, <a class="af nr" href="https://www.linkedin.com/in/ellchow/" rel="noopener ugc nofollow" target="_blank">Elliot Chow</a>, <a class="af nr" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank">Amir Ziai</a>, <a class="af nr" href="https://www.linkedin.com/in/varun-sekhri-087a213/" rel="noopener ugc nofollow" target="_blank">Varun Sekhri</a>, <a class="af nr" href="https://www.linkedin.com/in/santiagocastroserra/" rel="noopener ugc nofollow" target="_blank">Santiago Castro</a>, <a class="af nr" href="https://www.linkedin.com/in/keilafong/" rel="noopener ugc nofollow" target="_blank">Keila Fong</a>, <a class="af nr" href="https://www.linkedin.com/in/kelli-griggs-32990125/" rel="noopener ugc nofollow" target="_blank">Kelli Griggs</a>, <a class="af nr" href="https://www.linkedin.com/in/mallia-sherzai-8a92862/" rel="noopener ugc nofollow" target="_blank">Mallia Sherzai</a>, <a class="af nr" href="https://www.linkedin.com/in/mayerr/" rel="noopener ugc nofollow" target="_blank">Robert Mayer</a>, <a class="af nr" href="https://www.linkedin.com/in/yaoandy/" rel="noopener ugc nofollow" target="_blank">Andy Yao</a>, <a class="af nr" href="https://www.linkedin.com/in/vi-pallavika-iyengar-144abb1b/" rel="noopener ugc nofollow" target="_blank">Vi Iyengar</a>, <a class="af nr" href="https://www.linkedin.com/in/peachpie/" rel="noopener ugc nofollow" target="_blank">Jonathan Solorzano-Hamilton</a>, <a class="af nr" href="https://www.linkedin.com/in/mhtaghavi/" rel="noopener ugc nofollow" target="_blank">Hossein Taghavi</a>, <a class="af nr" href="https://www.linkedin.com/in/ritwik-kumar/" rel="noopener ugc nofollow" target="_blank">Ritwik Kumar</a></p><h1 id="7a74" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Introduction</h1><p id="399c" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Today we’re going to take a look at the behind the scenes technology behind how Netflix creates great trailers, Instagram reels, video shorts and other promotional videos.</p><figure class="ov ow ox oy oz pa"><div class="pb jh l fg"></div></figure><p id="7f7c" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Suppose you’re trying to create the trailer for the action thriller <em class="pe">The Gray Man</em>, and you know you want to use a shot of a car exploding. You don’t know if that shot exists or where it is in the film, and you have to look for it it by scrubbing through the whole film.</p><figure class="ov ow ox oy oz pa pf pg paragraph-image"><div role="button" tabindex="0" class="pi pj fg pk bg pl"><div class="pf pg ph"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*32RfnKGMENXaqEX8%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*32RfnKGMENXaqEX8%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*32RfnKGMENXaqEX8%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*32RfnKGMENXaqEX8%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*32RfnKGMENXaqEX8%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*32RfnKGMENXaqEX8%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*32RfnKGMENXaqEX8%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*32RfnKGMENXaqEX8 640w, https://miro.medium.com/v2/resize:fit:720/0*32RfnKGMENXaqEX8 720w, https://miro.medium.com/v2/resize:fit:750/0*32RfnKGMENXaqEX8 750w, https://miro.medium.com/v2/resize:fit:786/0*32RfnKGMENXaqEX8 786w, https://miro.medium.com/v2/resize:fit:828/0*32RfnKGMENXaqEX8 828w, https://miro.medium.com/v2/resize:fit:1100/0*32RfnKGMENXaqEX8 1100w, https://miro.medium.com/v2/resize:fit:1400/0*32RfnKGMENXaqEX8 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="po fc pp pf pg pq pr be b bf z dt">Exploding cars — <a class="af nr" href="https://www.netflix.com/title/81160697" rel="noopener ugc nofollow" target="_blank">The Gray Man</a> (2022)</figcaption></figure><p id="7229" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Or suppose it’s Christmas, and you want to create a great instagram piece out all the best scenes across Netflix films of people shouting “Merry Christmas”! Or suppose it’s Anya Taylor Joy’s birthday, and you want to create a highlight reel of all her most iconic and dramatic shots.</p><p id="c2cc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Making these comes down to finding the right video clips amongst hundreds of thousands movies and TV shows to find the right line of dialogue or the right visual elements (objects, scenes, emotions, actions, etc.). We have built an internal system that allows someone to perform in-video search across the entire Netflix video catalog, and we’d like to share our experience in building this system.</p><h1 id="24c7" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Building in-video search</h1><p id="ad31" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">To build such a visual search engine, we needed a machine learning system that can understand visual elements. Our early attempts included object detection, but found that general labels were both too limiting and too specific, yet not specific enough. Every show has special objects that are important (e.g. Demogorgon in Stranger Things) that don’t translate to other shows. The same was true for action recognition, and other common image and video tasks.</p><h2 id="f020" class="ps nt gr be nu pt pu dx ny pv pw dz oc ne px py pz ni qa qb qc nm qd qe qf qg bj">The Approach</h2><p id="4c48" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">We found that contrastive learning between images and text pairs work well for our goals because these models are able to learn joint embedding spaces between the two modalities. This approach is also able to learn about objects, scenes, emotions, actions, and more in a single model. We also found that extending contrastive learning to videos and text provided a substantial improvement over frame-level models.</p><p id="e93a" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In order to train the model on internal training data (video clips with aligned text descriptions), we implemented a scalable version on <a class="af nr" href="https://docs.ray.io/en/latest/train/train.html" rel="noopener ugc nofollow" target="_blank">Ray Train</a> and switched to a <a class="af nr" href="https://github.com/dmlc/decord" rel="noopener ugc nofollow" target="_blank">more performant video decoding library</a>. Lastly, the embeddings from the video encoder exhibit strong zero or few-shot performance on multiple video and content understanding tasks at Netflix and are used as a starting point in those applications.</p><p id="bdb4" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The recent success of large-scale models that jointly train image and text embeddings has enabled new use cases around multimodal retrieval. These models are trained on large amounts of image-caption pairs via in-batch contrastive learning. For a (large) batch of <code class="cw qh qi qj qk b">N</code> examples, we wish to maximize the embedding (cosine) similarity of the <code class="cw qh qi qj qk b">N</code> correct image-text pairs, while minimizing the similarity of the other <code class="cw qh qi qj qk b">N²-N</code> paired embeddings. This is done by treating the similarities as logits and minimizing the symmetric cross-entropy loss, which gives equal weighting to the two settings (treating the captions as labels to the images and vice versa).</p><p id="a928" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Consider the following two images and captions:</p><figure class="ov ow ox oy oz pa pf pg paragraph-image"><div role="button" tabindex="0" class="pi pj fg pk bg pl"><div class="pf pg ql"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*cK7_XVOisFc-il4Y7jLK9Q.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*cK7_XVOisFc-il4Y7jLK9Q.png" /></picture></div></div><figcaption class="po fc pp pf pg pq pr be b bf z dt">Images are from <a class="af nr" href="https://www.netflix.com/title/81458416" rel="noopener ugc nofollow" target="_blank">Glass Onion: A Knives Out Mystery</a> (2022)</figcaption></figure><p id="97b1" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Once properly trained, the embeddings for the corresponding images and text (i.e. captions) will be close to each other and farther away from unrelated pairs.</p><figure class="ov ow ox oy oz pa pf pg paragraph-image"><div role="button" tabindex="0" class="pi pj fg pk bg pl"><div class="pf pg qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*hsbG_08nSVKyZlfIsfygLQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*hsbG_08nSVKyZlfIsfygLQ.png" /></picture></div></div><figcaption class="po fc pp pf pg pq pr be b bf z dt">Typically embedding spaces are hundred/thousand dimensional.</figcaption></figure><p id="bfa7" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">At query time, the input text query can be mapped into this embedding space, and we can return the closest matching images.</p><figure class="ov ow ox oy oz pa pf pg paragraph-image"><div role="button" tabindex="0" class="pi pj fg pk bg pl"><div class="pf pg qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*CFXVklrWl4kerhtfxuMTxg.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*CFXVklrWl4kerhtfxuMTxg.png" /></picture></div></div><figcaption class="po fc pp pf pg pq pr be b bf z dt">The query may have not existed in the training set. <a class="af nr" href="https://en.wikipedia.org/wiki/Cosine_similarity" rel="noopener ugc nofollow" target="_blank">Cosine similarity</a> can be used as a similarity measure.</figcaption></figure><p id="de24" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">While these models are trained on image-text pairs, we have found that they are an excellent starting point to learning representations of video units like shots and scenes. As videos are a sequence of images (frames), additional parameters may need to be introduced to compute embeddings for these video units, although we have found that for shorter units like shots, an unparameterized aggregation like averaging (mean-pooling) can be more effective. To train these parameters as well as fine-tune the pretrained image-text model weights, we leverage in-house datasets that pair shots of varying durations with rich textual descriptions of their content. This additional adaptation step improves performance by 15–25% on video retrieval tasks (given a text prompt), depending on the starting model used and metric evaluated.</p><p id="cf04" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">On top of video retrieval, there are a wide variety of video clip classifiers within Netflix that are trained specifically to find a particular attribute (e.g. closeup shots, caution elements). Instead of training from scratch, we have found that using the shot-level embeddings can give us a significant head start, even beyond the baseline image-text models that they were built on top of.</p><p id="8fed" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Lastly, shot embeddings can also be used for video-to-video search, a particularly useful application in the context of trailer and promotional asset creation.</p><h1 id="a316" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Engineering and Infrastructure</h1><p id="02b3" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Our trained model gives us a text encoder and a video encoder. Video embeddings are precomputed on the shot level, stored in our <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243">media feature store</a>, and replicated to an elastic search cluster for real-time nearest neighbor queries. Our media feature management system automatically triggers the video embedding computation whenever new video assets are added, ensuring that we can search through the latest video assets.</p><p id="190b" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The embedding computation is based on a large neural network model and has to be run on GPUs for optimal throughput. However, shot segmentation from a full-length movie is CPU-intensive. To fully utilize the GPUs in the cloud environment, we first run shot segmentation in parallel on multi-core CPU machines, store the result shots in S3 object storage encoded in video formats such as mp4. During GPU computation, we stream mp4 video shots from S3 directly to the GPUs using a data loader that performs prefetching and preprocessing. This approach ensures that the GPUs are efficiently utilized during inference, thereby increasing the overall throughput and cost-efficiency of our system.</p><p id="a18a" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">At query time, a user submits a text string representing what they want to search for. For visual search queries, we use the text encoder from the trained model to extract an text embedding, which is then used to perform appropriate nearest neighbor search. Users can also select a subset of shows to search over, or perform a catalog wide search, which we also support.</p><p id="c3ee" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">If you’re interested in more details, see our other post covering the <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7">Media Understanding Platform</a>.</p><h1 id="5542" class="ns nt gr be nu nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op bj">Conclusion</h1><p id="cc70" class="pw-post-body-paragraph mt mu gr mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gk bj">Finding a needle in a haystack is hard. We learned from talking to video creatives who make trailers and social media videos that being able to find needles was key, and a big pain point. The solution we described has been fruitful, works well in practice, and is relatively simple to maintain. Our search system allows our creatives to iterate faster, try more ideas, and make more engaging videos for our viewers to enjoy.</p><p id="4375" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">We hope this post has been interesting to you. If you are interested in working on problems like this, Netflix is always <a class="af nr" href="https://jobs.netflix.com/" rel="noopener ugc nofollow" target="_blank">hiring</a> great researchers, engineers and creators.</p></div>]]></description>
      <link>https://netflixtechblog.com/building-in-video-search-936766f0017c</link>
      <guid>https://netflixtechblog.com/building-in-video-search-936766f0017c</guid>
      <pubDate>Mon, 06 Nov 2023 18:35:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Streaming SQL in Data Mesh]]></title>
      <description><![CDATA[<div class="ab ca"><div class="ch bg fw fx fy fz"><div><div class="hs ht hu hv hw"></div><p id="0bdf" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Democratizing Stream Processing @ Netflix</p><p id="38bb" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">By <a class="af nr" href="https://www.linkedin.com/in/guilhermesmi/" rel="noopener ugc nofollow" target="_blank"><em class="ns">Guil Pires</em></a><em class="ns">, </em><a class="af nr" href="https://www.linkedin.com/in/markcho/" rel="noopener ugc nofollow" target="_blank"><em class="ns">Mark Cho</em></a><em class="ns">, </em><a class="af nr" href="https://www.linkedin.com/in/liuml07/" rel="noopener ugc nofollow" target="_blank"><em class="ns">Mingliang Liu</em></a><em class="ns">, </em><a class="af nr" href="https://www.linkedin.com/in/sujayjain/" rel="noopener ugc nofollow" target="_blank"><em class="ns">Sujay Jain</em></a></p><p id="6406" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Data powers much of what we do at Netflix. On the Data Platform team, we build the infrastructure used across the company to process data at scale.</p><p id="5a59" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In our last blog post, we introduced <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873">“Data Mesh” — A Data Movement and Processing Platform</a>. When a user wants to leverage Data Mesh to move and transform data, they start by creating a new Data Mesh pipeline. The pipeline is composed of individual “Processors” that are connected by Kafka topics. The Processors themselves are implemented as Flink jobs that use the DataStream API.</p><p id="9707" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Since then, we have seen many use cases (including <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-content-engineering-makes-a-federated-graph-searchable-5c0c1c7d7eaf">Netflix Graph Search</a>) adopt Data Mesh for stream processing. We were able to onboard many of these use cases by offering some commonly used Processors out of the box, such as <em class="ns">Projection</em>, <em class="ns">Filtering</em>, <em class="ns">Unioning</em>, and <em class="ns">Field Renaming</em>.</p></div></div><div class="nt"><div class="ab ca"><div class="nu nv nw nx ny nz ce oa cf ob ch bg"><figure class="of og oh oi oj nt ok ol paragraph-image"><div role="button" tabindex="0" class="om on fg oo bg op"><div class="oc od oe"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*VxZlXPDb8-d7Xf4kfSulnw.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*VxZlXPDb8-d7Xf4kfSulnw.png" /></picture></div></div><figcaption class="os fc ot oc od ou ov be b bf z dt">An example of a Data Mesh pipeline which moves and transforms data using Union, GraphQL Enrichment, and Column Rename Processor before writing to an Iceberg table.</figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="06e4" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">By keeping the logic of individual Processors simple, it allowed them to be reusable so we could centrally manage and operate them at scale. It also allowed them to be composable, so users could combine the different Processors to express the logic they needed.</p><p id="72ea" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">However, this design decision led to a different set of challenges.</p><p id="d954" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Some teams found the provided building blocks were not expressive enough. For use cases which were not solvable using existing Processors, users had to express their business logic by building a custom Processor. To do this, they had to use the low-level DataStream API from Flink and the Data Mesh SDK, which came with a steep learning curve. After it was built, they also had to operate the custom Processors themselves.</p><p id="7477" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Furthermore, many pipelines needed to be composed of multiple Processors. Since each Processor was implemented as a Flink Job connected by Kafka topics, it meant there was a relatively high runtime overhead cost for many pipelines.</p><p id="9037" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">We explored various options to solve these challenges, and eventually landed on building the Data Mesh SQL Processor that would provide additional flexibility for expressing users’ business logic.</p><p id="53bd" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The existing Data Mesh Processors have a lot of overlap with SQL. For example, filtering and projection can be expressed in SQL through <strong class="mv gs"><em class="ns">SELECT</em></strong> and <strong class="mv gs"><em class="ns">WHERE</em></strong> clauses. Additionally, instead of implementing business logic by composing multiple individual Processors together, users could express their logic in a single SQL query, avoiding the additional resource and latency overhead that came from multiple Flink jobs and Kafka topics. Furthermore, SQL can support User Defined Functions (UDFs) and custom connectors for <em class="ns">lookup</em> <em class="ns">joins</em>, which can be used to extend expressiveness.</p><h1 id="9f1d" class="ow ox gr be oy oz pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt bj">Data Mesh SQL Processor</h1><p id="7a38" class="pw-post-body-paragraph mt mu gr mv b mw pu my mz na pv nc nd ne pw ng nh ni px nk nl nm py no np nq gk bj">Since Data Mesh Processors are built on top of Flink, it made sense to consider using Flink SQL instead of continuing to build additional Processors for every transform operation we needed to support.</p><p id="c1e3" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The Data Mesh SQL Processor is a platform-managed, parameterized Flink Job that takes schematized sources and a Flink SQL query that will be executed against those sources. By leveraging Flink SQL within a Data Mesh Processor, we were able to support the streaming SQL functionality without changing the architecture of Data Mesh.</p><p id="3e54" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Underneath the hood, the Data Mesh SQL Processor is implemented using Flink’s Table API, which provides a powerful abstraction to convert between DataStreams and Dynamic Tables. Based on the sources that the processor is connected to, the SQL Processor will automatically convert the upstream sources as tables within Flink’s SQL engine. User’s query is then registered with the SQL engine and translated into a Flink job graph consisting of physical operators that can be executed on a Flink cluster. Unlike the low-level DataStream API, users do not have to manually build a job graph using low-level operators, as this is all managed by Flink’s SQL engine.</p><h1 id="0357" class="ow ox gr be oy oz pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt bj">SQL Experience on Data Mesh</h1><p id="b7f0" class="pw-post-body-paragraph mt mu gr mv b mw pu my mz na pv nc nd ne pw ng nh ni px nk nl nm py no np nq gk bj">The SQL Processor enables users to fully leverage the capabilities of the Data Mesh platform. This includes features such as autoscaling, the ability to manage pipelines declaratively via Infrastructure as Code, and a rich connector ecosystem.</p><p id="fe55" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">In order to ensure a seamless user experience, we’ve enhanced the Data Mesh platform with SQL-centric features. These enhancements include an Interactive Query Mode, real-time query validation, and automated schema inference.</p><p id="b064" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">To understand how these features help the users be more productive, let’s take a look at a typical user workflow when using the Data Mesh SQL Processor.</p><ul class=""><li id="d372" class="mt mu gr mv b mw mx my mz na nb nc nd ne pz ng nh ni qa nk nl nm qb no np nq qc qd qe bj">Users start their journey by live sampling their upstream data sources using the Interactive Query Mode.</li><li id="88d9" class="mt mu gr mv b mw qf my mz na qg nc nd ne qh ng nh ni qi nk nl nm qj no np nq qc qd qe bj">As the user iterate on their SQL query, the query validation service provides real-time feedback about the query.</li><li id="9899" class="mt mu gr mv b mw qf my mz na qg nc nd ne qh ng nh ni qi nk nl nm qj no np nq qc qd qe bj">With a valid query, users can leverage the Interactive Query Mode again to execute the query and get the live results streamed back to the UI within seconds.</li><li id="5d24" class="mt mu gr mv b mw qf my mz na qg nc nd ne qh ng nh ni qi nk nl nm qj no np nq qc qd qe bj">For more efficient schema management and evolution, the platform will automatically infer the output schema based on the fields selected by the SQL query.</li><li id="0206" class="mt mu gr mv b mw qf my mz na qg nc nd ne qh ng nh ni qi nk nl nm qj no np nq qc qd qe bj">Once the user is done editing their query, it is saved to the Data Mesh Pipeline, which will then be deployed as a long running, streaming SQL job.</li></ul></div></div><div class="nt"><div class="ab ca"><div class="nu nv nw nx ny nz ce oa cf ob ch bg"><figure class="of og oh oi oj nt ok ol paragraph-image"><div role="button" tabindex="0" class="om on fg oo bg op"><div class="oc od qk"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*qAuPr360NMTueuACAZ4LeA@2x.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*qAuPr360NMTueuACAZ4LeA@2x.png" /></picture></div></div><figcaption class="os fc ot oc od ou ov be b bf z dt"><em class="ql">Overview of the SQL Processor workflow.</em></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><p id="4de9" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Users typically iterate on their SQL query multiple times before deploying it. Validating and analyzing queries at runtime after deployment will not only slow down their iteration, but also make it difficult to automate schema evolution in Data Mesh.</p><p id="a29e" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">To address this challenge, we have implemented a query validation service that can verify a Flink SQL query and provide a meaningful error message for violations in real time. This enables users to have prompt validation feedback while they are editing the query. We leverage Apache Flink’s internal Planner classes to parse and transform SQL queries without creating a fully-fledged streaming table environment. This makes the query service lightweight, scalable, and execution agnostic.</p><p id="73dd" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">To effectively operate thousands of use cases at the platform layer, we built opinionated guardrails to limit some functionalities of Flink SQL. We plan on gradually expanding the supported capabilities over time. We implemented the guardrails by recursively inspecting the Calcite tree constructed from user’s query. If the tree contains nodes that we currently don’t support, the query will be rejected from being deployed. Additionally, we translate Flink’s internal exceptions containing cryptic error messages into more meaningful error messages for our users. We plan on continuing our investments into improving the guardrails, as having proper guardrails help to improve the user experience. Some ideas for the future include rules to reject expensive and suboptimal queries.</p><p id="44df" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">To help Data Mesh users iterate quickly on their business logic, we have built the Interactive Query Mode as part of the platform. Users can start live sampling their streaming data by executing a simple `<strong class="mv gs"><em class="ns">SELECT</em></strong><strong class="mv gs"><em class="ns">*</em></strong><strong class="mv gs"><em class="ns">FROM</em></strong><strong class="mv gs"><em class="ns">&lt;table&gt;`</em></strong> query. Using the Interactive Query Mode, Data Mesh platform will execute the Flink SQL query and display the results in the UI in seconds. Since this is a Flink SQL query on streaming data, new results will continue to be delivered to the user in real-time.</p><p id="2472" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Users can continue to iterate and modify their Flink SQL query and once they’re satisfied with their query output, they can save the query as part of their stream processing pipeline.</p><p id="2e82" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">To provide this interactive experience, we maintain an always-running Flink Session Cluster that can run concurrent parameterized queries. These queries will output their data to a <a class="af nr" href="https://netflix.github.io/mantis/" rel="noopener ugc nofollow" target="_blank">Mantis</a> sink in order to stream the results back to the user’s browser.</p></div></div><div class="nt"><div class="ab ca"><div class="nu nv nw nx ny nz ce oa cf ob ch bg"><figure class="of og oh oi oj nt ok ol paragraph-image"><div role="button" tabindex="0" class="om on fg oo bg op"><div class="oc od qm"><picture><img src="https://miro.medium.com/v2/resize:fit:640/1*qpjyUAt8HOAdP3JdzV6v5g.gif" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*qpjyUAt8HOAdP3JdzV6v5g.gif" /></picture></div></div><figcaption class="os fc ot oc od ou ov be b bf z dt"><em class="ql">Interactive Query mode in action</em></figcaption></figure></div></div></div><div class="ab ca"><div class="ch bg fw fx fy fz"><h1 id="77f5" class="ow ox gr be oy oz pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt bj">Learnings from our journey</h1><p id="7fa3" class="pw-post-body-paragraph mt mu gr mv b mw pu my mz na pv nc nd ne pw ng nh ni px nk nl nm py no np nq gk bj">In hindsight, we wish we had invested in enabling Flink SQL on the DataMesh platform much earlier. If we had the Data Mesh SQL Processor earlier, we would’ve been able to avoid spending engineering resources to build smaller building blocks such as the Union Processor, Column Rename Processor, Projection and Filtering Processor.</p><p id="cd6f" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Since we’ve productionized Data Mesh SQL Processor, we’ve seen excitement and quick adoption from our Data Mesh users. Thanks to the flexibility of Flink SQL, users have a new way to express their streaming transformation logic other than writing a custom processor using the low-level DataStream API.</p><p id="7322" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">While Flink SQL is a powerful tool, we view the Data Mesh SQL Processor as a complimentary addition to our platform. It is not meant to be a replacement for custom processors and Flink jobs using low-level DataStream API. Since SQL is a higher-level abstraction, users no longer have control over low-level Flink operators and state. This means that if state evolution is critical to the user’s business logic, then having complete control over the state can only be done through low-level abstractions like the DataStream API. Even with this limitation, we have seen that there are many new use cases that are unlocked through the Data Mesh SQL Processor.</p><p id="eacc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Our early investment in guardrails has helped set clear expectations with our users and keep the operational burden manageable. It has allowed us to productionize queries and patterns that we are confident about supporting, while providing a framework to introduce new capabilities gradually.</p><h1 id="3674" class="ow ox gr be oy oz pa pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq pr ps pt bj">Future of SQL on Data Mesh</h1><p id="1d94" class="pw-post-body-paragraph mt mu gr mv b mw pu my mz na pv nc nd ne pw ng nh ni px nk nl nm py no np nq gk bj">While introducing the SQL Processor to the Data Mesh platform was a great step forward, we still have much more work to do in order to unlock the power of stream processing at Netflix. We’ve been working with our partner teams to prioritize and build the next set of features to extend the SQL Processor. These include stream enrichment using Slowly-Changing-Dimension (SCD) tables, temporal joins, and windowed aggregations.</p><p id="5c2e" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Stay tuned for more updates!</p></div></div></div>]]></description>
      <link>https://netflixtechblog.com/streaming-sql-in-data-mesh-0d83f5a00d08</link>
      <guid>https://netflixtechblog.com/streaming-sql-in-data-mesh-0d83f5a00d08</guid>
      <pubDate>Fri, 03 Nov 2023 22:48:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Kubernetes And Kernel Panics]]></title>
      <description><![CDATA[<div><div class="hs ht hu hv hw"></div><p id="8759" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">How Netflix’s Container Platform Connects Linux Kernel Panics to Kubernetes Pods</p><p id="158b" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj"><em class="nr">By Kyle Anderson</em></p><p id="53a2" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">With a recent effort to reduce customer (engineers, not end users) pain on our container platform <a class="af ns" href="https://netflixtechblog.com/tagged/titus" rel="noopener ugc nofollow" target="_blank">Titus</a>, I started investigating “orphaned” pods. There are pods that never got to finish and had to be garbage collected with no real satisfactory final status. Our Service job (think <a class="af ns" href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/" rel="noopener ugc nofollow" target="_blank">ReplicatSet</a>) owners don’t care too much, but our Batch users care a lot. Without a real return code, how can they know if it is safe to retry or not?</p><p id="e0e7" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">These orphaned pods represent real pain for our users, even if they are a small percentage of the total pods in the system. Where are they going, exactly? Why did they go away?</p><p id="5820" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">This blog post shows how to connect the dots from the worst case scenario (a kernel panic) through to Kubernetes (k8s) and eventually up to us operators so that we can track how and why our k8s nodes are going away.</p><h1 id="cedd" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Where Do Orphaned Pods Come From?</h1><p id="23e0" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">Orphaned pods get lost because the underlying k8s node object goes away. Once that happens a <a class="af ns" href="https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection" rel="noopener ugc nofollow" target="_blank">GC</a> process deletes the pod. On Titus we run a custom controller to store the history of Pod and Node objects, so that we can save some explanation and show it to our users. This failure mode looks like this in our UI:</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox oy"><picture><img src="https://miro.medium.com/v2/resize:fit:640/0*bPnudULpVKE1AKEH%20640w,%20https://miro.medium.com/v2/resize:fit:720/0*bPnudULpVKE1AKEH%20720w,%20https://miro.medium.com/v2/resize:fit:750/0*bPnudULpVKE1AKEH%20750w,%20https://miro.medium.com/v2/resize:fit:786/0*bPnudULpVKE1AKEH%20786w,%20https://miro.medium.com/v2/resize:fit:828/0*bPnudULpVKE1AKEH%20828w,%20https://miro.medium.com/v2/resize:fit:1100/0*bPnudULpVKE1AKEH%201100w,%20https://miro.medium.com/v2/resize:fit:1400/0*bPnudULpVKE1AKEH%201400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/0*bPnudULpVKE1AKEH 640w, https://miro.medium.com/v2/resize:fit:720/0*bPnudULpVKE1AKEH 720w, https://miro.medium.com/v2/resize:fit:750/0*bPnudULpVKE1AKEH 750w, https://miro.medium.com/v2/resize:fit:786/0*bPnudULpVKE1AKEH 786w, https://miro.medium.com/v2/resize:fit:828/0*bPnudULpVKE1AKEH 828w, https://miro.medium.com/v2/resize:fit:1100/0*bPnudULpVKE1AKEH 1100w, https://miro.medium.com/v2/resize:fit:1400/0*bPnudULpVKE1AKEH 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" /></picture></div></div><figcaption class="pl fc pm ow ox pn po be b bf z dt">What it looks like to our users when a k8s node and its pods disappear</figcaption></figure><p id="be89" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">This is <em class="nr">an </em>explanation, but it wasn’t very satisfying to me or to our users. <em class="nr">Why</em> was the agent lost?</p><h1 id="b4e3" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Where Do Lost Nodes Come From?</h1><p id="4839" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">Nodes can go away for any reason, especially in “the cloud”. When this happens, usually a k8s cloud-controller provided by the cloud vendor will detect that the actual server, in our case an EC2 Instance, has actually gone away, and will in turn delete the k8s node object. That still doesn’t really answer the question of <em class="nr">why</em>.</p><p id="67c5" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">How can we make sure that every instance that goes away has a reason, account for that reason, and bubble it up all the way to the pod? It all starts with an annotation:</p><pre class="oz pa pb pc pd pp pq pr bo ps ba bj">{<br />     "apiVersion": "v1",<br />     "kind": "Pod",<br />     "metadata": {<br />          "annotations": {<br />               "pod.titus.netflix.com/pod-termination-reason": "Something really bad happened!",<br />...</pre><p id="6cd5" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Just making a place to put this data is a great start. Now all we have to do is make our GC controllers aware of this annotation, and then sprinkle it into any process that could potentially make a pod or node go away unexpectedly. Adding an annotation (as opposed to patching the status) preserves the rest of the pod as-is for historical purposes. (We also add annotations for what did the terminating, and a short <code class="cw px py pz pq b">reason-code</code> for tagging)</p><p id="3ea2" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">The <code class="cw px py pz pq b">pod-termination-reason</code> annotation is useful to populate human readable messages like:</p><ul class=""><li id="2657" class="mt mu gr mv b mw mx my mz na nb nc nd ne qa ng nh ni qb nk nl nm qc no np nq qd qe qf bj">“This pod was preempted by a higher priority job ($id)”</li><li id="db59" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">“This pod had to be terminated because the underlying hardware failed ($failuretype)”</li><li id="c052" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj">“This pod had to be terminated because $user ran sudo halt on the node”</li><li id="28df" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq qd qe qf bj"><strong class="mv gs">“This pod died unexpectedly because the underlying node kernel panicked!”</strong></li></ul><p id="e870" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">But wait, how are we going to annotate a pod for a node that kernel panicked?</p><h1 id="6e7f" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Capturing Kernel Panics</h1><p id="5206" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">When the Linux kernel panics, there is just not much you can do. But what if you could send out some sort of “with my final breath, I curse Kubernetes!” UDP packet?</p><p id="23fc" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Inspired by this <a class="af ns" href="https://research.google/pubs/pub45855/" rel="noopener ugc nofollow" target="_blank">Google Spanner paper</a>, where Spanner nodes send out a “last gasp” UDP packet to release leases &amp; locks, you too can configure your servers to do the same upon kernel panic using a stock Linux module: <code class="cw px py pz pq b"><a class="af ns" href="https://www.kernel.org/doc/Documentation/networking/netconsole.txt" rel="noopener ugc nofollow" target="_blank">netconsole</a></code>.</p><h1 id="2965" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Configuring Netconsole</h1><p id="50c5" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">The fact that the Linux kernel can even send out UDP packets with the string ‘kernel panic’, <em class="nr">while it is panicking</em>, is kind of amazing. This works because netconsole needs to be configured with almost the entire IP header filled out already beforehand. That is right, you have to tell Linux exactly what your source MAC, IP, and UDP Port are, as well as the destination MAC, IP, and UDP ports. You are practically constructing the UDP packet for the kernel. But, with that prework, when the time comes, the kernel can easily <a class="af ns" href="https://github.com/torvalds/linux/blob/94f6f0550c625fab1f373bb86a6669b45e9748b3/drivers/net/netconsole.c#L932" rel="noopener ugc nofollow" target="_blank">construct</a> the packet and get it out the (preconfigured) network interface as things come crashing down. Luckily the <code class="cw px py pz pq b"><a class="af ns" href="https://manpages.ubuntu.com/manpages/jammy/en/man8/netconsole-setup.8.html" rel="noopener ugc nofollow" target="_blank">netconsole-setup</a></code> command makes the setup pretty easy. All the configuration options can be set <a class="af ns" href="https://wiki.ubuntu.com/Kernel/Netconsole#Step_3:_Initialize_netconsole_at_boot_time" rel="noopener ugc nofollow" target="_blank">dynamically</a> as well, so that when the endpoint changes one can point to the new IP.</p><p id="18e5" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Once this is setup, kernel messages will start flowing right after <code class="cw px py pz pq b">modprobe</code>. Imagine the whole thing operating like a <code class="cw px py pz pq b">dmesg | netcat -u $destination 6666</code>, but in kernel space.</p><h1 id="0abc" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Netconsole “Last Gasp” Packets</h1><p id="30f9" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">With <code class="cw px py pz pq b">netconsole</code> setup, the last gasp from a crashing kernel looks like a set of UDP packets exactly like one might expect, where the data of the UDP packet is simply the text of the kernel message. In the case of a kernel panic, it will look something like this (one UDP packet per line):</p><pre class="oz pa pb pc pd pp pq pr bo ps ba bj">Kernel panic - not syncing: buffer overrun at 0x4ba4c73e73acce54<br />[ 8374.456345] CPU: 1 PID: 139616 Comm: insmod Kdump: loaded Tainted: G OE<br />[ 8374.458506] Hardware name: Amazon EC2 r5.2xlarge/, BIOS 1.0 10/16/2017<br />[ 8374.555629] Call Trace:<br />[ 8374.556147] &lt;TASK&gt;<br />[ 8374.556601] dump_stack_lvl+0x45/0x5b<br />[ 8374.557361] panic+0x103/0x2db<br />[ 8374.558166] ? __cond_resched+0x15/0x20<br />[ 8374.559019] ? do_init_module+0x22/0x20a<br />[ 8374.655123] ? 0xffffffffc0f56000<br />[ 8374.655810] init_module+0x11/0x1000 [kpanic]<br />[ 8374.656939] do_one_initcall+0x41/0x1e0<br />[ 8374.657724] ? __cond_resched+0x15/0x20<br />[ 8374.658505] ? kmem_cache_alloc_trace+0x3d/0x3c0<br />[ 8374.754906] do_init_module+0x4b/0x20a<br />[ 8374.755703] load_module+0x2a7a/0x3030<br />[ 8374.756557] ? __do_sys_finit_module+0xaa/0x110<br />[ 8374.757480] __do_sys_finit_module+0xaa/0x110<br />[ 8374.758537] do_syscall_64+0x3a/0xc0<br />[ 8374.759331] entry_SYSCALL_64_after_hwframe+0x62/0xcc<br />[ 8374.855671] RIP: 0033:0x7f2869e8ee69<br />...</pre><h1 id="8d4e" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Connecting to Kubernetes</h1><p id="bdc0" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">The last piece is to connect is Kubernetes (k8s). We need a k8s controller to do the following:</p><ol class=""><li id="114a" class="mt mu gr mv b mw mx my mz na nb nc nd ne qa ng nh ni qb nk nl nm qc no np nq ql qe qf bj">Listen for netconsole UDP packets on port 6666, watching for things that look like kernel panics from nodes.</li><li id="10fd" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq ql qe qf bj">Upon kernel panic, lookup the k8s node object associated with the IP address of the incoming netconsole packet.</li><li id="2e6b" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq ql qe qf bj">For that k8s node, find all the pods bound to it, annotate, then delete those pods (they are toast!).</li><li id="0d85" class="mt mu gr mv b mw qg my mz na qh nc nd ne qi ng nh ni qj nk nl nm qk no np nq ql qe qf bj">For that k8s node, annotate the node and then delete it too (it is also toast!).</li></ol><p id="577a" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Parts 1&amp;2 might look like this:</p><pre class="oz pa pb pc pd pp pq pr bo ps ba bj">for {<br />    n, addr, err := serverConn.ReadFromUDP(buf)<br />    if err != nil {<br />        klog.Errorf("Error ReadFromUDP: %s", err)<br />    } else {<br />        line := santizeNetConsoleBuffer(buf[0:n])<br />        if isKernelPanic(line) {<br />            panicCounter = 20<br />            go handleKernelPanicOnNode(ctx, addr, nodeInformer, podInformer, kubeClient, line)<br />        }<br />    }<br />    if panicCounter &gt; 0 {<br />        klog.Infof("KernelPanic context from %s: %s", addr.IP, line)<br />        panicCounter++<br />    }<br />}</pre><p id="9ed0" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">And then parts 3&amp;4 might look like this:</p><pre class="oz pa pb pc pd pp pq pr bo ps ba bj">func handleKernelPanicOnNode(ctx context.Context, addr *net.UDPAddr, nodeInformer cache.SharedIndexInformer, podInformer cache.SharedIndexInformer, kubeClient kubernetes.Interface, line string) {<br />    node := getNodeFromAddr(addr.IP.String(), nodeInformer)<br />    if node == nil {<br />        klog.Errorf("Got a kernel panic from %s, but couldn't find a k8s node object for it?", addr.IP.String())<br />    } else {<br />        pods := getPodsFromNode(node, podInformer)<br />        klog.Infof("Got a kernel panic from node %s, annotating and deleting all %d pods and that node.", node.Name, len(pods))<br />        annotateAndDeletePodsWithReason(ctx, kubeClient, pods, line)<br />        err := deleteNode(ctx, kubeClient, node.Name)<br />        if err != nil {<br />            klog.Errorf("Error deleting node %s: %s", node.Name, err)<br />        } else {<br />            klog.Infof("Deleted panicked node %s", node.Name)<br />        }<br />    }<br />}</pre><p id="dec3" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">With that code in place, as soon as a kernel panic is detected, the pods and nodes immediately go away. No need to wait for any GC process. The annotations help document what happened to the node &amp; pod:</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg fg ph bg pi"><div class="ow ox qn"><picture><img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*cjClRuyUQ67lu2shmjCObQ.png" alt="image" /><source data-testid="og" srcset="https://miro.medium.com/v2/resize:fit:640/1*cjClRuyUQ67lu2shmjCObQ.png" /></picture></div></div><figcaption class="pl fc pm ow ox pn po be b bf z dt">A real pod lost on a real k8s node that had a real kernel panic!</figcaption></figure><h1 id="2f32" class="nt nu gr be nv nw nx ny nz oa ob oc od oe of og oh oi oj ok ol om on oo op oq bj">Conclusion</h1><p id="4bd6" class="pw-post-body-paragraph mt mu gr mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gk bj">Marking that a job failed because of a kernel panic may not be <em class="nr">that</em> satisfactory to our customers. But they can take satisfaction in knowing that we now have the required observability tools to start fixing those kernel panics!</p><p id="5c9c" class="pw-post-body-paragraph mt mu gr mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gk bj">Do you also enjoy really getting to the bottom of why things fail in your systems or think kernel panics are cool? Join us on the <a class="af ns" href="https://jobs.netflix.com/jobs/198642264" rel="noopener ugc nofollow" target="_blank">Compute Team</a> where we are building a world-class container platform for our engineers.</p></div>]]></description>
      <link>https://netflixtechblog.com/kubernetes-and-kernel-panics-ed620b9c6225</link>
      <guid>https://netflixtechblog.com/kubernetes-and-kernel-panics-ed620b9c6225</guid>
      <pubDate>Fri, 27 Oct 2023 18:05:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Zero Configuration Service Mesh with On-Demand Cluster Discovery]]></title>
      <description><![CDATA[<p id="30a6" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><em class="nr">by David Duke, James Mulcahy, Ling Yuan, Rob Gulewich</em></p><p id="665a" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">In this post we discuss Netflix’s adoption of service mesh: some history, motivations, and how we worked with Kinvolk and the Envoy community on a feature that streamlines service mesh adoption in complex microservice environments: on-demand cluster discovery.</p><p id="6cd9" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Netflix was early to the cloud, particularly for large-scale companies: we began the migration in 2008, and by 2010, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/four-reasons-we-choose-amazons-cloud-as-our-computing-platform-4aceb692afec">Netflix streaming was fully run on AWS</a>. Today we have a wealth of tools, both OSS and commercial, all designed for cloud-native environments. In 2010, however, nearly none of it existed: the <a class="af ov" href="https://www.cncf.io/" rel="noopener ugc nofollow" target="_blank">CNCF</a> wasn’t formed until 2015! Since there were no existing solutions available, we needed to build them ourselves.</p><p id="9d53" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">For Inter-Process Communication (IPC) between services, we needed the rich feature set that a mid-tier load balancer typically provides. We also needed a solution that addressed the reality of working in the cloud: a highly dynamic environment where nodes are coming up and down, and services need to quickly react to changes and route around failures. To improve availability, we designed systems where components could fail separately and avoid single points of failure. These design principles led us to client-side load-balancing, and the <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/a-closer-look-at-the-christmas-eve-outage-d7b409a529ee">2012 Christmas Eve outage</a> solidified this decision even further. During these early years in the cloud, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/netflix-shares-cloud-load-balancing-and-failover-tool-eureka-c10647ef95e5">we built Eureka</a> for Service Discovery and <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/announcing-ribbon-tying-the-netflix-mid-tier-services-together-a89346910a62">Ribbon (internally known as NIWS) for IPC</a>. Eureka solved the problem of how services discover what instances to talk to, and Ribbon provided the client-side logic for load-balancing, as well as many other resiliency features. These two technologies, alongside a host of other resiliency and chaos tools, made a massive difference: our reliability improved measurably as a result.</p><p id="9c5e" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Eureka and Ribbon presented a simple but powerful interface, which made adopting them easy. In order for a service to talk to another, it needs to know two things: the name of the destination service, and whether or not the traffic should be secure. The abstractions that Eureka provides for this are Virtual IPs (VIPs) for insecure communication, and Secure VIPs (SVIPs) for secure. A service advertises a VIP name and port to Eureka (eg: <em class="nr">myservice</em>, port <em class="nr">8080</em>), or an SVIP name and port (eg: <em class="nr">myservice-secure</em>, port 8443), or both. IPC clients are instantiated targeting that VIP or SVIP, and the Eureka client code handles the translation of that VIP to a set of IP and port pairs by fetching them from the Eureka server. The client can also optionally enable IPC features like retries or circuit breaking, or stick with a set of reasonable defaults.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg ff ph bg pi ow ox oy"><picture><img alt="A diagram showing an IPC client in a Java app directly communicating to hosts registered as SVIP A. Host and port information for SVIP A is fetched from Eureka by the IPC client." class="bg pj pk c" width="700" height="594" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="ed14" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">In this architecture, service to service communication no longer goes through the single point of failure of a load balancer. The downside is that Eureka is a new single point of failure as the source of truth for what hosts are registered for VIPs. However, if Eureka goes down, services can continue to communicate with each other, though their host information will become stale over time as instances for a VIP come up and down. The ability to run in a degraded but available state during an outage is still a marked improvement over completely stopping traffic flow.</p><p id="52d3" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">The above architecture has served us well over the last decade, though changing business needs and evolving industry standards have added more complexity to our IPC ecosystem in a number of ways. First, we’ve grown the number of different IPC clients. Our internal IPC traffic is now a mix of plain REST, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2">GraphQL</a>, and <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/practical-api-design-at-netflix-part-1-using-protobuf-fieldmask-35cfdc606518">gRPC</a>. Second, we’ve moved from a Java-only environment to a Polyglot one: we now also support <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/debugging-node-js-in-production-75901bb10f2d">node.js</a>, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/python-at-netflix-bba45dae649e">Python</a>, and a variety of OSS and off the shelf software. Third, we’ve continued to add more functionality to our IPC clients: features such as <a class="af ov" href="https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581" rel="noopener">adaptive concurrency limiting</a>, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/making-the-netflix-api-more-resilient-a8ec62159c2d">circuit breaking</a>, hedging, and fault injection have become standard tools that our engineers reach for to make our system more reliable. Compared to a decade ago, we now support more features, in more languages, in more clients. Keeping feature parity between all of these implementations and ensuring that they all behave the same way is challenging: what we want is a single, well-tested implementation of all of this functionality, so we can make changes and fix bugs in one place.</p><p id="bd73" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This is where service mesh comes in: we can centralize IPC features in a single implementation, and keep per-language clients as simple as possible: they only need to know how to talk to the local proxy. <a class="af ov" href="https://www.envoyproxy.io/" rel="noopener ugc nofollow" target="_blank">Envoy</a> is a great fit for us as the proxy: it’s a battle-tested OSS product at use in high scale in the industry, with <a class="af ov" href="https://github.com/envoyproxy/envoy/issues/7789" rel="noopener ugc nofollow" target="_blank">many critical resiliency features</a>, and <a class="af ov" href="https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/network_filters/wasm_filter.html" rel="noopener ugc nofollow" target="_blank">good extension points</a> for when we need to extend its functionality. The ability to <a class="af ov" href="https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/dynamic_configuration" rel="noopener ugc nofollow" target="_blank">configure proxies via a central control plane</a> is a killer feature: this allows us to dynamically configure client-side load balancing as if it was a central load balancer, but still avoids a load balancer as a single point of failure in the service to service request path.</p><p id="2cac" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Once we decided that moving to service mesh was the right bet to make, the next question became: how should we go about moving? We decided on a number of constraints for the migration. First: we wanted to keep the existing interface. The abstraction of specifying a VIP name plus secure serves us well, and we didn’t want to break backwards compatibility. Second: we wanted to automate the migration and to make it as seamless as possible. These two constraints meant that we needed to support the Discovery abstractions in Envoy, so that IPC clients could continue to use it under the hood. Fortunately, Envoy had <a class="af ov" href="https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/intro/terminology" rel="noopener ugc nofollow" target="_blank">ready to use abstractions</a> for this. VIPs could be represented as Envoy Clusters, and proxies could fetch them from our control plane using the Cluster Discovery Service (CDS). The hosts in those clusters are represented as Envoy Endpoints, and could be fetched using the Endpoint Discovery Service (EDS).</p><p id="47c2" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We soon ran into a stumbling block to a seamless migration: Envoy requires that clusters be specified as part of the proxy’s config. If service A needs to talk to clusters B and C, then you need to define clusters B and C as part of A’s proxy config. This can be challenging at scale: any given service might communicate with dozens of clusters, and that set of clusters is different for every app. In addition, Netflix is always changing: we’re constantly adding new initiatives like live streaming, <a class="af ov" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba">ads</a> and games, and evolving our architecture. This means the clusters that a service communicates with will change over time. There are a number of different approaches to populating cluster config that we evaluated, given the Envoy primitives available to us:</p><ol class=""><li id="6bc6" class="mt mu gq mv b mw mx my mz na nb nc nd pl nf ng nh pm nj nk nl pn nn no np nq po pp pq bj">Get service owners to define the clusters their service needs to talk to. This option seems simple, but in practice, service owners don’t always know, or want to know, what services they talk to. Services often import libraries provided by other teams that talk to multiple other services under the hood, or communicate with other operational services like telemetry and logging. This means that service owners would need to know how these auxiliary services and libraries are implemented under the hood, and adjust config when they change.</li>
<li id="adc8" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Auto-generate Envoy config based on a service’s call graph. This method is simple for pre-existing services, but is challenging when bringing up a new service or adding a new upstream cluster to communicate with.</li>
<li id="ca32" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Push all clusters to every app: this option was appealing in its simplicity, but back of the napkin math quickly showed us that pushing millions of endpoints to each proxy wasn’t feasible.</li>
</ol><p id="4598" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Given our goal of a seamless adoption, each of these options had significant enough downsides that we explored another option: what if we could fetch cluster information on-demand at runtime, rather than predefining it? At the time, the service mesh effort was still being bootstrapped, with only a few engineers working on it. We approached <a class="af ov" href="https://kinvolk.io/" rel="noopener ugc nofollow" target="_blank">Kinvolk</a> to see if they could work with us and the Envoy community in implementing this feature. The result of this collaboration was <a class="af ov" href="https://github.com/envoyproxy/envoy/pull/18723" rel="noopener ugc nofollow" target="_blank">On-Demand Cluster Discovery</a> (ODCDS). With this feature, proxies could now look up cluster information the first time they attempt to connect to it, rather than predefining all of the clusters in config.</p><p id="de68" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">With this capability in place, we needed to give the proxies cluster information to look up. We had already developed a service mesh control plane that implements the Envoy XDS services. We then needed to fetch service information from Eureka in order to return to the proxies. We represent Eureka VIPs and SVIPs as separate Envoy Cluster Discovery Service (CDS) clusters (so service <em class="nr">myservice</em> may have clusters <em class="nr">myservice.vip</em> and <em class="nr">myservice.svip</em>). Individual hosts in a cluster are represented as separate Endpoint Discovery Service (EDS) endpoints. This allows us to reuse the same Eureka abstractions, and IPC clients like Ribbon can move to mesh with minimal changes. With both the control plane and data plane changes in place, the flow works as follows:</p><ol class=""><li id="cc40" class="mt mu gq mv b mw mx my mz na nb nc nd pl nf ng nh pm nj nk nl pn nn no np nq po pp pq bj">Client request comes into Envoy</li>
<li id="ce20" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Extract the target cluster based on the Host / :authority header (the header used here is configurable, but this is our approach). If that cluster is known already, jump to step 7</li>
<li id="8674" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">The cluster doesn’t exist, so we pause the in flight request</li>
<li id="5782" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Make a request to the Cluster Discovery Service (CDS) endpoint on the control plane. The control plane generates a customized CDS response based on the service’s configuration and Eureka registration information</li>
<li id="0ac2" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Envoy gets back the cluster (CDS), which triggers a pull of the endpoints via Endpoint Discovery Service (EDS). Endpoints for the cluster are returned based on Eureka status information for that VIP or SVIP</li>
<li id="ae3c" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Client request unpauses</li>
<li id="4527" class="mt mu gq mv b mw pr my mz na ps nc nd pl pt ng nh pm pu nk nl pn pv no np nq po pp pq bj">Envoy handles the request as normal: it picks an endpoint using a load-balancing algorithm and issues the request</li>
</ol><p id="3df6" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This flow is completed in a few milliseconds, but only on the first request to the cluster. Afterward, Envoy behaves as if the cluster was defined in the config. Critically, this system allows us to seamlessly migrate services to service mesh with no configuration required, satisfying one of our main adoption constraints. The abstraction we present continues to be VIP name plus secure, and we can migrate to mesh by configuring individual IPC clients to connect to the local proxy instead of the upstream app directly. We continue to use Eureka as the source of truth for VIPs and instance status, which allows us to support a heterogeneous environment of some apps on mesh and some not while we migrate. There’s an additional benefit: we can keep Envoy memory usage low by only fetching data for clusters that we’re actually communicating with.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg ff ph bg pi ow ox pw"><picture><img alt="A diagram showing an IPC client in a Java app communicating through Envoy to hosts registered as SVIP A. Cluster and endpoint information for SVIP A is fetched from the mesh control plane by Envoy. The mesh control plane fetches host information from Eureka." class="bg pj pk c" width="700" height="568" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="996c" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">There is a downside to fetching this data on-demand: this adds latency to the first request to a cluster. We have run into use-cases where services need very low-latency access on the first request, and adding a few extra milliseconds adds too much overhead. For these use-cases, the services need to either predefine the clusters they communicate with, or prime connections before their first request. We’ve also considered pre-pushing clusters from the control plane as proxies start up, based on historical request patterns. Overall, we feel the reduced complexity in the system justifies the downside for a small set of services.</p><p id="1ce5" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We’re still early in our service mesh journey. Now that we’re using it in earnest, there are many more Envoy improvements that we’d love to work with the community on. The porting of our <a class="af ov" href="https://github.com/envoyproxy/envoy/issues/7789" rel="noopener ugc nofollow" target="_blank">adaptive concurrency limiting</a> implementation to Envoy was a great start — we’re looking forward to collaborating with the community on many more. We’re particularly interested in the community’s work on incremental EDS. EDS endpoints account for the largest volume of updates, and this puts undue pressure on both the control plane and Envoy.</p><p id="ee83" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We’d like to give a big thank-you to the folks at Kinvolk for their Envoy contributions: Alban Crequy, Andrew Randall, Danielle Tal, and in particular Krzesimir Nowak for his excellent work. We’d also like to thank the Envoy community for their support and razor-sharp reviews: Adi Peleg, Dmitri Dolguikh, Harvey Tuch, Matt Klein, and Mark Roth. It’s been a great experience working with you all on this.</p><p id="4bd0" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This is the first in a series of posts on our journey to service mesh, so stay tuned. If this sounds like fun, and you want to work on service mesh at scale, come work with us — <a class="af ov" href="https://jobs.netflix.com/jobs/271057970" rel="noopener ugc nofollow" target="_blank">we’re hiring</a>!</p>]]></description>
      <link>https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51</link>
      <guid>https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51</guid>
      <pubDate>Wed, 30 Aug 2023 01:08:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[AVA Discovery View: Surfacing Authentic Moments]]></title>
      <description><![CDATA[<p id="83b2" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">By: <a class="af nr" href="https://www.linkedin.com/in/hamidshahid" rel="noopener ugc nofollow" target="_blank">Hamid Shahid</a>, <a class="af nr" href="https://www.linkedin.com/in/ljworks34/" rel="noopener ugc nofollow" target="_blank">Laura Johnson</a>, <a class="af nr" href="https://www.linkedin.com/in/tiffany-low/" rel="noopener ugc nofollow" target="_blank">Tiffany Low</a></p><p id="2f1d" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">At Netflix, we have created millions of artwork to represent our titles. Each artwork tells a story about the title it represents. From our <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/selecting-the-best-artwork-for-videos-through-a-b-testing-f6155c4595f6">testing on promotional assets</a>, we know which of these assets have performed well and which ones haven’t. Through this, our teams have developed an intuition of what visual and thematic artwork characteristics work well for what genres of titles. A piece of promotional artwork may resonate more in certain regions, for certain genres, or for fans of particular talent. The complexity of these factors makes it difficult to determine the best creative strategy for upcoming titles.</p><p id="69d8" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Our assets are often created by selecting static image frames directly from our source videos. To improve it, we decided to invest in creating a <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7">Media Understanding Platform</a>, which enables us to extract meaningful insights from media that we can then surface in our creative tools. In this post, we will take a deeper look into one of these tools, AVA Discovery View.</p><p id="b7c8" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA is an internal tool that surfaces still frames from video content. The tool provides an efficient way for creatives (photo editors, artwork designers, etc.) to pull moments from video content that authentically represent the title’s narrative themes, main characters, and visual characteristics. These still moments are used by multiple teams across Netflix for artwork (on and off the Netflix platform), Publicity, Marketing, Social teams, and more.</p><p id="8db7" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Stills are used to merchandise &amp; publicize titles authentically, providing a diverse set of entry points to members who may watch for different reasons. For example, for our hit title “<em class="ov">Wednesday”</em>, one member may watch it because they love mysteries, while another may watch because they love coming-of-age stories or goth aesthetics. Another member may be drawn by talent. It’s a creative’s job to select frames with all these entry points in mind. Stills may be enhanced and combined to create a more polished piece of artwork or be used as is. For many teams and titles, Stills are essential to Netflix’s promotional asset strategy.</p><p id="b14b" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Watching every moment of content to find the best frames and select them manually takes a lot of time, and this approach is often not scalable. While frames can be saved manually from the video content, AVA goes beyond providing the functionality to surface authentic frames — it suggests the best moments for creatives to use: enter AVA Discovery View.</p><p id="78ca" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA’s imagery-harvesting algorithms pre-select and group relevant frames into categories like <em class="ov">Storylines &amp; Tones</em>, <em class="ov">Prominent Characters,</em> and <em class="ov">Environments</em>.</p><p id="5727" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Let’s look deeper at how different facets of a title are shown in one of Netflix’s biggest hits — “<em class="ov">Wednesday”</em>.</p><h2 id="1fc4" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Storyline / Tone</h2><p id="f384" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">The title <em class="ov">“Wednesday”</em> involves a character with supernatural abilities sleuthing to solve a mystery. The title has a dark, imaginative tone with shades of wit and dry humor. The setting is an extraordinary high school where teenagers of supernatural abilities are enrolled. The main character is a teenager and has relationship issues with her parents.</p><p id="4361" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The paragraph above provides a short glimpse of the title and is similar to the briefs that our creatives have to work with. Finding authentic moments from this information to build the base of the artwork suite is not trivial and has been very time-consuming for our creatives.</p><p id="baec" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This is where AVA Discovery View comes in and functions as a creative assistant. Using the information about the storyline and tones associated with a title, it surfaces key moments, which not only provide a nice visual summary but also provide a quick landscape view of the title’s main narrative themes and its visual language.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv ff pw bg px pl pm pn"><picture><img alt="" class="bg py pz c" width="700" height="209" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qa qb qc pl pm qd qe be b bf z dt">Storyline &amp; Tone suggestions</figcaption></figure><p id="522e" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Creatives can click on any storyline to see moments that best reflect that storyline and the title’s overall tone. For example, the following images illustrate how it displays moments for the “imaginative” tone.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv ff pw bg px pl pm qf"><picture><img alt="" class="bg py pz c" width="700" height="544" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="33e3" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Prominent Characters</h2><p id="3a8b" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Talent is a major draw for our titles, and our members want to see who is featured in a title to choose whether or not they want to watch that title. Getting to know the prominent characters for a title and then finding the best possible moments featuring them used to be an arduous task.</p><p id="aaa8" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">With the AVA Discovery View, all the prominent characters of the title and their best possible shots are presented to the creatives. They can see how much a character is featured in the title and find shots containing multiple characters and the best possible stills for the characters themselves.</p><h2 id="3d64" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Sensitivities</h2><p id="0b57" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">We don’t want the Netflix home screen to shock or offend audiences, so we aim to avoid artwork with violence, nudity, gore or similar attributes.</p><p id="b529" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">To help our creatives understand content sensitivities, AVA Discovery View lists moments where content contains gore, violence, intimacy, nudity, smoking, etc.</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div class="ab cm ca qg"><picture><img alt="" class="py bg pz c" width="700" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qa qb qc pl pm qd qe be b bf z dt">Sensitive Moments</figcaption></figure><h2 id="fca4" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Environments</h2><p id="6015" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">The setting and the filming location often provide great genre cues and form the basis of great-looking artwork. Finding moments from a virtual setting in the title or the actual filming location required a visual scan of all episodes of a title. Now, AVA Discovery View shows such moments as suggestions to the creatives.</p><p id="313f" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">For example, for the title “<em class="ov">Wednesday”</em>, the creatives are presented with “Nevermore Academy” as a suggested environment</p><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv ff pw bg px pl pm qh"><picture><img alt="" class="bg py pz c" width="700" height="286" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qa qb qc pl pm qd qe be b bf z dt">Suggested Environment — Nevermore Academy</figcaption></figure><h2 id="077f" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Algorithm Quality</h2><p id="8f72" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA Discovery View included several different algorithms at the start, and since its release, we have expanded support to additional algorithms. Each algorithm needed a process of evaluation and tuning to get great results in AVA Discovery View.</p><p id="5b89" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">For Visual Search</strong></p><ul class=""><li id="c7c1" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">We found that the model was influenced by the text present in the image. For example, stills of title credits would often get picked up and highly recommended to users. We added a step where such stills with text results would be filtered out and not present in the search.</li>
<li id="25a5" class="mt mu gq mv b mw qo my mz na qp nc nd qi qq ng nh qj qr nk nl qk qs no np nq ql qm qn bj">We also found that users preferred results that had a confidence threshold cutoff applied to them.</li>
</ul><p id="b96c" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">For Prominent Characters</strong></p><ul class=""><li id="a185" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">We found that our current algorithm model did not handle animated faces well. As a result, we often find that poor or no suggestions are returned for animated content.</li>
</ul><p id="a8e4" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">For Sensitive Moments</strong></p><ul class=""><li id="f1ba" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">We found that setting a high confidence threshold was helpful. The algorithm was originally developed to be sensitive to bloody scenes, and when applied to scenes of cooking and painting, often flagged as false positives.</li>
</ul><p id="ed2d" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">One challenge we encountered was the repetition of suggestions. Multiple suggestions from the same scene could be returned and lead to many visually similar moments. Users preferred seeing only the best frames and a diverse set of frames.</p><ul class=""><li id="3c64" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">We added a ranking step to some algorithms to mark frames too visually similar to higher-ranked frames. These duplicate frames would be filtered out from the suggestions list.</li>
<li id="acf3" class="mt mu gq mv b mw qo my mz na qp nc nd qi qq ng nh qj qr nk nl qk qs no np nq ql qm qn bj">However, not all algorithms can take this approach. We are exploring using scene boundary algorithms to group similar moments together as a single recommendation.</li>
</ul><h2 id="367f" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Suggestion Ranking</h2><p id="c97b" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA Discovery View presents multiple levels of algorithmic suggestions, and a challenge was to help users navigate through the best-performing suggestions and avoid selecting bad suggestions.</p><ul class=""><li id="3060" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">The suggestion categories are presented based on our users’ workflow relevance. We show Storyline/Tone, Prominent Characters, Environments, then Sensitivities.</li>
<li id="6969" class="mt mu gq mv b mw qo my mz na qp nc nd qi qq ng nh qj qr nk nl qk qs no np nq ql qm qn bj">Within each suggestion category, we display suggestions ranked by the number of results and tie break along the confidence threshold.</li>
</ul><h2 id="756e" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Algorithm Feedback</h2><p id="8405" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">As we launched the initial set of algorithms for AVA Discovery View, our team interviewed users about their experiences. We also built mechanisms within the tool to get explicit and implicit user feedback.</p><p id="2590" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">Explicit Feedback</strong></p><ul class=""><li id="027d" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">For each algorithmic suggestion presented to a user, users can click a thumbs up or thumbs down to give direct feedback.</li>
</ul><p id="e5dd" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">Implicit Feedback</strong></p><ul class=""><li id="d76e" class="mt mu gq mv b mw mx my mz na nb nc nd qi nf ng nh qj nj nk nl qk nn no np nq ql qm qn bj">We have tracking enabled to detect when an algorithmic suggestion has been utilized (downloaded or published for use on Netflix promotional purposes).</li>
<li id="c265" class="mt mu gq mv b mw qo my mz na qp nc nd qi qq ng nh qj qr nk nl qk qs no np nq ql qm qn bj">This implicit feedback is much easier to collect, although it may not work for all algorithms. For example, suggestions from Sensitivities are meant to be content watch-outs that should not be used for promotional purposes. As a result, this row does poorly on implicit feedback as we do not expect downloads or publish actions on these suggestions.</li>
</ul><p id="2f38" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This feedback is easily accessible by our algorithm partners and used in training improved versions of the models.</p><h2 id="4090" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Intersection Queries across Multiple Algorithms</h2><p id="c784" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Several media understanding algorithms return clip or short-duration video segment suggestions. We compute the timecode intersections against a set of known high-quality frames to surface the best frame within these clips.</p><p id="90be" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We also rely on intersection queries to help users narrow a large set of frames to a specific moment. For example, returning stills with two or more prominent characters or filtering only indoor scenes from a search query.</p><h2 id="028a" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Discovery View Plugin Architecture</h2><figure class="po pp pq pr ps pt pl pm paragraph-image"><div role="button" tabindex="0" class="pu pv ff pw bg px pl pm qt"><picture><img alt="" class="bg py pz c" width="700" height="535" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qa qb qc pl pm qd qe be b bf z dt">Discovery View Plugin Architecture</figcaption></figure><p id="6c32" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We built Discovery View as a pluggable feature that could quickly be extended to support more algorithms and other types of suggestions. Discovery View is available via Studio Gateway for AVA UI and other front-end applications to leverage.</p><h2 id="b4bf" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Unified Interface for Discovery</h2><p id="e703" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">All Discovery View rows implement the same interface, and it’s simple to extend it and plug it into the existing view.</p><p id="04bb" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">Scalable Categories</strong>In the Discovery View feature, we dynamically hide categories or recommendations based on the results of algorithms. Categories can be hidden if no suggestions are found. On the other hand, for a large number of suggestions, only top suggestions are retrieved, and users have the ability to request more.</p><p id="10d4" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">Graceful Failure Handling</strong>We load Discovery View suggestions independently for a responsive user experience.</p><h2 id="4a3f" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Asset Feedback MicroService</h2><figure class="po pp pq pr ps pt pl pm paragraph-image"><div class="pl pm qu"><picture><img alt="" class="bg py pz c" width="280" height="446" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qa qb qc pl pm qd qe be b bf z dt">Asset Feedback MicroService</figcaption></figure><p id="ddc9" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We identified that Asset Feedback is a functionality that is useful elsewhere in our ecosystem as well, so we decided to create a separate microservice for it. The service serves an important function of getting feedback about the quality of stills and ties them to the algorithms. This information is available both at individual and aggregated levels for our algorithm partners.</p><p id="50d7" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA Discovery View relies on the Media Understanding Platform (MUP) as the main interface for algorithm suggestions. The key features of this platform are</p><h2 id="3dd0" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Uniform Query Interface</h2><p id="2e2c" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Hosting all of the algorithms in AVA Discovery View on MUP made it easier for product integration as the suggestions could be queried from each algorithm similarly</p><h2 id="5f42" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Rich Query Feature Set</h2><p id="9111" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">We could test different confidence thresholds per algorithm, intersect across algorithm suggestions, and order suggestions by various fields.</p><h2 id="6f1e" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Fast Algo Onboarding</h2><p id="0e3f" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Each algorithm took fewer than two weeks to onboard, and the platform ensured that new titles delivered to Netflix would automatically generate algorithm suggestions. Our team was able to spend more time evaluating algorithm performance and quickly iterate on AVA Discovery View.</p><p id="f4ec" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">To learn more about MUP, please see a previous blog post from our team: <a class="af nr" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7">Building a Media Understanding Platform for ML Innovations</a>.</p><p id="ef1c" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Discovering authentic moments in an efficient and scalable way has a huge impact on Netflix and its creative teams. AVA has become a place to gain title insights and discover assets. It provides a concise brief on the main narratives, the visual language, and the title’s prominent characters. An AVA user can find relevant and visually stunning frames quickly and easily and leverage them as a context-gathering tool.</p><p id="a405" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">To improve AVA Discovery View, our team needs to balance the number of frames returned and the quality of the suggestions so that creatives can build more trust with the feature.</p><h2 id="f640" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Eliminating Repetition</h2><p id="3807" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">AVA Discovery View will often put the same frame into multiple categories, which results in creatives viewing and evaluating the same frame multiple times. How can we solve for an engaging frame being a part of multiple groupings without bloating each grouping with repetition?</p><h2 id="2fba" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Improving Frame Quality</h2><p id="e9f6" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">We’d like to only show creatives the best frames from a certain moment and work to eliminate frames that have either poor technical quality (a poor character expression) or poor editorial quality (not relevant to grouping, not relevant to narrative). Sifting through frames that aren’t up to quality standards creates user fatigue.</p><h2 id="5109" class="ow nt gq be nu ox oy dx ny oz pa dz oc ne pb pc pd ni pe pf pg nm ph pi pj pk bj">Building User Trust</h2><p id="fcab" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Creatives don’t want to wonder whether there’s something better outside an AVA Discovery View grouping or if anything is missing from these suggested frames.</p><p id="91a7" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">When looking at a particular grouping (like “<em class="ov">Wednesday”’s</em> <em class="ov">Solving a Mystery</em> or <em class="ov">Gothic</em>), creatives need to trust that it doesn’t contain any frames that don’t belong there, that these are the best quality frames, and that there are no better frames that exist in the content that isn’t included in the grouping. Suppose a creative is leveraging AVA Discovery View and doing separate manual work to improve frame quality or check for missing moments. In that case, AVA Discovery View hasn’t yet fully optimized the user experience.</p><p id="47b8" class="pw-post-body-paragraph mt mu gq mv b mw oq my mz na or nc nd ne os ng nh ni ot nk nl nm ou no np nq gj bj">Special thanks to <a class="af nr" href="https://www.linkedin.com/in/abhisheks0ni/" rel="noopener ugc nofollow" target="_blank">Abhishek Soni</a>, <a class="af nr" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank">Amir Ziai</a>, <a class="af nr" href="https://www.linkedin.com/in/andrea-johnson-01535946/" rel="noopener ugc nofollow" target="_blank">Andrew Johnson</a>, <a class="af nr" href="https://www.linkedin.com/in/ankushagrawal94/" rel="noopener ugc nofollow" target="_blank">Ankush Agrawal</a>, <a class="af nr" href="https://www.linkedin.com/in/aneeshvartakavi/" rel="noopener ugc nofollow" target="_blank">Aneesh Vartakavi</a>, <a class="af nr" href="https://www.linkedin.com/in/audra-reed-83a0007/" rel="noopener ugc nofollow" target="_blank">Audra Reed</a>, <a class="af nr" href="https://www.linkedin.com/in/briandasg/" rel="noopener ugc nofollow" target="_blank">Brianda Suarez</a>, <a class="af nr" href="https://www.linkedin.com/in/farazamiruddin/" rel="noopener ugc nofollow" target="_blank">Faraz Ahmad</a>, <a class="af nr" href="https://www.linkedin.com/in/farisito/" rel="noopener ugc nofollow" target="_blank">Faris Mustafa</a>, <a class="af nr" href="https://www.linkedin.com/in/fifi-mar%C3%A9e/" rel="noopener ugc nofollow" target="_blank">Fifi Maree</a>, <a class="af nr" href="https://www.linkedin.com/in/gurutahasildar/" rel="noopener ugc nofollow" target="_blank">Guru Tahasildar</a>, <a class="af nr" href="https://www.linkedin.com/in/gucarmo/" rel="noopener ugc nofollow" target="_blank">Gustavo Carmo</a>, <a class="af nr" href="https://www.linkedin.com/in/haleyjonesphillips/" rel="noopener ugc nofollow" target="_blank">Haley Jones Phillips</a>, <a class="af nr" href="https://www.linkedin.com/in/jananbarge/" rel="noopener ugc nofollow" target="_blank">Janan Barge</a>, <a class="af nr" href="https://www.linkedin.com/in/karenannwilliams/" rel="noopener ugc nofollow" target="_blank">Karen Williams</a>, <a class="af nr" href="https://www.linkedin.com/in/ljworks34/" rel="noopener ugc nofollow" target="_blank">Laura Johnson</a>, <a class="af nr" href="https://www.linkedin.com/in/maria-perkovic/" rel="noopener ugc nofollow" target="_blank">Maria Perkovic</a>, <a class="af nr" href="https://www.linkedin.com/in/meenakshijindal/" rel="noopener ugc nofollow" target="_blank">Meenakshi Jindal</a>, <a class="af nr" href="https://www.linkedin.com/in/nagendrak/" rel="noopener ugc nofollow" target="_blank">Nagendra Kamath</a>, <a class="af nr" href="https://www.linkedin.com/in/nicolapharoah/" rel="noopener ugc nofollow" target="_blank">Nicola Pharoah</a>, <a class="af nr" href="https://www.linkedin.com/in/qiang-liu-7a18b32a/" rel="noopener ugc nofollow" target="_blank">Qiang Liu</a>, <a class="af nr" href="https://www.linkedin.com/in/samuel-carvajal/" rel="noopener ugc nofollow" target="_blank">Samuel Carvajal</a>, <a class="af nr" href="https://www.linkedin.com/in/shervin-ardeshir/" rel="noopener ugc nofollow" target="_blank">Shervin Ardeshir</a>, <a class="af nr" href="https://www.linkedin.com/in/supriya-vadlamani/" rel="noopener ugc nofollow" target="_blank">Supriya Vadlamani</a>, <a class="af nr" href="https://www.linkedin.com/in/varun-sekhri-087a213/" rel="noopener ugc nofollow" target="_blank">Varun Sekhri</a>, and <a class="af nr" href="https://www.linkedin.com/in/vitalikauhanka/" rel="noopener ugc nofollow" target="_blank">Vitali Kauhanka</a> for making it all possible.</p>]]></description>
      <link>https://netflixtechblog.com/ava-discovery-view-surfacing-authentic-moments-b8cd145491cc</link>
      <guid>https://netflixtechblog.com/ava-discovery-view-surfacing-authentic-moments-b8cd145491cc</guid>
      <pubDate>Fri, 18 Aug 2023 00:07:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Curbing Connection Churn in Zuul]]></title>
      <description><![CDATA[<div class="ab ca ch bg fv fw fx fy"><p id="f311" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><em class="nr">By</em> <a class="af ns" href="https://twitter.com/agonigberg" rel="noopener ugc nofollow" target="_blank"><em class="nr">Arthur Gonigberg</em></a>, <a class="af ns" href="https://www.linkedin.com/in/argha-c" rel="noopener ugc nofollow" target="_blank"><em class="nr">Argha C</em></a></p><p id="b1e2" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">When <a class="af ns" href="https://github.com/Netflix/zuul" rel="noopener ugc nofollow" target="_blank">Zuul</a> was <a class="af ns" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/zuul-2-the-netflix-journey-to-asynchronous-non-blocking-systems-45947377fb5c">designed and developed</a>, there was an inherent assumption that connections were effectively free, given we weren’t using mutual TLS (mTLS). It’s built on top of <a class="af ns" href="https://netty.io/" rel="noopener ugc nofollow" target="_blank">Netty</a>, using event loops for non-blocking execution of requests, one loop per core. To reduce contention among event loops, we created connection pools for each, keeping them completely independent. The result is that the entire request-response cycle happens on the same thread, significantly reducing context switching.</p><p id="5f39" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">There is also a significant downside. It means that if each event loop has a connection pool that connects to every origin (our name for backend) server, there would be a multiplication of event loops by servers by Zuul instances. For example, a 16-core box connecting to an 800-server origin would have 12,800 connections. If the Zuul cluster has 100 instances, that’s 1,280,000 connections. That’s a significant amount and certainly more than is necessary relative to the traffic on most clusters.</p><p id="dea8" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">As streaming has grown over the years, these numbers multiplied with bigger Zuul and origin clusters. More acutely, if a traffic spike occurs and Zuul instances scale up, it exponentially increases connections open to origins. Although this has been a known issue for a long time, it has never been a critical pain point until we moved large streaming applications to mTLS and our Envoy-based service mesh.</p><p id="b226" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">The first step in improving connection overhead was implementing HTTP/2 (H2) multiplexing to the origins. Multiplexing allows the reuse of existing connections by creating multiple streams per connection, each able to send a request. Rather than requiring a connection for every request, we could reuse the same connection for many simultaneous requests. The more we reuse connections, the less overhead we have in establishing mTLS sessions with roundtrips, handshaking, and so on.</p><p id="1e92" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Although Zuul has had H2 proxying for some time, it never supported multiplexing. It effectively treated H2 connections as HTTP/1 (H1). For backward compatibility with existing H1 functionality, we modified the H2 connection bootstrap to create a stream and immediately release the connection back into the pool. Future requests will then be able to reuse the existing connection without creating a new one. Ideally, the connections to each origin server should converge towards 1 per event loop. It seems like a minor change, but it had to be seamlessly integrated into our existing metrics and connection bookkeeping.</p><p id="bfc2" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The standard way to initiate H2 connections is, over TLS, via an upgrade with <a class="af ns" href="https://en.wikipedia.org/wiki/Application-Layer_Protocol_Negotiation" rel="noopener ugc nofollow" target="_blank">ALPN (Application-Layer Protocol Negotiation</a>). ALPN allows us to gracefully downgrade back to H1 if the origin doesn’t support H2, so we can broadly enable it without impacting customers. Service mesh being available on many services made testing and rolling out this feature very easy because it enables ALPN by default. It meant that no work was required by service owners who were already on service mesh and mTLS.</p><p id="b1e3" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Sadly, our plan hit a snag when we rolled out multiplexing. Although the feature was stable and functionally there was no impact, we didn’t get a reduction in overall connections. Because some origin clusters were so large, and we were connecting to them from all event loops, there wasn’t enough re-use of existing connections to trigger multiplexing. Even though we were now capable of multiplexing, we weren’t utilizing it.</p><p id="d22f" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">H2 multiplexing will improve connection spikes under load when there is a large demand for all the existing connections, but it didn’t help in steady-state. Partitioning the whole origin into subsets would allow us to reduce total connection counts while leveraging multiplexing to maintain existing throughput and headroom.</p><p id="805a" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">We had discussed subsetting many times over the years, but there was concern about disrupting load balancing with the algorithms available. An even distribution of traffic to origins is critical for accurate <a class="af ns" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f">canary analysis</a> and preventing hot-spotting of traffic on origin instances.</p><p id="4366" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Subsetting was also top of mind after reading a <a class="af ns" href="https://queue.acm.org/detail.cfm?id=3570937" rel="noopener ugc nofollow" target="_blank">recent ACM paper</a> published by Google. It describes an improvement on their long-standing <a class="af ns" href="https://sre.google/sre-book/load-balancing-datacenter/" rel="noopener ugc nofollow" target="_blank">Deterministic Subsetting</a> algorithm that they’ve used for many years. The Ringsteady algorithm (figure below) creates an evenly distributed ring of servers (yellow nodes) and then walks the ring to allocate them to each front-end task (blue nodes).</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg ff ph bg pi ow ox oy"><picture><img alt="" class="bg pj pk c" width="700" height="478" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pl pm pn ow ox po pp be b bf z dt"><em class="pq">The figure above is from Google’s</em> <a class="af ns" href="https://queue.acm.org/detail.cfm?id=3570937" rel="noopener ugc nofollow" target="_blank"><em class="pq">ACM paper</em></a></figcaption></figure><p id="f8b2" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The algorithm relies on the idea of <a class="af ns" href="https://en.wikipedia.org/wiki/Low-discrepancy_sequence" rel="noopener ugc nofollow" target="_blank">low-discrepancy numeric sequences</a> to create a naturally balanced distribution ring that is more consistent than one built on a randomness-based consistent hash. The particular sequence used is a binary variant of the <a class="af ns" href="https://en.wikipedia.org/wiki/Van_der_Corput_sequence" rel="noopener ugc nofollow" target="_blank">Van der Corput sequence</a>. As long as the sequence of added servers is monotonically incrementing, for each additional server, the distribution will be evenly balanced between 0–1. Below is an example of what the binary Van der Corput sequence looks like.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div class="ow ox pr"><picture><img alt="" class="bg pj pk c" width="462" height="41" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="0235" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Another big benefit of this distribution is that it provides a consistent expansion of the ring as servers are removed and added over time, evenly spreading new nodes among the subsets. This results in the stability of subsets and no cascading churn based on origin changes over time. Each node added or removed will only affect one subset, and new nodes will be added to a different subset every time.</p><p id="1c89" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Here’s a more concrete demonstration of the sequence above, in decimal form, with each number between 0–1 assigned to 4 subsets. In this example, each subset has 0.25 of that range depicted with its own color.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg ff ph bg pi ow ox ps"><picture><img alt="" class="bg pj pk c" width="700" height="133" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="61c2" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">You can see that each new node added is balanced across subsets extremely well. If 50 nodes are added quickly, they will get distributed just as evenly. Similarly, if a large number of nodes are removed, it will affect all subsets equally.</p><p id="653a" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The real killer feature, though, is that if a node is removed or added, it doesn’t require all the subsets to be shuffled and recomputed. Every single change will generally only create or remove one connection. This will hold for bigger changes, too, reducing almost all churn in the subsets.</p><p id="b345" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">Our approach to implement this in Zuul was to integrate with <a class="af ns" href="https://github.com/Netflix/eureka" rel="noopener ugc nofollow" target="_blank">Eureka</a> service discovery changes and feed them into a distribution ring, based on the ideas discussed above. When new origins register in Zuul, we load their instances and create a new ring, and from then on, manage it with incremental deltas. We also take the additional step of shuffling the order of nodes before adding them to the ring. This helps prevent accidental hot spotting or overlap among Zuul instances.</p><p id="a01a" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The quirk in any load balancing algorithm from Google is that they do their <a class="af ns" href="https://sre.google/workbook/managing-load/#gslb" rel="noopener ugc nofollow" target="_blank">load balancing centrally</a>. Their centralized service creates subsets and load balances across their entire fleet, with a global view of the world. To use this algorithm, <strong class="mv gr">the key insight was to apply it to the event loops rather than the instances themselves</strong>. This allows us to continue having decentralized, client-side load balancing while also having the benefits of accurate subsetting. Although Zuul continues connecting to all origin servers, each event loop’s connection pool only gets a small subset of the whole. We end up with a singular, global view of the distribution that we can control on each instance — and a single sequence number that we can increment for each origin’s ring.</p><p id="f24a" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">When a request comes in, Netty assigns it to an event loop, and it remains there for the duration of the request-response lifecycle. After running the inbound filters, we determine the destination and load the connection pool for this event loop. This will pull from a mapping of loop-to-subset, giving us the limited set of nodes we’re looking for. We then load balance using a modified choice-of-2, as <a class="af ns" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/netflix-edge-load-balancing-695308b5548c">discussed before</a>. If this sounds familiar, it’s because there are no fundamental changes to how Zuul works. The only difference is that we provide a loop-bound subset of nodes to the load balancer as a starting point for its decision.</p><p id="e168" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Another insight we had was that we needed to replicate the number of subsets among the event loops. This allows us to maintain low connection counts for large and small origins. At the same time, having a reasonable subset size ensures we can continue providing good balance and resiliency features for the origin. Most origins require this because they are not big enough to create enough instances in each subset.</p><p id="6a76" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">However, we also don’t want to change this replication factor too often because it would cause a reshuffling of the entire ring and introduce a lot of churn. After a lot of iteration, we ended up implementing this by starting with an “ideal” subset size. We achieve this by computing the subset size that would achieve the ideal replication factor for a given cardinality of origin nodes. We can scale the replication factor across origins by growing our subsets until the desired subset size is achieved, especially as they scale up or down based on traffic patterns. Finally, we work backward to divide the ring into even slices based on the computed subset size.</p><p id="8950" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Our ideal subset side is roughly 25–50 nodes, so an origin with 400 nodes will have 8 subsets of 50 nodes. On a 32-core instance, we’ll have a replication factor of 4. However, that also means that between 200 and 400 nodes, we’re not shuffling the subsets at all. An example of this subset recomputation is in the rollout graphs <a class="af ns" href="https://medium.com/p/2feb273a3598#5e4d" rel="noopener">below</a>.</p><p id="6742" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">An interesting challenge here was to satisfy the dual constraints of origin nodes with a range of cardinality, and the number of event loops that hold the subsets. Our goal is to scale the subsets as we run on instances with higher event loops, with a sub-linear increase in overall connections, and sufficient replication for availability guarantees. Scaling the replication factor elastically described above helped us achieve this successfully.</p><p id="d6f6" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">The results were outstanding. We saw improvements across all key metrics on Zuul, but most importantly, there was a significant reduction in total connection counts and churn.</p><h2 id="7e80" class="pt nu gq be nv pu pv dx nz pw px dz od ne py pz qa ni qb qc qd nm qe qf qg qh bj"><strong class="al">Total Connections</strong></h2></div><div class="ab ca ch bg fv fw fx fy"><p id="2abc" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This graph (as well as the ones below) shows a week’s worth of data, with the typical diurnal cycle of Netflix usage. Each of the 3 colors represents our deployment regions in AWS, and the blue vertical line shows when we turned on the feature.</p><p id="5567" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj"><strong class="mv gr">Total connections at peak were significantly reduced in all 3 regions by a factor of 10x</strong>. This is a huge improvement, and it makes sense if you dig into how subsetting works. For example, a machine running 16 event loops could have 8 subsets — each subset is on 2 event loops. That means we’re dividing an origin by 8, hence an 8x improvement. As to why peak improvement goes up to 10x, it’s probably related to reduced churn (below).</p><h2 id="17f9" class="pt nu gq be nv pu pv dx nz pw px dz od ne py pz qa ni qb qc qd nm qe qf qg qh bj"><strong class="al">Churn</strong></h2></div><div class="ab ca ch bg fv fw fx fy"><p id="8bca" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">This graph is a good proxy for churn. It shows how many TCP connections Zuul is opening per second. You can see the before and after very clearly. Looking at the peak-to-peak improvement, there is roughly an 8x improvement.</p><p id="0039" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The decrease in churn is a testament to the stability of the subsets, even as origins scale up, down, and redeploy over time.</p><p id="3086" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Looking specifically at connections created in the pool, the reduction is even more impressive:</p></div><div class="ab ca ch bg fv fw fx fy"><p id="c2c1" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">The peak-to-peak reduction is massive and clearly shows how stable this distribution is. Although hard to see on the graph, the reduction went from thousands per second at peak down to about 60. Thereis <strong class="mv gr">effectively no churn of connections, even at peak traffic</strong>.</p><h2 id="5e4d" class="pt nu gq be nv pu pv dx nz pw px dz od ne py pz qa ni qb qc qd nm qe qf qg qh bj"><strong class="al">Load Balancing</strong></h2><p id="e5ac" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">The key constraint to subsetting is ensuring that the load balance on the backends is still consistent and evenly distributed. You’ll notice all the RPS on origin nodes grouped tightly, as expected. The thicker lines represent the subset size and the total origin size.</p></div><div class="ab ca ch bg fv fw fx fy"><p id="4422" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">In the second graph, you’ll note that we recompute the subset size (blue line) because the origin (purple line) became large enough that we could get away with less replication in the subsets. In this case, we went from a subset size of 100 for 400 servers (a division of 4) to 50 (a division of 8).</p><h2 id="2132" class="pt nu gq be nv pu pv dx nz pw px dz od ne py pz qa ni qb qc qd nm qe qf qg qh bj">System Metrics</h2><p id="6757" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">Given the significant reduction in connections, we saw reduced CPU utilization (~4%), heap usage (~15%), and latency (~3%) on Zuul, as well.</p><figure class="oz pa pb pc pd pe ow ox paragraph-image"><div role="button" tabindex="0" class="pf pg ff ph bg pi ow ox qx"><picture><img alt="" class="bg pj pk c" width="700" height="145" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pl pm pn ow ox po pp be b bf z dt">Zuul canary metrics</figcaption></figure><p id="005f" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">As we rolled this feature out to our largest origins — streaming playback APIs — we saw the pattern above continue, but with scale, it became more impressive. On some Zuul shards, we saw a reduction of as much as 13 million connections at peak, with almost no churn.</p><p id="4934" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Today the feature is rolled out widely. We’re serving the same amount of traffic but with tens of millions fewer connections. Despite the reduction of connections, there is no decrease in resiliency or load balancing. H2 multiplexing allows us to scale up requests separately from connections, and our subsetting algorithm ensures an even traffic balance.</p><p id="46a9" class="pw-post-body-paragraph mt mu gq mv b mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np nq gj bj">Although challenging to get right, subsetting is a worthwhile investment.</p><p id="15ab" class="pw-post-body-paragraph mt mu gq mv b mw or my mz na os nc nd ne ot ng nh ni ou nk nl nm ov no np nq gj bj">We would also like to thank <a class="af ns" href="https://twitter.com/flowblok" rel="noopener ugc nofollow" target="_blank">Peter Ward</a>, <a class="af ns" href="https://twitter.com/junyer" rel="noopener ugc nofollow" target="_blank">Paul Wankadia</a>, and <a class="af ns" href="https://www.linkedin.com/in/kavita-guliani/" rel="noopener ugc nofollow" target="_blank">Kavita Guliani</a> at Google for developing this algorithm and publishing their work for the benefit of the industry.</p></div>]]></description>
      <link>https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598</link>
      <guid>https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598</guid>
      <pubDate>Wed, 16 Aug 2023 19:55:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Detecting Scene Changes in Audiovisual Content]]></title>
      <description><![CDATA[<p id="3298" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><a class="af nq" href="https://www.linkedin.com/in/avneesh/" rel="noopener ugc nofollow" target="_blank">Avneesh Saluja</a>, <a class="af nq" href="https://www.linkedin.com/in/yaoandy/" rel="noopener ugc nofollow" target="_blank">Andy Yao</a>, <a class="af nq" href="https://www.linkedin.com/in/mhtaghavi/" rel="noopener ugc nofollow" target="_blank">Hossein Taghavi</a></p><p id="4996" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">When watching a movie or an episode of a TV show, we experience a cohesive narrative that unfolds before us, often without giving much thought to the underlying structure that makes it all possible. However, movies and episodes are not atomic units, but rather composed of smaller elements such as frames, shots, scenes, sequences, and acts. Understanding these elements and how they relate to each other is crucial for tasks such as video summarization and highlights detection, content-based video retrieval, dubbing quality assessment, and video editing. At Netflix, such workflows are performed hundreds of times a day by many teams around the world, so investing in algorithmically-assisted tooling around content understanding can reap outsized rewards.</p><p id="19ce" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">While segmentation of more granular units like frames and shot boundaries is either trivial or can primarily rely on <a class="af nq" href="https://arxiv.org/abs/2008.04838" rel="noopener ugc nofollow" target="_blank">pixel-based information</a>, higher order segmentation¹ requires a more nuanced understanding of the content, such as the narrative or emotional arcs. Furthermore, some cues can be better inferred from modalities other than the video, e.g. the screenplay or the audio and dialogue track. Scene boundary detection, in particular, is the task of identifying the transitions between scenes, where a scene is defined as a continuous sequence of shots that take place in the same time and location (often with a relatively static set of characters) and share a common action or theme.</p><p id="a4f0" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">In this blog post, we present two complementary approaches to scene boundary detection in audiovisual content. The first method, which can be seen as a form of <a class="af nq" href="http://ai.stanford.edu/blog/weak-supervision/" rel="noopener ugc nofollow" target="_blank">weak supervision</a>, leverages auxiliary data in the form of a screenplay by aligning screenplay text with timed text (closed captions, audio descriptions) and assigning timestamps to the screenplay’s scene headers (a.k.a. sluglines). In the second approach, we show that a relatively simple, supervised sequential model (bidirectional LSTM or GRU) that uses rich, pretrained shot-level embeddings can outperform the current state-of-the-art baselines on our internal benchmarks.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov ow"><picture><img alt="" class="bg pd pe c" width="600" height="338" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pf pg ph ou ov pi pj be b bf z dt">Figure 1: a scene consists of a sequence of shots.</figcaption></figure><p id="96a8" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">Screenplays are the blueprints of a movie or show. They are formatted in a specific way, with each scene beginning with a scene header, indicating attributes such as the location and time of day. This consistent formatting makes it possible to parse screenplays into a structured format. At the same time, a) changes made on the fly (directorial or actor discretion) or b) in post production and editing are rarely reflected in the screenplay, i.e. it isn’t rewritten to reflect the changes.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov pk"><picture><img alt="" class="bg pd pe c" width="550" height="367" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pf pg ph ou ov pi pj be b bf z dt">Figure 2: screenplay elements, from <em class="pl">The Witcher S1E1</em>.</figcaption></figure><p id="7235" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">In order to leverage this noisily aligned data source, we need to align time-stamped text (e.g. closed captions and audio descriptions) with screenplay text (dialogue and action² lines), bearing in mind a) the on-the-fly changes that might result in semantically similar but not identical line pairs and b) the possible post-shoot changes that are more significant (reordering, removing, or inserting entire scenes). To address the first challenge, we use pre trained sentence-level embeddings, e.g. from an embedding model optimized for <a class="af nq" href="https://www.sbert.net/examples/applications/paraphrase-mining/README.html" rel="noopener ugc nofollow" target="_blank">paraphrase identification</a>, to represent text in both sources. For the second challenge, we use <a class="af nq" href="https://en.wikipedia.org/wiki/Dynamic_time_warping" rel="noopener ugc nofollow" target="_blank">dynamic time warping</a> (DTW), a method for measuring the similarity between two sequences that may vary in time or speed. While DTW assumes a monotonicity condition on the alignments³ which is frequently violated in practice, it is robust enough to recover from local misalignments and the vast majority of salient events (like scene boundaries) are well-aligned.</p><p id="d92a" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">As a result of DTW, the scene headers have timestamps that can indicate possible scene boundaries in the video. The alignments can also be used to e.g., augment audiovisual ML models with screenplay information like scene-level embeddings, or transfer labels assigned to audiovisual content to train screenplay prediction models.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov pk"><picture><img alt="" class="bg pd pe c" width="550" height="367" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pf pg ph ou ov pi pj be b bf z dt">Figure 3: alignments between screenplay and video via time stamped text for <em class="pl">The Witcher S1E1</em>.</figcaption></figure><p id="f72d" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">The alignment method above is a great way to get up and running with the scene change task since it combines easy-to-use pretrained embeddings with a well-known dynamic programming technique. However, it presupposes the availability of high-quality screenplays. A complementary approach (which in fact, can use the above alignments as a feature) that we present next is to train a sequence model on annotated scene change data. Certain workflows in Netflix capture this information, and that is our primary data source; publicly-released datasets are also available.</p><p id="ddd4" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">From an architectural perspective, the model is relatively simple — a bidirectional <a class="af nq" href="https://arxiv.org/abs/1412.3555" rel="noopener ugc nofollow" target="_blank">GRU</a> (biGRU) that ingests shot representations at each step and predicts if a shot is at the end of a scene.⁴ The richness in the model comes from these pretrained, multimodal shot embeddings, a preferable design choice in our setting given the difficulty in obtaining labeled scene change data and the relatively larger scale at which we can pretrain various embedding models for shots.</p><p id="a50d" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">For video embeddings, we leverage an in-house model pretrained on aligned video clips paired with text (the <a class="af nq" href="https://docs.google.com/document/d/1Wrnp_O4HsdOQTjCi7xuHyXdJn_95pOh0eLx1kNQTFLw/edit#heading=h.bab3dsmi08jm" rel="noopener ugc nofollow" target="_blank">aforementioned</a> “timestamped text”). For audio embeddings, we first perform <a class="af nq" href="https://research.deezer.com/projects/spleeter.html" rel="noopener ugc nofollow" target="_blank">source separation</a> to try and separate foreground (speech) from background (music, sound effects, noise), embed each separated waveform separately using <a class="af nq" href="https://arxiv.org/abs/2006.11477" rel="noopener ugc nofollow" target="_blank">wav2vec2</a>, and then concatenate the results. Both early and late-stage fusion approaches are explored; in the former (Figure 4a), the audio and video embeddings are concatenated and fed into a single biGRU, and in the latter (Figure 4b) each input modality is encoded with its own biGRU, after which the hidden states are concatenated prior to the output layer.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov ow"><picture><img alt="" class="bg pd pe c" width="600" height="338" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pf pg ph ou ov pi pj be b bf z dt">Figure 4a: Early Fusion (concatenate embeddings at the input).</figcaption></figure><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pm pn ff po bg pp ou ov ow"><picture><img alt="" class="bg pd pe c" width="600" height="337" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pf pg ph ou ov pi pj be b bf z dt">Figure 4b: Late Fusion (concatenate prior to prediction output).</figcaption></figure><p id="70e6" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">We find:</p><ul class=""><li id="1f3b" class="ms mt gq mu b mv mw mx my mz na nb nc pq ne nf ng pr ni nj nk ps nm nn no np pt pu pv bj">Our results match and sometimes even outperform the <a class="af nq" href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Rao_A_Local-to-Global_Approach_to_Multi-Modal_Movie_Scene_Segmentation_CVPR_2020_paper.pdf" rel="noopener ugc nofollow" target="_blank">state-of-the-art</a> (benchmarked using the video modality only and on our evaluation data). We evaluate the outputs using F-1 score for the positive label, and also relax this evaluation to consider “off-by-<em class="pw">n</em>” F-1 i.e., if the model predicts scene changes within <em class="pw">n</em> shots of the ground truth. This is a more realistic measure for our use cases due to the human-in-the-loop setting that these models are deployed in.</li>
<li id="2888" class="ms mt gq mu b mv px mx my mz py nb nc pq pz nf ng pr qa nj nk ps qb nn no np pt pu pv bj">As with previous work, adding audio features improves results by 10–15%. A primary driver of variation in performance is late vs. early fusion.</li>
<li id="1f50" class="ms mt gq mu b mv px mx my mz py nb nc pq pz nf ng pr qa nj nk ps qb nn no np pt pu pv bj">Late fusion is consistently 3–7% better than early fusion. Intuitively, this result makes sense — the temporal dependencies between shots is likely modality-specific and should be encoded separately.</li>
</ul><p id="88a5" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">We have presented two complementary approaches to scene boundary detection that leverage a variety of available modalities — screenplay, audio, and video. Logically, the next steps are to a) combine these approaches and use screenplay features in a unified model and b) generalize the outputs across multiple shot-level inference tasks, e.g. shot type classification and memorable moments identification, as we hypothesize that this path would be useful for training general purpose video understanding models of longer-form content. Longer-form content also contains more complex narrative structure, and we envision this work as the first in a series of projects that aim to better integrate narrative understanding in our multimodal machine learning models.</p><p id="46b8" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><em class="pw">Special thanks to</em> <a class="af nq" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank"><em class="pw">Amir Ziai</em></a><em class="pw">,</em> <a class="af nq" href="https://www.linkedin.com/in/anna-pulido-61025063/" rel="noopener ugc nofollow" target="_blank"><em class="pw">Anna Pulido</em></a><em class="pw">, and</em> <a class="af nq" href="https://www.linkedin.com/in/angiepollema1/" rel="noopener ugc nofollow" target="_blank"><em class="pw">Angie Pollema</em></a><em class="pw">.</em></p>]]></description>
      <link>https://netflixtechblog.com/detecting-scene-changes-in-audiovisual-content-77a61d3eaad6</link>
      <guid>https://netflixtechblog.com/detecting-scene-changes-in-audiovisual-content-77a61d3eaad6</guid>
      <pubDate>Tue, 20 Jun 2023 18:51:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Migrating Netflix to GraphQL Safely]]></title>
      <description><![CDATA[<p id="5868" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">By <a class="af np" href="https://www.linkedin.com/in/jennifer-shin-0019a516/" rel="noopener ugc nofollow" target="_blank">Jennifer Shin</a>, <a class="af np" href="https://www.linkedin.com/in/tejas-shikhare-81027b19/" rel="noopener ugc nofollow" target="_blank">Tejas Shikhare</a>, <a class="af np" href="https://www.linkedin.com/in/willemmanuel/" rel="noopener ugc nofollow" target="_blank">Will Emmanuel</a></p><p id="00f0" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">In 2022, a major change was made to Netflix’s iOS and Android applications. We migrated Netflix’s mobile apps to GraphQL with zero downtime, which involved a total overhaul from the client to the API layer.</p><p id="f179" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Until recently, an internal API framework, <a class="af np" href="https://netflix.github.io/falcor/" rel="noopener ugc nofollow" target="_blank">Falcor</a>, powered our mobile apps. They are now backed by <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2">Federated GraphQL</a>, a distributed approach to APIs where domain teams can independently manage and own specific sections of the API.</p><p id="663a" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Doing this <strong class="mt gr">safely</strong> for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are <strong class="mt gr">AB Testing</strong>, <strong class="mt gr">Replay Testing,</strong> and <strong class="mt gr">Sticky Canaries</strong>.</p><p id="249a" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Before diving into these techniques, let’s briefly examine the migration plan.</p><p id="0221" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><strong class="mt gr">Before GraphQL: Monolithic Falcor API implemented and maintained by the API Team</strong></p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou ov"><picture><img alt="" class="bg pg ph c" width="700" height="336" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="dc3e" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Before moving to GraphQL, our API layer consisted of a monolithic server built with <a class="af np" href="https://netflix.github.io/falcor/" rel="noopener ugc nofollow" target="_blank">Falcor</a>. A single API team maintained both the Java implementation of the Falcor framework <em class="pi">and</em> the API Server.</p><p id="7ad5" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj"><strong class="mt gr">Created a GraphQL Shim Service on top of our existing Monolith Falcor API.</strong></p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou pj"><picture><img alt="" class="bg pg ph c" width="700" height="262" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="8fe5" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">By the summer of 2020, many UI engineers were ready to move to GraphQL. Instead of embarking on a full-fledged migration top to bottom, we created a GraphQL shim on top of our existing Falcor API. The GraphQL shim enabled client engineers to move quickly onto GraphQL, figure out client-side concerns like cache normalization, experiment with different GraphQL clients, and investigate client performance without being blocked by server-side migrations. To launch Phase 1 safely, we used <strong class="mt gr">AB Testing</strong>.</p><p id="fd21" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj"><strong class="mt gr">Deprecate the GraphQL Shim Service and Legacy API Monolith in favor of GraphQL services owned by the domain teams.</strong></p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou pk"><picture><img alt="" class="bg pg ph c" width="700" height="362" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="7f89" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We didn’t want the legacy Falcor API to linger forever, so we leaned into Federated GraphQL to power a single GraphQL API with multiple GraphQL servers.</p><p id="190d" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We could also swap out the implementation of a field from GraphQL Shim to Video API with federation directives. To launch Phase 2 safely, we used <strong class="mt gr">Replay Testing</strong> and <strong class="mt gr">Sticky Canaries</strong>.</p><p id="1b3c" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Two key factors determined our testing strategies:</p><ul class=""><li id="f773" class="mr ms gq mt b mu mv mw mx my mz na nb pl nd ne nf pm nh ni nj pn nl nm nn no po pp pq bj">Functional vs. non-functional requirements</li>
<li id="b25a" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Idempotency</li>
</ul><p id="c053" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">If we were testing <strong class="mt gr">functional requirements</strong> like data accuracy, and if the request was <strong class="mt gr">idempotent</strong>, we relied on <strong class="mt gr">Replay Testing</strong>. We knew we could test the same query with the same inputs and consistently expect the same results.</p><p id="1f23" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We couldn’t replay test GraphQL queries or mutations that requested non-idempotent fields.</p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div class="ot ou pw"><picture><img alt="" class="bg pg ph c" width="646" height="765" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="637b" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">And we definitely couldn’t replay test <strong class="mt gr">non-functional requirements</strong> like caching and logging user interaction. In such cases, we were not testing for response data but overall behavior. So, we relied on higher-level metrics-based testing: <strong class="mt gr">AB Testing</strong> and <strong class="mt gr">Sticky Canaries</strong>.</p><p id="7831" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Let’s discuss the three testing strategies in further detail.</p><p id="4404" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Netflix traditionally uses AB Testing to evaluate whether new product features resonate with customers. <strong class="mt gr">In Phase 1,</strong> we leveraged the AB testing framework to isolate a user segment into two groups totaling 1 million users. The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render.</p><p id="bb3c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We set up a client-side AB experiment that tested Falcor versus GraphQL and reported coarse-grained quality of experience metrics (<strong class="mt gr">QoE</strong>). The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. We spent the next few months diving into these high-level metrics and fixing issues such as cache TTLs, flawed client assumptions, etc.</p><h2 id="8905" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Wins</h2><p id="c256" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj"><strong class="mt gr">High-Level Health Metrics:</strong> AB Testing provided the assurance we needed in our overall client-side GraphQL implementation. This helped us successfully migrate 100% of the traffic on the mobile homepage canvas to GraphQL in 6 months.</p><h2 id="1a7a" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Gotchas</h2><p id="5d47" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj"><strong class="mt gr">Error Diagnosis:</strong> With an AB test, we could see coarse-grained metrics which pointed to potential issues, but it was challenging <strong class="mt gr">to diagnose</strong> the exact issues.</p><p id="3374" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The next phase in the migration was to reimplement our existing Falcor API in a GraphQL-first server (Video API Service). The Falcor API had become a logic-heavy monolith with over a decade of tech debt. So we had to ensure that the reimplemented Video API server was bug-free and identical to the already productized Shim service.</p><p id="38d6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We developed a Replay Testing tool to verify that <strong class="mt gr">idempotent</strong> APIs were migrated correctly from the GraphQL Shim to the Video API service.</p><p id="7ad1" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The Replay Testing framework leverages the <em class="pi">@override</em> directive available in GraphQL Federation. This directive tells the GraphQL Gateway to route to one GraphQL server over another. Take, for instance, the following two GraphQL schemas defined by the Shim Service and the Video Service:</p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou qm"><picture><img alt="" class="bg pg ph c" width="700" height="151" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="3ecb" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The GraphQL Shim first defined the <em class="pi">certificationRating</em> field (things like Rated R or PG-13) in Phase 1. In Phase 2, we stood up the VideoService and defined the same <em class="pi">certificationRating</em> field marked with the <em class="pi">@override</em> directive. The presence of the identical field with the <em class="pi">@override</em> directive informed the GraphQL Gateway to route the resolution of this field to the new Video Service rather than the old Shim Service.</p><p id="3453" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The Replay Tester tool samples raw traffic streams from <a class="af np" href="https://netflix.github.io/mantis/" rel="noopener ugc nofollow" target="_blank">Mantis</a>. With these sampled events, the tool can capture a live request from production and run an <strong class="mt gr">identical</strong> GraphQL query against both the GraphQL Shim and the new Video API service. The tool then compares the results and outputs any differences in response payloads.</p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou qn"><picture><img alt="" class="bg pg ph c" width="700" height="464" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="6e0c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><strong class="mt gr">Note: We do not replay test Personally Identifiable Information. It’s used only for non-sensitive product features on the Netflix UI.</strong></p><p id="80ce" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Once the test is completed, the engineer can view the diffs displayed as a <em class="pi">flattened JSON node</em>. You can see the control value on the left side of the comma in parentheses and the experiment value on the right.</p><pre class="ow ox oy oz pa qo qp qq bo qr qs qt">/data/videos/0/tags/3/id: (81496962, null)</pre><pre class="qz qo qp qq bo qr qs qt">/data/videos/0/tags/5/displayName: (Série, value: “S\303\251rie”)</pre><p id="aeef" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We captured two diffs above, the first had missing data for an ID field in the experiment, and the second had an encoding difference. We also saw differences in localization, date precisions, and floating point accuracy. It gave us confidence in replicated business logic, where subscriber plans and user geographic location determined the customer’s catalog availability.</p><h2 id="8afe" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Wins</h2><ul class=""><li id="69cd" class="mr ms gq mt b mu oo mw mx my op na nb pl oq ne nf pm or ni nj pn os nm nn no po pp pq bj"><strong class="mt gr">Confidence</strong> in parity between the two GraphQL Implementations</li>
<li id="d3be" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Enabled tuning</strong> <strong class="mt gr">configs</strong> in cases where data was missing due to over-eager timeouts</li>
<li id="f271" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Tested</strong> <strong class="mt gr">business logic</strong> that required many (unknown) inputs and where correctness can be hard to eyeball</li>
</ul><h2 id="8f09" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Gotchas</h2><ul class=""><li id="b889" class="mr ms gq mt b mu oo mw mx my op na nb pl oq ne nf pm or ni nj pn os nm nn no po pp pq bj"><strong class="mt gr">PII</strong> and non-idempotent APIs should <strong class="mt gr">not</strong> be tested using Replay Tests, and it would be valuable to have a mechanism to <strong class="mt gr"><em class="pi">prevent</em></strong> that.</li>
<li id="a289" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Manually constructed queries</strong> are only as good as the features the developer remembers to test. We ended up with untested fields simply because we forgot about them.</li>
<li id="6d9b" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Correctness:</strong> The idea of correctness can be confusing too. For example, is it more correct for an array to be empty or null, or is it just noise? Ultimately, we matched the existing behavior as much as possible because verifying the robustness of the client’s error handling was difficult.</li>
</ul><p id="7ab6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Despite these shortcomings, Replay Testing was a key indicator that we had achieved functional correctness of <em class="pi">most</em> idempotent queries.</p><p id="4c39" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">While Replay Testing validates the functional correctness of the new GraphQL APIs, it does not provide any performance or business metric insight, such as the <strong class="mt gr">overall perceived health of user interaction</strong>. Are users clicking play at the same rates? Are things loading in time before the user loses interest? Replay Testing also cannot be used for non-idempotent API validation. We reached for a Netflix tool called the Sticky Canary to build confidence.</p><p id="2597" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">A Sticky Canary is an infrastructure experiment where customers are assigned either to a canary or baseline host for the entire duration of an experiment. All incoming traffic is allocated to an experimental or baseline host based on their device and profile, similar to a bucket hash. The experimental host deployment serves all the customers assigned to the experiment. Watch our <a class="af np" href="https://www.youtube.com/watch?v=Xbn65E-BQhA" rel="noopener ugc nofollow" target="_blank">Chaos Engineering</a> talk from AWS Reinvent to learn more about Sticky Canaries.</p><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou ra"><picture><img alt="" class="bg pg ph c" width="700" height="308" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="78f4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">In the case of our GraphQL APIs, we used a Sticky Canary experiment to <strong class="mt gr">run two instances of our GraphQL gateway</strong>. The <strong class="mt gr">baseline</strong> gateway used the existing schema, which routes all traffic to the GraphQL Shim. The <strong class="mt gr">experimental</strong> gateway used the new proposed schema, which routes traffic to the latest Video API service. <a class="af np" href="https://github.com/Netflix/zuul" rel="noopener ugc nofollow" target="_blank">Zuul</a>, our primary edge gateway, assigns traffic to either cluster based on the experiment parameters.</p><p id="4b0b" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We then collect and analyze the performance of the two clusters. Some KPIs we monitor closely include:</p><ul class=""><li id="8a68" class="mr ms gq mt b mu mv mw mx my mz na nb pl nd ne nf pm nh ni nj pn nl nm nn no po pp pq bj">Median and tail latencies</li>
<li id="77c9" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Error rates</li>
<li id="7d23" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Logs</li>
<li id="43cb" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Resource utilization–CPU, network traffic, memory, disk</li>
<li id="f2e9" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Device QoE (Quality of Experience) metrics</li>
<li id="8c37" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj">Streaming health metrics</li>
</ul><figure class="ow ox oy oz pa pb ot ou paragraph-image"><div role="button" tabindex="0" class="pc pd ff pe bg pf ot ou rb"><picture><img alt="" class="bg pg ph c" width="700" height="467" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="2d10" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We started small, with tiny customer allocations for hour-long experiments. After validating performance, we slowly built up scope. We increased the percentage of customer allocations, introduced multi-region tests, and eventually 12-hour or day-long experiments. Validating along the way is essential since Sticky Canaries impact live production traffic and are assigned persistently to a customer.</p><p id="f089" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">After several sticky canary experiments, we had assurance that phase 2 of the migration improved all core metrics, and we could dial up GraphQL globally with confidence.</p><h2 id="8d41" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Wins</h2><p id="8d9a" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Sticky Canaries was essential to build confidence in our new GraphQL services.</p><ul class=""><li id="9151" class="mr ms gq mt b mu mv mw mx my mz na nb pl nd ne nf pm nh ni nj pn nl nm nn no po pp pq bj"><strong class="mt gr">Non-Idempotent APIs:</strong> these tests are compatible with mutating or non-idempotent APIs</li>
<li id="6fc2" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Business metrics:</strong> Sticky Canaries validated our core Netflix business metrics had improved after the migration</li>
<li id="baad" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">System performance:</strong> Insights into latency and resource usage help us understand how scaling profiles change after migration</li>
</ul><h2 id="a4d0" class="px nr gq be ns py pz dx nw qa qb dz oa nc qc qd qe ng qf qg qh nk qi qj qk ql bj">Gotchas</h2><ul class=""><li id="1eb3" class="mr ms gq mt b mu oo mw mx my op na nb pl oq ne nf pm or ni nj pn os nm nn no po pp pq bj"><strong class="mt gr">Negative Customer Impact:</strong> Sticky Canaries can impact real users. We needed confidence in our new services before persistently routing some customers to them. This is partially mitigated by <em class="pi">real-time impact detection</em>, which will automatically cancel experiments.</li>
<li id="2c89" class="mr ms gq mt b mu pr mw mx my ps na nb pl pt ne nf pm pu ni nj pn pv nm nn no po pp pq bj"><strong class="mt gr">Short-lived:</strong> Sticky Canaries are meant for short-lived experiments. For longer-lived tests, a full-blown AB test should be used.</li>
</ul><p id="feb3" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Technology is constantly changing, and we, as engineers, spend a large part of our careers performing migrations. The question is not whether we are migrating but whether we are migrating <strong class="mt gr"><em class="pi">safely</em></strong>, with zero downtime, in a timely manner.</p><p id="c543" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">At Netflix, we have developed tools that ensure confidence in these migrations, targeted toward each specific use case being tested. We covered three tools, <strong class="mt gr">AB testing</strong>, <strong class="mt gr">Replay Testing</strong>, and <strong class="mt gr">Sticky Canaries</strong> that we used for the GraphQL Migration.</p><p id="03f5" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">This blog post is part of our Migrating Critical Traffic series. Also, check out: Migrating Critical Traffic at Scale (<a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835">part 1</a>, <a class="af np" href="https://netflixtechblog.medium.com/migrating-critical-traffic-at-scale-with-no-downtime-part-2-4b1c8c7155c1" rel="noopener">part 2</a>) and <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba">Ensuring the Successful Launch of Ads</a>.</p>]]></description>
      <link>https://netflixtechblog.com/migrating-netflix-to-graphql-safely-8e1e4d4f1e72</link>
      <guid>https://netflixtechblog.com/migrating-netflix-to-graphql-safely-8e1e4d4f1e72</guid>
      <pubDate>Wed, 14 Jun 2023 19:59:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Migrating Critical Traffic At Scale with No Downtime — Part 2]]></title>
      <description><![CDATA[<p id="8c65" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><a class="af nq" href="https://www.linkedin.com/in/shyam-gala-5891224/" rel="noopener ugc nofollow" target="_blank">Shyam Gala</a>, <a class="af nq" href="https://www.linkedin.com/in/ivern/" rel="noopener ugc nofollow" target="_blank">Javier Fernandez-Ivern</a>, <a class="af nq" href="https://www.linkedin.com/in/rokkampratap/" rel="noopener ugc nofollow" target="_blank">Anup Rokkam Pratap</a>, <a class="af nq" href="https://www.linkedin.com/in/shahdewang/" rel="noopener ugc nofollow" target="_blank">Devang Shah</a></p><p id="cfbe" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. Behind these perfect moments of entertainment is a complex mechanism, with numerous gears and cogs working in harmony. But what happens when this machinery needs a transformation? This is where large-scale system migrations come into play. Our <a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835">previous blog post</a> presented replay traffic testing — a crucial instrument in our toolkit that allows us to implement these transformations with precision and reliability.</p><p id="94fc" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><strong class="mu gr">Replay traffic testing gives us the initial foundation of validation, but as our migration process unfolds, we are met with the need for a carefully controlled migration process. A process that doesn’t just minimize risk, but also facilitates a continuous evaluation of the rollout’s impact. This blog post will delve into the techniques leveraged at Netflix to introduce these changes to production.</strong></p><p id="98b1" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj"><a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69">Canary</a> deployments are an effective mechanism for validating changes to a production backend service in a controlled and limited manner, thus mitigating the risk of unforeseen consequences that may arise due to the change. This process involves creating two new clusters for the updated service; a baseline cluster containing the current version running in production and a canary cluster containing the new version of the service. A small percentage of production traffic is redirected to the two new clusters, allowing us to monitor the new version’s performance and compare it against the current version. By collecting and analyzing key performance metrics of the service over time, we can assess the impact of the new changes and determine if they meet the availability, latency, and performance requirements.</p><p id="a799" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">Some product features require a lifecycle of requests between the customer device and a set of backend services to drive the feature. For instance, video playback functionality on Netflix involves requesting URLs for the streams from a service, calling the CDN to download the bits from the streams, requesting a license to decrypt the streams from a separate service, and sending telemetry indicating the successful start of playback to yet another service. By tracking metrics only at the level of service being updated, we might miss capturing deviations in broader end-to-end system functionality.</p><p id="b80e" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><a class="af nq" href="https://www.infoq.com/presentations/sticky-canaries/" rel="noopener ugc nofollow" target="_blank">Sticky Canary</a> is an improvement to the traditional canary process that addresses this limitation. In this variation, the canary framework creates a pool of unique customer devices and then routes traffic for this pool consistently to the canary and baseline clusters for the duration of the experiment. Apart from measuring service-level metrics, the canary framework is able to keep track of broader system operational and customer metrics across the canary pool and thereby detect regressions on the entire request lifecycle flow.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov ow"><picture><img alt="" class="bg ph pi c" width="700" height="263" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt">Sticky Canary</figcaption></figure><p id="6a4d" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">It is important to note that with sticky canaries, devices in the canary pool continue to be routed to the canary throughout the experiment, potentially resulting in undesirable behavior persisting through retries on customer devices. Therefore, the canary framework is designed to monitor operational and customer KPI metrics to detect persistent deviations and terminate the canary experiment if necessary.</p><p id="1473" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><strong class="mu gr">Canaries and sticky canaries are valuable tools in the system migration process. Compared to replay testing, canaries allow us to extend the validation scope beyond the service level. They enable verification of the broader end-to-end system functionality across the request lifecycle for that functionality, giving us confidence that the migration will not cause any disruptions to the customer experience. Canaries also provide an opportunity to measure system performance under different load conditions, allowing us to identify and resolve any performance bottlenecks. They enable us to further fine-tune and configure the system, ensuring the new changes are integrated smoothly and seamlessly.</strong></p><p id="6f06" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">A/B testing is a widely recognized method for verifying hypotheses through a controlled experiment. It involves dividing a portion of the population into two or more groups, each receiving a different treatment. The results are then evaluated using specific metrics to determine whether the hypothesis is valid. The industry frequently employs the technique to assess hypotheses related to product evolution and user interaction. It is also <a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/a-b-testing-and-beyond-improving-the-netflix-streaming-experience-with-experimentation-and-data-5b0ae9295bdf">widely utilized at Netflix</a> to test changes to product behavior and customer experience.</p><p id="ee7c" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">A/B testing is also a valuable tool for assessing significant changes to backend systems. We can determine A/B test membership in either device application or backend code and selectively invoke new code paths and services. Within the context of migrations, A/B testing enables us to limit exposure to the migrated system by enabling the new path for a smaller percentage of the member base. Thereby controlling the risk of unexpected behavior resulting from the new changes. A/B testing is also a key technique in migrations where the updates to the architecture involve changing device contracts as well.</p><p id="7874" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><strong class="mu gr">Canary experiments are typically conducted over periods ranging from hours to days. However, in certain instances, migration-related experiments may be required to span weeks or months to obtain a more accurate understanding of the impact on specific Quality of Experience (QoE) metrics. Additionally, in-depth analyses of particular business Key Performance Indicators (KPIs) may require longer experiments. For instance, envision a migration scenario where we enhance the playback quality, anticipating that this improvement will lead to more customers engaging with the play button. Assessing relevant metrics across a considerable sample size is crucial for obtaining a reliable and confident evaluation of the hypothesis. A/B frameworks work as effective tools to accommodate this next step in the confidence-building process.</strong></p><p id="a5fb" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">In addition to supporting extended durations, A/B testing frameworks offer other supplementary capabilities. This approach enables test allocation restrictions based on factors such as geography, device platforms, and device versions, while also allowing for analysis of migration metrics across similar dimensions. This ensures that the changes do not disproportionately impact specific customer segments. A/B testing also provides adaptability, permitting adjustments to allocation size throughout the experiment.</p><p id="857e" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">We might not use A/B testing for every backend migration. Instead, we use it for migrations in which changes are expected to impact device QoE or business KPIs significantly. For example, as discussed earlier, if the planned changes are expected to improve client QoE metrics, we would test the hypothesis via A/B testing.</p><p id="165a" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">After completing the various stages of validation, such as replay testing, sticky canaries, and A/B tests, we can confidently assert that the planned changes will not significantly impact SLAs (service-level-agreement), device level QoE, or business KPIs. However, it is imperative that the final rollout is regulated to ensure that any unnoticed and unexpected problems do not disrupt the customer experience. To this end, we have implemented traffic dialing as the last step in mitigating the risk associated with enabling the changes in production.</p><p id="f9d0" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">A dial is a software construct that enables the controlled flow of traffic within a system. This construct samples inbound requests using a distribution function and determines whether they should be routed to the new path or kept on the existing path. The decision-making process involves assessing whether the distribution function’s output aligns within the range of the predefined target percentage. The sampling is done consistently using a fixed parameter associated with the request. The target percentage is controlled via a globally scoped dynamic property that can be updated in real-time. By increasing or decreasing the target percentage, traffic flow to the new path can be regulated instantaneously.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov po"><picture><img alt="" class="bg ph pi c" width="700" height="323" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt">Dial</figcaption></figure><p id="9777" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">The selection of the actual sampling parameter depends on the specific migration requirements. A dial can be used to randomly sample all requests, which is achieved by selecting a variable parameter like a timestamp or a random number. Alternatively, in scenarios where the system path must remain constant with respect to customer devices, a constant device attribute such as deviceId is selected as the sampling parameter. Dials can be applied in several places, such as device application code, the relevant server component, or even at the API gateway for edge API systems, making them a versatile tool for managing migrations in complex systems.</p><p id="6589" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj"><strong class="mu gr">Traffic is dialed over to the new system in measured discrete steps. At every step, relevant stakeholders are informed, and key metrics are monitored, including service, device, operational, and business metrics. If we discover an unexpected issue or notice metrics trending in an undesired direction during the migration, the dial gives us the capability to quickly roll back the traffic to the old path and address the issue.</strong></p><p id="06f3" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">The dialing steps can also be scoped at the data center level if traffic is served from multiple data centers. We can start by dialing traffic in a single data center to allow for an easier side-by-side comparison of key metrics across data centers, thereby making it easier to observe any deviations in the metrics. The duration of how long we run the actual discrete dialing steps can also be adjusted. Running the dialing steps for longer periods increases the probability of surfacing issues that may only affect a small group of members or devices and might have been too low to capture and perform shadow traffic analysis. We can complete the final step of migrating all the production traffic to the new system using the combination of gradual step-wise dialing and monitoring.</p><p id="0720" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">Stateful APIs pose unique challenges that require different strategies. While the replay testing technique discussed in the previous part of this blog series can be employed, additional measures <a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835">outlined earlier</a> are necessary.</p><p id="fa38" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">This alternate migration strategy has proven effective for our systems that meet certain criteria. Specifically, our data model is simple, self-contained, and immutable, with no relational aspects. Our system doesn’t require strict consistency guarantees and does not use database transactions. We adopt an ETL-based dual-write strategy that roughly follows this sequence of steps:</p><ul class=""><li id="b1c7" class="ms mt gq mu b mv mw mx my mz na nb nc pp ne nf ng pq ni nj nk pr nm nn no np ps pt pu bj"><strong class="mu gr">Initial Load through an ETL process:</strong> Data is extracted from the source data store, transformed into the new model, and written to the newer data store through an offline job. We use custom queries to verify the completeness of the migrated records.</li>
<li id="cf26" class="ms mt gq mu b mv pv mx my mz pw nb nc pp px nf ng pq py nj nk pr pz nn no np ps pt pu bj"><strong class="mu gr">Continuous migration via Dual-writes:</strong> We utilize an active-active/dual-writes strategy to migrate the bulk of the data. As a safety mechanism, we use dials (discussed previously) to control the proportion of writes that go to the new data store. To maintain state parity across both stores, we write all state-altering requests of an entity to both stores. This is achieved by selecting a sampling parameter that makes the dial sticky to the entity’s lifecycle. We incrementally turn the dial up as we gain confidence in the system while carefully monitoring its overall health. The dial also acts as a switch to turn off all writes to the new data store if necessary.</li>
<li id="75a7" class="ms mt gq mu b mv pv mx my mz pw nb nc pp px nf ng pq py nj nk pr pz nn no np ps pt pu bj"><strong class="mu gr">Continuous verification of records:</strong> When a record is read, the service reads from both data stores and verifies the functional correctness of the new record if found in both stores. One can perform this comparison live on the request path or offline based on the latency requirements of the particular use case. In the case of a live comparison, we can return records from the new datastore when the records match. This process gives us an idea of the functional correctness of the migration.</li>
<li id="ea84" class="ms mt gq mu b mv pv mx my mz pw nb nc pp px nf ng pq py nj nk pr pz nn no np ps pt pu bj"><strong class="mu gr">Evaluation of migration completeness:</strong> To verify the completeness of the records, cold storage services are used to take periodic data dumps from the two data stores and compared for completeness. Gaps in the data are filled back with an ETL process.</li>
<li id="2daa" class="ms mt gq mu b mv pv mx my mz pw nb nc pp px nf ng pq py nj nk pr pz nn no np ps pt pu bj"><strong class="mu gr">Cut-over and clean-up:</strong> Once the data is verified for correctness and completeness, dual writes and reads are disabled, any client code is cleaned up, and read/writes only occur to the new data store.</li>
</ul><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qa"><picture><img alt="" class="bg ph pi c" width="700" height="209" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt">Migrating Stateful Systems</figcaption></figure><p id="4b50" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">Clean-up of any migration-related code and configuration after the migration is crucial to ensure the system runs smoothly and efficiently and we don’t build up tech debt and complexity. Once the migration is complete and validated, all migration-related code, such as traffic dials, A/B tests, and replay traffic integrations, can be safely removed from the system. This includes cleaning up configuration changes, reverting to the original settings, and disabling any temporary components added during the migration. In addition, it is important to document the entire migration process and keep records of any issues encountered and their resolution. By performing a thorough clean-up and documentation process, future migrations can be executed more efficiently and effectively, building on the lessons learned from the previous migrations.</p><p id="8f53" class="pw-post-body-paragraph ms mt gq mu b mv op mx my mz oq nb nc nd or nf ng nh os nj nk nl ot nn no np gj bj">We have utilized a range of techniques outlined in our blog posts to conduct numerous large, medium, and small-scale migrations on the Netflix platform. Our efforts have been largely successful, with minimal to no downtime or significant issues encountered. Throughout the process, we have gained valuable insights and refined our techniques. It should be noted that not all of the techniques presented are universally applicable, as each migration presents its own unique set of circumstances. Determining the appropriate level of validation, testing, and risk mitigation requires careful consideration of several factors, including the nature of the changes, potential impacts on customer experience, engineering effort, and product priorities. Ultimately, we aim to achieve seamless migrations without disruptions or downtime.</p><p id="b4dd" class="pw-post-body-paragraph ms mt gq mu b mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no np gj bj">In a series of forthcoming blog posts, we will explore a selection of specific use cases where the techniques highlighted in this blog series were utilized effectively. They will focus on a <a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba">comprehensive analysis of the Ads Tier Launch</a> and an extensive GraphQL migration for various product APIs. These posts will offer readers invaluable insights into the practical application of these methodologies in real-world situations.</p>]]></description>
      <link>https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-2-4b1c8c7155c1</link>
      <guid>https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-2-4b1c8c7155c1</guid>
      <pubDate>Tue, 13 Jun 2023 19:23:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Escrow Buddy: An open-source tool from Netflix for remediation of missing FileVault keys in MDM]]></title>
      <description><![CDATA[<div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><div class=""><h2 id="caf2" class="pw-subtitle-paragraph hn gp gq be b ho hp hq hr hs ht hu hv hw hx hy hz ia ib ic cp dt">Netflix has open-sourced Escrow Buddy, which helps Security and IT teams ensure they have valid FileVault recovery keys for all their Macs in MDM.</h2></div></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><p id="0088" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">To be a client systems engineer is to take joy in small endpoint automations that make your fellow employees’ day a little better. When somebody is unable to log into their FileVault-encrypted Mac, few words are more joyful to hear than a support technician saying, “I’ve got your back. Let’s look up the recovery key.”</p><p id="7d8e" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Securely and centrally escrowing FileVault personal recovery keys is one of many capabilities offered by Mobile Device Management (MDM). A configuration profile that contains the <a class="af oh" href="https://developer.apple.com/documentation/devicemanagement/fderecoverykeyescrow" rel="noopener ugc nofollow" target="_blank">FDERecoveryKeyEscrow</a> payload will cause any new recovery key generated on the device, either by initially enabling FileVault or by manually changing the recovery key, to be automatically escrowed to your MDM for later retrieval if needed.</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="bcd9" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">The problem of missing FileVault keys</h2><p id="fbbd" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj">However, just because you’re deploying the MDM escrow payload to your managed Macs doesn’t necessarily mean you have valid recovery keys for all of them. Recovery keys can be missing from MDM for numerous reasons:</p><ul class=""><li id="7059" class="nl nm gq nn b ho no np nq hr nr ns nt pg nv nw nx ph nz oa ob pi od oe of og pj pk pl bj">FileVault may have been enabled prior to enrollment in MDM</li>
<li id="f80f" class="nl nm gq nn b ho pm np nq hr pn ns nt pg po nw nx ph pp oa ob pi pq oe of og pj pk pl bj">The MDM escrow payload may not have been present on the Mac due to scoping issues or misconfiguration on your MDM</li>
<li id="1d48" class="nl nm gq nn b ho pm np nq hr pn ns nt pg po nw nx ph pp oa ob pi pq oe of og pj pk pl bj">The Macs may be migrating from a different MDM in which the keys are stored</li>
<li id="39cf" class="nl nm gq nn b ho pm np nq hr pn ns nt pg po nw nx ph pp oa ob pi pq oe of og pj pk pl bj">MDM database corruption or data loss events may have claimed some or all of your escrowed keys</li>
</ul><p id="71b5" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Regardless of the cause, the effect is people who get locked out of their Macs must resort to wiping their computer and starting fresh — a productivity killer if your data is backed up, and a massive data loss event if it’s not backed up.</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="eeb7" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">Less than ideal solutions</h2><p id="1b39" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj">IT and security teams have approached this problem from multiple angles in the past. On a per-computer basis, a new key can be generated by disabling and re-enabling FileVault, but this leaves the computer in an unencrypted state briefly and requires multiple steps. The built-in <code class="cw pr ps pt pu b">fdesetup</code> command line tool can also be used to generate a new key, but not all users are comfortable entering Terminal commands. Plus, neither of these ideas scale to meet the needs of a fleet of Macs hundreds or thousands strong.</p><p id="1b73" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Another approach has been to use a tool capable of displaying an onscreen text input field to the user in order to display a password prompt, and then pass the provided password as input to the <code class="cw pr ps pt pu b">fdesetup</code> tool for generating a new key. However, this requires IT and security teams to communicate in advance of the remediation campaign to affected users, in order to give them the context they need to respond to the additional password prompt. Even more concerning, this password prompt approach has a detrimental effect on security culture because it contributes to “consent fatigue.” Users will be more likely to approve other types of password prompt, which may inadvertently prime them to be targeted by malware or ransomware.</p><p id="c982" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">The ideal solution would be one which can be automated across your entire fleet while not requiring any additional user interaction.</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="8e69" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">Crypt and its authorization plugin</h2><p id="59f5" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj"><a class="af oh" href="https://developer.apple.com/documentation/security/authorization_plug-ins" rel="noopener ugc nofollow" target="_blank">macOS authorization plugins</a> provide a way to connect with Apple’s authorization services API and participate in decisions around user login. They can also facilitate automations that require information available only in the “login window” context, such as the provided username and password.</p><p id="e751" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Relatively few authorization plugins are broadly used within the Mac admin community, but one popular example is the <a class="af oh" href="https://github.com/grahamgilbert/crypt" rel="noopener ugc nofollow" target="_blank">Crypt</a> agent. In its typical configuration the Crypt agent enforces FileVault upon login and escrows the resulting recovery key to a corresponding <a class="af oh" href="https://github.com/grahamgilbert/crypt-server" rel="noopener ugc nofollow" target="_blank">Crypt server</a>. The agent also enables rotation of recovery keys after use, local storage and validation of recovery keys, and other features.</p><p id="50da" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">While the Crypt agent can be deployed standalone and configured to simply regenerate a key upon next login, escrowing keys to MDM isn’t Crypt’s primary use case. Additionally, not all organizations have the time, expertise, or interest to commit to hosting a Crypt server and its accompanying database, or auditing the parts of Crypt’s codebase relating to its server capabilities.</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="8569" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">Introducing Escrow Buddy</h2><p id="de47" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj">Inspired by Crypt’s example, our Client Systems Engineering team created a minimal authorization plugin focused on serving the needs of organizations who escrow FileVault keys to MDM only. We call this new tool <strong class="nn gr">Escrow Buddy</strong>.</p><figure class="py pz qa qb qc qd pv pw paragraph-image"><a href="https://github.com/macadmins/escrow-buddy">
<div class="pv pw px"><picture><img alt="Escrow Buddy logo" class="bg qe qf c" width="300" height="200" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</a></figure><p id="a916" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Escrow Buddy’s authorization plugin includes a mechanism that, when added to the macOS login authorization database, will use the logging in user’s credentials as input to the <strong class="nn gr">fdesetup</strong> tool to automatically and seamlessly generate a new key during login. By integrating with the familiar and trusted macOS login experience, Escrow Buddy eliminates the need to display additional prompts or on-screen messages.</p><p id="00fe" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">Security and IT teams can take advantage of Escrow Buddy in three steps:</p><ol class=""><li id="5939" class="nl nm gq nn b ho no np nq hr nr ns nt pg nv nw nx ph nz oa ob pi od oe of og qg pk pl bj"><strong class="nn gr">Ensure your MDM is deploying the</strong> <a class="af oh" href="https://developer.apple.com/documentation/devicemanagement/fderecoverykeyescrow" rel="noopener ugc nofollow" target="_blank"><strong class="nn gr">FDERecoveryKeyEscrow</strong></a> <strong class="nn gr">payload</strong> to your managed Macs<strong class="nn gr">.</strong> This will ensure any newly generated FileVault key, no matter the method of generation, will be automatically escrowed to MDM.</li>
<li id="96e5" class="nl nm gq nn b ho pm np nq hr pn ns nt pg po nw nx ph pp oa ob pi pq oe of og qg pk pl bj"><strong class="nn gr">Deploy Escrow Buddy.</strong> The latest installer is available <a class="af oh" href="https://github.com/macadmins/escrow-buddy/releases/latest" rel="noopener ugc nofollow" target="_blank">here</a>, and you can choose to deploy to all your managed Macs or just the subset for which you need to escrow new keys.</li>
<li id="cd78" class="nl nm gq nn b ho pm np nq hr pn ns nt pg po nw nx ph pp oa ob pi pq oe of og qg pk pl bj">On Macs that lack a valid escrowed key, <strong class="nn gr">configure your MDM to run this command in root context</strong>:</li>
</ol><pre class="py pz qa qb qc qh pu qi bo qj qk ql">defaults write /Library/Preferences/com.netflix.Escrow-Buddy.plist GenerateNewKey -bool true</pre><p id="ec89" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">That’s it! At next startup or login, the specified Macs should generate a new key, which will be automatically escrowed to your MDM when the Mac next <a class="af oh" href="https://developer.apple.com/documentation/devicemanagement/securityinforesponse/securityinfo" rel="noopener ugc nofollow" target="_blank">responds to a SecurityInfo command</a>. (Timing varies by MDM vendor but this is often during an inventory update.)</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="41a5" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">Community contribution</h2><p id="4522" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj">Netflix is making Escrow Buddy’s source available via the Mac Admins Open Source organization on GitHub, the home of many other important projects in the Mac IT and security community, including <a class="af oh" href="https://github.com/macadmins/nudge" rel="noopener ugc nofollow" target="_blank">Nudge</a>, <a class="af oh" href="https://github.com/macadmins/installapplications" rel="noopener ugc nofollow" target="_blank">InstallApplications</a>, <a class="af oh" href="https://github.com/macadmins/outset" rel="noopener ugc nofollow" target="_blank">Outset</a>, and the <a class="af oh" href="https://github.com/macadmins/munki-builds" rel="noopener ugc nofollow" target="_blank">Munki signed builds</a>. Thousands of organizations worldwide benefit from the tools and ideas shared by the Mac admin community, and Netflix is excited that Escrow Buddy will be among them.</p><p id="6377" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">The <a class="af oh" href="https://github.com/macadmins/escrow-buddy" rel="noopener ugc nofollow" target="_blank">Escrow Buddy repository</a> leverages GitHub Actions to streamline the process of building new codesigned and notarized releases when new changes are merged into the <code class="cw pr ps pt pu b">main</code> branch. Our hope is that this will make it easy for contributors to collaborate and improve upon Escrow Buddy.</p></div><div class="gj gk gl gm gn ab ca ch bg fv fw fx fy"><h2 id="0c82" class="oi oj gq be ok ol om dx on oo op dz oq nu or os ot ny ou ov ow oc ox oy oz pa bj">A rising tide…</h2><p id="b486" class="pw-post-body-paragraph nl nm gq nn b ho pb np nq hr pc ns nt nu pd nw nx ny pe oa ob oc pf oe of og gj bj">Escrow Buddy represents our desire to elevate the industry standard around FileVault key regeneration. If your organization currently employs a password prompt workflow for this scenario, please consider trying Escrow Buddy instead. We hope you’ll find it more automatic, more supportive of security culture, and enables you to more often say “I’ve got your back” to your fellow employees who need a recovery key.</p><p id="14aa" class="pw-post-body-paragraph nl nm gq nn b ho no np nq hr nr ns nt nu nv nw nx ny nz oa ob oc od oe of og gj bj">— <a class="af oh" href="https://www.linkedin.com/in/reallyelliot/" rel="noopener ugc nofollow" target="_blank">Elliot Jordan</a></p></div>]]></description>
      <link>https://netflixtechblog.com/escrow-buddy-an-open-source-tool-from-netflix-for-remediation-of-missing-filevault-keys-in-mdm-815aef5107cd</link>
      <guid>https://netflixtechblog.com/escrow-buddy-an-open-source-tool-from-netflix-for-remediation-of-missing-filevault-keys-in-mdm-815aef5107cd</guid>
      <pubDate>Mon, 12 Jun 2023 18:36:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Native Frame Rate Playback]]></title>
      <description><![CDATA[<p id="2f95" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><em class="np">by</em> <a class="af nq" href="https://www.linkedin.com/in/akshaygarg05/" rel="noopener ugc nofollow" target="_blank"><em class="np">Akshay Garg</em></a><em class="np">,</em> <a class="af nq" href="https://www.linkedin.com/in/rquero/" rel="noopener ugc nofollow" target="_blank"><em class="np">Roger Quero</em></a></p><p id="7227" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Maximizing immersion for our members is an important goal for the Netflix product and engineering teams to keep our members entertained and fully engaged in our content. Leveraging a good mix of mature and cutting-edge client device technologies to deliver a smooth playback experience with glitch-free in-app transitions is an important step towards achieving this goal. In this article we explain our journey towards productizing a better viewing experience for our members by utilizing features and capabilities in consumer streaming devices.</p><p id="883c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">If you have a streaming device connected to your TV, such as a Roku Set Top Box (STB) or an Amazon FireTV Stick, you may have come across an option in the device display setting pertaining to content frame rate. Device manufacturers often call this feature “Match Content Frame Rate”, “Auto adjust display refresh rate” or something similar. If you’ve ever wondered what these features are and how they can improve your viewing experience, keep reading — the following sections cover the basics of this feature and explain the details of how the Netflix application uses it.</p><p id="e090" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Netflix’s content catalog is composed of video captured and encoded in one of various frame rates ranging from 23.97 to 60 frames per second (fps). When a member chooses to watch a movie or a TV show on a <strong class="mt gr"><em class="np">source device</em></strong> (ex. Set-top box, Streaming stick, Game Console, etc…) the content is delivered and then decoded at its <strong class="mt gr"><em class="np">native frame rate</em></strong>, which is the frame rate it was captured and encoded in. After the decode step, the source device converts it to the HDMI output frame rate which was configured based on the capabilities of the HDMI input port of the connected <strong class="mt gr"><em class="np">sink device</em></strong> (TV, AVR, Monitor etc). In general, the output frame rate over HDMI is automatically set to 50fps for <a class="af nq" href="https://en.wikipedia.org/wiki/PAL" rel="noopener ugc nofollow" target="_blank">PAL</a> regions and 60fps for <a class="af nq" href="https://en.wikipedia.org/wiki/NTSC" rel="noopener ugc nofollow" target="_blank">NTSC</a> regions.</p><p id="15b2" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Netflix offers limited high frame rate content (50fps or 60fps), but the majority of our catalog and viewing hours can be attributed to members watching 23.97 to 30fps content. This essentially means that most of the time, our content goes through a process called <strong class="mt gr"><em class="np">frame rate conversion</em></strong> (aka FRC) on the source device which converts the content from its native frame rate to match the HDMI output frame rate by replicating frames. Figure 1 illustrates a simple FRC algorithm that converts 24fps content to 60fps.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov ow"><picture><img alt="" class="bg ph pi c" width="700" height="394" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 1 : 3:2 pulldown technique to convert 24FPS content to 60FPS</strong></figcaption></figure><p id="e8ba" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Converting the content and transmitting it over HDMI at the output frame rate sounds logical and straightforward. In fact, FRC works well when the output frame rate is an integer multiple of the native frame rate ( ex. 24→48, 25→50, 30→60, 24→120, etc…). On the other hand, FRC introduces a visual artifact called <strong class="mt gr">Judder</strong> when non-integer multiple conversion is required (ex. 24→60, 25→60, etc…), which manifests as choppy video playback as illustrated below:</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov po"><picture><img alt="" class="bg ph pi c" width="300" height="300" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">With Judder</strong></figcaption></figure><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div class="ou ov po"><picture><img alt="" class="bg ph pi c" width="300" height="300" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Without Judder</strong></figcaption></figure><p id="e24a" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">It is important to note that the severity of the judder depends on the replication pattern. For this reason, judder is more prominent in PAL regions because of the process of converting 24fps content to 50fps over HDMI (see Figure 2):</p><ul class=""><li id="8a1f" class="mr ms gq mt b mu mv mw mx my mz na nb pp nd ne nf pq nh ni nj pr nl nm nn no ps pt pu bj">Total of 50 frames must be transmitted over HDMI per second</li>
<li id="9635" class="mr ms gq mt b mu pv mw mx my pw na nb pp px ne nf pq py ni nj pr pz nm nn no ps pt pu bj">Source device must replicate the original 24 frames to fill in the missing 26 frames</li>
<li id="bf2e" class="mr ms gq mt b mu pv mw mx my pw na nb pp px ne nf pq py ni nj pr pz nm nn no ps pt pu bj">50 output frames from 24 original frames are derived as follows:</li>
<li id="a4d9" class="mr ms gq mt b mu pv mw mx my pw na nb pp px ne nf pq py ni nj pr pz nm nn no ps pt pu bj">22 frames are duplicated ( total of 44 frames )</li>
<li id="6d8a" class="mr ms gq mt b mu pv mw mx my pw na nb pp px ne nf pq py ni nj pr pz nm nn no ps pt pu bj">2 frames are repeated three times ( total of 6 frames )</li>
</ul><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qa"><picture><img alt="" class="bg ph pi c" width="700" height="327" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 2: Example of a 24 to 50fps frame rate conversion algorithm</strong></figcaption></figure><p id="7015" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">As a review, judder is more pronounced when the frequency of the number of repeated frames is inconsistent and spread out e.g. in the scenario mentioned above, the frame replication factor varies between 2 and 3 resulting in a more prominent judder.</p><p id="c6f3" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Now that we have a better understanding of the issue, let’s review the solutions that Netflix has invested in. Due to the fragmented nature of device capabilities in the ecosystem, we explored multiple solutions to address this issue for as many devices as possible. Each unique solution leverages existing or new source device capabilities and comes with various tradeoffs.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qb"><picture><img alt="" class="bg ph pi c" width="700" height="107" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="8969" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The first solution we explored and recently enabled leverages the capability of existing source &amp; sink devices to change the outgoing frame rate on the HDMI link. Once this feature is enabled in the system settings, devices will match the HDMI output frame rate with the content frame rate, either exactly or an integer multiple, without user intervention.</p><p id="f39e" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">While this sounds like the perfect solution, devices that support older HDMI technologies e.g. HDMI v&lt;2.1, can’t change the frame rate without also changing the HDMI data rate. This results in what is often referred as an “HDMI bonk” which causes the TV to display a blank screen momentarily. Not only is this a disruptive experience for members, but the duration of the blank screen varies depending on how fast the source and sink devices can resynchronize. Figure 3 below is an example of how this transition looks:</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qc"><picture><img alt="" class="bg ph pi c" width="700" height="350" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 3: Native frame rate experience with screen blanking</strong></figcaption></figure><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qd"><picture><img alt="" class="bg ph pi c" width="700" height="105" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="3ed4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Improvements in the recent HDMI standards (HDMI 2.1+) now allow a source device to send the video content at its native frame rate without needing an HDMI resynchronization. This is possible through an innovative technology called <a class="af nq" href="https://www.hdmi.org/spec21sub/quickmediaswitching" rel="noopener ugc nofollow" target="_blank">Quick Media Switching</a> (QMS) which is an extension of <a class="af nq" href="https://www.hdmi.org/spec21sub/variablerefreshrate" rel="noopener ugc nofollow" target="_blank">Variable Refresh Rate</a> (VRR) targeted for content playback scenarios. QMS allows a source device to maintain a constant data rate on the HDMI link even during transmission of content with different frame rates. It does so by adjusting the amount of non-visible padding data while keeping the amount of visible video data constant. Due to the constant HDMI data rate, the HDMI transmitter and receiver don’t need to resynchronize, leading to a seamless/glitch-free transition as illustrated in Figure 4.</p><p id="eade" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">HDMI QMS is positioned to be the ideal solution to address the problem we are presenting. Unfortunately, at present, this technology is relatively new and adoption into source and sink devices will take time.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qc"><picture><img alt="" class="bg ph pi c" width="700" height="350" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 4: Native frame rate experience without screen blanking using HDMI QMS</strong></figcaption></figure><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qe"><picture><img alt="" class="bg ph pi c" width="700" height="106" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="0597" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Apart from the above HDMI specification dependent solutions, it is possible for an application like Netflix to manipulate the <a class="af nq" href="https://en.wikipedia.org/wiki/Presentation_timestamp" rel="noopener ugc nofollow" target="_blank">presentation time stamp</a> value of each video frame to minimize the effect of judder i.e. the application can present video frames to the underlying source device platform at a cadence that can help the source device to minimize the judder associated with FRC on the HDMI output link.</p><p id="41a1" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Let us understand this idea with the help of an example. Let’s go back to the same 24 to 50 fps FRC scenario that was covered earlier. But, instead of thinking about the FRC rate per second (24 ⇒ 50 fps), let’s expand the FRC calculation time period to 3 seconds (24*3 = 72 ⇒50*3 = 150 fps). For content with a native frame rate of 24 fps, the source device needs to get 72 frames from the streaming application in a period of 3 seconds. Now instead of sending 24 frames per second at a regular per second cadence, for each 3 second period the Netflix application can decide to send 25 frames in the first 2 seconds (25 x 2 = 50) and 22 frames in the 3rd second thereby still sending a total of 72 (50+22) frames in 3 seconds. This approach creates an even FRC in the first 2 seconds (25 frames replicated twice evenly) and in the 3rd second the source device can do a 22 to 50 fps FRC which will create less visual judder compared to the 24-&gt;50 fps FRC given a more even frame replication pattern. This concept is illustrated in Figure 5 below.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qf"><picture><img alt="" class="bg ph pi c" width="700" height="387" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 5: FRC Algorithm from Solution#3 for 24 to 50 fps conversion</strong></figcaption></figure><p id="8d1d" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">NOTE: This solution was developed by <a class="af nq" href="https://www.linkedin.com/in/david-weiguo-zheng-7409724/" rel="noopener ugc nofollow" target="_blank">David Zheng</a> in the Partner Experience Technology team at Netflix. Watch out for an upcoming article going into further details of this solution.</p><p id="7e0a" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Given the possible solutions available to use and the associated benefits and limitations, the Netflix application running on a source device adapts to use one of these approaches based on factors such as source and sink device capabilities, user preferences and the specific use case within the Netflix application. Let’s walk through each of these aspects briefly.</p><p id="8a0a" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Every source device that integrates the Netflix application is required to let the application know if it and the connected sink device have the ability to send and receive video content at its native frame rate. In addition, a source device is required to inform whether it can support QMS and perform a seamless playback start of any content at its native frame rate on the connected HDMI link.</p><p id="6aa4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">As discussed in the introduction section, the presence of a system setting like “Match Content Frame Rate” typically indicates that a source device is capable of this feature.</p><p id="b585" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">Even if a source device and the connected sink can support Native content frame rate streaming (seamless or non-seamless), a user might have selected not to do this via the source device system settings e.g. “Match Content Frame Rate” set to “Never”. Or they might have indicated a preference of doing this only when the native content frame rate play start can happen in a seamless manner e.g. “Match Content Frame Rate” set to “Seamless”.</p><p id="2882" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The Netflix application needs to know this user selection in order to honor their preference. Hence, source devices are expected to relay this user preference to the Netflix application to help with this run-time decision making.</p><p id="02d4" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">In spite of source device capability and the user preferences collectively indicating that the Native Content Frame Rate streaming should be enabled, the Netflix application can decide to disable this feature for specific member experiences. As an example, when the user is browsing Netflix content in the home UI, we cannot play Netflix trailers in their Native frame rate due to the following reasons:</p><ul class=""><li id="07d1" class="mr ms gq mt b mu mv mw mx my mz na nb pp nd ne nf pq nh ni nj pr nl nm nn no ps pt pu bj">If using Solution # 1, when the Netflix trailers are encoded in varying content frame rates, switching between trailers will result in screen blanking, thereby making the UI browsing unusable.</li>
<li id="ac8e" class="mr ms gq mt b mu pv mw mx my pw na nb pp px ne nf pq py ni nj pr pz nm nn no ps pt pu bj">If using Solution # 2, sending Netflix trailers in their Native frame rate would mean that the associated UI components (movement of cursor, asset selection etc) would also be displayed at the reduced frame rate and this will result in a sluggish UI browsing experience. This is because on HDMI output from the source device, both graphics (Netflix application UI) and video components will go out at the same frame rate (native content frame rate of the trailer) after being blended together on the source device.</li>
</ul><p id="5e1d" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">To handle these issues we follow an approach as shown in Figure 6 below where we enable the Native Frame Rate playback experience only when the user selects a title and watches it in full screen with minimal graphical UI elements.</p><figure class="ox oy oz pa pb pc ou ov paragraph-image"><div role="button" tabindex="0" class="pd pe ff pf bg pg ou ov qg"><picture><img alt="" class="bg ph pi c" width="700" height="390" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="pj pk pl ou ov pm pn be b bf z dt"><strong class="be nt">Figure 6: Native Frame Rate usage within Netflix application</strong></figcaption></figure><p id="76bb" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">This article presented features that aim to improve the content playback experience on HDMI source devices. The breadth of available technical solutions, user selectable preferences, device capabilities and the application of each of these permutations in the context of various in-app member journeys represent a typical engineering and product decision framework at Netflix. Here at Netflix, our goal is to maximize immersion for our members through introduction of new features that will improve their viewing experience and keep them fully engaged in our content.</p><p id="b9ec" class="pw-post-body-paragraph mr ms gq mt b mu op mw mx my oq na nb nc or ne nf ng os ni nj nk ot nm nn no gj bj">We would like to acknowledge the hard work of a number of teams that came together to deliver the features being discussed in this document. These include Core UI and JS Player development, Netflix Application Software development, AV Test and Tooling (earlier <a class="af nq" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/hdmi-scaling-netflix-certification-8e9cb3ec524f">article</a> from this team), Partner Engineering and Product teams in the <a class="af nq" href="https://jobs.netflix.com/team?slug=client-and-ui-engineering" rel="noopener ugc nofollow" target="_blank">Consumer Engineering</a> organization and our data science friends in the <a class="af nq" href="https://jobs.netflix.com/team?slug=data-science-and-engineering" rel="noopener ugc nofollow" target="_blank">Data Science and Engineering</a> organization at Netflix. Diagrams in this article are courtesy of our Partner Enterprise Platform XD team.</p>]]></description>
      <link>https://netflixtechblog.com/native-frame-rate-playback-6c87836a948</link>
      <guid>https://netflixtechblog.com/native-frame-rate-playback-6c87836a948</guid>
      <pubDate>Mon, 05 Jun 2023 18:31:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Ensuring the Successful Launch of Ads on Netflix]]></title>
      <description><![CDATA[<div class="ab ca ch bg fv fw fx fy"><p id="e865" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">By <a class="af np" href="https://www.linkedin.com/in/josefernandezmn/" rel="noopener ugc nofollow" target="_blank">Jose Fernandez</a>, <a class="af np" href="https://www.linkedin.com/in/edhbarker/" rel="noopener ugc nofollow" target="_blank">Ed Barker</a>, <a class="af np" href="https://www.linkedin.com/in/hajacobs/" rel="noopener ugc nofollow" target="_blank">Hank Jacobs</a></p><p id="f9b4" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">In November 2022, we introduced a brand new tier — <a class="af np" href="https://about.netflix.com/en/news/announcing-basic-with-ads-us" rel="noopener ugc nofollow" target="_blank"><em class="ot">Basic with ads</em></a>. This tier extended existing infrastructure by adding new backend components and a new remote call to our ads partner on the playback path. As we were gearing up for launch, we wanted to ensure it would go as smoothly as possible. To do this, we devised a novel way to simulate the projected traffic weeks ahead of launch by building upon the traffic migration framework described <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835">here</a>. We used this simulation to help us surface problems of scale and validate our Ads algorithms.</p><p id="9ca5" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><em class="ot">Basic with ads</em> was launched worldwide on November 3rd. In this blog post, we’ll discuss the methods we used to ensure a successful launch, including:</p><ul class=""><li id="0871" class="mr ms gq mt b mu mv mw mx my mz na nb ou nd ne nf ov nh ni nj ow nl nm nn no ox oy oz bj">How we tested the system</li>
<li id="ab2c" class="mr ms gq mt b mu pa mw mx my pb na nb ou pc ne nf ov pd ni nj ow pe nm nn no ox oy oz bj">Netflix technologies involved</li>
<li id="aaea" class="mr ms gq mt b mu pa mw mx my pb na nb ou pc ne nf ov pd ni nj ow pe nm nn no ox oy oz bj">Best practices we developed</li>
</ul><p id="5dec" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. New content or national events may drive brief spikes, but, by and large, traffic is usually smoothly increasing or decreasing. An exception to this trend is when we redirect traffic between AWS data centers during regional evacuations, which leads to sudden spikes in traffic in multiple regions. Region evacuations can occur at any time, for a variety of reasons.</p></div><div class="ab ca ch bg fv fw fx fy"><p id="30f6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">While evaluating options to test anticipated load and evaluate our ad selection algorithms at scale, we realized that mimicking member viewing behavior in combination with the seasonality of our organic traffic with abrupt regional shifts were important requirements. Replaying real traffic and making it appear as <em class="ot">Basic with ads</em> traffic was a better solution than artificially simulating Netflix traffic. <strong class="mt gr">Replay traffic enabled us to test our new systems and algorithms at scale before launch, while also making the traffic as realistic as possible.</strong></p><p id="cc31" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">A key objective of this initiative was to ensure that our customers were not impacted. We used member viewing habits to drive the simulation, but customers did not see any ads as a result. Achieving this goal required extensive planning and implementation of measures to isolate the replay traffic environment from the production environment.</p><p id="7a29" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Netflix’s data science team provided projections of what the <em class="ot">Basic with ads</em> subscriber count would look like a month after launch. We used this information to simulate a subscriber population through our <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15">AB testing platform</a>. When traffic matching our AB test criteria arrived at our playback services, we stored copies of those requests in a <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/stream-processing-with-mantis-78af913f51a6">Mantis stream</a>.</p><p id="fd46" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Next, we launched a Mantis job that processed all requests in the stream and replayed them in a duplicate production environment created for replay traffic. We set the services in this environment to “replay traffic” mode, which meant that they did not alter state and were programmed to treat the request as being on the ads plan, which activated the components of the ads system.</p><p id="cb47" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The replay traffic environment generated responses containing a standard playback manifest, a JSON document containing all the necessary information for a Netflix device to start playback. It also included metadata about ads, such as ad placement and impression-tracking events. We stored these responses in a <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/keystone-real-time-stream-processing-platform-a3ee651812a">Keystone stream</a> with outputs for Kafka and Elasticsearch. A Kafka consumer retrieved the playback manifests with ad metadata and simulated a device playing the content and triggering the impression-tracking events. We used Elasticsearch dashboards to analyze results.</p><p id="584e" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><strong class="mt gr">Ultimately, we accurately simulated the projected <em class="ot">Basic with ads</em> traffic weeks ahead of the launch date.</strong></p><figure class="po pp pq pr ps pf qh qi paragraph-image"><div role="button" tabindex="0" class="py pz ff qa bg qb qh qi qo"><picture><img alt="A diagram of the systems involved in traffic replay" class="bg qc qd c" width="700" height="579" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="qe qf qg qh qi qj qk be b bf z dt">Fig. 2: The Traffic Replay Setup</figcaption></figure><p id="f702" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">To fully replay the traffic, we first validated the idea with a small percentage of traffic. The <a class="af np" href="https://netflix.github.io/mantis/develop/querying/mql/" rel="noopener ugc nofollow" target="_blank">Mantis query language</a> allowed us to set the percentage of replay traffic to process. We informed our engineering and business partners, including customer support, about the experiment and ramped up traffic incrementally while monitoring the success and error metrics through <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c">Lumen dashboards</a>. We continued ramping up and eventually reached 100% replay. At this point we felt confident to run the replay traffic 24/7.</p><p id="90b6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">To validate handling traffic spikes caused by regional evacuations, we utilized Netflix’s region evacuation exercises which are scheduled regularly. By coordinating with the team in charge of region evacuations and aligning with their calendar, we validated our system and third-party touchpoints at 100% replay traffic during these exercises.</p><p id="b871" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">We also constructed and checked our ad monitoring and alerting system during this period. Having representative data allowed us to be more confident in our alerting thresholds. The ads team also made necessary modifications to the algorithms to achieve the desired business outcomes for launch.</p><p id="12d7" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Finally, we conducted chaos experiments using the <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f">ChAP experimentation platform</a>. This allowed us to validate our fallback logic and our new systems under failure scenarios. By intentionally introducing failure into the simulation, we were able to identify points of weakness and make the necessary improvements to ensure that our ads systems were resilient and able to handle unexpected events.</p><p id="b731" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><strong class="mt gr">The availability of replay traffic 24/7 enabled us to refine our systems and boost our launch confidence, reducing stress levels for the team.</strong></p><p id="1ceb" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The above summarizes three months of hard work by a tiger team consisting of representatives from various backend teams and <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb">Netflix’s centralized SRE team</a>. This work helped ensure <strong class="mt gr">a successful launch of the <em class="ot">Basic with ads</em></strong> tier on November 3rd.</p><p id="87d4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">To briefly recap, here are a few of the things that we took away from this journey:</p><ul class=""><li id="9db8" class="mr ms gq mt b mu mv mw mx my mz na nb ou nd ne nf ov nh ni nj ow nl nm nn no ox oy oz bj">Accurately simulating real traffic helps build confidence in new systems and algorithms more quickly.</li>
<li id="a01c" class="mr ms gq mt b mu pa mw mx my pb na nb ou pc ne nf ov pd ni nj ow pe nm nn no ox oy oz bj">Large scale testing using representative traffic helps to uncover bugs and operational surprises.</li>
<li id="826d" class="mr ms gq mt b mu pa mw mx my pb na nb ou pc ne nf ov pd ni nj ow pe nm nn no ox oy oz bj">Replay traffic has other applications outside of load testing that can be leveraged to build new products and features at Netflix.</li>
</ul><p id="69a2" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Replay traffic at Netflix has numerous applications, one of which has proven to be a valuable tool for development and launch readiness. The Resilience team is streamlining this simulation strategy by integrating it into the <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f">CHAP experimentation platform</a>, making it accessible for all development teams without the need for extensive infrastructure setup. Keep an eye out for updates on this.</p></div>]]></description>
      <link>https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba</link>
      <guid>https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba</guid>
      <pubDate>Thu, 01 Jun 2023 21:22:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Debugging a FUSE deadlock in the Linux kernel]]></title>
      <description><![CDATA[<p id="939c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj"><a class="af np" href="https://tycho.pizza" rel="noopener ugc nofollow" target="_blank">Tycho Andersen</a></p><p id="97e4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">I work on the Compute team at Netflix, which is charged with managing all AWS and containerized workloads at Netflix, including autoscaling, deployment of containers, issue remediation, etc.</p><p id="adb6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">This particular issue involved a custom internal <a class="af np" href="https://www.kernel.org/doc/html/latest/filesystems/fuse.html" rel="noopener ugc nofollow" target="_blank">FUSE filesystem</a>: <a class="af np" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/netflix-drive-a607538c3055">ndrive</a>. It had been festering for some time, but needed someone to sit down and look at it in anger. This blog post describes how I poked at <code class="cw nq nr ns nt b">/proc</code>to get a sense of what was going on, before posting the issue to the kernel mailing list and getting schooled on how the kernel’s wait code actually works!</p><p id="1e38" class="pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj">We had a stuck docker API call:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">goroutine 146 [select, 8817 minutes]:net/http.(*persistConn).roundTrip(0xc000658fc0, 0xc0003fc080, 0x0, 0x0, 0x0)/usr/local/go/src/net/http/transport.go:2610 +0x765net/http.(*Transport).roundTrip(0xc000420140, 0xc000966200, 0x30, 0x1366f20, 0x162)/usr/local/go/src/net/http/transport.go:592 +0xacbnet/http.(*Transport).RoundTrip(0xc000420140, 0xc000966200, 0xc000420140, 0x0, 0x0)/usr/local/go/src/net/http/roundtrip.go:17 +0x35net/http.send(0xc000966200, 0x161eba0, 0xc000420140, 0x0, 0x0, 0x0, 0xc00000e050, 0x3, 0x1, 0x0)/usr/local/go/src/net/http/client.go:251 +0x454net/http.(*Client).send(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0, 0xc00000e050, 0x0, 0x1, 0x10000168e)/usr/local/go/src/net/http/client.go:175 +0xffnet/http.(*Client).do(0xc000438480, 0xc000966200, 0x0, 0x0, 0x0)/usr/local/go/src/net/http/client.go:717 +0x45fnet/http.(*Client).Do(...)/usr/local/go/src/net/http/client.go:585golang.org/x/net/context/ctxhttp.Do(0x163bd48, 0xc000044090, 0xc000438480, 0xc000966100, 0x0, 0x0, 0x0)/go/pkg/mod/golang.org/x/net@v0.0.0-20211209124913-491a49abca63/context/ctxhttp/ctxhttp.go:27 +0x10fgithub.com/docker/docker/client.(*Client).doRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc000966100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/client/request.go:132 +0xbegithub.com/docker/docker/client.(*Client).sendRequest(0xc0001a8200, 0x163bd48, 0xc000044090, 0x13d8643, 0x3, 0xc00079a720, 0x51, 0x0, 0x0, 0x0, ...)/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/client/request.go:122 +0x156github.com/docker/docker/client.(*Client).get(...)/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/client/request.go:37github.com/docker/docker/client.(*Client).ContainerInspect(0xc0001a8200, 0x163bd48, 0xc000044090, 0xc0006a01c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)/go/pkg/mod/github.com/moby/moby@v0.0.0-20190408150954-50ebe4562dfc/client/container_inspect.go:18 +0x128github.com/Netflix/titus-executor/executor/runtime/docker.(*DockerRuntime).Kill(0xc000215180, 0x163bdb8, 0xc000938600, 0x1, 0x0, 0x0)/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runtime/docker/docker.go:2835 +0x310github.com/Netflix/titus-executor/executor/runner.(*Runner).doShutdown(0xc000432dc0, 0x163bd10, 0xc000938390, 0x1, 0xc000b821e0, 0x1d, 0xc0005e4710)/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:326 +0x4f4github.com/Netflix/titus-executor/executor/runner.(*Runner).startRunner(0xc000432dc0, 0x163bdb8, 0xc00071e0c0, 0xc0a502e28c08b488, 0x24572b8, 0x1df5980)/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:122 +0x391created by github.com/Netflix/titus-executor/executor/runner.StartTaskWithRuntime/var/lib/buildkite-agent/builds/ip-192-168-1-90-1/netflix/titus-executor/executor/runner/runner.go:81 +0x411</pre><p id="abe6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Here, our management engine has made an HTTP call to the Docker API’s unix socket asking it to kill a container. Our containers are configured to be killed via <code class="cw nq nr ns nt b">SIGKILL</code>. But this is strange. <code class="cw nq nr ns nt b">kill(SIGKILL)</code> should be relatively fatal, so what is the container doing?</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">$ docker exec -it 6643cd073492 bashOCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: process_linux.go:130: executing setns process caused: exit status 1: unknown</pre><p id="9510" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Hmm. Seems like it’s alive, but <code class="cw nq nr ns nt b">setns(2)</code> fails. Why would that be? If we look at the process tree via <code class="cw nq nr ns nt b">ps awwfux</code>, we see:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">\_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6643cd073492ba9166100ed30dbe389ff1caef0dc3d35|  \_ [docker-init]|      \_ [ndrive] &lt;defunct&gt;</pre><p id="484b" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Ok, so the container’s init process is still alive, but it has one zombie child. What could the container’s init process possibly be doing?</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg"># cat /proc/1528591/stack[&lt;0&gt;] do_wait+0x156/0x2f0[&lt;0&gt;] kernel_wait4+0x8d/0x140[&lt;0&gt;] zap_pid_ns_processes+0x104/0x180[&lt;0&gt;] do_exit+0xa41/0xb80[&lt;0&gt;] do_group_exit+0x3a/0xa0[&lt;0&gt;] __x64_sys_exit_group+0x14/0x20[&lt;0&gt;] do_syscall_64+0x37/0xb0[&lt;0&gt;] entry_SYSCALL_64_after_hwframe+0x44/0xae</pre><p id="5c89" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">It is in the process of exiting, but it seems stuck. The only child is the ndrive process in Z (i.e. “zombie”) state, though. Zombies are processes that have successfully exited, and are waiting to be reaped by a corresponding <code class="cw nq nr ns nt b">wait()</code> syscall from their parents. So how could the kernel be stuck waiting on a zombie?</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg"># ls /proc/1544450/task1544450  1544574</pre><p id="5d90" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Ah ha, there are two threads in the thread group. One of them is a zombie, maybe the other one isn’t:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg"># cat /proc/1544574/stack[&lt;0&gt;] request_wait_answer+0x12f/0x210[&lt;0&gt;] fuse_simple_request+0x109/0x2c0[&lt;0&gt;] fuse_flush+0x16f/0x1b0[&lt;0&gt;] filp_close+0x27/0x70[&lt;0&gt;] put_files_struct+0x6b/0xc0[&lt;0&gt;] do_exit+0x360/0xb80[&lt;0&gt;] do_group_exit+0x3a/0xa0[&lt;0&gt;] get_signal+0x140/0x870[&lt;0&gt;] arch_do_signal_or_restart+0xae/0x7c0[&lt;0&gt;] exit_to_user_mode_prepare+0x10f/0x1c0[&lt;0&gt;] syscall_exit_to_user_mode+0x26/0x40[&lt;0&gt;] do_syscall_64+0x46/0xb0[&lt;0&gt;] entry_SYSCALL_64_after_hwframe+0x44/0xae</pre><p id="f377" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Indeed it is not a zombie. It is trying to become one as hard as it can, but it’s blocking inside FUSE for some reason. To find out why, let’s look at some kernel code. If we look at <code class="cw nq nr ns nt b"><a class="af np" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/pid_namespace.c?h=v5.19#n166" rel="noopener ugc nofollow" target="_blank">zap_pid_ns_processes()</a></code>, it does:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">/** Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.* kernel_wait4() will also block until our children traced from the* parent namespace are detached and become EXIT_DEAD.*/do {clear_thread_flag(TIF_SIGPENDING);rc = kernel_wait4(-1, NULL, __WALL, NULL);} while (rc != -ECHILD);</pre><p id="b65e" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">which is where we are stuck, but before that, it has done:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">/* Don't allow any more processes into the pid namespace */disable_pid_allocation(pid_ns);</pre><p id="3eed" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">which is why docker can’t <code class="cw nq nr ns nt b">setns()</code> — the <em class="pm">namespace</em> is a zombie. Ok, so we can’t <code class="cw nq nr ns nt b">setns(2)</code>, but why are we stuck in <code class="cw nq nr ns nt b">kernel_wait4()</code>? To understand why, let’s look at what the other thread was doing in FUSE’s <code class="cw nq nr ns nt b"><a class="af np" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/dev.c?h=v5.19#n407" rel="noopener ugc nofollow" target="_blank">request_wait_answer()</a></code>:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">/** Either request is already in userspace, or it was forced.* Wait it out.*/wait_event(req-&gt;waitq, test_bit(FR_FINISHED, &amp;req-&gt;flags));</pre><p id="df37" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Ok, so we’re waiting for an event (in this case, that userspace has replied to the FUSE flush request). But <code class="cw nq nr ns nt b">zap_pid_ns_processes()</code>sent a <code class="cw nq nr ns nt b">SIGKILL</code>! <code class="cw nq nr ns nt b">SIGKILL</code> should be very fatal to a process. If we look at the process, we can indeed see that there’s a pending <code class="cw nq nr ns nt b">SIGKILL</code>:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg"># grep Pnd /proc/1544574/statusSigPnd: 0000000000000000ShdPnd: 0000000000000100</pre><p id="3a27" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Viewing process status this way, you can see <code class="cw nq nr ns nt b">0x100</code> (i.e. the 9th bit is set) under <code class="cw nq nr ns nt b">SigPnd</code>, which is the signal number corresponding to <code class="cw nq nr ns nt b">SIGKILL</code>. Pending signals are signals that have been generated by the kernel, but have not yet been delivered from userspace. Signals are only delivered at certain times, for example when entering or leaving a syscall, or when waiting on events. If the kernel is currently doing something on behalf of the task, the signal may be pending. Signals can also be blocked by a task, so that they are never delivered. Blocked signals will show up in their respective pending sets as well. However, <code class="cw nq nr ns nt b">man 7 signal</code> says: “The signals <code class="cw nq nr ns nt b">SIGKILL</code> and <code class="cw nq nr ns nt b">SIGSTOP</code> cannot be caught, blocked, or ignored.” But here the kernel is telling us that we have a pending <code class="cw nq nr ns nt b">SIGKILL</code>, aka that it is being ignored even while the task is waiting!</p><p id="03f8" class="pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj">Well that is weird. The wait code (i.e. <code class="cw nq nr ns nt b">include/linux/wait.h</code>) is used everywhere in the kernel: semaphores, wait queues, completions, etc. Surely it knows to look for <code class="cw nq nr ns nt b">SIGKILL</code>s. So what does <code class="cw nq nr ns nt b">wait_event()</code> actually do? Digging through the macro expansions and wrappers, the meat of it is:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">#define ___wait_event(wq_head, condition, state, exclusive, ret, cmd)           \({                                                                              \__label__ __out;                                                        \struct wait_queue_entry __wq_entry;                                     \long __ret = ret;       /* explicit shadow */                           \\init_wait_entry(&amp;__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0);        \for (;;) {                                                              \long __int = prepare_to_wait_event(&amp;wq_head, &amp;__wq_entry, state);\\if (condition)                                                  \break;                                                  \\if (___wait_is_interruptible(state) &amp;&amp; __int) {                 \__ret = __int;                                          \goto __out;                                             \}                                                               \\cmd;                                                            \}                                                                       \finish_wait(&amp;wq_head, &amp;__wq_entry);                                     \__out:  __ret;                                                                  \})</pre><p id="2343" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">So it loops forever, doing <code class="cw nq nr ns nt b">prepare_to_wait_event()</code>, checking the condition, then checking to see if we need to interrupt. Then it does <code class="cw nq nr ns nt b">cmd</code>, which in this case is <code class="cw nq nr ns nt b">schedule()</code>, i.e. “do something else for a while”. <code class="cw nq nr ns nt b">prepare_to_wait_event()</code> looks like:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state){unsigned long flags;long ret = 0;spin_lock_irqsave(&amp;wq_head-&gt;lock, flags);if (signal_pending_state(state, current)) {/** Exclusive waiter must not fail if it was selected by wakeup,* it should "consume" the condition we were waiting for.** The caller will recheck the condition and return success if* we were already woken up, we can not miss the event because* wakeup locks/unlocks the same wq_head-&gt;lock.** But we need to ensure that set-condition + wakeup after that* can't see us, it should wake up another exclusive waiter if* we fail.*/list_del_init(&amp;wq_entry-&gt;entry);ret = -ERESTARTSYS;} else {if (list_empty(&amp;wq_entry-&gt;entry)) {if (wq_entry-&gt;flags &amp; WQ_FLAG_EXCLUSIVE)__add_wait_queue_entry_tail(wq_head, wq_entry);else__add_wait_queue(wq_head, wq_entry);}set_current_state(state);}spin_unlock_irqrestore(&amp;wq_head-&gt;lock, flags);return ret;}EXPORT_SYMBOL(prepare_to_wait_event);</pre><p id="7e53" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">It looks like the only way we can break out of this with a non-zero exit code is if <code class="cw nq nr ns nt b">signal_pending_state()</code> is true. Since our call site was just <code class="cw nq nr ns nt b">wait_event()</code>, we know that state here is <code class="cw nq nr ns nt b">TASK_UNINTERRUPTIBLE</code>; the definition of <code class="cw nq nr ns nt b">signal_pending_state()</code> looks like:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">static inline int signal_pending_state(unsigned int state, struct task_struct *p){if (!(state &amp; (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))return 0;if (!signal_pending(p))return 0;return (state &amp; TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);}</pre><p id="45d6" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Our task is not interruptible, so the first if fails. Our task should have a signal pending, though, right?</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">static inline int signal_pending(struct task_struct *p){/** TIF_NOTIFY_SIGNAL isn't really a signal, but it requires the same* behavior in terms of ensuring that we break out of wait loops* so that notify signal callbacks can be processed.*/if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))return 1;return task_sigpending(p);}</pre><p id="d7ba" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">As the comment notes, <code class="cw nq nr ns nt b">TIF_NOTIFY_SIGNAL</code> isn’t relevant here, in spite of its name, but let’s look at <code class="cw nq nr ns nt b">task_sigpending()</code>:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">static inline int task_sigpending(struct task_struct *p){return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));}</pre><p id="962f" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Hmm. Seems like we should have that flag set, right? To figure that out, let’s look at how signal delivery works. When we’re shutting down the pid namespace in <code class="cw nq nr ns nt b">zap_pid_ns_processes()</code>, it does:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);</pre><p id="a015" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">which eventually gets to <code class="cw nq nr ns nt b">__send_signal_locked()</code>, which has:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">pending = (type != PIDTYPE_PID) ? &amp;t-&gt;signal-&gt;shared_pending : &amp;t-&gt;pending;...sigaddset(&amp;pending-&gt;signal, sig);...complete_signal(sig, t, type);</pre><p id="a0ed" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Using <code class="cw nq nr ns nt b">PIDTYPE_MAX</code> here as the type is a little weird, but it roughly indicates “this is very privileged kernel stuff sending this signal, you should definitely deliver it”. There is a bit of unintended consequence here, though, in that <code class="cw nq nr ns nt b">__send_signal_locked()</code> ends up sending the <code class="cw nq nr ns nt b">SIGKILL</code> to the shared set, instead of the individual task’s set. If we look at the <code class="cw nq nr ns nt b">__fatal_signal_pending()</code> code, we see:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">static inline int __fatal_signal_pending(struct task_struct *p){return unlikely(sigismember(&amp;p-&gt;pending.signal, SIGKILL));}</pre><p id="ba1a" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">But it turns out this is a bit of a red herring (<a class="af np" href="https://lore.kernel.org/all/YuGUyayVWDB7R89i@tycho.pizza/" rel="noopener ugc nofollow" target="_blank">although</a> <a class="af np" href="https://lore.kernel.org/all/20220728091220.GA11207@redhat.com/" rel="noopener ugc nofollow" target="_blank">it</a> <a class="af np" href="https://lore.kernel.org/all/871qu6bjp3.fsf@email.froward.int.ebiederm.org/" rel="noopener ugc nofollow" target="_blank">took</a> <a class="af np" href="https://lore.kernel.org/all/8735elhy4u.fsf@email.froward.int.ebiederm.org/" rel="noopener ugc nofollow" target="_blank">a</a> <a class="af np" href="https://lore.kernel.org/all/87pmhofr1q.fsf@email.froward.int.ebiederm.org/" rel="noopener ugc nofollow" target="_blank">while</a> for me to understand that).</p><p id="8339" class="pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj">To understand what’s really going on here, we need to look at <code class="cw nq nr ns nt b">complete_signal()</code>, since it unconditionally adds a <code class="cw nq nr ns nt b">SIGKILL</code> to the task’s pending set:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">sigaddset(&amp;t-&gt;pending.signal, SIGKILL);</pre><p id="4553" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">but why doesn’t it work? At the top of the function we have:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">/** Now find a thread we can wake up to take the signal off the queue.** If the main thread wants the signal, it gets first crack.* Probably the least surprising to the average bear.*/if (wants_signal(sig, p))t = p;else if ((type == PIDTYPE_PID) || thread_group_empty(p))/** There is just one thread and it does not need to be woken.* It will dequeue unblocked signals before it runs again.*/return;</pre><p id="07ad" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">but as <a class="af np" href="https://lore.kernel.org/all/877d4jbabb.fsf@email.froward.int.ebiederm.org/" rel="noopener ugc nofollow" target="_blank">Eric Biederman described</a>, basically every thread can handle a <code class="cw nq nr ns nt b">SIGKILL</code> at any time. Here’s <code class="cw nq nr ns nt b">wants_signal()</code>:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">static inline bool wants_signal(int sig, struct task_struct *p){if (sigismember(&amp;p-&gt;blocked, sig))return false;if (p-&gt;flags &amp; PF_EXITING)return false;if (sig == SIGKILL)return true;if (task_is_stopped_or_traced(p))return false;return task_curr(p) || !task_sigpending(p);}</pre><p id="04f2" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">So… if a thread is already exiting (i.e. it has <code class="cw nq nr ns nt b">PF_EXITING</code>), it doesn’t want a signal. Consider the following sequence of events:</p><p id="f18d" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">1. a task opens a FUSE file, and doesn’t close it, then exits. During that exit, the kernel dutifully calls <code class="cw nq nr ns nt b">do_exit()</code>, which does the following:</p><pre class="ox oy oz pa pb pc nt pd bo pe pf pg">exit_signals(tsk); /* sets PF_EXITING */</pre><p id="2fd3" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">2. <code class="cw nq nr ns nt b">do_exit()</code> continues on to <code class="cw nq nr ns nt b">exit_files(tsk);</code>, which flushes all files that are still open, resulting in the stack trace above.</p><p id="ded2" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">3. the pid namespace exits, and enters <code class="cw nq nr ns nt b">zap_pid_ns_processes()</code>, sends a <code class="cw nq nr ns nt b">SIGKILL</code> to everyone (that it expects to be fatal), and then waits for everyone to exit.</p><p id="8eb7" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">4. this kills the FUSE daemon in the pid ns so it can never respond.</p><p id="2db9" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">5. <code class="cw nq nr ns nt b">complete_signal()</code> for the FUSE task that was already exiting ignores the signal, since it has <code class="cw nq nr ns nt b">PF_EXITING</code>.</p><p id="b28c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">6. Deadlock. Without manually aborting the FUSE connection, things will hang forever.</p><p id="16d8" class="pw-post-body-paragraph mr ms gq mt b mu os mw mx my ot na nb nc ou ne nf ng ov ni nj nk ow nm nn no gj bj">It doesn’t really make sense to wait for flushes in this case: the task is dying, so there’s nobody to tell the return code of <code class="cw nq nr ns nt b">flush()</code> to. It also turns out that this bug can happen with several filesystems (anything that calls the kernel’s wait code in <code class="cw nq nr ns nt b">flush()</code>, i.e. basically anything that talks to something outside the local kernel).</p><p id="0106" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Individual filesystems will need to be patched in the meantime, for example the fix for FUSE is <a class="af np" href="https://github.com/torvalds/linux/commit/14feceeeb012faf9def7d313d37f5d4f85e6572b" rel="noopener ugc nofollow" target="_blank">here</a>, which was released on April 23 in Linux 6.3.</p><p id="ea5e" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">While this blog post addresses FUSE deadlocks, there are definitely issues in the nfs code and elsewhere, which we have not hit in production yet, but almost certainly will. You can also see it as a <a class="af np" href="https://lore.kernel.org/all/20230512225414.GE3223426@dread.disaster.area/" rel="noopener ugc nofollow" target="_blank">symptom of other filesystem bugs</a>. Something to look out for if you have a pid namespace that won’t exit.</p><p id="6e54" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">This is just a small taste of the variety of strange issues we encounter running containers at scale at Netflix. Our team is hiring, so please reach out if you also love red herrings and kernel deadlocks!</p>]]></description>
      <link>https://netflixtechblog.com/debugging-a-fuse-deadlock-in-the-linux-kernel-c75cd7989b6d</link>
      <guid>https://netflixtechblog.com/debugging-a-fuse-deadlock-in-the-linux-kernel-c75cd7989b6d</guid>
      <pubDate>Fri, 19 May 2023 21:21:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[ABAC on SpiceDB: Enabling Netflix’s Complex Identity Types]]></title>
      <description><![CDATA[<p id="cc8f" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">By <a class="af np" href="https://www.linkedin.com/in/chris-w-0a884022/" rel="noopener ugc nofollow" target="_blank">Chris Wolfe</a>, <a class="af np" href="https://www.linkedin.com/in/joseph-s-4324904/" rel="noopener ugc nofollow" target="_blank">Joey Schorr</a>, and <a class="af np" href="https://www.linkedin.com/in/vroldanbet/" rel="noopener ugc nofollow" target="_blank">Victor Roldán Betancort</a></p><p id="2c47" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The authorization team at Netflix recently sponsored work to add Attribute Based Access Control (ABAC) support to AuthZed’s <a class="af np" href="https://github.com/authzed/spicedb" rel="noopener ugc nofollow" target="_blank">open source Google Zanzibar inspired</a> authorization system, <a class="af np" href="https://authzed.com/products/spicedb" rel="noopener ugc nofollow" target="_blank">SpiceDB</a>. Netflix required attribute support in SpiceDB to support core Netflix application identity constructs. This post discusses why Netflix wanted ABAC support in SpiceDB, how Netflix collaborated with AuthZed, the end result–<a class="af np" href="https://authzed.com/docs/reference/caveats" rel="noopener ugc nofollow" target="_blank">SpiceDB Caveats</a>, and how Netflix may leverage this new feature.</p><p id="6cbd" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Netflix is always looking for security, ergonomic, or efficiency improvements, and this extends to authorization tools. <a class="af np" href="https://authzed.com/blog/what-is-google-zanzibar" rel="noopener ugc nofollow" target="_blank">Google Zanzibar</a> is exciting to Netflix as it makes it easier to produce authorization decision objects and reverse indexes for resources a principal can access.</p><p id="a986" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Last year, while experimenting with Zanzibar approaches to authorization, Netflix found SpiceDB, the <a class="af np" href="https://github.com/authzed/spicedb" rel="noopener ugc nofollow" target="_blank">open source Google Zanzibar inspired permission system</a>, and built a prototype to experiment with modeling. The prototype uncovered trade-offs required to implement Attribute Based Access Control in SpiceDB, which made it poorly suited to Netflix’s core requirements for application identities.</p><p id="3906" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Netflix application identities are fundamentally attribute based: e.g. an instance of the Data Processor runs in eu-west-1 in the test environment with a public shard.</p><p id="3e11" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Authorizing these identities is done not only by application name, but by specifying specific attributes on which to match. An application owner might want to craft a policy like “Application members of the EU data processors group can access a PI decryption key”. This is one normal relationship in SpiceDB. But, they might also want to specify a policy for compliance reasons that only allows access to the PI key from data processor instances running in the EU within a sensitive shard. Put another way, an identity should only be considered to have the “is member of the <code class="cw ot ou ov ow b">EU-data-processors</code> group” if certain identity attributes (like region==eu) match in addition to the application name. This is a Caveated SpiceDB relationship.</p><p id="52bf" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">SpiceDB, being a Relationship Based Access Control (ReBAC) system, expected authorization checks to be performed against the existence of a specific relationship between objects. Users fit this model — they have a single user ID to describe who they are. As described above, Netflix applications do not fit this model. Their attributes are used to scope permissions to varying degrees.</p><p id="f770" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Netflix ran into significant difficulties in trying to fit their existing policy model into relations. To do so Netflix’s design required:</p><ul class=""><li id="9fa5" class="mr ms gq mt b mu mv mw mx my mz na nb ox nd ne nf oy nh ni nj oz nl nm nn no pa pb pc bj">An event based mechanism that could ingest information about application autoscaling groups. An autoscaling group isn’t the lowest level of granularity, but it’s relatively close to the lowest level where we’d typically see authorization policy applied.</li>
<li id="9b25" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">Ingest the attributes describing the autoscaling group and write them as separate relations. That is for the data-processor, Netflix would need to write relations describing the region, environment, account, application name, etc.</li>
<li id="17a1" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">At authZ check time, provide the attributes for the identity to check, e.g. “can app bar in us-west-2 access this document.” SpiceDB is then responsible for figuring out which relations map back to the autoscaling group, e.g. name, environment, region, etc.</li>
<li id="119d" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">A cleanup process to prune stale relationships from the database.</li>
</ul><p id="d56c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">What was problematic about this design? Aside from being complicated, there were a few specific things that made Netflix uncomfortable. The most salient being that i<strong class="mt gr">t wasn’t resilient to an absence of relationship data, e.g. if a new autoscaling group started and reporting its presence to SpiceDB had not yet happened, the autoscaling group members would be missing necessary permissions to run</strong>. All this meant that Netflix would have to write and prune the relationship state with significant freshness requirements. This would be a significant departure from its existing policy based system.</p><p id="9c37" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">While working through this, Netflix hopped into the SpiceDB Discord to chat about possible solutions and found an open community issue: the <a class="af np" href="https://github.com/authzed/spicedb/issues/386" rel="noopener ugc nofollow" target="_blank">caveated relationships proposal</a>.</p><p id="5a98" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The SpiceDB community had already explored <a class="af np" href="https://github.com/authzed/spicedb/issues/158" rel="noopener ugc nofollow" target="_blank">integrating SpiceDB with Open Policy Agent (OPA)</a> and concluded it strayed too far from Zanzibar’s core promise of global horizontal scalability with strong consistency. With Netflix’s support, the AuthZed team pondered a Zanzibar-native approach to Attribute-Based Access Control.</p><p id="9f56" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The requirements were captured and published as the <a class="af np" href="https://github.com/authzed/spicedb/issues/386" rel="noopener ugc nofollow" target="_blank">caveated relationships proposal on GitHub</a> for feedback from the SpiceDB community. The community’s excitement and interest became apparent through comments, reactions, and conversations on the <a class="af np" href="https://authzed.com/discord" rel="noopener ugc nofollow" target="_blank">SpiceDB Discord server</a>. Clearly, Netflix wasn’t the only one facing challenges when reconciling SpiceDB with policy-based approaches, so Netflix decided to help! By sponsoring the project, Netflix was able to help AuthZed prioritize engineering effort and accelerate adding Caveats to SpiceDB.</p><h2 id="3c7d" class="pi nr gq be ns pj pk dx nw pl pm dz oa nc pn po pp ng pq pr ps nk pt pu pv pw bj">Quick Intro to SpiceDB</h2><p id="f590" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The <a class="af np" href="https://authzed.com/docs/reference/schema-lang" rel="noopener ugc nofollow" target="_blank">SpiceDB Schema Language</a> lays the rules for how to build, traverse, and interpret SpiceDB’s Relationship Graph to make authorization decisions. SpiceDB Relationships, e.g., <code class="cw ot ou ov ow b">document:readme writer user:emilia</code>, are stored as relationships that represent a graph within a datastore like CockroachDB or PostgreSQL. SpiceDB walks the graph and decomposes it into subproblems. These subproblems are assigned through <a class="af np" href="https://authzed.com/blog/consistent-hash-load-balancing-grpc/" rel="noopener ugc nofollow" target="_blank">consistent hashing</a> and dispatched to a node in a cluster running SpiceDB. Over time, each node caches a subset of subproblems to support a distributed cache, reduce the datastore load, and achieve SpiceDB’s horizontal scalability.</p><h2 id="2013" class="pi nr gq be ns pj pk dx nw pl pm dz oa nc pn po pp ng pq pr ps nk pt pu pv pw bj">SpiceDB Caveats Design</h2><p id="0f73" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">The fundamental challenge with policies is that their input arguments can change the authorization result as understood by a centralized relationships datastore. If SpiceDB were to cache subproblems that have been “tainted” with policy variables, the likelihood those are reused for other requests would decrease and thus severely affect the cache hit rate. As you’d suspect, this would jeopardize one of the pillars of the system: its ability to scale.</p><p id="b504" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Once you accept that adding input arguments to the distributed cache isn’t efficient, you naturally gravitate toward the first question: what if you keep those inputs out of the cached subproblems? They are only known at request-time, so let’s add them as a variable in the subproblem! The cost of propagating those variables, assembling them, and executing the logic pales compared to fetching relationships from the datastore.</p><p id="5577" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">The next question was: how do you integrate the policy decisions into the relationships graph? The SpiceDB Schema Languages’ core concepts are <a class="af np" href="https://authzed.com/docs/reference/glossary#relation" rel="noopener ugc nofollow" target="_blank">Relations</a> and <a class="af np" href="https://authzed.com/docs/reference/glossary#permission" rel="noopener ugc nofollow" target="_blank">Permissions</a>; these are how a developer defines the shape of their relationships and how to traverse them. Naturally, being a graph, it’s fitting to add policy logic at the edges or the nodes. That leaves at least two obvious options: <strong class="mt gr">policy at the Relation level, or policy at the Permission level.</strong></p><p id="d8c4" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">After iterating on both options to get a feel for the ergonomics and expressiveness the choice was <strong class="mt gr">policy at the relation level</strong>. After all, SpiceDB is a Relationship Based Access Control (ReBAC) system. Policy at the relation level allows you to parameterize each relationship, which brought about the saying “this relationship exists, but with a Caveat!.” With this approach, SpiceDB could do request-time relationship vetoing like so:</p><pre class="px py pz qa qb qc ow qd bo qe qf qg">definition human {}caveat the_answer(received int) {received == 42}definition the_answer_to_life_the_universe_and_everything {relation humans: human with the_answerpermission enlightenment = humans</pre><p id="244f" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Netflix and AuthZed discussed the concept of static versus dynamic Caveats as well. A developer would define static Caveat expressions in the SpiceDB Schema, while dynamic Caveats would have expressions defined at run time. The discussion centered around typed versus dynamic programming languages, but given SpiceDB’s Schema Language was designed for type safety, it seemed coherent with the overall design to continue with static Caveats. To support runtime-provided policies, the choice was to introduce expressions as arguments to a Caveat. Keeping the SpiceDB Schema easy to understand was a key driver for this decision.</p><p id="5fbd" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">For defining Caveats, the main requirement was to provide an expression language with first-class support for partially-evaluated expressions. <a class="af np" href="https://github.com/google/cel-spec" rel="noopener ugc nofollow" target="_blank">Google’s CEL</a> seemed like the obvious choice: a protobuf-native expression language that evaluates in linear time, with first-class support for partial results that can be run at the edge, and is not turing complete. CEL expressions are type-safe, so they wouldn’t cause as many errors at runtime and can be stored in the datastore as a compiled protobuf. Given the near-perfect requirement match, it does make you wonder what Google’s Zanzibar has been up to since the white paper!</p><p id="4415" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">To execute the logic, SpiceDB would have to return a third response <code class="cw ot ou ov ow b">CAVEATED</code>, in addition to <code class="cw ot ou ov ow b">ALLOW</code> and <code class="cw ot ou ov ow b">DENY</code>, to signal that a result of a CheckPermission request depends on computing an unresolved chain of CEL expressions.</p><p id="6efb" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">SpiceDB Caveats needed to allow static input variables to be stored before evaluation to represent the multi-dimensional nature of Netflix application identities. Today, this is called “Caveat context,” defined by the values written in a SpiceDB Schema alongside a Relation and those provided by the client. Think of build time variables as an expansion of a templated CEL expression, and those take precedence over request-time arguments. Here is an example:</p><pre class="px py pz qa qb qc ow qd bo qe qf qg">caveat the_answer(received int, expected int) {received == expected}</pre><p id="a738" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Lastly, to deal with scenarios where there are multiple Caveated subproblems, the decision was to collect up a final CEL expression tree before evaluating it. The result of the final evaluation can be <code class="cw ot ou ov ow b">ALLOW</code>, <code class="cw ot ou ov ow b">DENY</code>, or <code class="cw ot ou ov ow b">CAVEATED</code>. Things get trickier with wildcards and SpiceDB APIs, but let’s save that for another post! If the response is <code class="cw ot ou ov ow b">CAVEATED</code>, the client receives a list of missing variables needed to properly evaluate the expression.</p><p id="7d51" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">To sum up! The primary design decisions were:</p><ul class=""><li id="33c0" class="mr ms gq mt b mu mv mw mx my mz na nb ox nd ne nf oy nh ni nj oz nl nm nn no pa pb pc bj">Caveats defined at the Relation-level, not the Permission-level</li>
<li id="0317" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">Keep Caveats in line with SpiceDB Schema’s type-safe nature</li>
<li id="d150" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">Support well-typed values provided by the caller</li>
<li id="ee4f" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">Use Google’s CEL to define Caveat expressions</li>
<li id="eb06" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">Introduce a new result type: <code class="cw ot ou ov ow b">CAVEATED</code></li>
</ul><p id="5eb7" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj"><a class="af np" href="https://authzed.com/docs/reference/caveats" rel="noopener ugc nofollow" target="_blank">SpiceDB Caveats</a> simplify this approach by allowing Netflix to specify authorization policy as they have in the past for applications. Instead of needing to have the entire state of the authorization world persisted as relations, the system can have relations and attributes of the identity used at authorization check time.</p><p id="83e8" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Now Netflix can write a Caveat similar to <code class="cw ot ou ov ow b">match_fine</code> , described below, that takes lists of expected attributes, e.g. region, account, etc. This Caveat would allow the specific application named by the relation as long as the context of the authorization check had an observed account, stack, detail, region, and extended attribute values that matched the values in their expected counterparts. This <a class="af np" href="https://play.authzed.com/s/51q8FOZ1PlzG/assertions" rel="noopener ugc nofollow" target="_blank">playground</a> has a live version of the schema, relations, etc. with which to experiment.</p><pre class="px py pz qa qb qc ow qd bo qe qf qg">definition app {}caveat match_fine(expected_accounts list&lt;string&gt;,expected_regions list&lt;string&gt;,expected_stacks list&lt;string&gt;,expected_details list&lt;string&gt;,expected_ext_attrs map&lt;any&gt;,observed_account string,observed_region string,observed_stack string,observed_detail string,observed_ext_attrs map&lt;any&gt;) {observed_account in expected_accounts &amp;&amp;observed_region in expected_regions &amp;&amp;observed_stack in expected_stacks &amp;&amp;observed_detail in expected_details &amp;&amp;expected_ext_attrs.isSubtreeOf(observed_ext_attrs)}definition movie {relation replicator: app with match_finepermission replicate = replicator}</pre><p id="e80d" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">Using this SpiceDB Schema we can write a relation to restrict access to the replicator application. It should only be allowed to run when</p><ul class=""><li id="8526" class="mr ms gq mt b mu mv mw mx my mz na nb ox nd ne nf oy nh ni nj oz nl nm nn no pa pb pc bj">It is in the <code class="cw ot ou ov ow b">highrisk</code> or <code class="cw ot ou ov ow b">birdie</code> accounts</li>
<li id="3d50" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">AND in either <code class="cw ot ou ov ow b">us-west-1</code> or <code class="cw ot ou ov ow b">us-east-1</code></li>
<li id="4f82" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">AND it has stack <code class="cw ot ou ov ow b">bg</code></li>
<li id="4af9" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">AND it has detail <code class="cw ot ou ov ow b">casser</code></li>
<li id="3f28" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">AND its extended attributes contain the key-value pair ‘foo: bar’</li>
</ul><pre class="px py pz qa qb qc ow qd bo qe qf qg">movie:newspecial#replicator@app:mover[match_fine:{"expected_accounts":["highrisk","birdie"],"expected_regions":["us-west-1","us-east-1"],"expected_stacks":["bg"],"expected_details":["casser"],"expected_ext_attrs":{"foo":"bar"}}]</pre><p id="817c" class="pw-post-body-paragraph mr ms gq mt b mu mv mw mx my mz na nb nc nd ne nf ng nh ni nj nk nl nm nn no gj bj">With the playground we can also make assertions that can mirror the behavior we’d see from the CheckPermission API. These assertions make it clear that our caveats work as expected.</p><pre class="px py pz qa qb qc ow qd bo qe qf qg">assertTrue:- 'movie:newspecial#replicate@app:mover with {"observed_account": "highrisk", "observed_region": "us-west-1", "observed_stack": "bg", "observed_detail": "casser", "observed_ext_attrs": {"foo": "bar"}}'assertFalse:- 'movie:newspecial#replicate@app:mover with {"observed_account": "lowrisk", "observed_region": "us-west-1", "observed_stack": "bg", "observed_detail": "casser", "observed_ext_attrs": {"foo": "bar"}}'- 'movie:newspecial#replicate@app:purger with {"observed_account": "highrisk", "observed_region": "us-west-1", "observed_stack": "bg", "observed_detail": "casser", "observed_ext_attrs": {"foo": "bar"}}'</pre><p id="22f4" class="pw-post-body-paragraph mr ms gq mt b mu oo mw mx my op na nb nc oq ne nf ng or ni nj nk os nm nn no gj bj">Netflix and AuthZed are both excited about the collaboration’s outcome. Netflix has another authorization tool it can employ and SpiceDB users have another option with which to perform rich authorization checks. Bridging the gap between policy based authorization and ReBAC is a powerful paradigm that is already benefiting companies looking to Zanzibar based implementations for modernizing their authorization stack.</p><ul class=""><li id="9191" class="mr ms gq mt b mu oo mw mx my op na nb ox oq ne nf oy or ni nj oz os nm nn no pa pb pc bj"><a class="af np" href="https://www.linkedin.com/in/chris-w-0a884022/" rel="noopener ugc nofollow" target="_blank">Chris Wolfe</a></li>
<li id="5747" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://www.linkedin.com/in/joseph-s-4324904/" rel="noopener ugc nofollow" target="_blank">Joey Schorr</a></li>
<li id="0042" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://www.linkedin.com/in/vroldanbet/" rel="noopener ugc nofollow" target="_blank">Victor Roldán Betancort</a></li>
<li id="660d" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://www.linkedin.com/in/chestonlee/" rel="noopener ugc nofollow" target="_blank">Cheston Lee</a></li>
</ul><ul class=""><li id="3441" class="mr ms gq mt b mu oo mw mx my op na nb ox oq ne nf oy or ni nj oz os nm nn no pa pb pc bj"><a class="af np" href="https://authzed.com/blog/what-is-google-zanzibar" rel="noopener ugc nofollow" target="_blank">What is Google Zanzibar</a></li>
<li id="bf79" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://authzed.com/zanzibar" rel="noopener ugc nofollow" target="_blank">Annotated Google Zanzibar Paper</a></li>
<li id="89ba" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://github.com/authzed/spicedb" rel="noopener ugc nofollow" target="_blank">SpiceDB, an Open Source, Google Zanzibar Inspired Authorization System</a></li>
<li id="9465" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://github.com/google/cel-spec" rel="noopener ugc nofollow" target="_blank">Google’s CEL</a></li>
<li id="dd4e" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj">SpiceDB <a class="af np" href="https://authzed.com/docs/reference/glossary" rel="noopener ugc nofollow" target="_blank">Glossary</a></li>
<li id="8e55" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://authzed.com/blog/top-three-caveat-use-cases/" rel="noopener ugc nofollow" target="_blank">Top-3 Most Used SpiceDB Caveat Patterns (authzed.com)</a></li>
<li id="9a41" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://authzed.com/blog/check-it-out" rel="noopener ugc nofollow" target="_blank">How Permissions are Answered in SpiceDB</a></li>
<li id="968b" class="mr ms gq mt b mu pd mw mx my pe na nb ox pf ne nf oy pg ni nj oz ph nm nn no pa pb pc bj"><a class="af np" href="https://play.authzed.com/s/51q8FOZ1PlzG" rel="noopener ugc nofollow" target="_blank">Netflix Complex Identities Example Schema</a></li>
</ul>]]></description>
      <link>https://netflixtechblog.com/abac-on-spicedb-enabling-netflixs-complex-identity-types-c118f374fa89</link>
      <guid>https://netflixtechblog.com/abac-on-spicedb-enabling-netflixs-complex-identity-types-c118f374fa89</guid>
      <pubDate>Fri, 19 May 2023 14:01:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Migrating Critical Traffic At Scale with No Downtime — Part 1]]></title>
      <description><![CDATA[<p id="234e" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><a class="af jv" href="https://www.linkedin.com/in/shyam-gala-5891224/" rel="noopener ugc nofollow" target="_blank">Shyam Gala</a>, <a class="af jv" href="https://www.linkedin.com/in/ivern/" rel="noopener ugc nofollow" target="_blank">Javier Fernandez-Ivern</a>, <a class="af jv" href="https://www.linkedin.com/in/rokkampratap/" rel="noopener ugc nofollow" target="_blank">Anup Rokkam Pratap</a>, <a class="af jv" href="https://www.linkedin.com/in/shahdewang/" rel="noopener ugc nofollow" target="_blank">Devang Shah</a></p><p id="31c1" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Hundreds of millions of customers tune into Netflix every day, expecting an uninterrupted and immersive streaming experience. Behind the scenes, a myriad of systems and services are involved in orchestrating the product experience. These backend systems are consistently being evolved and optimized to meet and exceed customer and product expectations.</p><p id="b06e" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><strong class="iz ip">When undertaking system migrations, one of the main challenges is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.</strong></p><p id="fa21" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The backend for the streaming product utilizes a highly distributed microservices architecture; hence these migrations also happen at different points of the service call graph. It can happen on an edge API system servicing customer devices, between the edge and mid-tier services, or from mid-tiers to data stores. Another relevant factor is that the migration could be happening on APIs that are stateless and idempotent, or it could be happening on stateful APIs.</p><p id="e0ed" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">We have categorized the tools and techniques we have used to facilitate these migrations in two high-level phases. The first phase involves validating functional correctness, scalability, and performance concerns and ensuring the new systems’ resilience before the migration. The second phase involves migrating the traffic over to the new systems in a manner that mitigates the risk of incidents while continually monitoring and confirming that we are meeting crucial metrics tracked at multiple levels. These include Quality-of-Experience(QoE) measurements at the customer device level, Service-Level-Agreements (SLAs), and business-level Key-Performance-Indicators(KPIs).</p><p id="b8a6" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><strong class="iz ip">This blog post will provide a detailed analysis of replay traffic testing, a versatile technique we have applied in the preliminary validation phase for multiple migration initiatives</strong>. <strong class="iz ip">In a follow-up blog post, we will focus on the second phase and look deeper at some of the tactical steps that we use to migrate the traffic over in a controlled manner.</strong></p><p id="b241" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">Replay traffic refers to production traffic that is cloned and forked over to a different path in the service call graph, allowing us to exercise new/updated systems in a manner that simulates actual production conditions. In this testing strategy, we execute a copy (replay) of production traffic against a system’s existing and new versions to perform relevant validations. This approach has a handful of benefits.</p><ul class=""><li id="fd7e" class="kz la io iz b ja jb je jf ji lb jm lc jq ld ju le lf lg lh bj">Replay traffic testing enables <strong class="iz ip">sandboxed testing at scale</strong> without significantly impacting production traffic or user experience.</li>
<li id="fde5" class="kz la io iz b ja li je lj ji it jm iu jq iv ju le lf lg lh bj">Utilizing cloned real traffic, we can exercise the diversity of inputs from awide range of devices and device application software versions in production. This is particularly important for complex APIs that have many high cardinality inputs. Replay traffic provides the <strong class="iz ip">reach and coverage</strong> required to test the ability of the system to handle infrequently used input combinations andedge cases.</li>
<li id="2260" class="kz la io iz b ja li je lj ji it jm iu jq iv ju le lf lg lh bj">This technique facilitates validation on multiple fronts. It allows us to assert functional correctness and provides a mechanism to load test the system and <strong class="iz ip">tune the system and scaling parameters</strong> for optimal functioning.</li>
<li id="4138" class="kz la io iz b ja li je lj ji it jm iu jq iv ju le lf lg lh bj">By simulating a real production environment, we can <strong class="iz ip">characterize system performance</strong> over an extended period while considering the expected and unexpected traffic pattern shifts. It provides a good read on the availability and latency ranges under different production conditions.</li>
<li id="2bb0" class="kz la io iz b ja li je lj ji it jm iu jq iv ju le lf lg lh bj">Provides a platform to ensure that essential <strong class="iz ip">operational insights</strong>, metrics, logging, and alerting are in place before migration.</li>
</ul><h2 id="0a85" class="lk jx io be jy ll lm do kc ln lo dq kg ji lp lq lr jm ls lt lu jq lv lw lx ly bj">Replay Solution</h2><p id="5442" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">The replay traffic testing solution comprises two essential components.</p><ol class=""><li id="b62e" class="kz la io iz b ja jb je jf ji lb jm lc jq ld ju lz lf lg lh bj">Traffic Duplication and Correlation: The initial step requires the implementation of a mechanism to clone and fork production traffic to the newly established pathway, along with a process to record and correlate responses from the original and alternative routes.</li>
<li id="dad9" class="kz la io iz b ja li je lj ji it jm iu jq iv ju lz lf lg lh bj">Comparative Analysis and Reporting: Following traffic duplication and correlation, we need a framework to compare and analyze the responses recorded from the two paths and get a comprehensive report for the analysis.</li>
</ol><figure class="mb mc md me gu mf gi gj paragraph-image"><div role="button" tabindex="0" class="mg mh dj mi bg mj gi gj ma"><picture><img alt="" class="bg mk ml c" width="700" height="268" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mm mn gk gi gj mo mp be b bf z dl">Replay Testing Framework</figcaption></figure><p id="2a62" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">We have tried different approaches for the traffic duplication and recording step through various migrations, making improvements along the way. These include options where replay traffic generation is orchestrated on the device, on the server, and via a dedicated service. We will examine these alternatives in the upcoming sections.</p><p id="81d5" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Device Driven</p><p id="9cc0" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">In this option, the device makes a request on the production path and the replay path, then discards the response on the replay path. These requests are executed in parallel to minimize any potential delay on the production path. The selection of the replay path on the backend can be driven by the URL the device uses when making the request or by utilizing specific request parameters in routing logic at the appropriate layer of the service call graph. The device also includes a unique identifier with identical values on both paths, which is used to correlate the production and replay responses. The responses can be recorded at the most optimal location in the service call graph or by the device itself, depending on the particular migration.</p><figure class="mb mc md me gu mf gi gj paragraph-image"><div role="button" tabindex="0" class="mg mh dj mi bg mj gi gj mq"><picture><img alt="" class="bg mk ml c" width="700" height="284" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mm mn gk gi gj mo mp be b bf z dl">Device Driven Replay</figcaption></figure><p id="6e64" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The device-driven approach’s obvious downside is that we are wasting device resources. There is also a risk of impact on device QoE, especially on low-resource devices. Adding forking logic and complexity to the device code can create dependencies on device application release cycles that generally run at a slower cadence than service release cycles, leading to bottlenecks in the migration. Moreover, allowing the device to execute untested server-side code paths can inadvertently expose an attack surface area for potential misuse.</p><p id="975b" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Server Driven</p><p id="a28f" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">To address the concerns of the device-driven approach, the other option we have used is to handle the replay concerns entirely on the backend. The replay traffic is cloned and forked in the appropriate service upstream of the migrated service. The upstream service calls the existing and new replacement services concurrently to minimize any latency increase on the production path. The upstream service records the responses on the two paths along with an identifier with a common value that is used to correlate the responses. This recording operation is also done asynchronously to minimize any impact on the latency on the production path.</p><figure class="mb mc md me gu mf gi gj paragraph-image"><div role="button" tabindex="0" class="mg mh dj mi bg mj gi gj mq"><picture><img alt="" class="bg mk ml c" width="700" height="285" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mm mn gk gi gj mo mp be b bf z dl">Server Driven Replay</figcaption></figure><p id="9349" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The server-driven approach’s benefit is that the entire complexity of replay logic is encapsulated on the backend, and there is no wastage of device resources. Also, since this logic resides on the server side, we can iterate on any required changes faster. However, we are still inserting the replay-related logic alongside the production code that is handling business logic, which can result in unnecessary coupling and complexity. There is also an increased risk that bugs in the replay logic have the potential to impact production code and metrics.</p><p id="9863" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Dedicated Service</p><p id="df05" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The latest approach that we have used is to completely isolate all components of replay traffic into a separate dedicated service. In this approach, we record the requests and responses for the service that needs to be updated or replaced to an offline event stream asynchronously. Quite often, this logging of requests and responses is already happening for operational insights. Subsequently, we use <a class="af jv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/stream-processing-with-mantis-78af913f51a6">Mantis</a>, a distributed stream processor, to capture these requests and responses and replay the requests against the new service or cluster while making any required adjustments to the requests. After replaying the requests, this dedicated service also records the responses from the production and replay paths for offline analysis.</p><figure class="mb mc md me gu mf gi gj paragraph-image"><div role="button" tabindex="0" class="mg mh dj mi bg mj gi gj mr"><picture><img alt="" class="bg mk ml c" width="700" height="405" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mm mn gk gi gj mo mp be b bf z dl">Dedicated Replay Service</figcaption></figure><p id="5af7" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">This approach centralizes the replay logic in an isolated, dedicated code base. Apart from not consuming device resources and not impacting device QoE, this approach also reduces any coupling between production business logic and replay traffic logic on the backend. It also decouples any updates on the replay framework away from the device and service release cycles.</p><h2 id="ea7d" class="lk jx io be jy ll lm do kc ln lo dq kg ji lp lq lr jm ls lt lu jq lv lw lx ly bj">Analyzing Replay Traffic</h2><p id="dab9" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">Once we have run replay traffic and recorded a statistically significant volume of responses, we are ready for the comparative analysis and reporting component of replay traffic testing. Given the scale of the data being generated using replay traffic, we record the responses from the two sides to a cost-effective cold storage facility using technology like <a class="af jv" href="https://iceberg.apache.org/" rel="noopener ugc nofollow" target="_blank">Apache Iceberg</a>. We can then create offline distributed batch processing jobs to correlate &amp; compare the responses across the production and replay paths and generate detailed reports on the analysis.</p><p id="a912" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Normalization</p><p id="91df" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Depending on the nature of the system being migrated, the responses might need some preprocessing before being compared. For example, if some fields in the responses are timestamps, those will differ. Similarly, if there are unsorted lists in the responses, it might be best to sort them before comparing. In certain migration scenarios, there may be intentional alterations to the response generated by the updated service or component. For instance, a field that was a list in the original path is represented as key-value pairs in the new path. In such cases, we can apply specific transformations to the response on the replay path to simulate the expected changes. Based on the system and the associated responses, there might be other specific normalizations that we might apply to the response before we compare the responses.</p><p id="cfbc" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Comparison</p><p id="8713" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">After normalizing, we diff the responses on the two sides and check whether we have matching or mismatching responses. The batch job creates a high-level summary that captures some key comparison metrics. These include the total number of responses on both sides, the count of responses joined by the correlation identifier, matches and mismatches. The summary also records the number of passing/ failing responses on each path. This summary provides an excellent high-level view of the analysis and the overall match rate across the production and replay paths. Additionally, for mismatches, we record the normalized and unnormalized responses from both sides to another big data table along with other relevant parameters, such as the diff. We use this additional logging to debug and identify the root cause of issues driving the mismatches. Once we discover and address those issues, we can use the replay testing process iteratively to bring down the mismatch percentage to an acceptable number.</p><p id="07a6" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Lineage</p><p id="b006" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">When comparing responses, a common source of noise arises from the utilization of non-deterministic or non-idempotent dependency data for generating responses on the production and replay pathways. For instance, envision a response payload that delivers media streams for a playback session. The service responsible for generating this payload consults a metadata service that provides all available streams for the given title. Various factors can lead to the addition or removal of streams, such as identifying issues with a specific stream, incorporating support for a new language, or introducing a new encode. Consequently, there is a potential for discrepancies in the sets of streams used to determine payloads on the production and replay paths, resulting in divergent responses.</p><p id="667c" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">A comprehensive summary of data versions or checksums for all dependencies involved in generating a response, referred to as a lineage, is compiled to address this challenge. Discrepancies can be identified and discarded by comparing the lineage of both production and replay responses in the automated jobs analyzing the responses. This approach mitigates the impact of noise and ensures accurate and reliable comparisons between production and replay responses.</p><h2 id="f724" class="lk jx io be jy ll lm do kc ln lo dq kg ji lp lq lr jm ls lt lu jq lv lw lx ly bj">Comparing Live Traffic</h2><p id="b24b" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">An alternative method to recording responses and performing the comparison offline is to perform a live comparison. In this approach, we do the forking of the replay traffic on the upstream service as described in the `Server Driven` section. The service that forks and clones the replay traffic directly compares the responses on the production and replay path and records relevant metrics. This option is feasible if the response payload isn’t very complex, such that the comparison doesn’t significantly increase latencies or if the services being migrated are not on the critical path. Logging is selective to cases where the old and new responses do not match.</p><figure class="mb mc md me gu mf gi gj paragraph-image"><div role="button" tabindex="0" class="mg mh dj mi bg mj gi gj ms"><picture><img alt="" class="bg mk ml c" width="700" height="169" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mm mn gk gi gj mo mp be b bf z dl">Replay Traffic Analysis</figcaption></figure><h2 id="a90f" class="lk jx io be jy ll lm do kc ln lo dq kg ji lp lq lr jm ls lt lu jq lv lw lx ly bj">Load Testing</h2><p id="89c2" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">Besides functional testing, replay traffic allows us to stress test the updated system components. We can regulate the load on the replay path by controlling the amount of traffic being replayed and the new service’s horizontal and vertical scale factors. This approach allows us to evaluate the performance of the new services under different traffic conditions. We can see how the availability, latency, and other system performance metrics, such as CPU consumption, memory consumption, garbage collection rate, etc, change as the load factor changes. Load testing the system using this technique allows us to identify performance hotspots using actual production traffic profiles. It helps expose memory leaks, deadlocks, caching issues, and other system issues. It enables the tuning of thread pools, connection pools, connection timeouts, and other configuration parameters. Further, it helps in the determination of reasonable scaling policies and estimates for the associated cost and the broader cost/risk tradeoff.</p><h2 id="2dc4" class="lk jx io be jy ll lm do kc ln lo dq kg ji lp lq lr jm ls lt lu jq lv lw lx ly bj">Stateful Systems</h2><p id="e519" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">We have extensively utilized replay testing to build confidence in migrations involving stateless and idempotent systems. Replay testing can also validate migrations involving stateful systems, although additional measures must be taken. The production and replay paths must have distinct and isolated data stores that are in identical states before enabling the replay of traffic. Additionally, all different request types that drive the state machine must be replayed. In the recording step, apart from the responses, we also want to capture the state associated with that specific response. Correspondingly in the analysis phase, we want to compare both the response and the related state in the state machine. Given the overall complexity of using replay testing with stateful systems, we have employed other techniques in such scenarios. We will look at one of them in the follow-up blog post in this series.</p><p id="26ee" class="pw-post-body-paragraph ix iy io iz b ja ku jc jd je kv jg jh ji kw jk jl jm kx jo jp jq ky js jt ju ih bj">We have adopted replay traffic testing at Netflix for numerous migration projects. A recent example involved leveraging replay testing to validate an extensive re-architecture of the edge APIs that drive the playback component of our product. Another instance included migrating a mid-tier service from REST to gRPC. In both cases, replay testing facilitated comprehensive functional testing, load testing, and system tuning at scale using real production traffic. This approach enabled us to identify elusive issues and rapidly build confidence in these substantial redesigns.</p><p id="de42" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Upon concluding replay testing, we are ready to start introducing these changes in production. In an upcoming blog post, we will look at some of the techniques we use to roll out significant changes to the system to production in a gradual risk-controlled way while building confidence via metrics at different levels.</p>]]></description>
      <link>https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835</link>
      <guid>https://netflixtechblog.com/migrating-critical-traffic-at-scale-with-no-downtime-part-1-ba1c7a1c7835</guid>
      <pubDate>Thu, 04 May 2023 23:32:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Improved Alerting with Atlas Streaming Eval]]></title>
      <description><![CDATA[<p id="df80" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><a class="af jv" href="https://www.linkedin.com/in/ruchir-jha-9a861616/" rel="noopener ugc nofollow" target="_blank">Ruchir Jha</a>, <a class="af jv" href="https://www.linkedin.com/in/brharrington/" rel="noopener ugc nofollow" target="_blank">Brian Harrington</a>, <a class="af jv" href="https://www.linkedin.com/in/yingwu-zhao-62037418/" rel="noopener ugc nofollow" target="_blank">Yingwu Zhao</a></p><p id="cd38" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">TL;DR</p><ul class=""><li id="e583" class="jw jx io iz b ja jb je jf ji jy jm jz jq ka ju kb kc kd ke bj">Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.</li>
<li id="26a2" class="jw jx io iz b ja kf je kg ji it jm iu jq iv ju kb kc kd ke bj">It allows us to overcome high dimensionality/cardinality limitations of the time-series database.</li>
<li id="d38f" class="jw jx io iz b ja kf je kg ji it jm iu jq iv ju kb kc kd ke bj">It opens doors to support more exciting use-cases.</li>
</ul><figure class="ki kj kk kl gu km gi gj paragraph-image"><div role="button" tabindex="0" class="kn ko dj kp bg kq gi gj kh"><picture><img alt="" class="bg kr ks c" width="700" height="328" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="5990" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!</p><p id="ee02" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried <a class="af jv" href="https://github.com/Netflix/atlas" rel="noopener ugc nofollow" target="_blank">Atlas</a>, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.</p><p id="b72f" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around the scalability of our alerting system. We also heard from other platform teams at Netflix who wanted to build similar automation for their users who, given our state at the time, wouldn’t have been able to do so without impacting Mean Time To Detect (MTTD) for all others. Rather, we were looking at an order of magnitude increase in the number of alert queries just over the next 6 months!</p><p id="8dbd" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Since querying Atlas was the bottleneck, our first instinct was to scale it up to meet the increased alert query demand; however, we soon realized that would increase Atlas cost prohibitively. Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. It is already one of the largest services at Netflix both in size and cost. While Atlas <a class="af jv" href="https://netflix.github.io/atlas-docs/overview/" rel="noopener ugc nofollow" target="_blank">is architected</a> around compute &amp; storage separation, and we could theoretically just scale the query layer to meet the increased query demand, every query, regardless of its type, has a data component that needs to be pushed down to the storage layer. To serve the increasing number of push down queries, the in-memory storage layer would need to scale up as well, and it became clear that this would push the already expensive storage costs far higher. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness. Take for example, this alert query that checks if errors as a % of total RPS exceeds a threshold of 50% for 4 out of the last 5 minutes:</p><pre class="ki kj kk kl gu kt ku kv bo kw kx ky">name,errors,:eq,:sum,name,rps,:eq,:sum,:div,100,:mul,50,:gt,5,:rolling-count,4,:gt,</pre><p id="a552" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Say if the datapoint received for the last time interval leads to a positive evaluation for this query, relying on stale/cached data would either increase MTTD or result in the perception of a false negative, at least until the missing data is fetched and evaluated. It became clear to us that we needed to solve the scalability problem with a fundamentally different approach. Hence, we started down the path of alert evaluation via real-time <a class="af jv" href="https://github.com/Netflix/atlas/tree/main/atlas-eval" rel="noopener ugc nofollow" target="_blank">streaming metrics</a>.</p><p id="302d" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><strong class="iz ip">High Level Architecture</strong></p><p id="9685" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation.</p><figure class="ki kj kk kl gu km gi gj paragraph-image"><div role="button" tabindex="0" class="kn ko dj kp bg kq gi gj lf"><picture><img alt="" class="bg kr ks c" width="700" height="153" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="425b" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Alert queries are submitted either via our Alerting UI or by API clients, which are then saved to a custom config database that supports streaming config updates (full snapshot + update notifications). The Alerting Service receives these config updates and hashes every new or updated alert query for evaluation to one of its nodes by leveraging <a class="af jv" href="https://netflix.github.io/edda/rest-api/#apiv2group" rel="noopener ugc nofollow" target="_blank">Edda Slots</a>. The node responsible for evaluating a query, starts by breaking it down into a set of “data expressions” and with them subscribes to an upstream “broker” service. Data expressions define what data needs to be sourced in order to evaluate a query. For the example query listed above, the data expressions are name,errors,:eq,:sum and name,rps,:eq,:sum. The broker service acts as a subscription manager that maps a data expression to a set of subscriptions. In addition, it also maintains a Query Index of all active data expressions which is consulted to discern if an incoming datapoint is of interest to an active subscriber. The internals here are outside the scope of this blog post.</p><p id="572a" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Next, the Alerting service (via the <a class="af jv" href="https://github.com/Netflix/atlas/tree/main/atlas-eval" rel="noopener ugc nofollow" target="_blank">atlas-eval</a> library) maps the received data points for a data expression to the alert query that needs them. For alert queries that resolve to more than one data expression, we align the incoming data points for each one of those data expressions on the same time boundary before emitting the accumulated values to the final eval step. For the example above, the final eval step would be responsible for computing the ratio and maintaining the <a class="af jv" href="https://netflix.github.io/atlas-docs/asl/ref/rolling-count/" rel="noopener ugc nofollow" target="_blank">rolling-count</a>, which is keeping track of the number of intervals in which the ratio crossed the threshold as shown below:</p><figure class="ki kj kk kl gu km gi gj paragraph-image"><div role="button" tabindex="0" class="kn ko dj kp bg kq gi gj lg"><picture><img alt="" class="bg kr ks c" width="700" height="252" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="9e47" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">The atlas-eval library supports streaming evaluation for most if not all <a class="af jv" href="https://github.com/Netflix/atlas/wiki/Reference-query" rel="noopener ugc nofollow" target="_blank">Query</a>, <a class="af jv" href="https://github.com/Netflix/atlas/wiki/Reference-data" rel="noopener ugc nofollow" target="_blank">Data</a>, <a class="af jv" href="https://github.com/Netflix/atlas/wiki/Reference-math" rel="noopener ugc nofollow" target="_blank">Math</a> and <a class="af jv" href="https://github.com/Netflix/atlas/wiki/Reference-stateful" rel="noopener ugc nofollow" target="_blank">Stateful</a> operators supported by Atlas today. Certain operators such as <a class="af jv" href="https://netflix.github.io/atlas-docs/asl/ref/offset/" rel="noopener ugc nofollow" target="_blank">offset</a>, <a class="af jv" href="https://netflix.github.io/atlas-docs/asl/ref/integral/" rel="noopener ugc nofollow" target="_blank">integral</a>, <a class="af jv" href="https://netflix.github.io/atlas-docs/asl/ref/des/" rel="noopener ugc nofollow" target="_blank">des</a> are not supported on the streaming path.</p><p id="c712" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj"><strong class="iz ip">OK, Results?</strong></p><p id="45a2" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. Today, we run 20X the number of queries we used to run a few years ago, with ease and at a fraction of what it would have cost to scale up the Atlas storage layer to serve the same volume. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without having to worry about impacting other users of the system. We are able to maintain strong SLAs around Mean Time To Detect (MTTD) regardless of the number of alerts being evaluated by the system.</p><p id="b313" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Additionally, streaming evaluation allowed us to relax restrictions around high cardinality that our users were previously running into — alert queries that were rejected by Atlas Backend before due to cardinality constraints are now getting checked correctly on the streaming path. In addition, we are able to use Atlas Streaming to monitor and alert on some very high cardinality use-cases, such as metrics derived from free-form log data.</p><p id="e09e" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Finally, we switched <a class="af jv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba">Telltale</a>, our holistic application health monitoring system, from polling a metrics cache to using realtime Atlas Streaming. The fundamental idea behind Telltale is to detect anomalies on SLI metrics (for example, latency, error rates, etc). When such anomalies are detected, Telltale is able to compute correlations with similar metrics emitted from either upstream or downstream services. In addition, it also computes correlations between SLI metrics and custom metrics like the log derived metrics mentioned above. This has proven to be valuable towards reducing Mean Time to Recover (MTTR). For example, we are able to now correlate increased error rates with increased rate of specific exceptions occurring in logs and even point to an exemplar stacktrace, as shown below:</p><figure class="ki kj kk kl gu km gi gj paragraph-image"><div role="button" tabindex="0" class="kn ko dj kp bg kq gi gj lh"><picture><img alt="" class="bg kr ks c" width="700" height="525" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="249d" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Our logs pipeline fingerprints every log message and attaches a (very high cardinality) fingerprint tag to a log events counter that is then emitted to Atlas Streaming. Telltale consumes this metric in a streaming fashion to identify fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is found, we query the logs backend with the fingerprint hash to obtain the exemplar stacktrace. What’s more is we are now able to identify correlated anomalies (and exceptions) occurring in services that may be N hops away from the affected service. A system like Telltale becomes more effective as more services are onboarded (and for that matter the full service graph), because otherwise it becomes difficult to root cause the problem, especially in a microservices-based architecture. A few years ago, as noted in this <a class="af jv" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba">blog</a>, only about a hundred services were using Telltale; thanks to Atlas Streaming we have now managed to onboard thousands of other services at Netflix.</p><p id="110e" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Finally, we realized that once you remove limits on the number of monitored queries, and start supporting much higher metric dimensionality/cardinality without impacting the cost/performance profile of the system, it opens doors to many exciting new possibilities. For example, to make alerts more actionable, we may now be able to compute correlations between SLI anomalies and custom metrics with high cardinality dimensions, for example an alert on elevated HTTP error rates may be able to point to impacted customer cohorts, by linking to precisely correlated exemplars. This would help developers with reproducibility.</p><p id="9d68" class="pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj">Transitioning to the streaming path has been a long journey for us. One of the challenges was difficulty in debugging scenarios where the streaming path didn’t agree with what is returned by querying the Atlas database. This is especially true when either the data is not available in Atlas or the query is not supported because of (say) cardinality constraints. This is one of the reasons it has taken us years to get here. That said, early signs indicate that the streaming paradigm may help with tackling a cardinal problem in observability — effective correlation between the metrics &amp; events verticals (logs, and potentially traces in the future), and we are excited to explore the opportunities that this presents for Observability in general.</p>]]></description>
      <link>https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e</link>
      <guid>https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e</guid>
      <pubDate>Thu, 27 Apr 2023 22:52:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building a Media Understanding Platform for ML Innovations]]></title>
      <description><![CDATA[<p id="9dc9" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">By <a class="ae kl" href="https://www.linkedin.com/in/gurutahasildar/" rel="noopener ugc nofollow" target="_blank">Guru Tahasildar</a>, <a class="ae kl" href="https://www.linkedin.com/in/amirziai/" rel="noopener ugc nofollow" target="_blank">Amir Ziai</a>, <a class="ae kl" href="https://www.linkedin.com/in/peachpie/" rel="noopener ugc nofollow" target="_blank">Jonathan Solórzano-Hamilton</a>, <a class="ae kl" href="https://www.linkedin.com/in/kelli-griggs-32990125/" rel="noopener ugc nofollow" target="_blank">Kelli Griggs</a>, <a class="ae kl" href="https://www.linkedin.com/in/vi-pallavika-iyengar-144abb1b/" rel="noopener ugc nofollow" target="_blank">Vi Iyengar</a></p><p id="aca5" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Netflix leverages machine learning to create the best media for our members. Earlier we shared the details of one of these <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/match-cutting-at-netflix-finding-cuts-with-smooth-visual-transitions-31c3fc14ae59">algorithms</a>, introduced how our platform team is evolving the <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243">media-specific machine learning ecosystem</a>, and discussed how data from these algorithms gets stored in our <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428">annotation service</a>.</p><p id="8605" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Much of the ML literature focuses on model training, evaluation, and scoring. In this post, we will explore an understudied aspect of the ML lifecycle: integration of model outputs into applications.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi lp"><picture><img alt="" class="bf lz ma c" width="700" height="394" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">An example of using Machine Learning to find shots of Eleven in <a class="ae kl" href="https://www.netflix.com/title/80057281" rel="noopener ugc nofollow" target="_blank">Stranger Things</a> and surfacing the results in studio application for the consumption of Netflix video editors.</figcaption></figure><p id="c435" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Specifically, we will dive into the architecture that powers search capabilities for studio applications at Netflix. We discuss specific problems that we have solved using Machine Learning (ML) algorithms, review different pain points that we addressed, and provide a technical overview of our new platform.</p><p id="c663" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">At Netflix, we aim to bring joy to our members by providing them with the opportunity to experience outstanding content. There are two components to this experience. First, we must provide the content that will bring them joy. Second, we must make it effortless and intuitive to choose from our library. We must quickly surface the most stand-out highlights from the titles available on our service in the form of images and videos in the member experience.</p><p id="acf4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">These multimedia assets, or “supplemental” assets, don’t just come into existence. Artists and video editors must create them. We build creator tooling to enable these colleagues to focus their time and energy on creativity. Unfortunately, much of their energy goes into labor-intensive pre-work. A key opportunity is to automate these mundane tasks.</p><h2 id="5214" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Use case #1: Dialogue search</h2><p id="987d" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Dialogue is a central aspect of storytelling. One of the best ways to tell an engaging story is through the mouths of the characters. Punchy or memorable lines are a prime target for trailer editors. The manual method for identifying such lines is a watchdown (aka breakdown).</p><p id="d5c7" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">An editor watches the title start-to-finish, transcribes memorable words and phrases with a timecode, and retrieves the snippet later if the quote is needed. An editor can choose to do this quickly and only jot down the most memorable moments, but will have to rewatch the content if they miss something they need later. Or, they can do it thoroughly and transcribe the entire piece of content ahead of time. In the words of one of our editors:</p><blockquote class="mr ms mt">
<p id="d946" class="jn jo mu jp b jq jr js jt ju jv jw jx mv jz ka kb mw kd ke kf mx kh ki kj kk ij bi">Watchdowns / breakdown are very repetitive and waste countless hours of creative time!</p>
</blockquote><p id="ef99" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Scrubbing through hours of footage (or dozens of hours if working on a series) to find a single line of dialogue is profoundly tedious. In some cases editors need to search across many shows and manually doing it is not feasible. But what if scrubbing and transcribing dialogue is not needed at all?</p><p id="0498" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Ideally, we want to enable dialogue search that supports the following features:</p><ul class=""><li id="c077" class="my mz iq jp b jq jr ju jv jy na kc nb kg nc kk nd ne nf ng bi">Search across one title, a subset of titles (e.g. all dramas), or the entire catalog</li>
<li id="3f12" class="my mz iq jp b jq nh ju ni jy nj kc nk kg nl kk nd ne nf ng bi">Search by character or talent</li>
<li id="64b9" class="my mz iq jp b jq nh ju ni jy nj kc nk kg nl kk nd ne nf ng bi">Multilingual search</li>
</ul><h2 id="f643" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Use case #2: Visual search</h2><p id="faee" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">A picture is worth a thousand words. Visual storytelling can help make complex stories easier to understand, and as a result, deliver a more impactful message.</p><p id="9199" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Artists and video editors routinely need specific visual elements to include in artworks and trailers. They may scrub for frames, shots, or scenes of specific characters, locations, objects, events (e.g. a car chasing scene in an action movie), or attributes (e.g. a close-up shot). What if we could enable users to find visual elements using natural language?</p><p id="13a1" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Here is an example of the desired output when the user searches for “red race car” across the entire content library.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi nm"><picture><img alt="Screenshot from an internal application where user is shown thumbnail preview of “red race car” results from different titles." class="bf lz ma c" width="700" height="441" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">User searching for “red race car”</figcaption></figure><h2 id="2130" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Use case #3: Reverse shot search</h2><p id="e7e2" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Natural-language visual search offers editors a powerful tool. But what if they already have a shot in mind, and they want to find something that just <em class="mu">looks</em> similar? For instance, let’s say that an editor has found a visually stunning shot of a plate of food from <a class="ae kl" href="https://www.netflix.com/title/80007945" rel="noopener ugc nofollow" target="_blank">Chef’s Table</a>, and she’s interested in finding similar shots across the entire show.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi nn"><picture><img alt="Input image on left side of food on a decorative plate and output images on right side of different food items that look similar to input image." class="bf lz ma c" width="700" height="525" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">User provides a sample image to find other similar images</figcaption></figure><h2 id="b94c" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Approach #1: on-demand batch processing</h2><p id="a9e8" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Our first approach to surface these innovations was a tool to trigger these algorithms on-demand and on a per-show basis. We implemented a batch processing system for users to submit their requests and wait for the system to generate the output. Processing took several hours to complete. Some ML algorithms are computationally intensive. Many of the samples provided had a significant number of frames to process. A typical 1 hour video could contain over 80,000 frames!</p><p id="08d3" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">After waiting for processing, users downloaded the generated algo outputs for offline consumption. This limited pilot system greatly reduced the time spent by our users to manually analyze the content. Here is a visualization of this flow.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi no"><picture><img alt="Sequence diagram showing how different entities interact with each other in case of batch processing system." class="bf lz ma c" width="700" height="522" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">On-demand batch processing system flow</figcaption></figure><h2 id="a3b2" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Approach #2: enabling online request with pre-computation</h2><p id="207d" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">After the success of this approach we decided to add online support for a couple of algorithms. For the first time, users were able to discover matches across the entire catalog, oftentimes finding moments they never knew even existed. They didn’t need any time-consuming local setup and there was no delays since the data was already pre-computed.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi nm"><picture><img alt="Sequence diagram showing how different entities interact with each other for online interactive system." class="bf lz ma c" width="700" height="397" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">Interactive system with pre-computed data flow</figcaption></figure><p id="4546" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The following quote exemplifies the positive reception by our users:</p><blockquote class="mr ms mt">
<p id="3216" class="jn jo mu jp b jq jr js jt ju jv jw jx mv jz ka kb mw kd ke kf mx kh ki kj kk ij bi">“We wanted to find all the shots of the dining room in a show. In seconds, we had what normally would have taken 1–2 people hours/a full day to do, look through all the shots of the dining room from all 10 episodes of the show. Incredible!”<a class="ae kl" href="https://www.linkedin.com/in/dawn-ec/" rel="noopener ugc nofollow" target="_blank">Dawn Chenette</a>, Design Lead</p>
</blockquote><p id="bb46" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">This approach had several benefits for product engineering. It allowed us to transparently update the algo data without users knowing about it. It also provided insights into query patterns and algorithms that were gaining traction among users. In addition, we were able to perform a handful of A/B tests to validate or negate our hypotheses for tuning the search experience.</p><p id="c1d5" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Our early efforts to deliver ML insights to creative professionals proved valuable. At the same time we experienced growing engineering pains that limited our ability to scale.</p><p id="89d1" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Maintaining disparate systems posed a challenge. They were first built by different teams on different stacks, so maintenance was expensive. Whenever ML researchers finished a new algorithm they had to integrate it separately into each system. We were near the breaking point with just two systems and a handful of algorithms. We knew this would only worsen as we expanded to more use cases and more researchers.</p><p id="9e50" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The online application unlocked the interactivity for our users and validated our direction. However, it was not scaling well. Adding new algos and onboarding new use cases was still time consuming and required the effort of too many engineers. These investments in one-to-one integrations were volatile with implementation timelines varying from a few weeks to several months. Due to the bespoke nature of the implementation, we lacked catalog wide searches for all available ML sources.</p><p id="d8d6" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">In summary, this model was a tightly-coupled application-to-data architecture, where machine learning algos were mixed with the backend and UI/UX software code stack. To address the variance in the implementation timelines we needed to standardize how different algorithms were integrated — starting from how they were executed to making the data available to all consumers consistently. As we developed more media understanding algos and wanted to expand to additional use cases, we needed to invest in system architecture redesign to enable researchers and engineers from different teams to innovate independently and collaboratively. Media Search Platform (MSP) is the initiative to address these requirements.</p><p id="4558" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Although we were just getting started with <em class="mu">media-search</em>, search itself is not new to Netflix. We have a mature and robust search and recommendation functionality exposed to millions of our subscribers. We knew we could leverage learnings from our colleagues who are responsible for building and innovating in this space. In keeping with our “<a class="ae kl" href="https://jobs.netflix.com/culture" rel="noopener ugc nofollow" target="_blank">highly aligned, loosely coupled</a>” culture, we wanted to enable engineers to onboard and improve algos quickly and independently, while making it easy for Studio and product applications to integrate with the media understanding algo capabilities.</p><p id="1d12" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Making the platform modular, pluggable and configurable was key to our success. This approach allowed us to keep the distributed ownership of the platform. It simultaneously provided different specialized teams to contribute relevant components of the platform. We used services already available for other use cases and extended their capabilities to support new requirements.</p><p id="0b86" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Next we will discuss the system architecture and describe how different modules interact with each other for end-to-end flow.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi np"><picture><img alt="Architecture diagram showing different sub-modules involved in the system." class="bf lz ma c" width="700" height="843" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk">System Architecture</figcaption></figure><p id="8bb4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Netflix engineers strive to iterate rapidly and prefer the “MVP” (minimum viable product) approach to receive early feedback and minimize the upfront investment costs. Thus, we didn’t build all the modules completely. We scoped the pilot implementation to ensure immediate functionalities were unblocked. At the same time, we kept the design open enough to allow future extensibility. We will highlight a few examples below as we discuss each component separately.</p><h2 id="cdf2" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Interfaces - API &amp; Query</h2><p id="de61" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Starting at the top of the diagram, the platform allows apps to interact with it using either gRPC or GraphQL interfaces. Having diversity in the interfaces is essential to meet the app-developers where they are. At Netflix, gRPC is predominantly used in backend-to-backend communication. With active GraphQL tooling provided by our developer productivity teams, GraphQL has become a de-facto choice for UI — backend integration. You can find more about what the team has built and how it is getting used in <a class="ae kl" href="https://netflixtechblog.com/tagged/graphql" rel="noopener ugc nofollow" target="_blank">these blog posts</a>. In particular, we have been relying on <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/open-sourcing-the-netflix-domain-graph-service-framework-graphql-for-spring-boot-92b9dcecda18">Domain Graph Service</a> Framework for this project.</p><p id="ce1a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">During the query schema design, we accounted for future use cases and ensured that it will allow future extensions. We aimed to keep the schema generic enough so that it hides implementation details of the actual search systems that are used to execute the query. Additionally it is intuitive and easy to understand yet feature rich so that it can be used to express complex queries. Users have flexibility to perform multimodal search with input being a simple text term, image or short video. As discussed earlier, search could be performed against the entire Netflix catalog, or it could be limited to specific titles. Users may prefer results that are organized in some way such as group by a movie, sorted by timestamp. When there are a large number of matches, we allow users to paginate the results (with configurable page size) instead of fetching all or a fixed number of results.</p><h2 id="bec4" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Search Gateway</h2><p id="47e2" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">The client generated input query is first given to the Query processing system. Since most of our users are performing targeted queries such as — search for dialogue “friends don’t lie” (from the above example), today this stage performs lightweight processing and provides a hook to integrate A/B testing. In the future we plan to evolve it into a “query understanding system” to support free-form searches to reduce the burden on users and simplify client side query generation.</p><p id="9af5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The query processing modifies queries to match the target data set. This includes “embedding” transformation and translation. For queries against embedding based data sources it transforms the input such as text or image to corresponding vector representation. Each data source or algorithm could use a different encoding technique so, this stage ensures that the corresponding encoding is also applied to the provided query. One example why we need different encoding techniques per algorithm is because there is different processing for an image — which has a single frame while video — which contains a sequence of multiple frames.</p><p id="e4ee" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">With global expansion we have users where English is not a primary language. All of the text-based models in the platform are trained using English language so we translate non-English text to English. Although the translation is not always perfect it has worked well in our case and has expanded the eligible user base for our tool to non-English speakers.</p><p id="f96b" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Once the query is transformed and ready for execution, we delegate search execution to one or more of the searcher systems. First we need to federate which query should be routed to which system. This is handled by the Query router and Searcher-proxy module. For the initial implementation we have relied on a single searcher for executing all the queries. Our extensible approach meant the platform could support additional searchers, which have already been used to prototype new algorithms and experiments.</p><p id="feb4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">A search may intersect or aggregate the data from multiple algorithms so this layer can fan out a single query into multiple search executions. We have implemented a “searcher-proxy” inside this layer for each supported searcher. Each proxy is responsible for mapping input query to one expected by the corresponding searcher. It then consumes the raw response from the searcher before handing it over to the Results post-processor component.</p><p id="82b8" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The Results post-processor works on the results returned by one or more searchers. It can rank results by applying custom scoring, populate search recommendations based on other similar searches. Another functionality we are evaluating with this layer is to dynamically create different <em class="mu">views</em> from the same underlying data.</p><p id="3e22" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">For ease of coordination and maintenance we abstracted the query processing and response handling in a module called — Search Gateway.</p><h2 id="7688" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Searchers</h2><p id="3964" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">As mentioned above, query execution is handled by the searcher system. The primary searcher used in the current implementation is called <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428"><em class="mu">Marken</em> — scalable annotation service</a> built at Netflix. It supports different categories of searches including full text and embedding vector based similarity searches. It can store and retrieve temporal (timestamp) as well as spatial (coordinates) data. This service leverages Cassandra and Elasticsearch for data storage and retrieval. When onboarding embedding vector data we performed an extensive benchmarking to evaluate the available datastores. One takeaway here is that even if there is a datastore that specializes in a particular query pattern, for ease of maintainability and consistency we decided to not introduce it.</p><p id="7e1e" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">We have identified a handful of common schema types and standardized how data from different algorithms is stored. Each algorithm still has the flexibility to define a custom schema type. We are actively innovating in this space and recently added capability to intersect data from different algorithms. This is going to unlock creative ways of how the data from multiple algorithms can be superimposed on each other to quickly get to the desired results.</p><h2 id="d33d" class="mf kn iq bd ko mg mh dn ks mi mj dp kw jy mk ml la kc mm mn le kg mo mp li mq bi">Algo Execution &amp; Ingestion</h2><p id="107b" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">So far we have focused on how the data is queried but, there is an equally complex machinery powering algorithm execution and the generation of the data. This is handled by our dedicated ML Platform team. The team specializes in building a suite of <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243">media-specific machine learning</a> tooling. It includes how algorithms are executed at scale, how the generated data (feature values) gets organized for efficient retrieval and model training. It facilitates seamless and consistent access to all the media such as audio, video, image and text making it easy for ML researchers to use them in their algos.</p><p id="f45c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">For this project we developed a custom sink that indexes the generated data into Marken according to predefined schema. Special care is taken when the data is backfilled for the first time so as to avoid overwhelming the system with huge amounts of writes.</p><p id="1aee" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Last but not the least, our UI team has built a configurable, extensible library to simplify integrating this platform with end user applications. Configurable UI makes it easy to customize query generation and response handling as per the needs of individual applications and algorithms. The future work involves building native widgets to minimize the UI work even further.</p><p id="7ec7" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">The media understanding platform serves as an abstraction layer between machine learning algos and various applications and features. The platform has already allowed us to seamlessly integrate search and discovery capabilities in several applications. We believe future work in maturing different parts will unlock value for more use cases and applications. We hope this post has offered insights into how we approached its evolution. We will continue to share our work in this space, so stay tuned.</p><p id="6780" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Do these types of challenges interest you? If yes, we’re always looking for <a class="ae kl" href="https://jobs.netflix.com/search?q=software%20engineer" rel="noopener ugc nofollow" target="_blank">engineers</a> and <a class="ae kl" href="https://jobs.netflix.com/search?q=%22machine%20learning%22" rel="noopener ugc nofollow" target="_blank">machine learning practitioners</a> to join us.</p><p id="afd4" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Special thanks to <a class="ae kl" href="https://www.linkedin.com/in/vinodvarmauddaraju/" rel="noopener ugc nofollow" target="_blank">Vinod Uddaraju</a>, <a class="ae kl" href="https://www.linkedin.com/in/fernando-amat-6110931/" rel="noopener ugc nofollow" target="_blank">Fernando Amat Gil</a>, <a class="ae kl" href="https://www.linkedin.com/in/benjamin-klein-usa/" rel="noopener ugc nofollow" target="_blank">Ben Klein</a>, <a class="ae kl" href="https://www.linkedin.com/in/meenakshijindal/" rel="noopener ugc nofollow" target="_blank">Meenakshi Jindal</a>, <a class="ae kl" href="https://www.linkedin.com/in/varun-sekhri-087a213/" rel="noopener ugc nofollow" target="_blank">Varun Sekhri</a>, <a class="ae kl" href="https://www.linkedin.com/in/burakbacioglu/" rel="noopener ugc nofollow" target="_blank">Burak Bacioglu</a>, <a class="ae kl" href="https://www.linkedin.com/in/boris-chen-b921a214/" rel="noopener ugc nofollow" target="_blank">Boris Chen</a>, <a class="ae kl" href="https://www.linkedin.com/in/jasonge27/" rel="noopener ugc nofollow" target="_blank">Jason Ge</a>, <a class="ae kl" href="https://www.linkedin.com/in/tiffany-low/" rel="noopener ugc nofollow" target="_blank">Tiffany Low</a>, <a class="ae kl" href="https://www.linkedin.com/in/vitalikauhanka/" rel="noopener ugc nofollow" target="_blank">Vitali Kauhanka</a>, <a class="ae kl" href="https://www.linkedin.com/in/supriya-vadlamani/" rel="noopener ugc nofollow" target="_blank">Supriya Vadlamani</a>, <a class="ae kl" href="https://www.linkedin.com/in/abhisheks0ni/" rel="noopener ugc nofollow" target="_blank">Abhishek Soni</a>, <a class="ae kl" href="https://www.linkedin.com/in/gucarmo/" rel="noopener ugc nofollow" target="_blank">Gustavo Carmo</a>, <a class="ae kl" href="https://www.linkedin.com/in/ellchow/" rel="noopener ugc nofollow" target="_blank">Elliot Chow</a>, <a class="ae kl" href="https://www.linkedin.com/in/prasannapadmanabhan/" rel="noopener ugc nofollow" target="_blank">Prasanna Padmanabhan</a>, <a class="ae kl" href="https://www.linkedin.com/in/akshay-naresh-modi/" rel="noopener ugc nofollow" target="_blank">Akshay Modi</a>, <a class="ae kl" href="https://www.linkedin.com/in/nagendrak/" rel="noopener ugc nofollow" target="_blank">Nagendra Kamath</a>, <a class="ae kl" href="https://www.linkedin.com/in/wenbingbai/" rel="noopener ugc nofollow" target="_blank">Wenbing Bai</a>, <a class="ae kl" href="https://www.linkedin.com/in/jacksondecampos/" rel="noopener ugc nofollow" target="_blank">Jackson de Campos</a>, <a class="ae kl" href="https://www.linkedin.com/in/jivimberg/" rel="noopener ugc nofollow" target="_blank">Juan Vimberg</a>, <a class="ae kl" href="https://www.linkedin.com/in/patrickstrawderman/" rel="noopener ugc nofollow" target="_blank">Patrick Strawderman</a>, <a class="ae kl" href="https://www.linkedin.com/in/dawn-ec/" rel="noopener ugc nofollow" target="_blank">Dawn Chenette</a>, <a class="ae kl" href="https://www.linkedin.com/in/yuchen-xie-788a3818/" rel="noopener ugc nofollow" target="_blank">Yuchen Xie</a>, <a class="ae kl" href="https://www.linkedin.com/in/yaoandy/" rel="noopener ugc nofollow" target="_blank">Andy Yao</a>, and <a class="ae kl" href="https://www.linkedin.com/in/chen-zheng-a70434/" rel="noopener ugc nofollow" target="_blank">Chen Zheng</a> for designing, developing, and contributing to different parts of the platform.</p>]]></description>
      <link>https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7</link>
      <guid>https://netflixtechblog.com/building-a-media-understanding-platform-for-ml-innovations-9bef9962dcb7</guid>
      <pubDate>Tue, 14 Mar 2023 16:48:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Elasticsearch Indexing Strategy in Asset Management Platform (AMP)]]></title>
      <description><![CDATA[<p id="6999" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">By <a class="ae kl" href="https://www.linkedin.com/in/burakbacioglu/" rel="noopener ugc nofollow" target="_blank">Burak Bacioglu</a>, <a class="ae kl" href="https://www.linkedin.com/in/meenakshijindal/" rel="noopener ugc nofollow" target="_blank">Meenakshi Jindal</a></p><p id="b65d" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">At Netflix, all of our digital media assets (images, videos, text, etc.) are stored in secure storage layers. We built an asset management platform (AMP), codenamed <strong class="jp ir">Amsterdam</strong>, in order to easily organize and manage the metadata, schema, relations and permissions of these assets. It is also responsible for asset discovery, validation, sharing, and for triggering workflows.</p><p id="9009" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Amsterdam service utilizes various solutions such as <a class="ae kl" href="https://cassandra.apache.org/" rel="noopener ugc nofollow" target="_blank">Cassandra</a>, <a class="ae kl" href="https://kafka.apache.org/" rel="noopener ugc nofollow" target="_blank">Kafka</a>, <a class="ae kl" href="https://zookeeper.apache.org/" rel="noopener ugc nofollow" target="_blank">Zookeeper</a>, <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/ephemeral-volatile-caching-in-the-cloud-8eba7b124589">EvCache</a> etc. In this blog, we will be focusing on how we utilize <a class="ae kl" href="https://www.elastic.co" rel="noopener ugc nofollow" target="_blank">Elasticsearch</a> for indexing and search the assets.</p><p id="6ae1" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Amsterdam is built on top of three storage layers.</p><p id="9eae" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The first layer, <strong class="jp ir">Cassandra</strong>, is the source of truth for us. It consists of close to a hundred tables (column families) , the majority of which are reverse indices to help query the assets in a more optimized way.</p><p id="61a4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The second layer is <strong class="jp ir">Elasticsearch</strong>, which is used to discover assets based on user queries. This is the layer we’d like to focus on in this blog. And more specifically, how we index and query over 7TB of data in a read-heavy and continuously growing environment and keep our Elasticsearch cluster healthy.</p><p id="3ea0" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">And finally, we have an Apache <strong class="jp ir">Iceberg</strong> layer which stores assets in a denormalized fashion to help answer heavy queries for analytics use cases.</p><p id="c2b1" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Elasticsearch is one of the best and widely adopted distributed, open source search and analytics engines for all types of data, including textual, numerical, geospatial, structured or unstructured data. It provides simple APIs for creating indices, indexing or searching documents, which makes it easy to integrate. No matter whether you use in-house deployments or hosted solutions, you can quickly stand up an Elasticsearch cluster, and start integrating it from your application using one of the clients provided based on your programming language (Elasticsearch has a rich set of languages it supports; Java, Python, .Net, Ruby, Perl etc.).</p><p id="667a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">One of the first decisions when integrating with Elasticsearch is designing the indices, their settings and mappings. <strong class="jp ir">Settings</strong> include index specific properties like number of shards, analyzers, etc. <strong class="jp ir">Mapping</strong> is used to define how documents and their fields are supposed to be stored and indexed. You define the data types for each field, or use dynamic mapping for unknown fields. You can find more information on settings and mappings on Elasticsearch <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html" rel="noopener ugc nofollow" target="_blank">website</a>.</p><p id="4ac0" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Most applications in content and studio engineering at Netflix deal with assets; such as videos, images, text, etc. These applications are built on a microservices architecture, and the Asset Management Platform provides asset management to those dozens of services for various asset types. Each asset type is defined in a centralized schema registry service responsible for storing asset type taxonomies and relationships. Therefore, it initially seemed natural to create a different index for each asset type. When creating index mappings in Elasticsearch, one has to define the data type for each field. Since different asset types could potentially have fields with the same name but with different data types; having a separate index for each type would prevent such type collisions. Therefore we created around a dozen indices per asset type with fields mapping based on the asset type schema. As we onboarded new applications to our platform, we kept creating new indices for the new asset types. We have a schema management microservice which is used to store the taxonomy of each asset type; and this programmatically created new indices whenever new asset types were created in this service. All the assets of a specific type use the specific index defined for that asset type to create or update the asset document.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi lp"><picture><img alt="" class="bf lz ma c" width="700" height="334" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 1. Indices based on Asset Types</strong></figcaption></figure><p id="0b15" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">As Netflix is now producing significantly more originals than it used to when we started this project a few years ago, not only did the number of assets grow dramatically but also the number of asset types grew from dozens to several thousands. Hence the number of Elasticsearch indices (per asset type) as well as asset document indexing or searching RPS (requests per second) grew over time. Although this indexing strategy worked smoothly for a while, interesting challenges started coming up and we started to notice performance issues over time. We started to observe CPU spikes, long running queries, instances going yellow/red in status.</p><p id="79d1" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Usually the first thing to try is to scale up the Elasticsearch cluster horizontally by increasing the number of nodes or vertically by upgrading instance types. We tried both, and in many cases it helps, but sometimes it is a short term fix and the performance problems come back after a while; and it did for us. You know it is time to dig deeper to understand the root cause of it.</p><p id="3f16" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">It was time to take a step back and reevaluate our ES data indexing and sharding strategy. Each index was assigned a fixed number of 6 shards and 2 replicas (defined in the template of the index). With the increase in the number of asset types, we ended up having approximately 900 indices (thus 16200 shards). Some of these indices had millions of documents, whereas many of them were very small with only thousands of documents. We found the root cause of the CPU spike was unbalanced shards size. Elasticsearch nodes storing those large shards became hot spots and queries hitting those instances were timing out or very slow due to busy threads.</p><p id="d17a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">We changed our indexing strategy and decided to create indices based on time buckets, rather than asset types. What this means is, assets created between t1 and t2 would go to the T1 bucket, assets created between t2 and t3 would go to the T2 bucket, and so on. So instead of persisting assets based on their asset types, we would use their ids (thus its creation time; because the asset id is a time based uuid generated at the asset creation) to determine which time bucket the document should be persisted to. Elasticsearch <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#shard-size-recommendation" rel="noopener ugc nofollow" target="_blank">recommends</a> each shard to be under 65GB (AWS <a class="ae kl" href="https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html" rel="noopener ugc nofollow" target="_blank">recommends</a> them to be under 50GB), so we could create time based indices where each index holds somewhere between 16–20GB of data, giving some buffer for data growth. Existing assets can be redistributed appropriately to these precreated shards, and new assets would always go to the current index. Once the size of the current index exceeds a certain threshold (16GB), we would create a new index for the next bucket (minute/hour/day) and start indexing assets to the new index created. We created an <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-templates.html" rel="noopener ugc nofollow" target="_blank">index template</a> in Elasticsearch so that the new indices always use the same settings and mappings stored in the template.</p><p id="0cd2" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">We chose to index all versions of an asset in the the same bucket - the one that keeps the first version. Therefore, even though new assets can never be persisted to an old index (due to our time based id generation logic, they always go to the latest/current index); existing assets can be updated, causing additional documents for those new asset versions to be created in those older indices. Therefore we chose a lower threshold for the roll over so that older shards would still be well under 50GB even after those updates.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mf"><picture><img alt="" class="bf lz ma c" width="700" height="313" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 2. Indices based on Time Buckets</strong></figcaption></figure><p id="865f" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">For searching purposes, we have a single read alias that points to all indices created. When performing a query, we always execute it on the alias. This ensures that no matter where documents are, all documents matching the query will be returned. For indexing/updating documents, though, we cannot use an alias, we use the exact index name to perform index operations.</p><p id="f6f3" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">To avoid the ES query for the list of indices for every indexing request, we keep the list of indices in a distributed cache. We refresh this cache whenever a new index is created for the next time bucket, so that new assets will be indexed appropriately. For every asset indexing request, we look at the cache to determine the corresponding time bucket index for the asset. The cache stores all time-based indices in a sorted order (for simplicity we named our indices based on their starting time in the format yyyyMMddHHmmss) so that we can easily determine exactly which index should be used for asset indexing based on the asset creation time. Without using the time bucket strategy, the same asset could have been indexed into multiple indices because Elasticsearch doc id is unique per index and not the cluster. Or we would have to perform two API calls, first to identify the specific index and then to perform the asset update/delete operation on that specific index.</p><p id="8cf7" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">It is still possible to exceed 50GB in those older indices if millions of updates occur within that time bucket index. To address this issue, we added an API that would split an old index into two programmatically. In order to split a given bucket T1 (which stores all assets between t1 and t2) into two, we choose a time t1.5 between t1 and t2, create a new bucket T1_5, and reindex all assets created between t1.5 and t2 from T1 into this new bucket. While the reindexing is happening, queries / reads are still answered by T1, so any new document created (via asset updates) would be dual-written into T1 and T1.5, provided that their timestamp falls between t1.5 and t2. Finally, once the reindexing is complete, we enable reads from T1_5, stop the dual write and delete reindexed documents from T1.</p><p id="40fe" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">In fact, Elasticsearch provides an index rollover feature to handle the growing indicex problem <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/6.0/indices-rollover-index.html" rel="noopener ugc nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/6.0/indices-rollover-index.html</a>. With this feature, a new index is created when the current index size hits a threshold, and through a write alias, the index calls will point to the new index created. That means, all future index calls would go to the new index created. However, this would create a problem for our update flow use case, because we would have to query multiple indices to determine which index contains a particular document so that we can update it appropriately. Because the calls to Elasticsearch may not be sequential, meaning, an asset a1 created at T1 can be indexed after another asset a2 created at T2 where T2&gt;T1, the older asset a1 can end up in the newer index while the newer asset a2 is persisted in the old index. In our current implementation, however, by simply looking at the asset id (and asset creation time), we can easily find out which index to go to and it is always deterministic.</p><p id="729c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">One thing to mention is, Elasticsearch has a default limit of 1000 fields per index. If we index all types to a single index, wouldn’t we easily exceed this number? And what about the data type collisions we mentioned above? Having a single index for all data types could potentially cause collisions when two asset types define different data types for the same field. We also changed our mapping strategy to overcome these issues. Instead of creating a separate Elasticsearch field for each metadata field defined in an asset type, we created a single <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/6.0/nested.html" rel="noopener ugc nofollow" target="_blank">nested</a> type with a mandatory field called `key`, which represents the name of the field on the asset type, and a handful of data-type specific fields, such as: `string_value`, `long_value`, `date_value`, etc. We would populate the corresponding data-type specific field based on the actual data type of the value. Below you can see a part of the index mapping defined in our template, and an example from a document (asset) which has four metadata fields:</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mg"><picture><img alt="" class="bf lz ma c" width="700" height="803" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 3. Snippet of the index mapping</strong></figcaption></figure><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mh"><picture><img alt="" class="bf lz ma c" width="700" height="419" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 4. Snippet of nested metadata field on a stored document</strong></figcaption></figure><p id="7819" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">As you see above, all asset properties go under the same nested field `<strong class="jp ir">metadata</strong>` with a mandatory `<strong class="jp ir">key</strong>` field, and the corresponding data-type specific field. This ensures that no matter how many asset types or properties are indexed, we would always have a fixed number of fields defined in the mapping. When searching for these fields, instead of querying for a single value (cameraId == 42323243), we perform a nested query where we query for both key and the value (key == cameraId AND long_value == 42323243). For more information on nested queries, please refer to this <a class="ae kl" href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-nested-query.html" rel="noopener ugc nofollow" target="_blank">link</a>.</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mh"><picture><img alt="" class="bf lz ma c" width="700" height="167" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 5. Search/Indexing RPS</strong></figcaption></figure><p id="48df" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">After these changes, the indices we created are now balanced in terms of data size. CPU utilization is down from an average of 70% to 10%. In addition, we are able to reduce the refresh interval time on these indices from our earlier setting 30 seconds to 1 sec in order to support use cases like read after write, which enables users to search and get a document after a second it was created</p><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mh"><picture><img alt="" class="bf lz ma c" width="700" height="321" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 6. CPU Spike with Old indexing strategy</strong></figcaption></figure><figure class="lq lr ls lt gt lu gh gi paragraph-image"><div role="button" tabindex="0" class="lv lw di lx bf ly gh gi mh"><picture><img alt="" class="bf lz ma c" width="700" height="264" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mb mc gj gh gi md me bd b be z dk"><strong class="bd ko">Fig 7. CPU Usage with New indexing strategy</strong></figcaption></figure><p id="ec79" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">We had to do a one time migration of the existing documents to the new indices. Thankfully we already have a framework in place that can query all assets from Cassandra and index them in Elasticsearch. Since doing full table scans in Cassandra is not generally recommended on large tables (due to potential timeouts), our cassandra schema contains several reverse indices that help us query all data efficiently. We also utilize Kafka to process these assets asynchronously without impacting our real time traffic. This infrastructure is used not only to index assets to Elasticsearch, but also to perform administrative operations on all or some assets, such as bulk updating assets, scanning / fixing problems on them, etc. Since we only focused on Elasticsearch indexing in this blog, we are planning to create another blog to talk about this infrastructure later.</p>]]></description>
      <link>https://netflixtechblog.com/elasticsearch-indexing-strategy-in-asset-management-platform-amp-99332231e541</link>
      <guid>https://netflixtechblog.com/elasticsearch-indexing-strategy-in-asset-management-platform-amp-99332231e541</guid>
      <pubDate>Fri, 10 Mar 2023 23:59:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Data Reprocessing Pipeline in Asset Management Platform @Netflix]]></title>
      <description><![CDATA[<p id="8b09" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">By <a class="ae kl" href="https://www.linkedin.com/in/meenakshijindal/" rel="noopener ugc nofollow" target="_blank">Meenakshi Jindal</a></p><p id="75b0" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">At Netflix, we built the asset management platform (AMP) as a centralized service to organize, store and discover the digital media assets created during the movie production. Studio applications use this service to store their media assets, which then goes through an asset cycle of schema validation, versioning, access control, sharing, triggering configured workflows like inspection, proxy generation etc. This platform has evolved from supporting studio applications to data science applications, machine-learning applications to discover the assets metadata, and build various data facts.</p><p id="cd3f" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">During this evolution, quite often we receive requests to update the existing assets metadata or add new metadata for the new features added. This pattern grows over time when we need to access and update the existing assets metadata. Hence we built the data pipeline that can be used to extract the existing assets metadata and process it specifically to each new use case. This framework allowed us to evolve and adapt the application to any unpredictable inevitable changes requested by our platform clients without any downtime. Production assets operations are performed in parallel with older data reprocessing without any service downtime. Some of the common supported data reprocessing use cases are listed below.</p><ul class=""><li id="cbc1" class="lp lq iq jp b jq lk ju ll jy lr kc ls kg lt kk lu lv lw lx bi">Real-Time APIs (backed by the Cassandra database) for asset metadata access don’t fit analytics use cases by data science or machine learning teams. We build the data pipeline to persist the assets data in the <a class="ae kl" href="https://iceberg.apache.org/" rel="noopener ugc nofollow" target="_blank">iceberg</a> in parallel with cassandra and elasticsearch DB. But to build the data facts, we need the complete data set in the iceberg and not just the new. Hence the existing assets data was read and copied to the iceberg tables without any production downtime.</li>
<li id="1162" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Asset versioning scheme is evolved to support the major and minor version of assets metadata and relations update. This feature support required a significant update in the data table design (which includes new tables and updating existing table columns). Existing data got updated to be backward compatible without impacting the existing running production traffic.</li>
<li id="f34b" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Elasticsearch version upgrade which includes backward incompatible changes, so all the assets data is read from the primary source of truth and reindexed again in the new indices.</li>
<li id="86b2" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Data Sharding strategy in elasticsearch is updated to provide low search latency (as described in <a class="ae kl" href="https://medium.com/@netflixtechblog/elasticsearch-indexing-strategy-in-asset-management-platform-amp-99332231e541" rel="noopener">blog</a> post)</li>
<li id="c73a" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Design of new Cassandra reverse indices to support different sets of queries.</li>
<li id="1832" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Automated workflows are configured for media assets (like inspection) and these workflows are required to be triggered for old existing assets too.</li>
<li id="6aa3" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Assets Schema got evolved that required reindexing all assets data again in ElasticSearch to support search/stats queries on new fields.</li>
<li id="bc45" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Bulk deletion of assets related to titles for which license is expired.</li>
<li id="66e3" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Updating or Adding metadata to existing assets because of some regressions in client application/within service itself.</li>
</ul><figure class="me mf mg mh gt mi gh gi paragraph-image"><div role="button" tabindex="0" class="mj mk di ml bf mm gh gi md"><picture><img alt="" class="bf mn mo c" width="700" height="404" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mp mq gj gh gi mr ms bd b be z dk">Figure 1. Data Reprocessing Pipeline Flow</figcaption></figure><p id="6ec8" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Cassandra is the primary data store of the asset management service. With SQL datastore, it was easy to access the existing data with pagination regardless of the data size. But there is no such concept of pagination with No-SQL datastores like Cassandra. Some features are provided by Cassandra (with newer versions) to support pagination like <a class="ae kl" href="https://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/PagingState.html" rel="noopener ugc nofollow" target="_blank">pagingstate</a>, <a class="ae kl" href="https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshCopy.html" rel="noopener ugc nofollow" target="_blank">COPY</a>, but each one of them has some limitations. To avoid dependency on data store limitations, we designed our data tables such that the data can be read with pagination in a performant way.</p><p id="fecf" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Mainly we read the assets data either by asset schema types or time bucket based on asset creation time. Data sharding completely based on the asset type may have created the wide rows considering some types like VIDEO may have many more assets compared to others like TEXT. Hence, we used the asset types and time buckets based on asset creation date for data sharding across the Cassandra nodes. Following is the example of tables primary and clustering keys defined:</p><figure class="me mf mg mh gt mi gh gi paragraph-image"><div role="button" tabindex="0" class="mj mk di ml bf mm gh gi mt"><picture><img alt="" class="bf mn mo c" width="700" height="298" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mp mq gj gh gi mr ms bd b be z dk">Figure 2. Cassandra Table Design</figcaption></figure><p id="d8fc" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Based on the asset type, first time buckets are fetched which depends on the creation time of assets. Then using the time buckets and asset types, a list of assets ids in those buckets are fetched. Asset Id is defined as a cassandra T<a class="ae kl" href="https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/uuid_type_r.html" rel="noopener ugc nofollow" target="_blank">imeuuid</a> data type. We use Timeuuids for AssetId because it can be sorted and then used to support pagination. Any sortable Id can be used as the table primary key to support the pagination. Based on the page size e.g. N, first N rows are fetched from the table. Next page is fetched from the table with limit N and asset id &lt; last asset id fetched.</p><figure class="me mf mg mh gt mi gh gi paragraph-image"><div role="button" tabindex="0" class="mj mk di ml bf mm gh gi mu"><picture><img alt="" class="bf mn mo c" width="700" height="62" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mp mq gj gh gi mr ms bd b be z dk">Figure 3. Cassandra Data Fetch Query</figcaption></figure><p id="ef7d" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Data layers can be designed based on different business specific entities which can be used to read the data by those buckets. But the primary id of the table should be sortable to support the pagination.</p><p id="3cb2" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Sometimes we have to reprocess a specific set of assets only based on some field in the payload. We can use Cassandra to read assets based on time or an asset type and then further filter from those assets which satisfy the user’s criteria. Instead we use Elasticsearch to search those assets which are more performant.</p><p id="4280" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">After reading the asset ids using one of the ways, an event is created per asset id to be processed synchronously or asynchronously based on the use case. For asynchronous processing, events are sent to Apache Kafka topics to be processed.</p><p id="d1db" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Data processor is designed to process the data differently based on the use case. Hence, different processors are defined which can be extended based on the evolving requirements. Data can be processed synchronously or asynchronously.</p><p id="9843" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Synchronous Flow</strong>: Depending on the event type, the specific processor can be directly invoked on the filtered data. Generally, this flow is used for small datasets.</p><p id="c4a8" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Asynchronous Flow</strong>: Data processor consumes the data events sent by the data extractor. <a class="ae kl" href="https://kafka.apache.org/" rel="noopener ugc nofollow" target="_blank">Apache Kafka</a> topic is configured as a message broker. Depending on the use case, we have to control the number of events processed in a time unit e.g. to reindex all the data in elasticsearch because of template change, it is preferred to re-index the data at certain RPS to avoid any impact on the running production workflow. Async processing has the benefit to control the flow of event processing with Kafka consumers count or with controlling thread pool size on each consumer. Event processing can also be stopped at any time by disabling the consumers in case production flow gets any impact with this parallel data processing. For fast processing of the events, we use different <a class="ae kl" href="https://kafka.apache.org/documentation/#consumerconfigs_max.poll.records" rel="noopener ugc nofollow" target="_blank">settings</a> of Kafka consumer and Java executor thread pool. We poll records in bulk from Kafka topics, and process them asynchronously with multiple threads. Depending on the processor type, events can be processed at high scale with right settings of consumer poll size and thread pool.</p><p id="8ec4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Each of these use cases mentioned above looks different, but they all need the same reprocessing flow to extract the old data to be processed. Many applications design data pipelines for the processing of the new data; but setting up such a data processing pipeline for the existing data supports handling the new features by just implementing a new processor. This pipeline can be thoughtfully triggered anytime with the data filters and data processor type (which defines the actual action to be performed).</p><p id="264b" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi">Errors are part of software development. But with this framework, it has to be designed more carefully as bulk data reprocessing will be done in parallel with the production traffic. We have set up the different clusters of data extractor and processor from the main Production cluster to process the older assets data to avoid any impact of the assets operations live in production. Such clusters may have different configurations of thread pools to read and write data from database, logging levels and connection configuration with external dependencies.</p><figure class="me mf mg mh gt mi gh gi paragraph-image"><div role="button" tabindex="0" class="mj mk di ml bf mm gh gi mv"><picture><img alt="" class="bf mn mo c" width="700" height="275" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mp mq gj gh gi mr ms bd b be z dk">Figure 4: Processing clusters</figcaption></figure><p id="c849" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Data processors are designed to continue processing the events even in case of some errors for eg. There are some unexpected payloads in old data. In case of any error in the processing of an event, Kafka consumers acknowledge that event is processed and send those events to a different queue after some retries. Otherwise Kafka consumers will continue trying to process the same message again and block the processing of other events in the topic. We reprocess data in the dead letter queue after fixing the root cause of the issue. We collect the failure metrics to be checked and fixed later. We have set up the alerts and continuously monitor the production traffic which can be impacted because of the bulk old data reprocessing. In case any impact is noticed, we should be able to slow down or stop the data reprocessing at any time. With different data processor clusters, this can be easily done by reducing the number of instances processing the events or reducing the cluster to 0 instances in case we need a complete halt.</p><ul class=""><li id="d576" class="lp lq iq jp b jq lk ju ll jy lr kc ls kg lt kk lu lv lw lx bi">Depending on existing data size and use case, processing may impact the production flow. So identify the optimal event processing limits and accordingly configure the consumer threads.</li>
<li id="a437" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">If the data processor is calling any external services, check the processing limits of those services because bulk data processing may create unexpected traffic to those services and cause scalability/availability issues.</li>
<li id="5673" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Backend processing may take time from seconds to minutes. Update the Kafka consumer timeout settings accordingly otherwise different consumer may try to process the same event again after processing timeout.</li>
<li id="4b98" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Verify the data processor module with a small data set first, before trigger processing of the complete data set.</li>
<li id="4ed3" class="lp lq iq jp b jq ly ju lz jy ma kc mb kg mc kk lu lv lw lx bi">Collect the success and error processing metrics because sometimes old data may have some edge cases not handled correctly in the processors. We are using the Netflix <a class="ae kl" href="https://github.com/Netflix/atlas" rel="noopener ugc nofollow" target="_blank">Atlas</a> framework to collect and monitor such metrics.</li>
</ul><p id="2ff9" class="pw-post-body-paragraph jn jo iq jp b jq lk js jt ju ll jw jx jy lm ka kb kc ln ke kf kg lo ki kj kk ij bi"><a class="ae kl" href="mailto:burakb@netflix.com" rel="noopener ugc nofollow" target="_blank">Burak Bacioglu</a> and other members of the Asset Management platform team have contributed in the design and development of this data reprocessing pipeline.</p>]]></description>
      <link>https://netflixtechblog.com/data-reprocessing-pipeline-in-asset-management-platform-netflix-46fe225c35c9</link>
      <guid>https://netflixtechblog.com/data-reprocessing-pipeline-in-asset-management-platform-netflix-46fe225c35c9</guid>
      <pubDate>Fri, 10 Mar 2023 23:58:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[NTS: Reliable Device Testing at Scale]]></title>
      <description><![CDATA[<p id="16ef" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">By <a class="ae kl" href="https://www.linkedin.com/in/benson-ma-86338917/" rel="noopener ugc nofollow" target="_blank"><em class="km">Benson Ma</em></a><em class="km">,</em> <a class="ae kl" href="https://www.linkedin.com/in/zzzimmerman/" rel="noopener ugc nofollow" target="_blank"><em class="km">ZZ Zimmerman</em></a><em class="km">With contributions from</em> <a class="ae kl" href="https://www.linkedin.com/in/alahuja/" rel="noopener ugc nofollow" target="_blank"><em class="km">Alok Ahuja</em></a><em class="km">,</em> <a class="ae kl" href="https://www.linkedin.com/in/shravanheroor/" rel="noopener ugc nofollow" target="_blank"><em class="km">Shravan Heroor</em></a><em class="km">,</em> <a class="ae kl" href="https://www.linkedin.com/in/mnkras/" rel="noopener ugc nofollow" target="_blank"><em class="km">Michael Krasnow</em></a><em class="km">,</em> <a class="ae kl" href="https://www.linkedin.com/in/todorminchev/" rel="noopener ugc nofollow" target="_blank"><em class="km">Todor Minchev</em></a><em class="km">, Inder Singh</em></p><p id="db49" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">At Netflix, we test hundreds of different device types every day, ranging from streaming sticks to smart TVs, to ensure that new version releases of the Netflix SDK continue to provide the exceptional Netflix experience that our customers expect. We also collaborate with our Partners to integrate the Netflix SDK onto their upcoming new devices, such as TVs and set top boxes. This program, known as <a class="ae kl" href="https://www.linkedin.com/pulse/scaling-netflix-quality-validation-streaming-devices-matt-duddles/" rel="noopener ugc nofollow" target="_blank">Partner Certification</a>, is particularly important for the business because device expansion historically has been crucial for new Netflix subscription acquisitions. The <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/nts-real-time-streaming-for-test-automation-7cb000e933a1">Netflix Test Studio</a> (NTS) platform was created to support Netflix SDK testing and Partner Certification by providing a consistent automation solution for both Netflix and Partner developers to deploy and execute tests on “Netflix Ready” devices.</p><p id="b175" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Over the years, both Netflix SDK testing and Partner Certification have gradually transitioned upstream towards a <a class="ae kl" href="https://www.browserstack.com/guide/what-is-shift-left-testing" rel="noopener ugc nofollow" target="_blank">shift-left testing strategy</a>. This requires the automation infrastructure to support large-scale CI, which NTS was not originally designed for. NTS 2.0 addresses this very limitation of NTS, as it has been built by taking the learnings from NTS 1.0 to re-architect the system into a platform that significantly improves reliable device testing at scale while maintaining the NTS user experience.</p><h2 id="4984" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">The Test Workflow in NTS</h2><p id="814f" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">We first describe the device testing workflow in NTS at a high level.</p><p id="8de5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Tests:</strong> Netflix device tests are defined as scripts that run against the Netflix application. Test authors at Netflix write the tests and register them into the system along with information that specifies the hardware and software requirements for the test to be able to run correctly, since tests are written to exercise device- and Netflix SDK-specific features which can vary.</p><p id="53c5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">One feature that is unique to NTS as an automation system is the support for user interactions in device tests, i.e. tests that require user input or action in the middle of execution. For example, a test might ask the user to turn the volume button up, play an audio clip, then ask the user to either confirm the volume increase or fail the assertion. While most tests are fully automated, these semi-manual tests are often valuable in the device certification process, because they help us verify the integration of the Netflix SDK with the Partner device’s firmware, which we have no control over, and thus cannot automate.</p><p id="8ac8" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Target:</strong> In both the Netflix SDK and Partner testing use cases, the test targets are generally production devices, meaning they may not necessarily provide ssh / root access. As such, operations on devices by the automation system may only be reliably carried out through established device communication protocols such as <a class="ae kl" href="http://www.dial-multiscreen.org/dial-protocol-specification" rel="noopener ugc nofollow" target="_blank">DIAL</a> or <a class="ae kl" href="https://developer.android.com/studio/command-line/adb" rel="noopener ugc nofollow" target="_blank">ADB</a>, instead of through hardware-specific debugging tools that the Partners use.</p><p id="456e" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Environment:</strong> The test targets are located both internally at Netflix and inside the Partner networks. To normalize the diversity of networking environments across both the Netflix and Partner networks and create a consistent and controllable computing environment on which users can run certification testing on their devices, Netflix provides a customized embedded computer to Partners called the Reference Automation Environment (RAE). The devices are in turn connected to the RAE, which provides access to the testing services provided by NTS.</p><p id="f2e7" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Device Onboarding:</strong> Before a user can execute tests, they must make their device known to NTS and associate it with their Netflix Partner account in a process called device onboarding. The user achieves this by connecting the device to the RAE in a plug-and-play fashion. The RAE collects the device properties and publishes this information to NTS. The user then goes to the UI to claim the newly-visible device so that its ownership is associated with their account.</p><p id="0eb4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Device and Test Selection:</strong> To run tests, the user first selects from the browser-based web UI (the “NTS UI”) a target device from the list of devices under their ownership (Figure 1).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mc"><picture><img alt="" class="bf mm mn c" width="700" height="483" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 1: Device selection in the NTS UI.</figcaption></figure><p id="6fe9" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">After a device has been selected, the user is presented with all tests that are applicable to the device being developed (Figure 2). The user then selects the subset of tests they are interested in running, and submits them for execution by NTS.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi ms"><picture><img alt="" class="bf mm mn c" width="700" height="443" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 2: Test selection in the NTS UI.</figcaption></figure><p id="f2e5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Tests can be executed as a single test run or as part of a batch run. In the latter case, additional execution options are available, such as the option to run multiple iterations of the same test or re-run tests on failure (Figure 3).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mt"><picture><img alt="" class="bf mm mn c" width="700" height="490" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 3: Batch run options in the NTS UI.</figcaption></figure><p id="64fb" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Execution:</strong> Once the tests are launched, the user will get a view of the tests being run, with a live update of their progress (Figure 4).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi ms"><picture><img alt="" class="bf mm mn c" width="700" height="440" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 4: The NTS UI batch execution view.</figcaption></figure><p id="2825" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">If the test is a manual test, prompts will appear in the UI at certain points during the test execution (Figure 5). The user follows the instructions in the prompt and clicks on the prompt buttons to notify the test to continue.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mu"><picture><img alt="" class="bf mm mn c" width="700" height="259" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 5: An example confirmation prompt in the NTS UI.</figcaption></figure><h2 id="7c30" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Defining the Stakeholders</h2><p id="d6a3" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">To better define the business and system requirements for NTS, we must first identify who the stakeholders are and what their roles are in the business. For the purposes of this discussion, the major stakeholders in NTS are the following:</p><p id="f900" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">System Users:</strong> The system users are the Partners (system integrators) and the Partner Engineers that work with them. They select the certification targets, run tests, and analyze the results.</p><p id="83d9" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Authors:</strong> The test authors write the test cases that are to be run against the certification targets (devices). They are generally a subset of the system users, and are familiar or involved with the development of the Netflix SDK and UI.</p><p id="a5ad" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">System Developers:</strong> The system developers are responsible for developing the NTS platform and its components, adding new features, fixing bugs, maintaining uptime, and evolving the system architecture over time.</p><h2 id="cf32" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">From the Use Cases to System Requirements</h2><p id="802a" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">With the business workflows and stakeholders defined, we can articulate a set of high level system requirements / design guidelines that NTS should in theory follow:</p><p id="e18b" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Scheduling Non-requirement:</strong> The devices that are used in NTS form a pool of heterogeneous resources that have a diverse range of hardware constraints. However, NTS is built around the use case where users come in with a specific resource or pool of similar resources in mind and are searching for a subset of compatible tests to run on the target resource(s). This contrasts with test automation systems where users come in with a set of diverse tests, and are searching for compatible resources on which to run the tests. Resource sharing is possible, but it is expected to be manually coordinated between the users because the business workflows that use NTS often involve physical ownership of the device anyway. For these reasons, advanced resource scheduling is not a user requirement of this system.</p><p id="e81c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Execution Component:</strong> Similar to other workflow automation systems, running tests in NTS involve performing tasks external to the target. These include controlling the target device, keeping track of the device state / connectivity, setting up test accounts for the test execution, collecting device logs, publishing test updates, validating test input parameters, and uploading test results, just to name a few. Thus, there needs to be a well-defined test execution stack that sits outside of the device under test to coordinate all these operations.</p><p id="d87c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Proper State Management:</strong> Test execution statuses need to be accurately tracked, so that multiple users can follow what is happening while the test is running. Furthermore, certain tests require user interactions via prompts, which necessitate the system keeping track of messages being passed back and forth from the UI to the device. These two use cases call for a well-defined data model for representing test executions, as well as a system that provides consistent and reliable test execution state management.</p><p id="fdf2" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Higher Level Execution Semantics:</strong> As noted from the business workflow description, users may want to run tests in batches, run multiple iterations of a test case, retry failing tests up to a given number of times, cancel tests in single or at the batch level, and be notified on the completion of a batch execution. Given that the execution of a single test case is already complex as is, these user features call for the need to encapsulate single test executions as the unit of abstraction that we can then use to define higher level execution semantics for supporting said features in a consistent manner.</p><p id="23cf" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Automated Supervision:</strong> Running tests on prototype hardware inherently comes with reliability issues, not to mention that it takes place in a network environment which we do not necessarily control. At any point during a test execution, the target device can run into any number of errors stemming from either the target device itself, the test execution stack, or the network environment. When this happens, the users should not be left without test execution updates and incomplete test results. As such, multiple levels of supervision need to be built into the test system, so that test executions are always cleaned up in a reliable manner.</p><p id="5985" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Orchestration Component:</strong> The requirements for proper state management, higher level execution semantics, and automated supervision call for a well-defined test orchestration stack that handles these three aspects in a consistent manner. To clearly delineate the responsibilities of test orchestration from those of test execution, the test orchestration stack should be separate from and sit on top of the test execution component abstraction (Figure 6).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mv"><picture><img alt="" class="bf mm mn c" width="700" height="997" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 6: The workflow cases in NTS.</figcaption></figure><p id="73e5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">System Scalability:</strong> Scalability in NTS has different meaning for each of the system’s stakeholders. For the users, scalability implies the ability to always be able to run and interact with tests, no matter the scale (notwithstanding genuine device unavailability). For the test authors, scalability implies the ease of defining, extending, and debugging certification test cases. For the system developers, scalability implies the employment of distributed system design patterns and practices that scale up the development and maintenance velocities required to meet the needs of the users.</p><p id="b82c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Adherence to the Paved Path:</strong> At Netflix, we emphasize building out solutions that use paved-path tooling as much as possible (see posts <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/how-we-build-code-at-netflix-c5d9bd727f15">here</a> and <a class="ae kl" href="https://www.slideshare.net/diannemarsh/the-paved-road-at-netflix" rel="noopener ugc nofollow" target="_blank">here</a>). JVM and Kafka support are the most relevant components of the paved-path tooling for this article.</p><p id="3bda" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">With the system requirements properly articulated, let us do a high-level walkthrough of the NTS 1.0 as implemented and examine some of its shortcomings with respect to meeting the requirements.</p><h2 id="21b7" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Test Execution Stack</h2><p id="c0d6" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">In NTS 1.0, the test execution stack is partitioned into two components to address two orthogonal concerns: maintaining the test environment and running the actual tests. The RAE serves as the foundation for addressing the first concern. On the RAE sits the first component of the test execution stack, the device agent. The device agent is a monolithic daemon running on the RAE that manages the physical connections to the devices under test (DUTs), and provides an RPC API abstraction over physical device management and control.</p><p id="cf5d" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Complementing the device agent is the test harness, which manages the actual test execution. The test harness accepts HTTP requests to run a single test case, upon which it will spin off a test executor instance to drive and manage the test case’s execution through RPC calls to the device agent managing the target device (see the <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/nts-real-time-streaming-for-test-automation-7cb000e933a1">NTS 1.0 blog post</a> for details). Throughout the lifecycle of the test execution, the test harness publishes test updates to a message bus (<a class="ae kl" href="https://kafka.apache.org/" rel="noopener ugc nofollow" target="_blank">Kafka</a> in this case) that other services consume from.</p><p id="82ed" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Because the device agent provides a hardware abstraction layer for device control, the business logic for executing tests that resides in the test harness, from invoking device commands to publishing test results, is device-independent. This provides freedom for the component to be developed and deployed as a cloud-native application, so that it can enjoy the benefits of the cloud application model, e.g. write once run everywhere, automatic scalability, etc. Together, the device agent and the test harness form what is called the <strong class="jp ir">Hybrid Execution Context (HEC)</strong>, i.e. the test execution is co-managed by a cloud and edge software stack (Figure 7).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mw"><picture><img alt="" class="bf mm mn c" width="700" height="390" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 7: The test execution stack (Hybrid Execution Context) in NTS 1.0.</figcaption></figure><p id="78cb" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Because the test harness contains all the common test execution business logic, it effectively acts as an “SDK” that device tests can be written on top of. Consequently, test case definitions are packaged as a common software library that the test harness imports on startup, and are executed as library methods called by the test executors in the test harness. This development model complements the write once run everywhere development model of test harness, since improvements to the test harness generally translate to test case execution improvements without any changes made to the test definitions themselves.</p><p id="e589" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">As noted earlier, executing a single test case against a device consists of many operations involved in the setup, runtime, and teardown of the test. Accordingly, the responsibility for each of the operations was divided between the device agent and test harness along device-specific and non-device-specific lines. While this seemed reasonable in theory, oftentimes there were operations that could not be clearly delegated to one or the other component. For example, since relevant logs are emitted by both software inside and outside of the device during a test, test log collection becomes a responsibility for both the device agent and test harness.</p><h2 id="9953" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Presentation Layer</h2><p id="1c9c" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">While the test harness publishes test events that eventually make their way into the test results store, the test executors and thus the intermediate test execution states are ephemeral and localized to the individual test harness instances that spun them. Consequently, a middleware service called the test dispatcher sits in between the users and the test harness to handle the complexity of test executor “discovery” (see the <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/nts-real-time-streaming-for-test-automation-7cb000e933a1">NTS 1.0 blog post</a> for details). In addition to proxying test run requests coming from the users to the test harness, the test dispatcher most importantly serves materialized views of the intermediate test execution states to the users, by building them up through the ingestion of test events published by the test harness (Figure 8).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="291" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 8: The presentation layer in NTS 1.0.</figcaption></figure><p id="8413" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">This presentation layer that is offered by the test dispatcher is more accurately described as a console abstraction to the test execution, since users rely on this service to not just follow the latest updates to a test execution, but also to interact with the tests that require user interaction. Consequently, bidirectionality is a requirement for the communications protocol shared between the test dispatcher service and the user interface, and as such, the <a class="ae kl" href="https://www.rfc-editor.org/rfc/rfc6455" rel="noopener ugc nofollow" target="_blank">WebSocket protocol</a> was adopted due to its relative simplicity of implementation for both the test dispatcher and the user interface (web browsers in this case). When a test executes, users open a WebSocket session with the test dispatcher through the UI, and materialized test updates flow to the UI through this session as they are consumed by the service. Likewise, test prompt responses / cancellation requests flow from the UI back to the test dispatcher via the same session, and the test dispatcher forwards the message to the appropriate test executor instance in the test harness.</p><h2 id="3f37" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Batch Execution Stack</h2><p id="41de" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">In NTS 1.0, the unit of abstraction for running tests is the single test case execution, and both the test execution stack and presentation layer was designed and implemented with this in mind. The construct of a batch run containing multiple tests was introduced only later in the evolution of NTS, being motivated by a set of related user-demanded features: the ability to run and associate multiple tests together, the ability to retry tests on failure, and the ability to be notified when a group of tests completes. To address the business logic of managing batch runs, a batch executor was developed, separate from both the test harness and dispatcher services (Figure 9).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="365" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 9: The batch execution stack in NTS 1.0.</figcaption></figure><p id="d95a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Similar to the test dispatcher service, the batch execution service proxies batch run requests coming from the users, and is ultimately responsible for dispatching the individual test runs in the batch through the test harness. However, the batch execution service maintains its own data model of the test execution that is separate from and thus incompatible with that materialized by the test dispatcher service. This is a necessary difference considering the unit of abstraction for running tests using the batch execution service is the batch run.</p><h2 id="0f05" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Examining the Shortcomings of NTS 1.0</h2><p id="3289" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">Having described the major system components at a high level, we can now analyze some of the shortcomings of the system in detail:</p><p id="e2c2" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Inconsistent Execution Semantics:</strong> Because batch runs were introduced as an afterthought, the semantics of batch executions in relation to those of the individual test executions were never fully clarified in implementation. In addition, the presence of both the test dispatcher and batch executor created a bifurcation in test executions management, where neither service alone satisfied the users’ needs. For example, a single test that is kicked off as part of a batch run through the batch executor must be canceled through the test dispatcher service. However, cancellation is only possible if the test is in a running state, since the test dispatcher has no information about tests prior to their execution. Behaviors such as this often resulted in the system appearing inconsistent and unintuitive to the users, while presenting a knowledge overhead for the system developers.</p><p id="31d3" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Execution Scalability and Reliability:</strong> The test execution stack suffered two technical issues that hampered its reliability and ability to scale. The first is in the partitioning of the test execution stack into two distinct components. While this division had emerged naturally from the setup of the business workflow, the device agent and test harness are fundamentally two pieces of a common stack separated by a control plane, i.e. the network. The conditions of the network at the Partner sites are known to be inconsistent and sometimes unreliable, as there might be traffic congestion, low bandwith, or unique firewall rules in place. Furthermore, RPC communications between the device agent and test harness are not direct, but go through a few more system components (e.g. gateway services). For these reasons, test executions in practice often suffer from a host of stability, reliability, and latency issues, most of which we cannot take action upon.</p><p id="9917" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The second technical issue is in the implementation of the test executors hosted by the test harness. When a test case is run, a full thread is spawned off to manage its execution, and all intermediate test execution state is stored in thread-local memory. Given that much of the test execution lifecycle is involved with making blocking RPC calls, this choice of implementation in practice limits the number of tests that can effectively be run and managed per test harness instance. Moreover, the decision to maintain intermediate test execution state only in thread-local memory renders the test harness fragile, as all test executors running on a given test harness instance will be lost along with their data if the instance goes down. Operational issues stemming from the brittle implementation of the test executors and from the partitioning of the test execution stack frequently exacerbate each other, leading to situations where test executions are slow, unreliable, and prone to infrastructure errors.</p><p id="c3b4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Presentation Layer Scalability:</strong> In theory, the dispatcher service’s WebSocket server can scale up user sessions to the maximum number of HTTP connections allowed by the service and host configuration. However, the service was designed to be stateless so as to reduce the codebase size and complexity. This meant that the dispatcher service had to initialize a new Kafka consumer, read from the beginning of the target partition, filter for the relevant test updates, and build the intermediate test execution state on the fly each time a user opened a new WebSocket session with the service. This was a slow and resource-intensive process, which limited the scalability of the dispatcher service as an interactive test execution console for users in practice.</p><p id="ce73" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Test Authoring Scalability:</strong> Because the common test execution business logic was bundled with the test harness as a de facto SDK, test authors had to actually be familiar with the test harness stack in order to define new test cases. For the test authors, this presented a huge learning curve, since they had to learn a large codebase written in a programming language and toolchain that was completely different from those used in Netflix SDK and UI. Since only the test harness maintainers can effectively contribute test case definitions and improvements, this became a bottleneck as far as development velocity was concerned.</p><p id="0042" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Unreliable State Management:</strong> Each of the three core services has a different policy with respect to test execution state management. In the test harness, state is held in thread-local memory, while in the test dispatcher, it is built on the fly by reading from Kafka with each new console session. In the batch executor, on the other hand, intermediate test execution states are ignored entirely and only test results are stored. Because there is no persistence story with regards to intermediate test execution state, and because there is no data model to represent test execution states consistently across the three services, it becomes very difficult to coordinate and track test executions. For example, two WebSocket sessions to the same test execution are generally not reproducible if user interactions such as prompt responses are involved, since each session has its own materialization of the test execution state. Without the ability to properly model and track test executions, supervision of test executions is consequently non-existent.</p><p id="3bf9" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">The evolution of NTS can best be described as that of an emergent system architecture, with many features added over time to fulfill the users’ ever-increasing needs. It became apparent that this model brought forth various shortcomings that prevented it from satisfying the system requirements laid out earlier. We now discuss the high-level architectural changes we have made with NTS 2.0, which was built with an intentional design approach to address the system requirements of the business problem.</p><h2 id="2eea" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Decoupling Test Definitions</h2><p id="4a84" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">In NTS 2.0, tests are defined as scripts against the Netflix SDK that execute on the device itself, as opposed to library code that is dependent on and executes in the test harness. These test definitions are hosted on a separate service where they can be accessed by the Netflix SDK on devices located in the Partner networks (Figure 10).</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div class="ab gu cl my"><picture><img alt="" class="mm bf mn c" width="700" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 10: Decoupling the test definitions from the test execution stack in NTS 2.0.</figcaption></figure><p id="4aeb" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">This change brings several distinct benefits to the system. The first is that the new setup is more aligned with device certification, where ultimately we are testing the integration of the Netflix SDK with the target device’s firmware. The second is that we are able to consolidate instrumentation and logging onto a single stack, which simplifies the debugging process for the developers. In addition, by having tests be defined using the same programming language and toolchain used to develop the Netflix UI, the learning curve for writing and maintaining tests is significantly reduced for the test authors. Finally, this setup strongly decouples test definitions from the rest of the test execution infrastructure, allowing for the two to be developed separately in parallel with improved velocity.</p><h2 id="0050" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Defining the Job Execution Model</h2><p id="1f16" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">A proper job execution model with concise semantics has been defined in NTS 2.0 to address the inconsistent semantics between single test and batch executions (Figure 11). The model is summarized as follows:</p><ul class=""><li id="e1e5" class="mz na iq jp b jq jr ju jv jy nb kc nc kg nd kk ne nf ng nh bi">The base unit of test execution is the batch. A batch consists of one or more test cases to be run sequentially on the target device.</li>
<li id="6de4" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">The base unit of test orchestration is the job. A job is a template containing a list of test cases to be run, configurations for test retries and job notifications, and information on the target device.</li>
<li id="4de2" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">All test run requests create a job template, from which batches are instantiated for execution. This includes single test run requests.</li>
<li id="7a62" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Upon batch completion, a new batch may be instantiated from the source job, but containing only the subset of the test cases that failed earlier. Whether or not this occurs depends on the source job’s test retries configuration.</li>
<li id="a152" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">A job is considered finished when its instantiated batches and subsequent retries have completed. Notifications may then be sent out according to the job’s configuration.</li>
<li id="8f07" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Cancellations are applicable to either the single test execution level or the batch execution level. Jobs are considered canceled when its current batch instantiation is canceled.</li>
</ul><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi nn"><picture><img alt="" class="bf mm mn c" width="700" height="780" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 11: The job execution model in NTS 2.0.</figcaption></figure><p id="f62c" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The newly-defined job execution model thoroughly clarifies the semantics of single test and batch executions while remaining consistent with all existing use cases of the system, and has informed the re-architecting of both the test execution and orchestration components, which we will discuss in the next few sections.</p><h2 id="db81" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Replacement of the Control Plane</h2><p id="7132" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">In NTS 1.0, the device agent at the edge and the test harness in the cloud communicate to each other via RPC calls proxied by intermediate gateway services. As noted in great detail earlier, this setup brought many stability, reliability, and latency issues that were observed in test executions. With NTS 2.0, this point-to-point-based control plane is replaced with a message bus-based control plane that is built on MQTT and Kafka (Figure 12).</p><p id="7f9a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">MQTT is an <a class="ae kl" href="https://mqtt.org/" rel="noopener ugc nofollow" target="_blank">OASIS standard messaging protocol</a> for the Internet of Things (IoT) and was designed as a highly lightweight yet reliable publish/subscribe messaging transport that is ideal for connecting remote devices with a small code footprint and minimal network bandwidth. MQTT clients connect to the MQTT broker and send messages prefixed with a topic. The broker is responsible for receiving all messages, filtering them, determining who is subscribed to which topic, and sending the messages to the subscribed clients accordingly. The key features that make MQTT highly appealing to us are its support for request retries, fault tolerance, hierarchical topics, client authentication and authorization, per-topic ACLs, and bi-directional request/response message patterns, all of which are crucial for the business use cases around NTS.</p><p id="afe0" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Since the paved-path solution at Netflix supports Kafka, a bridge is established between the two protocols to allow cloud-side services to communicate with the control plane (Figure 12). Through the bridge, MQTT messages are converted directly to Kafka records, where the record key is set to be the MQTT topic that the message was assigned to. We take advantage of this construction by having test execution updates published on MQTT contain the test_id in the topic. This forces all updates for a given test execution to effectively appear on the same Kafka partition with a well-defined message order for consumption by NTS component cloud services.</p><p id="b689" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The introduction of the new control plane has enabled communications between different NTS components to be carried out in a consistent, scalable, and reliable manner, regardless of where the components were located. One example of its use is described in our earlier blog post about <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623">reliable devices management</a>. The new control plane sets the foundations for the evolution of the test execution stack in NTS 2.0, which we discuss next.</p><h2 id="8c81" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Migration from a Hybrid to Local Execution Context</h2><p id="3668" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">The test execution component is completely migrated over from the cloud to the edge in NTS 2.0. This includes functionality from the batch execution stack in NTS 1.0, since batch executions are the new base unit of test execution. The migration immediately addresses the long standing problems of network reliability and latency in test executions, since the entire test execution stack now sits together in the same isolated environment, the RAE, instead of being partitioned by a control plane.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="370" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 12: The test execution stack (Local Execution Context) and the control plane in NTS 2.0.</figcaption></figure><p id="f720" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">During the migration, the test harness and the device agent components were modularized, as each aspect of test execution management — device state management, device communications protocol management, batch executions management, log collection, etc — was moved into a dedicated system service running on the RAE that communicated with the other components via the new control plane (Figure 12). Together with the new control plane, these new local modules form what is called the <strong class="jp ir">Local Execution Context (LEC)</strong>. By consolidating test execution management onto the edge and thus in close proximity to the device, the LEC becomes largely immune from the many network-related scalability, reliability, and stability issues that the HEC model frequently encounters. Alongside with the decoupling of test definitions from the test harness, the LEC has significantly reduced the complexity of the test execution stack, and has paved the way for its development to be parallelized and thus scalable.</p><h2 id="8ec4" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Proper State Modeling with Event Sourcing</h2><p id="461c" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">Test orchestration covers many aspects: support for the established job execution model (kicking off and running jobs), consistent state management for test executions, reconciliation of user interaction events with test execution state, and overall job execution supervision. These functions were divided amongst the three core services in NTS 1.0, but without a consistent model of the intermediate execution states that they can rely upon for coordination, test orchestration as defined by the system requirements could not be reliably achieved. With NTS 2.0, a unified data schema for test execution updates is defined according to the job execution model, with the data itself persisted in storage as an append-only log. In this state management model, all updates for a given test execution, including user interaction events, are stored as a totally-ordered sequence of immutable records ordered by time and grouped by the <code class="fe no np nq nr b">test_id</code>. The append-only property here is a very powerful feature, because it gives us the ability to materialize a test execution state at <strong class="jp ir">any</strong> intermediate point in time simply by replaying the append-only log for the test execution from the beginning up until the given timestamp. Because the records are immutable, state materializations are always fully reproducible.</p><p id="093a" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Since the test execution stack continuously publishes test updates to the control plane, state management at the test orchestration layer simply becomes a matter of ingesting and storing these updates in the correct order in accordance with the <a class="ae kl" href="https://microservices.io/patterns/data/event-sourcing.html" rel="noopener ugc nofollow" target="_blank">Event Sourcing Pattern</a>. For this, we turn to the solution provided by <a class="ae kl" href="https://doc.akka.io/docs/alpakka-kafka/current/" rel="noopener ugc nofollow" target="_blank">Alpakka-Kafka</a>, whose adoption we have <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623">previously pioneered</a> in the implementation of our devices management platform (Figure 13). To summarize here, we chose Alpakka-Kafka as the basis of the test updates ingestion infrastructure because it fulfilled the following technical requirements: support for per-partition in-order processing of events, back-pressure support, fault tolerance, integration with the paved-path tooling, and long-term maintainability. Ingested updates are subsequently persisted into a log store backed by <a class="ae kl" href="https://www.cockroachlabs.com/" rel="noopener ugc nofollow" target="_blank">CockroachDB</a>. CockroachDB was chosen as the backing store because it is designed to be horizontally scalable and it offers the SQL capabilities needed for working with the job execution data model.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="207" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 13: The event sourcing pipeline in NTS 2.0, powered by Alpakka-Kafka.</figcaption></figure><p id="5e7e" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">With proper event sourcing in place and the test execution stack fully migrated over to the LEC, the remaining functionality in the three core services is consolidated into dedicated single service in NTS 2.0, effectively replacing and improving upon the former three in all areas where test orchestration was concerned. The scalable state management solution provided by this test orchestration service becomes the foundation for scalable presentation and job supervision in NTS 2.0, which we discuss next.</p><h2 id="f1a6" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Scaling Up the Presentation Layer</h2><p id="e7ca" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">The new test orchestration service serves the presentation layer, which, as with NTS 1.0, provides a test execution console abstraction implemented using WebSocket sessions. However, for the console abstraction to be truly reliable and functional, it needs to fulfill several requirements. The first and foremost is that console sessions must be fully reproducible, i.e. two users interacting with the same test execution should observe the exact same behavior. This was an area that was particularly problematic in NTS 1.0. The second is that console sessions must scale up with the number of concurrent users in practice, i.e. sessions should not be resource-intensive. The third is that communications between the session console and the user should be minimal and efficient, i.e. new test execution updates should be delivered to the user only once. This requirement implies the need for maintaining session-local memory to keep track of delivered updates. Finally, the test orchestration service itself needs to be able to intervene in console sessions, e.g. send session liveness updates to the users on an interval schedule or notify the users of session termination if the service instance hosting the session is shutting down.</p><p id="8475" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">To handle all of these requirements in a consistent yet scalable manner, we turn to the Actor Model for inspiration. The Actor Model is a concurrency model in which actors are the universal primitive of concurrent computation. Actors send messages to each other, and in response to incoming messages, they can perform operations, create more actors, send out other messages, and change their future behavior. Actors also maintain and modify their own private state, but they can only affect each other’s states indirectly through messaging. In-depth discussions of the Actor Model and its many applications can be found <a class="ae kl" href="https://getakka.net/articles/intro/what-are-actors.html" rel="noopener ugc nofollow" target="_blank">here</a> and <a class="ae kl" href="https://www.oreilly.com/library/view/applied-akka-patterns/9781491934876/ch01.html" rel="noopener ugc nofollow" target="_blank">here</a>.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="484" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 14: The presentation layer in NTS 2.0.</figcaption></figure><p id="e323" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The Actor Model naturally fits the mental model of the test execution console, since the console is fundamentally a standalone entity that reacts to messages (e.g. test updates, service-level notifications, and user interaction events) and maintains internal state. Accordingly, we modeled test execution sessions as such using <a class="ae kl" href="https://doc.akka.io/docs/akka/current/typed/index.html" rel="noopener ugc nofollow" target="_blank">Akka Typed</a>, a well-known and highly-maintained actor system implementation for the JVM (Figure 14). Console sessions are instantiated when a WebSocket connection is opened by the user to the service, and upon launch, the console begins fetching new test updates for the given <code class="fe no np nq nr b">test_id</code> from the data store. Updates are delivered to the user over the WebSocket connection and saved to session-local memory as record to keep track of what has already been delivered, while user interaction events are forwarded back to the LEC via the control plane. The polling process is repeated on a cron schedule (every 2 seconds) that is registered to the actor system’s scheduler during console instantiation, and the polling’s data query pattern is designed to be aligned with the service’s state management model.</p><h2 id="8fdc" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Putting in Job Supervision</h2><p id="204e" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">As a distributed system whose components communicate asynchronously and are involved with prototype embedded devices, faults frequently occur throughout the NTS stack. These faults range from device loops and crashes to the RAE being temporarily disconnected from the network, and generally result in missing test updates and/or incomplete test results if left unchecked. Such undefined behavior is a frequent occurrence in NTS 1.0 that impedes the reliability of the presentation layer as an accurate view of test executions. In NTS 2.0, multiple levels of supervision are present across the system to address this class of issues. Supervision is carried out through checks that are scheduled throughout the job execution lifecycle in reaction to the job’s progress. These checks include:</p><ul class=""><li id="e243" class="mz na iq jp b jq jr ju jv jy nb kc nc kg nd kk ne nf ng nh bi">Handling response timeouts for requests sent from the test orchestration service to the LEC.</li>
<li id="259d" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Handling test “liveness”, i.e. ensuring that updates are continuously present until the test execution reaches a terminal state.</li>
<li id="16be" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Handling test execution timeouts.</li>
<li id="5254" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Handling batch execution timeouts.</li>
</ul><p id="7937" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">When these faults occur, the checks will discover them and automatically clean up the faulting test execution, e.g. marking test results as invalid, releasing the target device from reservation, etc. While some checks exist in the LEC stack, job-level supervision facilities mainly reside in the test orchestration service, whose log store can be reliably used for monitoring test execution runs.</p><h2 id="2e93" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">System Behavioral Reliability</h2><p id="c394" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">The importance of understanding the business problem space and cementing this understanding through proper conceptual modeling cannot be underscored enough. Many of the perceived reliability issues in NTS 1.0 can be attributed to undefined behavior or missing features. These are an inevitable occurrence in the absence of conceptual modeling and thus strongly codified expectations of system behavior. With NTS 2.0, we properly defined from the very beginning the job execution model, the data schema for test execution updates according to the model, and the state management model for test execution states (i.e. the append-only log model). We then implemented various system-level features that are built upon these formalisms, such as event-sourcing of test updates, reproducible test execution console sessions, and job supervision. It is this development approach, along with the implementation choices made along the way, that empowers us to achieve behavioral reliability across the NTS system in accordance with the business requirements.</p><h2 id="e197" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">System Scalability</h2><p id="5c4f" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">We can examine how each component in NTS 2.0 addresses the scalability issues that are present in its predecessor:</p><p id="85dd" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">LEC Stack:</strong> With the consolidation of the test execution stack fully onto the RAE, the challenge of scaling up test executions is now broken down into two separate problems:</p><ul class=""><li id="6970" class="mz na iq jp b jq jr ju jv jy nb kc nc kg nd kk ne nf ng nh bi">Whether or not the LEC stack can support executing as many tests simultaneously as the maximum number of devices that can be connected to the RAE.</li>
<li id="f43b" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ne nf ng nh bi">Whether or not the communications between the edge and the cloud can scale with the number of RAEs in the system.</li>
</ul><p id="d196" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">The first problem is naturally resolved by hardware-imposed limitations on the number of connected devices, as the RAE is an embedded appliance. The second refers to the scalability of the NTS control plane, which we will discuss next.</p><p id="5d22" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Control Plane:</strong> With the replacement of the point-to-point RPC-based control plane with a message bus-based control plane, system faults stemming from Partner networks have become a rare occurrence and RAE-edge communications have become scalable. For the MQTT side of the control plane, we used <a class="ae kl" href="https://www.hivemq.com/" rel="noopener ugc nofollow" target="_blank">HiveMQ</a> as the cloud MQTT broker. We chose HiveMQ because it met all of our business use case requirements in terms of performance and stability (see our <a class="ae kl" href="https://www.hivemq.com/case-studies/netflix/" rel="noopener ugc nofollow" target="_blank">adoption report</a> for details), and came with the MQTT-Kafka bridging support that we needed.</p><p id="0b6e" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Event Sourcing Infrastructure:</strong> The event-sourcing solution provided by Alpakka-Kafka and CockroachDB has already been demonstrated to be very performant, scalable, and fault tolerant in our earlier work on <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623">reliable devices management</a>.</p><p id="9c6d" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Presentation Layer:</strong> The current implementation of the test execution console abstraction using actors removed the practical scaling limits of the previous implementation. The real advantage of this implementation model is that we can achieve meaningful concurrency and performance without having to worry about the low-level details of thread pool management and lock-based synchronization. Notably, systems built on Akka Typed have been shown to support roughly 2.5 million actors per GB of heap and relay actor messages at a throughput of nearly 50 million messages per second.</p><p id="ad3e" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">To be thorough, we performed basic load tests on the presentation layer using the <a class="ae kl" href="https://gatling.io/" rel="noopener ugc nofollow" target="_blank">Gatling</a> load-testing framework to verify its scalability. The simulated test scenario per request is as follows:</p><ol class=""><li id="e483" class="mz na iq jp b jq jr ju jv jy nb kc nc kg nd kk ns nf ng nh bi">Open a test execution console session (i.e. WebSocket connection) in the test orchestration service.</li>
<li id="655d" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ns nf ng nh bi">Wait for 2 to 3 minutes (randomized), during which the session will be polling the data store at 2 second intervals for test updates.</li>
<li id="b643" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ns nf ng nh bi">Close the session.</li>
</ol><p id="16c4" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">This scenario is comparable to the typical NTS user workflow that involves the presentation layer. The load test plan is as follows:</p><ol class=""><li id="477e" class="mz na iq jp b jq jr ju jv jy nb kc nc kg nd kk ns nf ng nh bi">Burst ramp-up requests to 1000 over 5 seconds.</li>
<li id="d610" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ns nf ng nh bi">Add 80 new requests per second for 10 minutes.</li>
<li id="651e" class="mz na iq jp b jq ni ju nj jy nk kc nl kg nm kk ns nf ng nh bi">Wait for all requests to complete.</li>
</ol><p id="71e2" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">We observed that, in load tests of a single client machine (2.4 GHz, 8-Core, 32 GB RAM) running against a small cluster of 3 AWS <code class="fe no np nq nr b">m4.xlarge</code> instances, we were able to peg the client at over 10,900 simultaneous live WebSocket connections before the client’s limits were reached (Figure 15). On the server side, neither CPU nor memory utilization appeared significantly impacted for the duration of the tests, and the database connection pool was able to handle the query load from all the data store polling (Figures 16–18). We can conclude from these load test results that scalability of the presentation layer has been achieved with the new implementation.</p><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="264" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 15: WebSocket sessions and handshake response time percentiles over time during the load testing.</figcaption></figure><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="333" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 16: CPU usage over time during the load testing.</figcaption></figure><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="333" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 17: Available memory over time during the load testing.</figcaption></figure><figure class="md me mf mg gt mh gh gi paragraph-image"><div role="button" tabindex="0" class="mi mj di mk bf ml gh gi mx"><picture><img alt="" class="bf mm mn c" width="700" height="294" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mo mp gj gh gi mq mr bd b be z dk">Figure 18: Database requests per second over time during the load testing.</figcaption></figure><p id="9f0d" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi"><strong class="jp ir">Job Supervision:</strong> While the actual business logic may be complex, job supervision itself is a very lightweight process, as checks are reactively scheduled in response to events across the job execution cycle. In implementation, checks are scheduled through the Akka scheduler and run using actors, which have been shown above to scale very well.</p><h2 id="8cf4" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Development Velocity</h2><p id="2d50" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">The design decisions we have made with NTS 2.0 have simplified the NTS architecture and in the process made the platform run tests observably much faster, as there are simply a lot less moving components to work with. Whereas it used to take roughly 60 seconds to run through a “Hello, World” device test from setup to teardown, now it takes less than 5 seconds. This has translated to increased development velocity for our users, who can now iterate their test authoring and device integration / certification work much more frequently.</p><p id="4178" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">In NTS 2.0, we have thoroughly added multiple levels of observability across the stack using paved-path tools, from <a class="ae kl" href="https://logback.qos.ch/manual/mdc.html" rel="noopener ugc nofollow" target="_blank">contextual logging</a> to <a class="ae kl" href="https://github.com/Netflix/spectator" rel="noopener ugc nofollow" target="_blank">metrics</a> to <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304">distributed tracing</a>. Some of these capabilities were previously not available in NTS 1.0 because the component services were built prior to the introduction of paved-path tooling at Netflix. Combined with the simplification of the NTS architecture, this has increased development velocity for the system maintainers by an order of magnitude, as user-reported issues in general can now be tracked down and fixed within the same day as they were reported, for example.</p><h2 id="e637" class="lq ko iq bd kp lr ls dn kt lt lu dp kx jy lv lw lb kc lx ly lf kg lz ma lj mb bi">Costs Reduction</h2><p id="38a9" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">Though our discussion of NTS 1.0 focused on the three core services, in reality there are many auxiliary services in between that coordinate different aspects of a test execution, such as RPC requests proxying from cloud to edge, test results collection, etc. Over the course of building NTS 2.0, we have deprecated a total of 10 microservices whose roles have been either obsolesced by the new architecture or consolidated into the LEC and test orchestration service. In addition, our work has paved the way for the eventual deprecation of 5 additional services and the evolution of several others. The consolidation of component services along with the increase in development and maintenance velocity brought about by NTS 2.0 has significantly reduced the business costs of maintaining the NTS platform, in terms of both compute and developer resources.</p><p id="e4ec" class="pw-post-body-paragraph jn jo iq jp b jq ll js jt ju lm jw jx jy ln ka kb kc lo ke kf kg lp ki kj kk ij bi">Systems design is a process of discovery and can be difficult to get right on the first iteration. Many design decisions need to be considered in light of the business requirements, which evolve over time. In addition, design decisions must be regularly revisited and guided by implementation experience and customer feedback in a process of value-driven development, while avoiding the pitfalls of an emergent model of system evolution. Our in-field experience with NTS 1.0 has thoroughly informed the evolution of NTS into a device testing solution that better satisfies the business workflows and requirements we have while scaling up developer productivity in building out and maintaining this solution.</p><p id="15c5" class="pw-post-body-paragraph jn jo iq jp b jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj kk ij bi">Though we have brought in large changes with NTS 2.0 that addressed the systemic shortcomings of its predecessor, the improvements discussed here are focused on only a few components of the overall NTS platform. We have previously discussed <a class="ae kl" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/towards-a-reliable-device-management-platform-4f86230ca623">reliable devices management</a>, which is another large focus domain. The overall reliability of the NTS platform rests on significant work made in many other key areas, including devices onboarding, the MQTT-Kafka transport, authentication and authorization, test results management, and system observability, which we plan to discuss in detail in future blog posts. In the meantime, thanks to this work, we expect NTS to continue to scale with increasing workloads and diversity of workflows over time according to the needs of our stakeholders.</p>]]></description>
      <link>https://netflixtechblog.com/nts-reliable-device-testing-at-scale-43139ae05382</link>
      <guid>https://netflixtechblog.com/nts-reliable-device-testing-at-scale-43139ae05382</guid>
      <pubDate>Thu, 09 Mar 2023 18:38:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Data ingestion pipeline with Operation Management]]></title>
      <description><![CDATA[<p id="bfc8" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">At Netflix, to promote and recommend the content to users in the best possible way there are many Media Algorithm teams which work hand in hand with content creators and editors. Several of these algorithms aim to improve different manual workflows so that we show the personalized promotional image, trailer or the show to the user.</p><p id="7e2a" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">These media focused machine learning algorithms as well as other teams generate a lot of data from the media files, which we described in our <a class="ae ke" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428">previous blog</a>, are stored as annotations in Marken. We designed a unique concept called Annotation Operations which allows teams to create data pipelines and easily write annotations without worrying about access patterns of their data from different applications.</p><figure class="lz ma mb mc gs md gg gh paragraph-image"><div role="button" tabindex="0" class="me mf di mg bf mh gg gh ly"><picture><img alt="" class="bf mi mj c" width="700" height="412" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mk ml gi gg gh mm mn bd b be z dk">Annotation Operations</figcaption></figure><p id="ce2b" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Lets pick an example use case of identifying objects (like trees, cars etc.) in a video file. As described in the above picture</p><ul class=""><li id="b4a6" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">During the first run of the algorithm it identified 500 objects in a particular Video file. These 500 objects were stored as annotations of a specific schema type, let’s say Objects, in Marken.</li>
<li id="ba2a" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">The Algorithm team improved their algorithm. Now when we re-ran the algorithm on the same video file it created 600 annotations of schema type Objects and stored them in our service.</li>
</ul><p id="a4a5" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Notice that we cannot update the annotations from previous runs because we don’t know how many annotations a new algorithm run will result into. It is also very expensive for us to keep track of which annotation needs to be updated.</p><p id="7944" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">The goal is that when the consumer comes and searches for annotations of type Objects for the given video file then the following should happen.</p><ul class=""><li id="56f2" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">Before Algo run 1, if they search they should not find anything.</li>
<li id="d02a" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">After the completion of Algo run 1, the query should find the first set of 500 annotations.</li>
<li id="7e98" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">During the time when Algo run 2 was creating the set of 600 annotations, clients search should still return the older 500 annotations.</li>
<li id="a025" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">When all of the 600 annotations are successfully created, they should replace the older set of 500.</li>
<li id="0774" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">So now when clients search annotations for Objects then they should get 600 annotations.</li>
</ul><p id="115b" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Does this remind you of something? This seems very similar (not exactly same) to a distributed transaction.</p><p id="5f5f" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Typically, an algorithm run can have 2k-5k annotations. There are many naive solutions possible for this problem for example:</p><ul class=""><li id="ff66" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">Write different runs in different databases. This is obviously very expensive.</li>
<li id="d4dc" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">Write algo runs into files. But we cannot search or present low latency retrievals from files</li>
<li id="2528" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">Etc.</li>
</ul><p id="c131" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Instead our challenge was to implement this feature on top of Cassandra and ElasticSearch databases because that’s what Marken uses. The solution which we present in this blog is not limited to annotations and can be used for any other domain which uses ES and Cassandra as well.</p><p id="2bee" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Marken’s architecture diagram is as follows. We refer the reader to our previous <a class="ae ke" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428">blog article</a> for details. We use Cassandra as a source of truth where we store the annotations while we index annotations in ElasticSearch to provide rich search functionalities.</p><figure class="lz ma mb mc gs md gg gh paragraph-image"><div role="button" tabindex="0" class="me mf di mg bf mh gg gh nc"><picture><img alt="" class="bf mi mj c" width="700" height="321" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mk ml gi gg gh mm mn bd b be z dk">Marken Architecture</figcaption></figure><p id="7884" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Our goal was to help teams at Netflix to create data pipelines without thinking about how that data is available to the readers or the client teams. Similarly, client teams don’t have to worry about when or how the data is written. This is what we call decoupling producer flows from clients of the data.</p><p id="3c8c" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Lifecycle of a movie goes through a lot of creative stages. We have many temporary files which are delivered before we get to the final file of the movie. Similarly, a movie has many different languages and each of those languages can have different files delivered. Teams generally want to run algorithms and create annotations using all those media files.</p><p id="a280" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Since algorithms can be run on a different permutations of how the media files are created and delivered we can simplify an algorithm run as follows</p><ul class=""><li id="a168" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">Annotation Schema Type — identifies the schema for the annotation generated by the Algorithm.</li>
<li id="b560" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">Annotation Schema Version — identifies the schema version of the annotation generated by the Algorithm.</li>
<li id="1909" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">PivotId — a unique string identifier which identifies the file or method which is used to generate the annotations. This could be the SHA hash of the file or simply the movie Identifier number.</li>
</ul><p id="abb8" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Given above we can describe the data model for an annotation operation as follows.</p><pre class="lz ma mb mc gs nd ne nf bn ng nh bi">{"annotationOperationKeys": [{"annotationType": "string",   ❶"annotationTypeVersion": “integer”,"pivotId": "string","operationNumber": “integer”    ❷}],"id": "UUID","operationStatus": "STARTED",   ❸"isActive": true   ❹}</pre><ol class=""><li id="80f7" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls nn mu mv mw bi">We already explained AnnotationType, AnnotationTypeVersion and PivotId above.</li>
<li id="8348" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls nn mu mv mw bi">OperationNumber is an auto incremented number for each new operation.</li>
<li id="4fdf" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls nn mu mv mw bi">OperationStatus — An operation goes through three phases, Started, Finished and Canceled.</li>
<li id="cf9a" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls nn mu mv mw bi">IsActive — Whether an operation and its associated annotations are active and searchable.</li>
</ol><p id="0cd1" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">As you can see from the data model that the producer of an annotation has to choose an AnnotationOperationKey which lets them define how they want UPSERT annotations in an AnnotationOperation. Inside, AnnotationOperationKey the important field is pivotId and how it is generated.</p><p id="5a25" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Our source of truth for all objects in Marken in Cassandra. To store Annotation Operations we have the following main tables.</p><ul class=""><li id="63ba" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">AnnotationOperationById — It stores the AnnotationOperations</li>
<li id="adeb" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">AnnotationIdByAnnotationOperationId — it stores the Ids of all annotations in an operation.</li>
</ul><p id="6e75" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Since Cassandra is NoSql, we have more tables which help us create reverse indices and run admin jobs so that we can scan all annotation operations whenever there is a need.</p><p id="024e" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Each annotation in Marken is also indexed in ElasticSearch for powering various searches. To record the relationship between annotation and operation we also index two fields</p><ul class=""><li id="7c7c" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">annotationOperationId — The ID of the operation to which this annotation belongs</li>
<li id="e81d" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">isAnnotationOperationActive — Whether the operation is in an ACTIVE state.</li>
</ul><p id="af87" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">We provide three APIs to our users. In following sections we describe the APIs and the state management done within the APIs.</p><h2 id="cf5c" class="no kg ip bd kh np nq dn kl nr ns dp kp lg nt nu kr lk nv nw kt lo nx ny kv nz bi"><strong class="ak">StartAnnotationOperation</strong></h2><p id="bcb7" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">When this API is called we store the operation with its OperationKey (tuple of annotationType, annotationType Version and pivotId) in our database. This new operation is marked to be in STARTED state. We store all OperationIDs which are in STARTED state in a distributed cache (EVCache) for fast access during searches.</p><figure class="lz ma mb mc gs md gg gh paragraph-image"><div role="button" tabindex="0" class="me mf di mg bf mh gg gh oa"><picture><img alt="" class="bf mi mj c" width="700" height="67" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mk ml gi gg gh mm mn bd b be z dk">StartAnnotationOperation</figcaption></figure><h2 id="567b" class="no kg ip bd kh np nq dn kl nr ns dp kp lg nt nu kr lk nv nw kt lo nx ny kv nz bi"><strong class="ak">UpsertAnnotationsInOperation</strong></h2><p id="e202" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Users call this API to upsert the annotations in an Operation. They pass annotations along with the OperationID. We store the annotations and also record the relationship between the annotation IDs and the Operation ID in Cassandra. During this phase operations are in isAnnotationOperationActive = ACTIVE and operationStatus = STARTED state.</p><p id="0430" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">Note that typically in one operation run there can be 2K to 5k annotations which can be created. Clients can call this API from many different machines or threads for fast upserts.</p><figure class="lz ma mb mc gs md gg gh paragraph-image"><div role="button" tabindex="0" class="me mf di mg bf mh gg gh ob"><picture><img alt="" class="bf mi mj c" width="700" height="96" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mk ml gi gg gh mm mn bd b be z dk">UpsertAnnotationsInOperation</figcaption></figure><h2 id="07da" class="no kg ip bd kh np nq dn kl nr ns dp kp lg nt nu kr lk nv nw kt lo nx ny kv nz bi"><strong class="ak">FinishAnnotationOperation</strong></h2><p id="be9f" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Once the annotations have been created in an operation clients call FinishAnnotationOperation which changes following</p><ul class=""><li id="d969" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">Marks the current operation (let’s say with ID2) to be operationStatus = FINISHED and isAnnotationOperationActive=ACTIVE.</li>
<li id="9c84" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">We remove the ID2 from the Memcache since it is not in STARTED state.</li>
<li id="b25c" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">Any previous operation (let’s say with ID1) which was ACTIVE is now marked isAnnotationOperationActive=FALSE in Cassandra.</li>
<li id="b738" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">Finally, we call updateByQuery API in ElasticSearch. This API finds all Elasticsearch documents with ID1 and marks isAnnotationOperationActive=FALSE.</li>
</ul><figure class="lz ma mb mc gs md gg gh paragraph-image"><div role="button" tabindex="0" class="me mf di mg bf mh gg gh oc"><picture><img alt="" class="bf mi mj c" width="700" height="156" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="mk ml gi gg gh mm mn bd b be z dk">FinishAnnotationOperation</figcaption></figure><h2 id="9380" class="no kg ip bd kh np nq dn kl nr ns dp kp lg nt nu kr lk nv nw kt lo nx ny kv nz bi"><strong class="ak">Search API</strong></h2><p id="84fb" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">This is the key part for our readers. When a client calls our search API we must exclude</p><ul class=""><li id="30a2" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls mt mu mv mw bi">any annotations which are from isAnnotationOperationActive=FALSE operations or</li>
<li id="6128" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls mt mu mv mw bi">for which Annotation operations are currently in STARTED state. We do that by excluding the following from all queries in our system.</li>
</ul><p id="08e9" class="pw-post-body-paragraph kx ky ip kz b la lt jq lc ld lu jt lf lg lv li lj lk lw lm ln lo lx lq lr ls ii bi">To achieve above</p><ol class=""><li id="0478" class="mo mp ip kz b la lt ld lu lg mq lk mr lo ms ls nn mu mv mw bi">We add a filter in our ES query to exclude isAnnotationOperationStatus is FALSE.</li>
<li id="4b0f" class="mo mp ip kz b la mx ld my lg mz lk na lo nb ls nn mu mv mw bi">We query EVCache to find out all operations which are in STARTED state. Then we exclude all those annotations with annotationId found in memcache. Using memcache allows us to keep latencies for our search low (most of our queries are less than 100ms).</li>
</ol><p id="ff54" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">Cassandra is our source of truth so if an error happens we fail the client call. However, once we commit to Cassandra we must handle Elasticsearch errors. In our experience, all errors have happened when the Elasticsearch database is having some issue. In the above case, we created a retry logic for updateByQuery calls to ElasticSearch. If the call fails we push a message to SQS so we can retry in an automated fashion after some interval.</p><p id="ba79" class="pw-post-body-paragraph kx ky ip kz b la lb jq lc ld le jt lf lg lh li lj lk ll lm ln lo lp lq lr ls ii bi">In near term, we want to write a high level abstraction single API which can be called by our clients instead of calling three APIs. For example, they can store the annotations in a blob storage like S3 and give us a link to the file as part of the single API.</p>]]></description>
      <link>https://netflixtechblog.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8</link>
      <guid>https://netflixtechblog.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8</guid>
      <pubDate>Tue, 07 Mar 2023 21:39:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Scaling Media Machine Learning at Netflix]]></title>
      <description><![CDATA[<div class="ii ij ik il im"><p id="e8bd" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">By <a class="ae kk" href="https://www.linkedin.com/in/gucarmo/" rel="noopener ugc nofollow" target="_blank">Gustavo Carmo</a>, <a class="ae kk" href="https://www.linkedin.com/in/ellchow/" rel="noopener ugc nofollow" target="_blank">Elliot Chow</a>, <a class="ae kk" href="https://www.linkedin.com/in/nagendrak" rel="noopener ugc nofollow" target="_blank">Nagendra Kamath</a>, <a class="ae kk" href="https://www.linkedin.com/in/akshay-naresh-modi" rel="noopener ugc nofollow" target="_blank">Akshay Modi</a>, <a class="ae kk" href="https://www.linkedin.com/in/jasonge27" rel="noopener ugc nofollow" target="_blank">Jason Ge</a>, <a class="ae kk" href="https://www.linkedin.com/in/wenbingbai" rel="noopener ugc nofollow" target="_blank">Wenbing Bai</a>, <a class="ae kk" href="https://www.linkedin.com/in/jacksondecampos" rel="noopener ugc nofollow" target="_blank">Jackson de Campos</a>, <a class="ae kk" href="https://www.linkedin.com/in/lingyi-liu-4b866016/" rel="noopener ugc nofollow" target="_blank">Lingyi Liu</a>, <a class="ae kk" href="https://www.linkedin.com/in/pabloadelgado" rel="noopener ugc nofollow" target="_blank">Pablo Delgado</a>, <a class="ae kk" href="https://www.linkedin.com/in/meenakshijindal" rel="noopener ugc nofollow" target="_blank">Meenakshi Jindal</a>, <a class="ae kk" href="https://www.linkedin.com/in/boris-chen-b921a214/" rel="noopener ugc nofollow" target="_blank">Boris Chen</a>, <a class="ae kk" href="https://www.linkedin.com/in/vi-pallavika-iyengar-144abb1b/" rel="noopener ugc nofollow" target="_blank">Vi Iyengar</a>, <a class="ae kk" href="https://www.linkedin.com/in/kelli-griggs-32990125/" rel="noopener ugc nofollow" target="_blank">Kelli Griggs</a>, <a class="ae kk" href="https://linkedin.com/in/amirziai" rel="noopener ugc nofollow" target="_blank">Amir Ziai</a>, <a class="ae kk" href="https://www.linkedin.com/in/prasannapadmanabhan" rel="noopener ugc nofollow" target="_blank">Prasanna Padmanabhan</a>, and <a class="ae kk" href="https://www.linkedin.com/in/mhtaghavi/" rel="noopener ugc nofollow" target="_blank">Hossein Taghavi</a></p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div role="button" tabindex="0" class="kr ks di kt bf ku gg gh kl"><picture><img alt="" class="bf kv kw c" width="700" height="371" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="kx ky gi gg gh kz la bd b be z dk">Figure 1 - Media Machine Learning Infrastructure</figcaption></figure><p id="d0dd" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">In 2007, Netflix started offering streaming alongside its DVD shipping services. As the catalog grew and users adopted streaming, so did the opportunities for creating and improving our recommendations. With a catalog spanning thousands of shows and a diverse member base spanning millions of accounts, recommending the right show to our members is crucial.</p><p id="682f" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Why should members care about any particular show that we recommend? Trailers and artworks provide a glimpse of what to expect in that show. We have been leveraging machine learning (ML) models to <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">personalize artwork</a> and to help our <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/new-series-creating-media-with-machine-learning-5067ac110bcd">creatives create promotional content</a> efficiently.</p><p id="df14" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Our goal in building a media-focused ML infrastructure is to reduce the time from ideation to productization for our media ML practitioners. We accomplish this by paving the path to:</p><ul class=""><li id="f8d5" class="me mf ip jo b jp jq jt ju jx mg kb mh kf mi kj mj mk ml mm bi"><strong class="jo iq">Accessing</strong> and processing <strong class="jo iq">media data</strong> (e.g. video, image, audio, and text)</li>
<li id="7e33" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj mj mk ml mm bi"><strong class="jo iq">Training</strong> large-scale models efficiently</li>
<li id="285a" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj mj mk ml mm bi"><strong class="jo iq">Productizing</strong> models in a self-serve fashion in order to execute on existing and newly arriving assets</li>
<li id="457b" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj mj mk ml mm bi"><strong class="jo iq">Storing</strong> and <strong class="jo iq">serving</strong> model outputs for consumption in promotional content creation</li>
</ul><p id="a8bb" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In this post, we will describe some of the challenges of applying machine learning to media assets, and the infrastructure components that we have built to address them. We will then present a case study of using these components in order to optimize, scale, and solidify an existing pipeline. Finally, we’ll conclude with a brief discussion of the opportunities on the horizon.</p></div><div class="ii ij ik il im"><p id="e4c6" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">In this section, we highlight some of the unique challenges faced by media ML practitioners, along with the infrastructure components that we have devised to address them.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh ne"><picture><img alt="" class="bf kv kw c" width="95" height="69" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="c576" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><em class="nr">Media Access: Jasper</em></h2><p id="d440" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">In the early days of media ML efforts, it was very hard for researchers to access media data. Even after gaining access, one needed to deal with the challenges of homogeneity across different assets in terms of decoding performance, size, metadata, and general formatting.</p><p id="b199" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">To streamline this process, westandardized media assets with pre-processing steps that create and store dedicated quality-controlled derivatives with associated snapshotted metadata. In addition, we provide a unified library that enables ML practitioners to seamlessly access video, audio, image, and various text-based assets.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="ab34" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak"><em class="nr">Media Feature Storage: Amber Storage</em></strong></h2><p id="839c" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Media feature computation tends to be expensive and time-consuming. Many ML practitioners independently computed identical features against the same asset in their ML pipelines.</p><p id="e90a" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">To reduce costs and promote reuse, we have built a feature store in order to memoize features/embeddings tied to media entities. This feature store is equipped with a data replication system that enables copying data to different storage solutions depending on the required access patterns.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="3e83" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak"><em class="nr">Compute Triggering and Orchestration: Amber Orchestration</em></strong></h2><p id="12a3" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Productized models must run over newly arriving assets for scoring. In order to satisfy this requirement, ML practitioners had to develop bespoke triggering and orchestration components per pipeline. Over time, these bespoke components became the source of many downstream errors and were difficult to maintain.</p><p id="b229" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Amber is a suite of multiple infrastructure components that offers triggering capabilities to initiate the computation of algorithms with recursive dependency resolution.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div role="button" tabindex="0" class="kr ks di kt bf ku gg gh nu"><picture><img alt="" class="bf kv kw c" width="84" height="93" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="f7f1" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak"><em class="nr">Training Performance</em></strong></h2><p id="4f59" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Media model training poses multiple system challenges in storage, network, and GPUs. We have developed a large-scale GPU training cluster based on <a class="ae kk" href="https://www.ray.io/" rel="noopener ugc nofollow" target="_blank">Ray</a>, which supports multi-GPU / multi-node distributed training. We precompute the datasets, offload the preprocessing to CPU instances, optimize model operators within the framework, and utilize a high-performance file system to resolve the data loading bottleneck, increasing the entire training system throughput 3–5 times.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nv"><picture><img alt="" class="bf kv kw c" width="72" height="53" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><h2 id="a6e4" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak"><em class="nr">Serving and Searching</em></strong></h2><p id="da48" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Media feature values can be optionally synchronized to other systems depending on necessary query patterns. One of these systems is <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/scalable-annotation-service-marken-f5ba9266d428">Marken</a>, a scalable service used to persist feature values as annotations, which are versioned and strongly typed constructs associated with Netflix media entities such as videos and artwork.</p><p id="7924" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">This service provides a user-friendly query DSL for applications to perform search operations over these annotations with specific filtering and grouping. Marken provides unique search capabilities on temporal and spatial data by time frames or region coordinates, as well as vector searches that are able to scale up to the entire catalog.</p><p id="ce08" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">ML practitioners interact with this infrastructure mostly using Python, but there is a plethora of tools and platforms being used in the systems behind the scenes. These include, but are not limited to, <a class="ae kk" href="https://conductor.netflix.com/" rel="noopener ugc nofollow" target="_blank">Conductor</a>, <a class="ae kk" href="https://www.youtube.com/watch?v=V2E1PdboYLk" rel="noopener ugc nofollow" target="_blank">Dagobah</a>, <a class="ae kk" href="https://metaflow.org/" rel="noopener ugc nofollow" target="_blank">Metaflow</a>, <a class="ae kk" href="https://netflix.github.io/titus/" rel="noopener ugc nofollow" target="_blank">Titus</a>, <a class="ae kk" href="https://github.com/Netflix/iceberg" rel="noopener ugc nofollow" target="_blank">Iceberg</a>, Trino, Cassandra, Elastic Search, Spark, Ray, <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/mezzfs-mounting-object-storage-in-netflixs-media-processing-platform-cda01c446ba">MezzFS</a>, S3, <a class="ae kk" href="https://www.infoq.com/presentations/netflix-drive/" rel="noopener ugc nofollow" target="_blank">Baggins</a>, <a class="ae kk" href="https://aws.amazon.com/fsx/" rel="noopener ugc nofollow" target="_blank">FSx</a>, and Java/Scala-based applications with Spring Boot.</p></div><div class="ii ij ik il im"><p id="ccc8" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">The <em class="ns">Media Machine Learning Infrastructure</em> is empowering various scenarios across Netflix, and some of them are described <a class="ae kk" href="https://netflixtechblog.medium.com/new-series-creating-media-with-machine-learning-5067ac110bcd" rel="noopener">here</a>. In this section, we showcase the use of this infrastructure through the case study of <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/match-cutting-at-netflix-finding-cuts-with-smooth-visual-transitions-31c3fc14ae59"><em class="ns">Match Cutting</em></a>.</p><h2 id="802c" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi">Background</h2><p id="72b7" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi"><em class="ns">Match Cutting</em> is a video editing technique. It’s a transition between two <a class="ae kk" href="https://en.wikipedia.org/wiki/Shot_(filmmaking)#:~:text=In%20filmmaking%20and%20video%20production,express%20emotion%2C%20ideas%20and%20movement." rel="noopener ugc nofollow" target="_blank">shots</a> that uses similar visual framing, composition, or action to fluidly bring the viewer from one scene to the next. It is a powerful visual storytelling tool used to create a connection between two scenes.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nw"><picture><img alt="" class="bf kv kw c" width="600" height="320" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="kx ky gi gg gh kz la bd b be z dk">Figure 2 - a series of frame match cuts from <a class="ae kk" href="https://www.netflix.com/title/81231974" rel="noopener ugc nofollow" target="_blank">Wednesday</a>.</figcaption></figure><p id="b193" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/match-cutting-at-netflix-finding-cuts-with-smooth-visual-transitions-31c3fc14ae59">an earlier post</a>, we described how we’ve used machine learning to find candidate pairs. In this post, we will focus on the engineering and infrastructure challenges of delivering this feature.</p><h2 id="e080" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi">Where we started</h2><p id="524d" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Initially, we built <em class="ns">Match Cutting</em> to find matches across a single title (i.e. either a movie or an episode within a show). An average title has 2k shots, which means that we need to enumerate and process ~2M pairs.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div role="button" tabindex="0" class="kr ks di kt bf ku gg gh kl"><picture><img alt="" class="bf kv kw c" width="700" height="732" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="kx ky gi gg gh kz la bd b be z dk">Figure 3- The original Match Cutting pipeline before leveraging media ML infrastructure components.</figcaption></figure><p id="f9d5" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">This entire process was encapsulated in a single <a class="ae kk" href="https://metaflow.org/" rel="noopener ugc nofollow" target="_blank">Metaflow</a> flow. Each step was mapped to a Metaflow <a class="ae kk" href="https://docs.metaflow.org/metaflow/basics#what-should-be-a-step" rel="noopener ugc nofollow" target="_blank">step</a>, which allowed us to control the amount of resources used per step.</p><p id="b0dc" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Step 1</strong></p><p id="cb47" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We download a video file and produce shot boundary metadata. An example of this data is provided below:</p><pre class="km kn ko kp gs nx ny nz bn oa ob bi">SB = {0: [0, 20], 1: [20, 30], 2: [30, 85], …}</pre><p id="5c91" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Each key in the <code class="fd oh oi oj ny b">SB</code> dictionary is a shot index and each value represents the frame range corresponding to that shot index. For example, for the shot with index <code class="fd oh oi oj ny b">1</code> (the second shot), the value captures the shot frame range <code class="fd oh oi oj ny b">[20, 30]</code>, where <code class="fd oh oi oj ny b">20</code> is the start frame and <code class="fd oh oi oj ny b">29</code> is the end frame (i.e. the end of the range is exclusive while the start is inclusive).</p><p id="eb05" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Using this data, we then materialized individual clip files (e.g. <code class="fd oh oi oj ny b">clip0.mp4</code>, <code class="fd oh oi oj ny b">clip1.mp4</code>, etc) corresponding to each shot so that they can be processed in Step 2.</p><p id="65c6" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Step 2</strong></p><p id="8b45" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">This step works with the individual files produced in <em class="ns">Step 1</em> and the list of shot boundaries. We first extract a representation (aka embedding) of each file using a video encoder (i.e. an algorithm that converts a video to a fixed-size vector) and use that embedding to identify and remove duplicate shots.</p><p id="b103" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In the following example <code class="fd oh oi oj ny b">SB_deduped</code> is the result of deduplicating <code class="fd oh oi oj ny b">SB</code>:</p><pre class="km kn ko kp gs nx ny nz bn oa ob bi"># the second shot (index 1) was removed and so was clip1.mp4SB_deduped = {0: [0, 20], 2: [30, 85], …}</pre><p id="736b" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><code class="fd oh oi oj ny b">SB_deduped</code> along with the surviving files are passed along to step 3.</p><p id="1bf0" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Step 3</strong></p><p id="5245" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We compute another representation per shot, depending on the flavor of match cutting.</p><p id="7e25" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Step 4</strong></p><p id="e99a" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We enumerate all pairs and compute a score for each pair of representations. These scores are stored along with the shot metadata:</p><pre class="km kn ko kp gs nx ny nz bn oa ob bi">[# shots with indices 12 and 729 have a high matching score{shot1: 12, shot2: 729, score: 0.96},# shots with indices 58 and 419 have a low matching score{shot1: 58, shot2: 410, score: 0.02},…]</pre><p id="0358" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Step 5</strong></p><p id="10a5" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Finally, we sort the results by score in descending order and surface the top-<code class="fd oh oi oj ny b">K</code> pairs, where <code class="fd oh oi oj ny b">K</code> is a parameter.</p><h2 id="3e41" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi">The problems we faced</h2><p id="d55a" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">This pattern works well for a single flavor of match cutting and finding matches within the same title. As we started venturing beyond single-title and added more flavors, we quickly faced a few problems.</p><p id="aef4" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Lack of standardization</strong></p><p id="106c" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">The representations we extract in <em class="ns">Steps 2</em> and <em class="ns">Step 3</em> are sensitive to the characteristics of the input video files. In some cases such as instance segmentation, the output representation in <em class="ns">Step 3</em> is a function of the dimensions of the input file.</p><p id="1902" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Not having a standardized input file format (e.g. same encoding recipes and dimensions) created matching quality issues when representations across titles with different input files needed to be processed together (e.g. multi-title match cutting).</p><p id="4edb" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Wasteful repeated computations</strong></p><p id="27e7" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Segmentation at the shot level is a common task used across many media ML pipelines. Also, deduplicating similar shots is a common step that a subset of those pipelines shares.</p><p id="b7a1" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We realized that memoizing these computations not only reduces waste but also allows for congruence between algo pipelines that share the same preprocessing step. In other words, having a single source of truth for shot boundaries helps us guarantee additional properties for the data generated downstream. As a concrete example, knowing that algo <code class="fd oh oi oj ny b">A</code> and algo<code class="fd oh oi oj ny b">B</code> both used the same shot boundary detection step, we know that shot index <code class="fd oh oi oj ny b">i</code> has identical frame ranges in both. Without this knowledge, we’ll have to check if this is actually true.</p><p id="189b" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Gaps in media-focused pipeline triggering and orchestration</strong></p><p id="35cf" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Our stakeholders (i.e. video editors using match cutting) need to start working on titles as quickly as the video files land. Therefore, we built a mechanism to trigger the computation upon the landing of new video files. This triggering logic turned out to present two issues:</p><ol class=""><li id="d0e9" class="me mf ip jo b jp jq jt ju jx mg kb mh kf mi kj ok mk ml mm bi">Lack of standardization meant that the computation was sometimes re-triggered for the same video file due to changes in metadata, without any content change.</li>
<li id="9230" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj ok mk ml mm bi">Many pipelines independently developed similar bespoke components for triggering computation, which created inconsistencies.</li>
</ol><p id="9ebe" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Additionally, decomposing the pipeline into modular pieces and orchestrating computation with dependency semantics did not map to existing workflow orchestrators such as <a class="ae kk" href="https://conductor.netflix.com/" rel="noopener ugc nofollow" target="_blank">Conductor</a> and <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/meson-workflow-orchestration-for-netflix-recommendations-fc932625c1d9">Meson</a> out of the box. The media machine learning domain needed to be mapped with some level of coupling between media assets metadata, media access, feature storage, feature compute and feature compute triggering, in a way that new algorithms could be easily plugged with predefined standards.</p><p id="b1e8" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">This is where <em class="ns">Amber</em> comes in, offering a <em class="ns">Media Machine Learning Feature Development and Productization Suite,</em> gluing all aspects of shipping algorithms while permitting the interdependency and composability of multiple smaller parts required to devise a complex system.</p><p id="e15d" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Each part is in itself an algorithm, which we call an <em class="ns">Amber Feature</em>, with its own scope of computation, storage, and triggering. Using dependency semantics, an <em class="ns">Amber Feature</em> can be plugged into other <em class="ns">Amber Features</em>, allowing for the composition of a complex mesh of interrelated algorithms.</p><p id="8e05" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Match Cutting across titles</strong></p><p id="0af7" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><em class="ns">Step 4</em> entails a computation that is quadratic in the number of shots. For instance, matching across a series with 10 episodes with an average of 2K shots per episode translates into 200M comparisons. Matching across 1,000 files (across multiple shows) would take approximately 200 trillion computations.</p><p id="1feb" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Setting aside the sheer number of computations required momentarily, editors may be interested in considering any subset of shows for matching. The naive approach is to pre-compute all possible subsets of shows. Even assuming that we only have 1,000 video files, this means that we have to pre-compute 2¹⁰⁰⁰ subsets, which is more than the <a class="ae kk" href="https://en.wikipedia.org/wiki/Observable_universe#Matter_content%E2%80%94number_of_atoms" rel="noopener ugc nofollow" target="_blank">number of atoms in the observable universe</a>!</p><p id="fd63" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Ideally, we want to use an approach that avoids both issues.</p><h2 id="912d" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak">Where we landed</strong></h2><p id="2092" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">The <em class="ns">Media Machine Learning Infrastructure</em> provided many of the building blocks required for overcoming these hurdles.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh ne"><picture><img alt="" class="bf kv kw c" width="95" height="69" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="1ace" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Standardized video encodes</strong></p><p id="97a5" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">The entire Netflix catalog is pre-processed and stored for reuse in machine learning scenarios. <em class="ns">Match Cutting</em> benefits from this standardization as it relies on homogeneity across videos for proper matching.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="c275" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Shot segmentation and deduplication reuse</strong></p><p id="bf38" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Videos are matched at the shot level. Since breaking videos into shots is a very common task across many algorithms, the infrastructure team provides this canonical feature that can be used as a dependency for other algorithms. With this, we were able to reuse memoized feature values, saving on compute costs and guaranteeing coherence of shot segments across algos.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="7f95" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Orchestrating embedding computations</strong></p><p id="12aa" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We have used <em class="ns">Amber</em>’s feature dependency semantics to tie the computation of embeddings to shot deduplication. Leveraging <em class="ns">Amber</em>’s triggering, we automatically initiate scoring for new videos as soon as the standardized video encodes are ready. <em class="ns">Amber</em> handles the computation in the dependency chain recursively.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="9361" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Feature value storage</strong></p><p id="b346" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We store embeddings in <em class="ns">Amber</em>, which guarantees immutability, versioning, auditing, and various metrics on top of the feature values. This also allows other algorithms to be built on top of the <em class="ns">Match Cutting</em> output as well as all the intermediate embeddings.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="f1a9" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Compute pairs and sink to Marken</strong></p><p id="5d29" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We have also used <em class="ns">Amber’s</em> synchronization mechanisms to replicate data from the main feature value copies to <em class="ns">Marken</em>, which is used for serving.</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div class="gg gh nt"><picture><img alt="" class="bf kv kw c" width="96" height="72" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="e019" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Media Search Platform</strong></p><p id="66ff" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Used to serve high-scoring pairs to video editors in internal applications via <em class="ns">Marken</em>.</p><p id="1146" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">The following figure depicts the new pipeline using the above-mentioned components:</p><figure class="km kn ko kp gs kq gg gh paragraph-image"><div role="button" tabindex="0" class="kr ks di kt bf ku gg gh kl"><picture><img alt="" class="bf kv kw c" width="700" height="315" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="kx ky gi gg gh kz la bd b be z dk">Figure 4 - Match cutting pipeline built using media ML infrastructure components. Interactions between algorithms are expressed as a feature mesh, and each Amber Feature encapsulates triggering and compute.</figcaption></figure></div><div class="ii ij ik il im"><p id="aa44" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">The intersection of media and ML holds numerous prospects for innovation and impact. We examined some of the unique challenges that media ML practitioners face and presented some of our early efforts in building a platform that accommodates the scaling of ML solutions.</p><p id="d843" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In addition to the promotional media use cases we discussed, we are extending the infrastructure to facilitate a growing set of use cases. Here are just a few examples:</p><ul class=""><li id="ed02" class="me mf ip jo b jp jq jt ju jx mg kb mh kf mi kj mj mk ml mm bi">ML-based VFX tooling</li>
<li id="2f81" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj mj mk ml mm bi">Improving recommendations using a suite of content understanding models</li>
<li id="ef25" class="me mf ip jo b jp mn jt mo jx mp kb mq kf mr kj mj mk ml mm bi">Enriching content understanding ML and creative tooling by leveraging personalization signals and insights</li>
</ul><p id="a1ee" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In future posts, we’ll dive deeper into more details about the solutions built for each of the components we have briefly described in this post.</p><p id="06b0" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">If you’re interested in media ML, we’re always looking for <a class="ae kk" href="https://jobs.netflix.com/search?q=engineer" rel="noopener ugc nofollow" target="_blank">engineers</a> and <a class="ae kk" href="https://jobs.netflix.com/search?q=%22machine%20learning%22" rel="noopener ugc nofollow" target="_blank">ML researchers and practitioners</a> to join us!</p><h2 id="f8b2" class="nf lc ip bd ld ng nh dn lh ni nj dp ll jx nk nl lp kb nm nn lt kf no np lx nq bi"><strong class="ak">Acknowledgments</strong></h2><p id="0d66" class="pw-post-body-paragraph jm jn ip jo b jp lz jr js jt ma jv jw jx mb jz ka kb mc kd ke kf md kh ki kj ii bi">Special thanks to <a class="ae kk" href="https://www.linkedin.com/in/benjamin-klein-usa/" rel="noopener ugc nofollow" target="_blank">Ben Klein</a>, <a class="ae kk" href="https://www.linkedin.com/in/fernando-amat-6110931/" rel="noopener ugc nofollow" target="_blank">Fernando Amat Gil</a>, <a class="ae kk" href="https://www.linkedin.com/in/varun-sekhri-087a213/" rel="noopener ugc nofollow" target="_blank">Varun Sekhri</a>, <a class="ae kk" href="https://www.linkedin.com/in/gurutahasildar/" rel="noopener ugc nofollow" target="_blank">Guru Tahasildar</a>, and <a class="ae kk" href="https://www.linkedin.com/in/burakbacioglu/" rel="noopener ugc nofollow" target="_blank">Burak Bacioglu</a> for contributing to ideas, designs, and discussions.</p></div>]]></description>
      <link>https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243</link>
      <guid>https://netflixtechblog.com/scaling-media-machine-learning-at-netflix-f19b400243</guid>
      <pubDate>Mon, 13 Feb 2023 18:59:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Discovering Creative Insights in Promotional Artwork]]></title>
      <description><![CDATA[<p id="fdd3" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">By <a class="ae kk" href="https://www.linkedin.com/in/tsmgrace/" rel="noopener ugc nofollow" target="_blank">Grace Tang</a>, <a class="ae kk" href="https://www.linkedin.com/in/aneeshvartakavi/" rel="noopener ugc nofollow" target="_blank">Aneesh Vartakavi</a>, <a class="ae kk" href="https://www.linkedin.com/in/jbagdonaite/" rel="noopener ugc nofollow" target="_blank">Julija Bagdonaite</a> and <a class="ae kk" href="https://www.linkedin.com/in/cristinasegalin/" rel="noopener ugc nofollow" target="_blank">Cristina Segalin</a></p><p id="cae6" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">When members are shown a title on Netflix, the displayed artwork, trailers, and synopses are personalized. That means members see the assets that are most likely to help them make an informed choice. These assets are a critical source of information for the member to make a decision to watch, or not watch, a title. The stories on Netflix are multidimensional and there are many ways that a single story could appeal to different members. We want to show members the images, trailers, and synopses that are most helpful to them for making a watch decision.</p><p id="1ac0" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In a <a class="ae kk" rel="noopener ugc nofollow" target="_blank" href="https://netflixtechblog.com/artwork-personalization-c589f074ad76">previous blog post</a> we explained how our artwork personalization algorithm can pick the best image for each member, but how do we create a good set of images to choose from? What data would you like to have if you were designing an asset suite?</p><p id="5b7e" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In this blog post, we talk about two approaches to create effective artwork. Broadly, they are:</p><ol class=""><li id="1575" class="kl km ip jo b jp jq jt ju jx kn kb ko kf kp kj kq kr ks kt bi">The top-down approach, where we preemptively identify image properties to investigate, informed by our initial beliefs.</li>
<li id="ebbe" class="kl km ip jo b jp ku jt kv jx kw kb kx kf ky kj kq kr ks kt bi">The bottom-up approach, where we let the data naturally surface important trends.</li>
</ol><p id="af33" class="pw-post-body-paragraph jm jn ip jo b jp lx jr js jt ly jv jw jx lz jz ka kb ma kd ke kf mb kh ki kj ii bi">Great promotional media helps viewers discover titles they’ll love. In addition to helping members quickly find titles already aligned with their tastes, they help members discover new content. We want to make artwork that is compelling and personally relevant, but we also want to represent the title authentically. We don’t want to make clickbait.</p><p id="8d3e" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Here’s an example: <a class="ae kk" href="https://www.netflix.com/title/81043665" rel="noopener ugc nofollow" target="_blank"><em class="mc">Purple Hearts</em></a> is a film about an aspiring singer-songwriter who commits to a marriage of convenience with a soon-to-deploy Marine.This title has storylines that might appeal to both fans of romance as well as military and war themes. This is reflected in our artwork suite for this title.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div class="gg gh md"><picture><img alt="" class="bf mj mk c" width="664" height="309" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">Images for the title “Purple Hearts”</figcaption></figure><p id="9960" class="pw-post-body-paragraph jm jn ip jo b jp lx jr js jt ly jv jw jx lz jz ka kb ma kd ke kf mb kh ki kj ii bi">To create suites that are relevant, attractive, and authentic, we’ve relied on creative strategists and designers with intimate knowledge of the titles to recommend and create the right art for upcoming titles. To supplement their domain expertise, we’ve built a suite of tools to help them look for trends. By inspecting past asset performance from thousands of titles that have already been launched on Netflix, we achieve a beautiful intersection of art &amp; science. However, there are some downsides to this approach: It is tedious to manually scrub through this large collection of data, and looking for trends this way could be subjective and vulnerable to confirmation bias.</p><p id="d464" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Creators often have years of experience and expert knowledge on what makes a good piece of art. However, it is still useful to test our assumptions, especially in the context of the specific canvases we use on the Netflix product. For example, certain traditional art styles that are effective in traditional media like movie posters might not translate well to the Netflix UI in your living room. Compared to a movie poster or physical billboard, Netflix artwork on TV screens and mobile phones have very different size, aspect ratios, and amount of attention paid to them. As a consequence, we need to conduct research into the effectiveness of artwork on our unique user interfaces instead of extrapolating from established design principles.</p><p id="5a5e" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Given these challenges, we develop data-driven recommendations and surface them to creators in an actionable, user-friendly way. These insights complement their extensive domain expertise in order to help them to create more effective asset suites. We do this in two ways, a top-down approach that can find known features that have worked well in the past, and a bottom-up approach that surfaces groups of images with no prior knowledge or assumptions.</p><p id="e62c" class="pw-post-body-paragraph jm jn ip jo b jp lx jr js jt ly jv jw jx lz jz ka kb ma kd ke kf mb kh ki kj ii bi">In our top-down approach, we describe an image using attributes and find features that make images successful. We collaborate with experts to identify a large set of features based on their prior knowledge and experience, and model them using Computer Vision and Machine Learning techniques. These features range from low level features like color and texture, to higher level features like the number of faces, composition, and facial expressions.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div class="gg gh mp"><picture><img alt="" class="bf mj mk c" width="300" height="420" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">An example of the features we might capture for this image include: number of people (two), where they’re facing (facing each other), emotion (neutral to positive), saturation (low), objects present (military uniform)</figcaption></figure><p id="7f8e" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">We can use pre-trained models/APIs to create some of these features, like face detection and object labeling. We also build internal datasets and models for features where pre-trained models are not sufficient. For example, common Computer Vision models can tell us that an image contains two people facing each other with happy facial expressions — are they friends, or in a romantic relationship? We have built human-in-the-loop tools to help experts train ML models rapidly and efficiently, enabling them to build custom models for subjective and complex attributes.</p><p id="642d" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Once we describe an image with features, we employ various <a class="ae kk" href="https://netflixtechblog.medium.com/causal-machine-learning-for-creative-insights-4b0ce22a8a96" rel="noopener">predictive and causal methods</a> to extract insights about which features are most important for effective artwork, which are leveraged to create artwork for upcoming titles. An example insight is that when we look across the catalog, we found that single person portraits tend to perform better than images featuring more than one person.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div role="button" tabindex="0" class="mr ms di mt bf mu gg gh mq"><picture><img alt="" class="bf mj mk c" width="700" height="242" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">Single Character Portraits</figcaption></figure><p id="6997" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Bottom-up approach</strong></p><p id="ebc8" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">The top-down approach can deliver clear actionable insights supported by data, but these insights are limited to the features we are able to identify beforehand and model computationally. We balance this using a bottom-up approach where we do not make any prior guesses, and let the data surface patterns and features. In practice, we surface clusters of similar images and have our creative experts derive insights, patterns and inspiration from these groups.</p><p id="a7d9" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">One such method we use for image clustering is leveraging large pre-trained convolutional neural networks to model image similarity. Features from the early layers often model low level similarity like colors, edges, textures and shape, while features from the final layers group images depending on the task (eg. similar objects if the model is trained for object detection). We could then use an unsupervised clustering algorithm (like k-means) to find clusters within these images.</p><p id="4ab8" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Using our example title above, one of the characters in <em class="mc">Purple Hearts</em> is in the Marines. Looking at clusters of images from similar titles, we see a cluster that contains imagery commonly associated with images of military and war, featuring characters in military uniform.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div role="button" tabindex="0" class="mr ms di mt bf mu gg gh mv"><picture><img alt="" class="bf mj mk c" width="700" height="459" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">An example cluster of imagery related to military and war.</figcaption></figure><p id="0b5b" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Sampling some images from the cluster above, we see many examples of soldiers or officers in uniform, some holding weapons, with serious facial expressions, looking off camera. A creator could find this pattern of images within the cluster below, confirm that the pattern has worked well in the past using performance data, and use this as inspiration to create final artwork.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div role="button" tabindex="0" class="mr ms di mt bf mu gg gh mw"><picture><img alt="" class="bf mj mk c" width="700" height="365" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">A creator can draw inspiration from images in the cluster to the left, and use this to create effective artwork for new titles, such as the image for Purple Hearts on the right.</figcaption></figure><p id="8738" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Similarly, the title has a romance storyline, so we find a cluster of images that show romance. From such a cluster, a creator could infer that showing close physical proximity and body language convey romance, and use this as inspiration to create the artwork below.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div role="button" tabindex="0" class="mr ms di mt bf mu gg gh mx"><picture><img alt="" class="bf mj mk c" width="700" height="363" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
</figure><p id="8629" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">On the flip side, creatives can also use these clusters to learn what <em class="mc">not</em> to do. For example, here are images within the same cluster with military and war imagery above. If, hypothetically speaking, they were presented with historical evidence that these kinds of images didn’t perform well for a given canvas, a creative strategist could infer that highly saturated silhouettes don’t work as well in this context, confirm it with a test to establish a causal relationship, and decide not to use it for their title.</p><figure class="me mf mg mh gs mi gg gh paragraph-image"><div role="button" tabindex="0" class="mr ms di mt bf mu gg gh my"><picture><img alt="" class="bf mj mk c" width="700" height="433" role="presentation" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" /></picture></div>
<figcaption class="ml mm gi gg gh mn mo bd b be z dk">A creator can also spot patterns that didn’t work in the past, and avoid using it for future titles.</figcaption></figure><p id="fdd2" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Member clustering</strong></p><p id="6e6a" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Another complementary technique is member clustering, where we group members based on their preferences. We can group them by viewing behavior, or also leverage our image personalization algorithm to find groups of members that positively responded to the same image asset. As we observe these patterns across many titles, we can learn to predict which user clusters might be interested in a title, and we can also learn which assets might resonate with these user clusters.</p><p id="b85a" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">As an example, let’s say we are able to cluster Netflix members into two broad clusters — one that likes romance, and another that enjoys action. We can look at how these two groups of members responded to a title after its release. We might find that 80% of viewers of <em class="mc">Purple Hearts</em> belong to the romance cluster, while 20% belong to the action cluster. Furthermore, we might find that a representative romance fan (eg. the cluster centroid) responds most positively to images featuring the star couple in an embrace. Meanwhile, viewers in the action cluster respond most strongly to images featuring a soldier on the battlefield. As we observe these patterns across many titles, we can learn to predict which user clusters might be interested in similar upcoming titles, and we can also learn which themes might resonate with these user clusters. Insights like these can guide artwork creation strategy for future titles.</p><p id="0f15" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Conclusion</strong></p><p id="266c" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Our goal is to empower creatives with data-driven insights to create better artwork. Top-down and bottom-up methods approach this goal from different angles, and provide insights with different tradeoffs.</p><p id="ff0a" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">Top-down features have the benefit of being clearly explainable and testable. On the other hand, it is relatively difficult to model the effects of interactions and combinations of features. It is also challenging to capture complex image features, requiring custom models. For example, there are many visually distinct ways to convey a theme of “love”: heart emojis, two people holding hands, or people gazing into each others’ eyes and so on, which are all very visually different. Another challenge with top-down approaches is that our lower level features could miss the true underlying trend. For example, we might detect that the colors green and blue are effective features for nature documentaries, but what is really driving effectiveness may be the portrayal of natural settings like forests or oceans.</p><p id="842d" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">In contrast, bottom-up methods model complex high-level features and their combinations, but their insights are less explainable and subjective. Two users may look at the same cluster of images and extract different insights. However, bottom-up methods are valuable because they can surface unexpected patterns, providing inspiration and leaving room for creative exploration and interpretation without being prescriptive.</p><p id="dddf" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">The two approaches are complementary. Unsupervised clusters can give rise to observable trends that we can then use to create new testable top-down hypotheses. Conversely, top-down labels can be used to describe unsupervised clusters to expose common themes within clusters that we might not have spotted at first glance. Our users synthesize information from both sources to design better artwork.</p><p id="e9dd" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">There are many other important considerations that our current models don’t account for. For example, there are factors outside of the image itself that might affect its effectiveness, like how popular a celebrity is locally, cultural differences in aesthetic preferences or how certain themes are portrayed, what device a member is using at the time and so on. As our member base becomes increasingly global and diverse, these are factors we need to account for in order to create an inclusive and personalized experience.</p><p id="334f" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi"><strong class="jo iq">Acknowledgements</strong></p><p id="016f" class="pw-post-body-paragraph jm jn ip jo b jp jq jr js jt ju jv jw jx jy jz ka kb kc kd ke kf kg kh ki kj ii bi">This work would not have been possible without our cross-functional partners in the creative innovation space. We would like to specifically thank Ben Klein and Amir Ziai for helping to build the technology we describe here.</p>]]></description>
      <link>https://netflixtechblog.com/discovering-creative-insights-in-promotional-artwork-295e4d788db5</link>
      <guid>https://netflixtechblog.com/discovering-creative-insights-in-promotional-artwork-295e4d788db5</guid>
      <pubDate>Mon, 30 Jan 2023 17:16:00 +0100</pubDate>
    </item>
  </channel>
</rss>
