<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:webfeeds="http://webfeeds.org/rss/1.0" version="2.0">
  <channel>
    <atom:link href="http://pubsubhubbub.appspot.com/" rel="hub"/>
    <atom:link href="https://f43.me/facebook-code.xml" rel="self" type="application/rss+xml"/>
    <title>Facebook Code</title>
    <description>Meet the engineers who code Facebook</description>
    <link>http://code.facebook.com</link>
    <webfeeds:icon>https://s2.googleusercontent.com/s2/favicons?alt=feed&amp;domain=code.facebook.com</webfeeds:icon>
    <generator>f43.me</generator>
    <lastBuildDate>Fri, 13 Mar 2026 06:40:08 +0100</lastBuildDate>
    <item>
      <title><![CDATA[How Advanced Browsing Protection Works in Messenger]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing the technical details behind how Advanced Browsing Protection (ABP) in Messenger protects the privacy of the links clicked on within chats while still warning people about malicious links.</li>
<li class="c1" aria-level="1">We hope that this post has helped to illuminate some of the engineering challenges and infrastructure components involved for providing this feature for our users.</li>
</ul><p>While <a href="https://engineering.fb.com/2023/12/06/security/building-end-to-end-security-for-messenger/" target="_blank" rel="noopener">end-to-end encryption (E2EE) on Messenger</a> ensures that direct messages and calls are protected, Messenger’s Safe Browsing feature safeguards against malicious links within end-to-end encrypted messages and calls on the app. If you’re sent an unsafe link for some reason – maybe it’s sent by someone you don’t know or by a friend whose account has been compromised – Safe Browsing warns you that the link points to an unsafe website that may try to steal passwords or other personal information from you.</p>
<p>In its standard setting, Safe Browsing uses on-device models to analyze malicious links shared in chats. But we’ve extended this further with an advanced setting called Advanced Browsing Protection (ABP) that leverages a continually updated watchlist of millions more potentially malicious websites.</p>
<p>To build ABP, we had to leverage a series of intricate infrastructure components, a complex system of cryptographic primitives, all working together with the goal of protecting user privacy in Messenger.</p>
<h2>Private Information Retrieval – The Starting Point for ABP</h2>
<p>ABP closely mirrors the setting for a cryptographic primitive known as private information retrieval (PIR). In the classical PIR setting, a client queries a server (that holds a database) to learn whether or not the subject of the query is a member of that database. This protocol aims for the server to learn as little information as possible (ideally no information) about the client’s query.</p>
<p>In a theoretical setting, the server could send the entire database to the client, allowing the client to perform subsequent query lookups on its own, without needing to involve the server anymore. However, the database used by ABP needs to be updated frequently, and is too large to reasonably be sent down to the client. Furthermore, revealing the entire database to the client could inadvertently aid attackers attempting to circumvent the system.</p>
<p>Other work has suggested that <a href="https://www.usenix.org/system/files/sec19-thomas.pdf" target="_blank" rel="noopener">this approach can be improved upon by using an oblivious pseudorandom function (OPRF)</a> and dividing the database into multiple shards (or “buckets”) so that the linear-time operation is performed over a fraction of the database.</p>
<p>This existing approach was the starting point for our implementation of ABP, but there were two issues that we needed to adapt into our setting.</p>
<ol><li class="c1" aria-level="1">An OPRF works well for queries that are exact matches into the database. However, URL-matching queries are not exact matches, as we will describe in more detail, shortly.</li>
<li class="c1" aria-level="1">This also means that the client still needs to tell the server which bucket to look into. This inherently introduces a tradeoff between the privacy of the system versus the efficiency/bandwidth: The less granular the buckets, the less efficient the protocol becomes, but the less information is leaked from client’s query to the server.</li>
</ol><p>There are also other approaches, namely cryptographic constructions, which improve this tradeoff by employing lattice-based techniques to reduce the amount of sharding needed. However, at the time of writing, these did not appear to be practical enough to completely eliminate the need for sharding at our scale. This could be a promising future direction for the system, though, and for industrial applications of PIR in general.</p>
<h2>How ABP Handles Prefix Queries for URLs</h2>
<p>The server’s database entries consist of URL domains with (and without) paths, which do not always correspond to exact link matches. For instance, if an entry for “example.com” existed in our database, and the client submits a query in the form, “example.com/a/b/index.html” this should be reported to the client as a match, even though the link contents do not match exactly.</p>
<p>Instead, what we need is a privacy-preserving “URL-matching” scheme between the client’s query and each of the database entries. Subdomains are also a consideration here, but we’ve omitted them for the simplicity of this example.</p>
<p>One simple approach we considered to address these prefix queries was to run a series of parallel queries for PIR, one for each path prefix of the URL. So, in our running example of the client query being “example.com/a/b/index.html” the client would create PIR queries for:</p>
<ul><li class="c1" aria-level="1">example.com</li>
<li class="c1" aria-level="1">example.com/a</li>
<li class="c1" aria-level="1">example.com/a/b</li>
<li class="c1" aria-level="1">example.com/a/b/index.html</li>
</ul><p>Functionally, this would satisfy prefix matching, but there is a privacy issue with this approach: Each of these path prefix queries leaks extra information about the client’s actual URL. If the PIR scheme we use does not leak any information to the server, then this might be acceptable, but if the server learns <em>B</em> bits of the client query, then in this scheme the server learns <em>P</em> * <em>B</em> bits, where <em>P</em> is the number of path prefixes in the URL. For extremely long URLs, this might even be enough to uniquely identify a plaintext link!</p>
<p>In order to reduce the leakage to the server, we can instead have the server group together links that share the same domain. This way, the client can again request just one bucket (the bucket corresponding to the URL’s domain), then check <em>all</em> the prefix URL path components for membership in that one bucket.</p>
<p>This would indeed address the privacy issue so that the server only learns <em>B</em> bits. But it also creates a new efficiency problem: Bucket sizes can become unbalanced. We create buckets by hashing URLs. If we were to hash full URLs, we could expect bucket sizes to be approximately uniform because each blocklist entry is mapped to a bucket pseudorandomly. When we hash only domains, that’s no longer the case. If many blocklist entries share the same domain, they all end up in the same bucket. </p>
<p>It turns out that in practice many blocklisted URLs <em>do</em> share domains. For example, consider link shortening services: These services might host many, many URLs (both malicious and benign) that all share the same domain. If many links share the same domain and, hence, belong in the same bucket, then the size of the bucket might be too large to be able to return to the client. And since we apply padding to buckets, the response size would be equal to the maximum across all buckets! </p>
<h2>Pre-processing Rulesets</h2>
<p>To address this problem, we have the server perform a pre-processing step in which it attempts to balance buckets by generating a “ruleset”: a set of operations to process and hash a given URL. The server computes this ruleset and shares it with clients ahead of time so that the client can apply the same set of rules at lookup time.</p>
<p>Here’s an example of a ruleset containing three rules:</p>
<table class="c3" border="1" style="width: 762px;"><tbody><tr><td class="c2"><strong>Hash Prefix</strong></td>
<td class="c2"><strong># of Path Segments</strong></td>
</tr><tr><td class="c2">08bd4dd11758b503</td>
<td class="c2">2</td>
</tr><tr><td class="c2">fe891588d205cf7f</td>
<td class="c2">1</td>
</tr><tr><td class="c2">c078e5ff2e262830</td>
<td class="c2">4</td>
</tr></tbody></table><p><br />
Each row is a rule that maps an 8-byte hash prefix to a certain number of path segments to append to the running URL query. Using our example of the link “example.com/a/b/index.html,” the client starts by computing a short hash of the domain: Hash(“example.com”). Let’s say that it matches one of the hashes in the ruleset, 08bd4dd11758b503. Then the client is instructed to recompute the hash after appending two path segments, meaning that the client computes the new hash as Hash(“example.com/a/b”) and again checks to see if the ruleset contains an entry for the new hash. The client repeats these steps until the hash prefix does not exist in the ruleset, at which point it stops and outputs the first two bytes of that hash prefix as a bucket identifier.</p>
<p>The server generates the ruleset in an iterative process. The server starts with the assumption that each URL is hashed only by its domain and computes the initial buckets. It then identifies the largest bucket and finds the most common domain in that bucket. Then, it breaks up that bucket by adding a rule to append one or more additional URL segments for that domain. This process is repeated until all buckets are below an acceptable threshold.</p>
<p>Because of the way the ruleset is generated, any URL that has a blocked prefix is guaranteed to hash to the bucket containing that entry. This invariant holds so long as the blocklist doesn’t contain redundant entries (e.g., one entry for “example.com” and another for “example.com/a”) and as long as the hash function used for ruleset mapping doesn’t produce any collisions among blocklist entries.</p>
<p>At lookup time, the client uses the same ruleset to compute the URL’s bucket identifier. The client sends the bucket identifier to the server alongside an OPRF-blinded element for each path segment of the query link. The server responds with the bucket contents and the OPRF-blinded responses. Finally, the client unblinds the OPRF output and checks for an exact match of any of the OPRF outputs in the bucket contents. If a match is found, then the URL is flagged.</p>
<p>Note that in order to hide the number of path segments of the query link from the server, we must appropriately pad up to a fixed maximum number of elements in order to prevent the length of the request from revealing information about the link. Likewise, we must also pad the bucket contents so that all buckets are of the same length, so that the length of the server response doesn’t reveal information about the client’s link.</p>
<h2>Safeguarding Client Queries</h2>
<p>Now, in the description of this protocol so far, the client still sends a bucket identifier (computed from the URL) to the server in order to be able to efficiently process the query. We can use additional mechanisms to further reduce the bits of information that a hypothetically adversarial server could glean from the client’s query, which we will cover in the following sections.</p>
<h3>Confidential Computing</h3>
<p>In order to limit the exposure of these hash prefixes to Meta’s servers, we leverage <a href="https://www.amd.com/en/developer/sev.html" target="_blank" rel="noopener">AMD’s SEV-SNP technology</a> to provide a confidential virtual machine (CVM) for which the server-side code processes these hash prefixes. At a high level, the CVM provides an environment for us to run application code that we can generate attestation reports for. It also allows us to bootstrap a secure channel from a client to the CVM after the client establishes “trust” by verifying these attestation reports.</p>
<p>An attestation report contains:</p>
<ul><li class="c1" aria-level="1">A container manifest containing hash digests of the CVM’s launch configuration and packages, which essentially acts as a commitment to the application logic running on the CVM.</li>
<li class="c1" aria-level="1">A public key generated on CVM startup, corresponding to a private key that remains secured within the TEE.</li>
<li class="c1" aria-level="1">A certificate chain, with its root certificate established by AMD’s Key Distribution Service.</li>
<li class="c1" aria-level="1">A signature from the <a href="https://developers.cloudflare.com/key-transparency/">transparency log witness</a>, which provides a uniqueness guarantee that mitigates server-side equivocation</li>
</ul><p>Upon receiving this report, the client verifies all of the certificates/signatures and then uses the embedded public key to establish a secure channel with the CVM. This secure channel is used by the client to transmit the bucket identifier to the CVM, which then uses the corresponding decryption key to decrypt the client’s request to obtain the plaintext bucket identifier.</p>
<p>Last year, we posted about our usage of AMD SEV-SNP for providing a trusted execution environment for <a href="https://engineering.fb.com/2025/04/29/security/whatsapp-private-processing-ai-tools/" target="_blank" rel="noopener">WhatsApp Private Processing</a>, and many of the details behind the hardware setup are similar there.</p>
<p>One aspect missing from this verification procedure is the releasing of these artifacts for external security researchers to be able to validate. We aim to provide a platform for hosting these artifacts in the near future.</p>
<h3>Oblivious RAM</h3>
<p>While the hardware guarantees provided by AMD SEV-SNP do allow us to reduce the exposure of these hash prefixes and send them through an encrypted channel, they are not sufficient by themselves to fully hide these hash prefixes from an observer that obtains administrative privileges of the host system to monitor memory accesses over time. Although the memory pages are encrypted through AMD’s Secure Nested Paging (SNP) technology, the patterns of access themselves must also be kept private.</p>
<p>A straightforward way to address this would be to load the database into the machine’s memory on startup, and upon every client request, ensure that each one of the B buckets in the database is retrieved from memory, even though only one bucket is actually included in the server’s response. While this is fairly wasteful from a purely computational perspective (the B-1 accesses don’t actually factor into the response), the server can avoid directly leaking the bucket index being fetched to an adversary that can observe its memory access patterns when serving client requests.For a really large database, these B-1 accesses can end up being a bottleneck on the overall runtime of the server. There are two methods we leverage to optimize this performance overhead without compromising on privacy:</p>
<ol><li class="c1" aria-level="1">Since our database is (at the time of writing) not overwhelmingly large, we can fit multiple disparate copies of the same database into memory on a single machine. Incoming client requests are assigned one of these copies based on availability, since the linear scan is inherently sequential in nature.</li>
<li class="c1" aria-level="1">We can improve on the number of accesses asymptotically, from linear to sublinear, by relying on an algorithm called <a href="https://eprint.iacr.org/2013/280.pdf">Path ORAM</a>.</li>
</ol><p>The exact details of how Path ORAM works in our setting is beyond the scope of this post, but you can find more information about this in <a href="https://github.com/facebook/oram">our open-source library for Path ORAM</a>.</p>
<h3>Using Oblivious HTTP</h3>
<p>To further strengthen the privacy guarantees of ABP, we leverage a third-party proxy and the <a href="https://www.ietf.org/rfc/rfc9458.html" target="_blank" rel="noopener">Oblivious HTTP (OHTTP) protocol</a> to de-identify client requests. The third-party proxy sits in between the client and server, processing encrypted client requests by stripping out identifying information from them and forwarding these de-identified requests to the server, which, in turn, is able to decrypt the request payload. This makes it more difficult for the server to be able to observe identifiers (such as the client’s IP address).</p>
<h2>The ABP Request Lifecycle</h2>
<p>The overall lifecycle of ABP for a request works as follows:</p>
<p>Pre-processing/background phase:</p>
<ol><li class="c1" aria-level="1">On a periodic basis, the server pulls in the latest updates to the URL database, iteratively computing a ruleset that balances the database entries into similarly-sized buckets. </li>
<li class="c1" aria-level="1">These buckets are then loaded onto a TEE using ORAM. </li>
<li class="c1" aria-level="1">The TEE generates a keypair, and the public key is embedded in an attestation report, generated by AMD SEV-SNP hardware. </li>
<li class="c1" aria-level="1">The attestation report and the current ruleset for the database are provided to the client upon request (through a third-party proxy).</li>
<li class="c1" aria-level="1">The client verifies the signatures contained in the attestation report, and locally stores a copy of the public key and database ruleset.</li>
</ol><p>And then, on each client request corresponding to a link click:</p>
<ol><li class="c1" aria-level="1">The client, upon clicking a link in an E2EE chat, calculates the bucket identifier for the link by applying the rules of the “ruleset” to the URL. </li>
<li class="c1" aria-level="1">This bucket identifier is encrypted for the specific CVM instance using its public key. </li>
<li class="c1" aria-level="1">The client also computes a series of OPRF requests (blinded group elements), one for each path segment of the URL (padded). </li>
<li class="c1" aria-level="1">The encrypted bucket identifier, along with these OPRF requests, are sent through a third-party proxy to the server, along with a client public key as part of the establishment of a secure channel.</li>
<li class="c1" aria-level="1">The server precomputes the server-side evaluation of the OPRF requests to produce OPRF responses. </li>
<li class="c1" aria-level="1">The server then decrypts the bucket identifier, uses ORAM to look up the corresponding bucket contents, and returns the OPRF responses and bucket contents to the client, encrypted under the client’s public key.</li>
<li class="c1" aria-level="1">The client then decrypts the server’s response, and uses the bucket contents along with the OPRF responses to complete the OPRF evaluation and determine if a match was found. If a match was found, then the client displays a warning about the query link.</li>
</ol>]]></description>
      <link>https://engineering.fb.com/2026/03/09/security/how-advanced-browsing-protection-works-in-messenger/</link>
      <guid>https://engineering.fb.com/2026/03/09/security/how-advanced-browsing-protection-works-in-messenger/</guid>
      <pubDate>Mon, 09 Mar 2026 17:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[FFmpeg at Meta: Media Processing at Scale]]></title>
      <description><![CDATA[<p>FFmpeg is truly a multi-tool for media processing. As an industry-standard tool it supports a wide variety of audio and video codecs and container formats. It can also orchestrate complex chains of filters for media editing and manipulation. For the people who use our apps, FFmpeg plays an important role in enabling new video experiences and improving the reliability of existing ones.</p>
<p>Meta executes ffmpeg (the main CLI application) and ffprobe (a utility for obtaining media file properties) binaries tens of billions of times a day, introducing unique challenges when dealing with media files. FFmpeg can easily perform transcoding and editing on individual files, but our workflows have additional requirements to meet our needs. For many years we had to rely on our own internally developed fork of FFmpeg to provide features that have only recently been added to FFmpeg, such as threaded multi-lane encoding and real-time quality metric computation.</p>
<p>Over time, our internal fork came to diverge significantly from the upstream version of FFmpeg. At the same time, new versions of FFmpeg brought support for new codecs and file formats, and reliability improvements, all of which allowed us to ingest more diverse video content from users without disruptions. This necessitated that we support both recent open-source versions of FFmpeg alongside our internal fork. Not only did this create a gradually divergent feature set, it also created challenges around safely rebasing our internal changes to avoid regressions.</p>
<p>As our internal fork became increasingly outdated, we collaborated with FFmpeg developers, FFlabs, and VideoLAN to develop features in FFmpeg that allowed us to fully deprecate our internal fork and rely exclusively on the upstream version for our use cases. Using upstreamed patches and refactorings we’ve been able to fill two important gaps that we had previously relied on our internal fork to fill: threaded, multi-lane transcoding and real-time quality metrics.  </p>
<h2>Building More Efficient Multi-Lane Transcoding for VOD and Livestreaming</h2>
<figure id="attachment_23676" aria-describedby="caption-attachment-23676" class="wp-caption alignnone c1"><img class="size-full wp-image-23676" src="https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg" alt="" width="1920" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg 1920w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-2.jpg?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23676" class="wp-caption-text">A video transcoding pipeline producing multiple outputs at different resolutions.</figcaption></figure><p>When a user uploads a video through one of our apps, we generate a set of encodings to support Dynamic Adaptive Streaming over HTTP (DASH) playback. DASH playback allows the app’s video player to dynamically choose an encoding based on signals such as network conditions. These encodings can differ in resolution, codec, framerate, and visual quality level but they are created from the same source encoding, and the player can seamlessly switch between them in real time.</p>
<p>In a very simple system separate FFmpeg command lines can generate the encodings for each lane one-by-one in serial. This could be optimized by running each command in parallel, but this quickly becomes inefficient due to the duplicate work done by each process.</p>
<p>To work around this, multiple outputs could be generated within a single FFmpeg command line, decoding the frames of a video once and sending them to each output’s encoder instance. This eliminates a lot of overhead by deduplicating the video decoding and process startup time overhead incurred by each command line. Given that we process over 1 billion video uploads daily, each requiring multiple FFmpeg executions, reductions in per-process compute usage yield significant efficiency gains.</p>
<p>Our internal FFmpeg fork provided an additional optimization to this: parallelized video encoding. While individual video encoders are often internally multi-threaded, previous FFmpeg versions executed each encoder in serial for a given frame when multiple encoders were in use. By running all encoder instances in parallel, better parallelism can be obtained overall.</p>
<p>Thanks to contributions from FFmpeg developers, including those at FFlabs and VideoLAN, more efficient threading was implemented starting with FFmpeg 6.0, with the finishing touches landing in 8.0. This was directly influenced by the design of our internal fork and was one of the main features we had relied on it to provide. This development led to the <a href="https://x.com/FFmpeg/status/1731288541395587411" target="_blank" rel="noopener">most complex refactoring of FFmpeg in decades</a> and has enabled more efficient encodings for all FFmpeg users.</p>
<p>To fully migrate off of our internal fork we needed one more feature implemented upstream: real-time quality metrics.</p>
<h2>Enabling Real-Time Quality Metrics While Transcoding for Livestreams</h2>
<p><img class="alignnone size-full wp-image-23675" src="https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg" alt="" width="1920" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg 1920w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2026/02/FFmpeg-at-Meta-image-1.jpg?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Visual quality metrics, which give a numeric representation of the perceived visual quality of media, can be used to quantify the quality loss incurred from compression. These metrics are categorized as reference or no-reference metrics, where the former compares a <em>reference</em> encoding to some other <em>distorted</em> encoding.</p>
<p>FFmpeg can compute various visual quality metrics such as PSNR, SSIM, and VMAF using two existing encodings in a separate command line after encoding has finished. This is okay for offline or VOD use cases, but not for livestreaming where we might want to compute quality metrics in real time.</p>
<p>To do this, we need to insert a video decoder after each video encoder used by each output lane. These provide bitmaps for each frame in the video <em>after</em> compression has been applied so that we can compare against the frames <em>before</em> compression. In the end, we can produce a quality metric for each encoded lane in real time using a single FFmpeg command line.</p>
<p>Thanks to “in-loop” decoding, which was enabled by FFmpeg developers including those from FFlabs and VideoLAN, beginning with FFmpeg 7.0, we no longer have to rely on our internal FFmpeg fork for this capability.</p>
<h2>We Upstream When It Will Have the Most Community Impact</h2>
<p>Things like real-time quality metrics while transcoding and more efficient threading can bring efficiency gains to a variety of FFmpeg-based pipelines both in and outside of Meta, and we strive to enable these developments upstream to benefit the FFmpeg community and wider industry. However, there are some patches we’ve developed internally that don’t make sense to contribute upstream. These are highly specific to our infrastructure and don’t generalize well.</p>
<p>FFmpeg supports hardware-accelerated decoding, encoding, and filtering with devices such as NVIDIA’s NVDEC and NVENC, AMD’s Unified Video Decoder (UVD), and Intel’s Quick Sync Video (QSV). Each device is supported through an implementation of standard APIs in FFmpeg, allowing for easier integration and minimizing the need for device-specific command line flags. We’ve added support for the <a href="https://ai.meta.com/blog/meta-scalable-video-processor-MSVP/" target="_blank" rel="noopener">Meta Scalable Video Processor (MSVP)</a>, our custom ASIC for video transcoding, through these same APIs, enabling the use of common tooling across different hardware platforms with minimal platform-specific quirks.</p>
<p>As MSVP is only used within Meta’s own infrastructure, it would create a challenge for FFmpeg developers to support it without access to the hardware for testing and validation. In this case, it makes sense to keep patches like this internal since they wouldn’t provide benefit externally. We’ve taken on the responsibility of rebasing our internal patches onto more recent FFmpeg versions over time, utilizing extensive validation to ensure robustness and correctness during upgrades.</p>
<h2>Our Continued Commitment to FFmpeg</h2>
<p>With more efficient multi-lane encoding and real-time quality metrics, we were able to fully deprecate our internal FFmpeg fork for all VOD and livestreaming pipelines. And thanks to standardized hardware APIs in FFmpeg, we’ve been able to support our MSVP ASIC alongside software-based pipelines with minimal friction.</p>
<p>FFmpeg has withstood the test of time with over 25 years of active development. Developments that improve resource utilization, add support for new codecs and features, and increase reliability enable robust support for a wider range of media. For people on our platforms, this means enabling new experiences and improving the reliability of existing ones. We plan to continue investing in FFmpeg in partnership with open source developers, bringing benefits to Meta, the wider industry, and people who use our products.</p>
<h2>Acknowledgments</h2>
<p><em>We would like to acknowledge contributions from the open source community, our partners in FFlabs and VideoLAN, and many Meta engineers, including Max Bykov, Jordi Cenzano Ferret, Tim Harris, Colleen Henry, Mark Shwartzman, Haixia Shi, Cosmin Stejerean, Hassene Tmar, and Victor Loh.</em></p>]]></description>
      <link>https://engineering.fb.com/2026/03/02/video-engineering/ffmpeg-at-meta-media-processing-at-scale/</link>
      <guid>https://engineering.fb.com/2026/03/02/video-engineering/ffmpeg-at-meta-media-processing-at-scale/</guid>
      <pubDate>Mon, 02 Mar 2026 21:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure.</li>
<li class="c1" aria-level="1">We are renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and workloads.</li>
<li class="c1" aria-level="1">We are committed to continuing to develop jemalloc development with the open source community and welcome contributions and collaborations from the community.</li>
</ul><p>Building a software system is a lot like building a skyscraper: The product everyone sees is the top, but the part that keeps it from falling over is the foundation buried in the dirt and the scaffolding hidden from sight.</p>
<p><a href="https://github.com/jemalloc/jemalloc">jemalloc</a>, the high performance memory allocator, has consistently been a highly-leveraged component within our software stack, adapting over time to changes in underlying hardware and upper-layer software. Alongside the Linux kernel and the compilers, it has delivered long-term benefits to Meta, contributing to a reliable and performant infrastructure. </p>
<h2>Listening, Reflecting, and Changing</h2>
<p>High leverage comes with high stakes. On the spectrum of practical versus principled engineering practice, foundational software components like jemalloc need the highest rigor. With the leverage jemalloc provides however, it can be tempting to realize some short-term benefit. It requires strong self-discipline as an organization to resist that temptation and adhere to the core engineering principles. </p>
<p>In recent years, there has been a gradual shift away from the core engineering principles that have long guided jemalloc’s development. While some decisions delivered immediate benefits, the resulting technical debt eventually slowed progress.</p>
<p>We took the community’s feedback to heart. In the spirit of collaboration, we have reflected deeply on our stewardship and its impact on jemalloc’s long-term health. As we’ve met with some members of the community, including the project’s founder, <a href="https://jasone.github.io/">Jason Evans</a>, to share our introspection and how we are changing our approach. We’ve started an effort to remove technical debt and rebuild a long-term roadmap for jemalloc. </p>
<h2>A New Chapter for jemalloc</h2>
<p>As a result of these conversations with the community, the original <a href="https://github.com/jemalloc/jemalloc">jemalloc open source repository</a> has been unarchived. We are grateful for the opportunity to continue as stewards of the project. Meta is renewing focus on jemalloc, aiming to reduce maintenance needs and modernizing the codebase while continuing to evolve the allocator to adapt to the latest and emerging hardware and workloads.</p>
<p>Looking ahead, our current plan for jemalloc focus on several key areas of improvement:</p>
<ul><li class="c1" aria-level="1"><strong>Technical Debt Reduction</strong>: We are focusing on cleaning up technical debt, refactoring, and enhancing jemalloc to ensure it remains efficient, reliable and easy to use for all users.</li>
<li class="c1" aria-level="1"><strong>Huge-Page Allocator:</strong> We will continue to improve jemalloc’s hugepage allocator  (HPA) to better utilize transparent hugepages (THP) for improved CPU efficiency.</li>
<li class="c1" aria-level="1"><strong>Memory Efficiency:</strong> We plan to deliver improvements to packing, caching, and purging mechanisms for optimized memory efficiency.</li>
<li class="c1" aria-level="1"><strong>AArch64 Optimizations:</strong> We will make sure jemalloc has good out-of-the-box performance for the AArch64 (ARM64) platform.</li>
</ul><p>We know that trust is earned through action. Our hope is that, over time, our renewed commitment will be evident in the health and progress of jemalloc. We invite the community to join us in this new chapter — share your feedback and help shape jemalloc’s future. We look forward to collaborating with the community to drive jemalloc forward.</p>]]></description>
      <link>https://engineering.fb.com/2026/03/02/data-infrastructure/investing-in-infrastructure-metas-renewed-commitment-to-jemalloc/</link>
      <guid>https://engineering.fb.com/2026/03/02/data-infrastructure/investing-in-infrastructure-metas-renewed-commitment-to-jemalloc/</guid>
      <pubDate>Mon, 02 Mar 2026 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[RCCLX: Innovating GPU communications on AMD platforms]]></title>
      <description><![CDATA[<p>We are open-sourcing the initial version of <a href="https://github.com/meta-pytorch/torchcomms/tree/main/comms/rcclx/develop" target="_blank" rel="noopener">RCCLX</a> – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with <a href="https://pytorch.org/blog/torchcomms/" target="_blank" rel="noopener">Torchcomms</a> and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend.</p>
<p>Communication patterns for AI models are constantly evolving, as are hardware capabilities. We want to iterate on collectives, transports, and novel features quickly on AMD platforms. Earlier, we developed and open-sourced <a href="https://arxiv.org/pdf/2510.20171" target="_blank" rel="noopener">CTran</a>, a custom transport library on the NVIDIA platform. With RCCLX, we have integrated CTran to AMD platforms, enabling the AllToAllvDynamic – a GPU-resident collective. While not all the CTran features are currently integrated into the open source RCCLX library, we’re aiming to have them available in the coming months. </p>
<p>In this post, we highlight two new features – Direct Data Access (DDA) and Low Precision Collectives. These features provide significant performance improvements on AMD platforms and we are excited to share this with the community. </p>
<h2>Direct Data Access (DDA) – Lightweight Intra-node Collectives</h2>
<p>Large language model inference operates through two distinct computational stages, each with fundamentally different performance characteristics: </p>
<ul><li class="c1" aria-level="1"><strong>The prefill stage</strong> processes the input prompt, which can span thousands of tokens, to generate a key-value (KV) cache for each transformer layer of the model. This stage is compute-bound because the attention mechanism scales quadratically with sequence length, making it highly demanding on GPU computational resources.</li>
<li class="c1" aria-level="1"><strong>The decoding stage</strong> then utilizes and incrementally updates the KV cache to generate tokens one by one. Unlike prefill, decoding is memory-bound, as the I/O time of reading memory dominates attention time, with model weights and the KV cache occupying the majority of memory.</li>
</ul><p><a href="https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/" target="_blank" rel="noopener">Tensor parallelism</a> enables models to be distributed across multiple GPUs by sharding individual layers into smaller, independent blocks that execute on different devices. However, one important challenge is the AllReduce communication operation can contribute up to 30% of end-to-end (E2E) latency. To address this bottleneck, Meta developed two DDA algorithms. </p>
<ul><li class="c1" aria-level="1"><strong>The DDA flat algorithm</strong> improves small message-size allreduce latency by allowing each rank to directly load memory from other ranks and perform local reduce operations, reducing latency from O(N) to O(1) by increasing the data exchange from O(n) to O(n²).</li>
<li class="c1" aria-level="1"><strong>The DDA tree algorithm</strong> breaks the allreduce into two phases (reduce-scatter and all-gather) and uses direct data access in each step, moving the same amount of data as the ring algorithm but reducing latency to a constant factor for slightly larger message sizes.</li>
</ul><p><img class="alignnone size-full wp-image-23620" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png" alt="" width="1442" height="780" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png 1442w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png?resize=1024,554 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-1.png?resize=192,104 192w" sizes="(max-width: 992px) 100vw, 62vw" /><img class="alignnone size-full wp-image-23624" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png" alt="" width="1052" height="710" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png 1052w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png?resize=916,618 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png?resize=768,518 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png?resize=1024,691 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png?resize=96,65 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-.png?resize=192,130 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>The performance improvements of DDA over baseline communication libraries are substantial, particularly on AMD hardware. With AMD MI300X GPUs, DDA outperforms the RCCL baseline by 10-50% for decode (small message sizes) and yields 10-30% speedup for prefill. These improvements resulted in approximately 10% reduction in time-to-incremental-token (TTIT), directly enhancing the user experience during the critical decoding phase.</p>
<h2>Low-precision Collectives</h2>
<p>Low-precision (LP) collectives are a set of distributed communication algorithms — AllReduce, AllGather, AlltoAll, and ReduceScatter — optimized for AMD Instinct MI300/MI350 GPUs to accelerate AI training and inference workloads. These collectives support both FP32 and BF16 data types, leveraging FP8 quantization for up to 4:1 compression, which significantly reduces communication overhead and improves scalability and resource utilization for large message sizes (≥16MB). </p>
<p>The algorithms use parallel peer-to-peer (P2P) mesh communication, fully exploiting AMD’s Infinity Fabric for high bandwidth and low latency, while compute steps are performed in high precision (FP32) to maintain numerical stability. Precision loss is primarily dictated by the number of quantization operations — typically one or two per data type in each collective — and whether the data can be adequately represented within the FP8 range. </p>
<p>By dynamically enabling LP collectives, users can selectively activate these optimizations in E2E scenarios that benefit most from performance gains. Based on internal experiments, we have observed significant speed up for FP32 and notable improvements for BF16; it’s important to note that these collectives have been tuned for single-node deployments at this time. </p>
<p>Reducing the precision of types can potentially have an impact on numeric accuracy so we tested for this and we found that it provided acceptable numerical accuracy for our workloads. This flexible approach allows teams to maximize throughput while maintaining acceptable numerical accuracy, and is now fully integrated and available in RCCLX for AMD platforms — simply set the environment variable RCCL_LOW_PRECISION_ENABLE=1 to get started.</p>
<figure id="attachment_23619" aria-describedby="caption-attachment-23619" class="wp-caption alignnone c2"><img class="size-full wp-image-23619" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png" alt="" width="1218" height="750" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png 1218w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png?resize=916,564 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png?resize=768,473 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png?resize=1024,631 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-3.png?resize=192,118 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23619" class="wp-caption-text">MI300 – Float LP AllReduce speedup.</figcaption></figure><figure id="attachment_23621" aria-describedby="caption-attachment-23621" class="wp-caption alignnone c3"><img class="size-full wp-image-23621" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png" alt="" width="1214" height="746" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png 1214w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png?resize=916,563 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png?resize=768,472 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png?resize=1024,629 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-4.png?resize=192,118 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23621" class="wp-caption-text">MI300 – Float LP AllGather speedup.</figcaption></figure><figure id="attachment_23623" aria-describedby="caption-attachment-23623" class="wp-caption alignnone c4"><img class="size-full wp-image-23623" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png" alt="" width="1216" height="742" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png 1216w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png?resize=916,559 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png?resize=768,469 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png?resize=1024,625 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-5.png?resize=192,117 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23623" class="wp-caption-text">MI300 – Float LP AllToAll speedup.</figcaption></figure><figure id="attachment_23618" aria-describedby="caption-attachment-23618" class="wp-caption alignnone c5"><img class="size-full wp-image-23618" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png" alt="" width="1206" height="748" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png 1206w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png?resize=916,568 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png?resize=768,476 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png?resize=1024,635 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-6.png?resize=192,119 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23618" class="wp-caption-text">MI300 – Float LP ReduceScatter speedup.</figcaption></figure><p>We are observing the following results from E2E inference workload evaluations when selectively enabling LP collectives:</p>
<ul><li>Approximately ~0.3% delta on GSM8K evaluation runs.</li>
<li>~9–10% decrease in latency.</li>
<li>~7% increase in throughput.</li>
</ul><p>The throughput measurements shown in the graphs were obtained using param-bench rccl-tests. For the MI300, the tests were run on RCCLX built with ROCm 6.4, and for the MI350, on RCCLX built with ROCm 7.0. Each test included 10 warmup iterations followed by 100 measurement iterations. The reported results represent the average throughput across the measurement iterations.</p>
<h2>Easy adaptation of AI models</h2>
<p>RCCLX is integrated with the Torchcomms API as a custom <a href="https://github.com/meta-pytorch/torchcomms/tree/main/comms/torchcomms/rcclx">backend</a>. We aim for this backend to have feature parity with our NCCLX backend (for NVIDIA platforms). Torchcomms allows users to have a single API for communication for different platforms. A user would not need to change the APIs they’re familiar with to port their applications across AMD, or other platforms even when using the novel features provided by CTran. </p>
<p><img class="alignnone size-full wp-image-23626" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg" alt="" width="1070" height="348" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg 1070w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg?resize=916,298 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg?resize=768,250 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg?resize=1024,333 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg?resize=96,31 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-7.jpg?resize=192,62 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<figure id="attachment_23625" aria-describedby="caption-attachment-23625" class="wp-caption alignnone c6"><img class="size-full wp-image-23625" src="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg" alt="" width="1072" height="342" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg 1072w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg?resize=916,292 916w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg?resize=768,245 768w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg?resize=1024,327 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg?resize=96,31 96w, https://engineering.fb.com/wp-content/uploads/2026/02/rrcclx-innovating-gpu-communications-amd-platforms-meta-image-8.jpg?resize=192,61 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23625" class="wp-caption-text">A comparison of application code written for NCCLX (top) versus RCCLX( bottom).</figcaption></figure><h2>RCCLX Quick Start Guide</h2>
<p>Install Torchcomms with RCCLX backend by following <a href="https://github.com/meta-pytorch/torchcomms/tree/main" target="_blank" rel="noopener">the installation instructions in the Torchcomms repo</a>.</p>
<pre class="line-numbers"><code class="language-none">import torchcomms
# Eagerly initialize a communicator using MASTER_PORT/MASTER_ADDR/RANK/WORLD_SIZE environment variables 
provided by torchrun.
# This communicator is bound to a single device.
comm = torchcomms.new_comm("rcclx", torch.device("hip"), name="my_comm")
print(f"I am rank {comm.get_rank()} of {comm.get_size()}!")
t = torch.full((10, 20), value=comm.rank, dtype=torch.float)
# run an all_reduce on the current stream
comm.allreduce(t, torchcomms.ReduceOp.SUM, async_op=False)
</code></pre>
<h1>Acknowledgements</h1>
<p><em>We extend our gratitude to the AMD RCCL team for their ongoing collaboration. We also want to recognize the many current and former Meta employees whose contributions were vital in developing torchcomms and torchcomms-backends for production-scale training and inference. In particular, we would like to give special thanks to Dingming Wu, Qiye Tan, Cen Zhao, Yan Cui, Zhe Qu, Ahmed Khan, Ajit Mathews, CQ Tang, Srinivas Vaidyanathan, Harish Kumar Chandrappa, Peng Chen, Shashi Gandham, and Omar Baldonado</em></p>]]></description>
      <link>https://engineering.fb.com/2026/02/24/data-center-engineering/rrcclx-innovating-gpu-communications-amd-platforms-meta/</link>
      <guid>https://engineering.fb.com/2026/02/24/data-center-engineering/rrcclx-innovating-gpu-communications-amd-platforms-meta/</guid>
      <pubDate>Tue, 24 Feb 2026 22:30:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It]]></title>
      <description><![CDATA[<h2>WHAT IT IS</h2>
<p>The rise of agentic software development means code is being written, reviewed, and shipped faster than ever before across the entire industry. It also means that testing frameworks need to evolve for this rapidly changing landscape. Faster development demands faster testing that can catch bugs as they land in a codebase, without requiring regular updates and maintenance.</p>
<p>Just-in‑Time Tests (JiTTests) are a fundamentally novel approach to testing where tests are automatically generated by large language models (LLMs) on the fly to catch bugs – even ones that traditional testing might not catch – just-in-time before the code lands into production.</p>
<p><a href="https://arxiv.org/pdf/2601.22832" target="_blank" rel="noopener">A Catching JiTTest</a> focuses specifically on finding regressions introduced by a code change. This type of testing reimagines <a href="https://arxiv.org/abs/2109.04086">decades</a> of <a href="https://web.eecs.umich.edu/~weimerw/2022-481F/readings/mutation-testing.pdf" target="_blank" rel="noopener">software testing theory and practice</a>. While traditional testing relies on static test suites, manual authoring, and ongoing maintenance, Catching JiTTests require no test maintenance and no test code review, meaning engineers can focus their expertise on real bugs, not false positives. Catching JiTTests use sophisticated techniques to maximize test signal value and minimize false positive drag, targeting test signals where they matter most: on serious failures.</p>
<h2>HOW TESTING TRADITIONALLY WORKS</h2>
<p>Under the traditional paradigm, tests are manually built as new code lands in a codebase and continually executed, requiring regular updates and maintenance. The engineers building these tests face the challenge of needing to check the behavior, not only of the current code, but all possible future changes. Inherent uncertainty about future changes results in tests that don’t catch anything, or when they do, it’s a false positive. Agentic development dramatically increases the pace of code change, straining test development burden and scaling the cost of false positives and test maintenance to breaking point. </p>
<h2>HOW CATCHING JITTESTS WORK</h2>
<p>Broadly, JiTTests are bespoke tests, tailored to a specific code change, that give engineers simple, actionable feedback about unexpected behavior changes without the need to read or write test code. LLMs can generate JiTTests automatically the moment a pull request is submitted. And since the JiTTest itself is LLM-generated, it can often infer the plausible intention of a code change and simulate possible faults that may result from it.</p>
<p>With an understanding of intent, Catching JiTTests can significantly drive down instances of false positives.</p>
<p>Here are the key steps of the Catching JiTTest process:</p>
<ol><li class="c1" aria-level="1">New code lands in the codebase.</li>
<li class="c1" aria-level="1">The system infers the intention of the code change.</li>
<li class="c1" aria-level="1">It creates <a href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/" target="_blank" rel="noopener">mutants</a> (code versions with faults deliberately inserted) to simulate what could go wrong.</li>
<li class="c1" aria-level="1">It generates and runs tests to catch those faults.</li>
<li class="c1" aria-level="1">Ensembles of rule-based and LLM-based assessors focus the signal on true positive failures.</li>
<li class="c1" aria-level="1">Engineers receive clear, relevant reports about unexpected changes right when it matters most.</li>
</ol><h2>WHY IT MATTERS</h2>
<p>Catching JiTTests are designed for the world of AI-powered agentic software development and accelerate testing by focusing on serious unexpected bugs. With them engineers no longer have to spend time writing, reviewing, and testing complex test code. Catching JiTTests, by design, kill many of the issues with traditional testing in one stroke:</p>
<ul><li class="c1" aria-level="1">They are generated on-the-fly for each code change and do not reside in the codebase, eliminating ongoing maintenance costs and shifting effort from humans to machines.</li>
<li class="c1" aria-level="1">They are tailored to each change, making them more robust and less prone to breaking due to intended updates.</li>
<li class="c1" aria-level="1">They automatically adapt as the code changes.</li>
<li class="c1" aria-level="1">They only require human review when a bug is actually caught.</li>
</ul><p>This all amounts to an important shift in testing infrastructure where the focus moves from generic code quality to whether a test actually finds faults in a specific change without raising a false positive. It helps improve testing overall while also allowing it to keep up with the pace of agentic coding.</p>
<h2>READ THE PAPER</h2>
<p><a href="https://arxiv.org/pdf/2601.22832">Just-in-Time Catching Test Generation at Meta</a></p>]]></description>
      <link>https://engineering.fb.com/2026/02/11/developer-tools/the-death-of-traditional-testing-agentic-development-jit-testing-revival/</link>
      <guid>https://engineering.fb.com/2026/02/11/developer-tools/the-death-of-traditional-testing-agentic-development-jit-testing-revival/</guid>
      <pubDate>Wed, 11 Feb 2026 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like <a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/" target="_blank" rel="noopener">Prometheus</a>.</li>
<li class="c1" aria-level="1">BAG allows us to seamlessly connect thousands of GPUs across multiple data centers and regions.</li>
<li class="c1" aria-level="1">Our BAG implementation is connecting two different network fabrics –<a href="https://engineering.fb.com/2025/10/20/data-center-engineering/disaggregated-scheduled-fabric-scaling-metas-ai-journey/" target="_blank" rel="noopener"> Disaggregated Schedule Fabric (DSF)</a> and <a href="https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/#nsf" target="_blank" rel="noopener">Non-Scheduled Fabric (NSF)</a>.</li>
</ul><p>Once it’s complete our AI cluster, <a href="https://www.threads.com/@zuck/post/DMF6uUgx9f9/video-were-actually-building-several-multi-gw-clusters-were-calling-the-first-one-prom?hl=en">Prometheus</a>, will deliver 1-gigawatt of capacity to enhance and enable new and existing AI experiences across Meta products. Prometheus’ infrastructure will span several data center buildings in a single larger region, interconnecting tens of thousands of GPUs.</p>
<p>A key piece of scaling and connecting this infrastructure is backend aggregation (BAG), which we use to seamlessly connect GPUs and data centers with robust, high-capacity networking. By leveraging modular hardware, advanced routing, and resilient topologies, BAG ensures both performance and reliability at unprecedented scale</p>
<p>As our AI clusters continue to grow, we expect BAG to play an important role in meeting future demands and driving innovation across Meta’s global network.</p>
<h2>What Is Backend Aggregation?</h2>
<p>BAG is a centralized Ethernet-based super spine network layer that primarily functions to interconnect multiple spine layer fabrics across various data centers and regions within large clusters. Within Prometheus, for example, the BAG layer serves as the aggregation point between regional networks and Meta’s backbone, enabling the creation of mega AI clusters. BAG is designed to support immense bandwidth needs, with inter-BAG capacities reaching the petabit range (e.g., 16-48 Pbps per region pair).</p>
<figure id="attachment_23636" aria-describedby="caption-attachment-23636" class="wp-caption alignnone c2"><img class="size-full wp-image-23636" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png" alt="" width="1999" height="970" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=916,444 916w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=768,373 768w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=1024,497 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=1536,745 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=96,47 96w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-1.png?resize=192,93 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23636" class="wp-caption-text">We use backend aggregation (BAG) to interconnect data center regions to share compute and other resources into large clusters.</figcaption></figure><h2>How BAG Is Helping Us Build Gigawatt-Scale AI Clusters </h2>
<p>To address the challenge of interconnecting tens of thousands of GPUs, we’re deploying distributed BAG layers regionally.</p>
<h3>How We Interconnect BAG Layers</h3>
<p>BAG layers are strategically distributed across regions to serve subsets of L2 fabrics, adhering to distance, buffer, and latency constraints. Inter-BAG connectivity utilizes either a planar (direct match) or spread connection topology, chosen based on site size and fiber availability.</p>
<ul><li class="c1" aria-level="1"><strong>Planar topology</strong> connects BAG switches one-to-one between regions following the plane, offering simplified management but concentrating potential failure domains.</li>
<li class="c1" aria-level="1"><strong>Spread connection topology</strong> distributes links across multiple BAG switches/planes, enhancing path diversity and resilience.</li>
</ul><figure id="attachment_23638" aria-describedby="caption-attachment-23638" class="wp-caption alignnone c3"><img class="size-full wp-image-23638" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png" alt="" width="1480" height="1180" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png 1480w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png?resize=916,730 916w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png?resize=768,612 768w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png?resize=1024,816 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png?resize=96,77 96w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-2.png?resize=192,153 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23638" class="wp-caption-text">An example of an inter-BAG network topology.</figcaption></figure><h3>How a BAG Layer Connects to L2 Fabrics</h3>
<p>So far, we’ve discussed how the BAG layers are interconnected, now let’s see how a BAG layer connects downstream to L2 fabrics.</p>
<p>We’ve used two main fabric technologies, <a href="https://engineering.fb.com/2025/10/20/data-center-engineering/disaggregated-scheduled-fabric-scaling-metas-ai-journey/" target="_blank" rel="noopener">Disaggregated Schedule Fabric (DSF)</a> and <a href="https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/#nsf" target="_blank" rel="noopener">Non-Scheduled Fabric (NSF)</a> to build L2 networks.</p>
<p>Below is an example of DSF L2 zones across five data center buildings connected to the BAG layer via a special backend edge pod in each building. </p>
<figure id="attachment_23637" aria-describedby="caption-attachment-23637" class="wp-caption alignnone c4"><img class="size-full wp-image-23637" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png" alt="" width="1842" height="952" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png 1842w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=916,473 916w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=768,397 768w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=1024,529 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=1536,794 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=96,50 96w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-3.png?resize=192,99 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23637" class="wp-caption-text">A BAG inter-building connection for DSF fabric across five data centers.</figcaption></figure><p>Below is an example of NSF L2 connected to BAG planes. Each BAG plane connects to matching Spine Training Switches (STSWs) from all spine planes. Effective oversubscription is 4.98:1.  </p>
<figure id="attachment_23635" aria-describedby="caption-attachment-23635" class="wp-caption alignnone c5"><img class="size-full wp-image-23635" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png" alt="" width="1600" height="870" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png 1600w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=916,498 916w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=768,418 768w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=1024,557 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=1536,835 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Prometheus-Backend-Aggregation-BAG-image-4.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23635" class="wp-caption-text">A BAG inter-building connection for NSF fabric.</figcaption></figure><p>Careful management of oversubscription ratios assists in balancing scale and performance. Typical oversubscription from L2 to BAG is around 4.5:1, while BAG-to-BAG oversubscription varies based on regional requirements and link capacity.</p>
<h3>Hardware and Routing </h3>
<p>Meta’s implementation of BAG uses a modular chassis equipped with Jericho3 (J3) ASIC line cards, each providing up to 432x800G ports for high-capacity, scalable, and resilient interconnect. The central hub BAG employs a larger chassis to accommodate numerous spokes and long-distance links with varied cable lengths for optimized buffer utilization.</p>
<p>Routing within BAG uses eBGP with link bandwidth attributes, enabling Unequal Cost Multipath (UCMP) for efficient load balancing and robust failure handling. BAG-to-BAG connections are secured with MACsec, aligning with network security requirements.</p>
<h3>Designing the Network for Resilience</h3>
<p>The network design meticulously details port striping, IP addressing schemes, and comprehensive failure domain analysis to ensure high availability and minimize the impact of failures. Failure modes are analyzed at the BAG, data hall, and power distribution levels. We also employ various strategies to mitigate blackholing risks, including draining affected BAG planes and conditional route aggregation.</p>
<h3>Considerations for Long Cable Distances</h3>
<p>An important advantage of BAG’s distributed architecture is it keeps the distance from the L2 edge small, which is important for shallow buffer NSF switches. Longer, BAG-to-BAG, cable distances dictate that we use deep buffer switches for the BAG role. This provides a large headroom buffer to support lossless congestion control protocols like PFC.  </p>
<h2>Building Prometheus and Beyond</h2>
<p>As a technology, BAG is playing an important role in Meta’s next generation of AI infrastructure. By centralizing the interconnection of regional networks, BAG helps enable the gigawatt-scale Prometheus cluster, ensuring seamless, high-capacity networking across tens of thousands of GPUs. This thoughtful design, leveraging modular hardware and resilient topologies, positions BAG to not only meet the demands of Prometheus but also to drive the future innovation and scalability of Meta’s global AI network for years to come.</p>]]></description>
      <link>https://engineering.fb.com/2026/02/09/data-center-engineering/building-prometheus-how-backend-aggregation-enables-gigawatt-scale-ai-clusters/</link>
      <guid>https://engineering.fb.com/2026/02/09/data-center-engineering/building-prometheus-how-backend-aggregation-enables-gigawatt-scale-ai-clusters/</guid>
      <pubDate>Mon, 09 Feb 2026 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[No Display? No Problem: Cross-Device Passkey Authentication for XR Devices]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing a novel approach to enabling cross-device passkey authentication for devices with inaccessible displays (like XR devices).</li>
<li class="c1" aria-level="1">Our approach bypasses the use of QR codes and enables cross-device authentication without the need for an on-device display, while still complying with all trust and proximity requirements.</li>
<li class="c1" aria-level="1">This approach builds on work done by the FIDO Alliance and we hope it will open the door to bring secure, passwordless authentication to a whole new ecosystem of devices and platforms.</li>
</ul><p>Passkeys are a significant leap forward in authentication, offering a phishing-resistant, cryptographically secure alternative to traditional passwords. Generally, the standard cross-device passkey flow, where someone registers or authenticates on a desktop device by approving the action on their nearby mobile device, is done in a familiar way with QR codes scanned by their phone camera.  But how can we facilitate this flow for XR devices with a head-mounted display or no screen at all, or for other devices with an <em>inaccessible display</em> like smart home hubs and industrial sensors? </p>
<p>We’ve taken a  novel approach to adapting the WebAuthn passkey flow and <a href="https://fidoalliance.org/specs/fido-v2.0-id-20180227/fido-client-to-authenticator-protocol-v2.0-id-20180227.html">FIDO’s CTAP hybrid protocol</a> for this unique class of devices that either lack a screen entirely or whose screen is not easily accessible to another device’s camera. Our implementation has been developed and is now broadly available on Meta Quest devices powered by Meta Horizon OS. We hope that this approach can also ensure robust security built on the strength of existing passkey frameworks, without sacrificing usability, for users of a variety of other screenless IoT devices, consumer electronics, and industrial hardware.</p>
<p><img class="alignnone size-full wp-image-23612" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-Quest-cross-device-passkey.gif" alt="" width="960" height="664" /></p>
<h2>The Challenge: No Screen, No QR Code</h2>
<p>The <a href="http://fidoalliance.org/specs/fido-v2.2-ps-20250714/fido-client-to-authenticator-protocol-v2.2-ps-20250714.html#sctn-hybrid" target="_blank" rel="noopener">standard cross-device flow</a> relies on two primary mechanisms:</p>
<ol><li class="c1" aria-level="1"><strong>QR code scanning:</strong> The relying party displays a QR code on the desktop/inaccessible device, which the mobile authenticator scans to establish a secure link.</li>
<li class="c1" aria-level="1"><strong>Bluetooth/NFC proximity:</strong> The devices use local communication protocols to discover each other and initiate the secure exchange.</li>
</ol><p>For devices with no display, the QR code method is impossible. Proximity-based discovery is feasible, but initiating the user verification step and confirming the intent without any on-device visual feedback can introduce security and usability risks. People need clear assurance that they are approving the correct transaction on the correct device.</p>
<h2>Our Solution: Using a Companion App for Secure Message Transport</h2>
<p>Scanning a QR code sends the authenticator device a command to initiate a hybrid (cross-device) login flow with a nonce that identifies the unauthenticated device client. But if a user has a companion application – like the Meta Horizon app – that uses the same account as the device we can use that application to pass this same request to the authenticator OS and execute it using general link/intent execution.</p>
<p>We made the flow easy to navigate by using in-app notifications to show users when a login request has been initiated, take them directly into the application, and immediately execute the login request.</p>
<p>For simplicity, we opted to begin the hybrid flow as soon as the application is opened since the user would have had to take some action (clicking the notification or opening the app) to trigger this and there is an additional user verification step in hybrid implementations on iOS and Android.</p>
<p>Here’s how this plays out on a Meta Quest with the Meta Horizon mobile app:</p>
<p><img class="alignnone size-full wp-image-23601" src="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png" alt="" width="1999" height="726" srcset="https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png 1999w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=916,333 916w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=768,279 768w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=1024,372 1024w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=1536,558 1536w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2026/02/Meta-XR-passkeys-flow.png?resize=192,70 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>1. The Hybrid Flow Message Is Generated</h3>
<p>When a passkey login is initiated on the Meta Quest, the headset’s browser locally constructs the same payload that would have been embedded in a QR Code – including a fresh ECDH public key, a session-specific secret, and routing information used later in the handshake. Instead of rendering this information into an image (QR code), the browser encodes it into a FIDO URL (the standard mechanism defined for hybrid transport) that instructs the mobile device to begin the passkey authentication flow.</p>
<h3>2. The Message Is Sent to the Companion App</h3>
<p>After the FIDO URL is generated, the headset requires a secure and deterministic method for transferring it to the user’s phone. Because the device cannot present a QR code, the system leverages the Meta Horizon app’s authenticated push channel to deliver the FIDO URL directly to the mobile device. When the user selects the passkey option in the login dialog, the headset encodes the FIDO URL as structured data within a GraphQL-based push notification. </p>
<p>The Meta Horizon app, signed in with the same account as the headset, receives this payload and validates the delivery context to ensure it is routed to the correct user. </p>
<h3>3. The Application Sends a Notification of the Login Request</h3>
<p>After the FIDO URL is delivered to the mobile device, the platform’s push service surfaces it as a standard iOS or Android notification indicating that a login request is pending. When the user taps the notification, the operating system routes the deep link to the Meta Horizon app. The app then opens the FIDO URL using the system URL launcher and invokes the operating system passkey interface.</p>
<p>For users that have the notification turned off or disabled, launching the Meta Horizon app directly will also trigger a query to the backend for any pending passkey requests associated with the user’s account. If a valid request exists (requests expire after five minutes), the app automatically initiates the same passkey flow by opening the FIDO URL.</p>
<p>Once the FIDO URL is opened, the mobile device begins the hybrid transport sequence, including broadcasting the BLE advertisement, establishing the encrypted tunnel, and producing the passkey assertion. In this flow, the system notification and the app launch path both serve as user consent surfaces and entry points into the standard hybrid transport workflow.</p>
<h3>4. The App Executes the Hybrid Command</h3>
<p>Once the user approves the action on their mobile device, the secure channel is established as per WebAuthn standards. The main difference is the challenge exchange timing:</p>
<ol><li class="c1" aria-level="1">The inaccessible device generates the standard WebAuthn challenge and waits.</li>
<li class="c1" aria-level="1">The mobile authenticator, initiates the secure BLE/NFC connection.</li>
<li class="c1" aria-level="1">The challenge is transmitted over this secure channel.</li>
<li class="c1" aria-level="1">Upon UV success, the mobile device uses the relevant key material to generate the AuthenticatorAssertionResponse or AuthenticatorAttestationResponse.</li>
<li class="c1" aria-level="1">The response is sent back to the inaccessible device.</li>
</ol><p>The inaccessible device then acts as the conduit, forwarding the response to the relying party server to complete the transaction, exactly as a standard display-equipped device would.</p>
<h2>Impact and Future Direction</h2>
<p>This novel implementation successfully bypasses the need for an on-device display in the cross-device flow and still complies with the proximity and other trust challenges that exist today for cross-device passkey login. We hope that our solution paves the way for secure, passwordless authentication across a wider range of different platforms and ecosystems, moving passkeys beyond just mobile and desktop environments and into the burgeoning world of wearable and IoT devices. </p>
<p>We are proud to build on top of and collaborate the excellent work already done in this area by our peers in the FIDO Alliance and mobile operating systems committed to this work and building a robust and interoperable ecosystem for secure and easy login.</p>]]></description>
      <link>https://engineering.fb.com/2026/02/04/security/cross-device-passkey-authentication-for-xr-devices-meta-quest/</link>
      <guid>https://engineering.fb.com/2026/02/04/security/cross-device-passkey-authentication-for-xr-devices-meta-quest/</guid>
      <pubDate>Wed, 04 Feb 2026 23:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Rust at Scale: An Added Layer of Security for WhatsApp]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">WhatsApp has adopted and rolled out a new layer of security for users – built with Rust – as part of its effort to harden defenses against malware threats.</li>
<li class="c1" aria-level="1">WhatsApp’s experience creating and distributing our media consistency library in Rust to billions of devices and browsers proves Rust is production ready at a global scale.</li>
</ul><h2>Our Media Handling Strategy</h2>
<p>WhatsApp provides default end-to-end encryption for over 3 billion people to message securely each and every day. Online security is an adversarial space, and to continue ensuring users can keep messaging securely, we’re constantly adapting and evolving our strategy against cyber-security threats – all while supporting the WhatsApp infrastructure to help people connect. </p>
<p>For example, WhatsApp, like many other applications, allows users to share media and other types of documents. WhatsApp helps protect users by warning about dangerous attachments like APKs, yet rare and sophisticated malware could be hidden within a seemingly benign file like an image or video. These maliciously crafted files might target unpatched vulnerabilities in the operating system, libraries distributed by the operating system, or the application itself.</p>
<p>To help protect against such potential threads, WhatsApp is increasingly using the Rust programming language, including in our media sharing functionality. Rust is a memory safe language offering numerous security benefits. We believe that this is the largest rollout globally of any library written in Rust.</p>
<p>To help explain why and how we rolled this out, we should first look back at a key OS-level vulnerability that sent an important signal to WhatsApp around hardening media-sharing defenses.</p>
<h2>2015 Android Vulnerability: A Wake-up Call for Media File Protections</h2>
<p>In 2015, Android devices, and the applications that ran on them, became vulnerable to the “<a href="https://www.cisa.gov/news-events/alerts/2015/07/28/stagefright-android-vulnerability" target="_blank" rel="noopener">Stagefright” vulnerability</a>. The bug lay in the processing of media files by operating system-provided libraries, so WhatsApp and other applications could not patch the underlying vulnerability. Because it could often take months for people to update to the latest version of their software, we set out to find solutions that would keep WhatsApp users safe, even in the event of an operating system vulnerability. </p>
<p>At that time, we realized that a cross-platform C++ library already developed by WhatsApp to send and consistently format MP4 files (called “wamedia”) could be modified to detect files which do not adhere to the MP4 standard and might trigger bugs in a vulnerable OS library on the receiver side – hence putting a target’s security at risk. We rolled out this check and were able to protect WhatsApp users from the Stagefright vulnerability much more rapidly than by depending on users to update the OS itself.</p>
<p>But because media checks run automatically on download and process untrusted inputs, we identified early on that wamedia was a prime candidate for using a memory safe language. </p>
<p><img class="alignnone size-full wp-image-23552" src="https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg" alt="" width="2100" height="1250" srcset="https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg 2100w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=916,545 916w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=768,457 768w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=1024,610 1024w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=1536,914 1536w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=2048,1219 2048w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=96,57 96w, https://engineering.fb.com/wp-content/uploads/2026/01/Rust-at-scale_inline.jpg?resize=192,114 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Our Solution: Rust at Scale</h2>
<p>Rather than an incremental rewrite, we developed the Rust version of wamedia in parallel with the original C++ version. We used differential fuzzing and extensive integration and unit tests to ensure compatibility between the two implementations.</p>
<p>Two major hurdles were the initial binary size increase due to bringing in the Rust standard library and the build system support required for the diverse platforms supported by WhatsApp. WhatsApp made a long-term bet to build that support. In the end, we replaced 160,000 lines of C++ (excluding tests) with 90,000 lines of Rust (including tests). The Rust version showed performance and runtime memory usage advantages over the C++. Given this success, Rust was fully rolled out to all WhatsApp users and many platforms: Android, iOS, Mac, Web, Wearables, and more. With this positive evidence in hand, memory safe languages will play an ever increasing part in WhatsApp’s overall approach to application and user security.</p>
<p>Over time, we’ve added more checks for non-conformant structures within certain file types to help protect downstream libraries from parser differential exploit attempts. Additionally, we check higher risk file types, even if structurally conformant, for risk indicators. For instance, PDFs are often a vehicle for malware, and more specifically, the presence of embedded files and scripting elements within a PDF further raise risks. We also detect when one file type masquerades as another, through a spoofed extension or MIME type. Finally, we uniformly flag known dangerous file types, such as executables or applications, for special handling in the application UX. Altogether, we call this ensemble of checks “Kaleidoscope.” This system protects people on WhatsApp from potentially malicious unofficial clients and attachments. Although format checks will not stop every attack, this layer of defense helps mitigate many of them.</p>
<p>Each month, these libraries are distributed to billions of phones, laptops, desktops, watches, and browsers running on multiple operating systems for people on WhatsApp, Messenger, and Instagram. This is the largest ever deployment of Rust code to a diverse set of end-user platforms and products that we are aware of. Our experience speaks to the production-readiness and unique value proposition of Rust on the client-side.</p>
<h2>How Rust Fits In To WhatsApp’s Approach to App Security</h2>
<p>This is just one example of WhatsApp’s many investments in security. It’s why we built default end-to-end encryption for personal messages and calls, offer <a href="https://blog.whatsapp.com/end-to-end-encrypted-backups-on-whatsapp" target="_blank" rel="noopener">end-to-end encrypted</a> backups, and use <a href="https://tech.facebook.com/engineering/2023/4/strengthening-whatsapp-end-to-end-encryption-key-transparency/" target="_blank" rel="noopener">key transparency technology</a> to verify a secure connection, provide additional <a href="https://engineering.fb.com/2023/11/08/security/whatsapp-calls-enhancing-security/" target="_blank" rel="noopener">calling protections</a>, and more.</p>
<p>WhatsApp has a strong track record of being loud when we find issues and working to hold bad actors accountable. For example, WhatsApp <a href="https://www.whatsapp.com/security/advisories" target="_blank" rel="noopener">reports CVEs</a> for important issues we find in our applications, even if we do not find evidence of exploitation. We do this to give people on WhatsApp the best chance of protecting themselves by seeing a security advisory and updating quickly.</p>
<p>To ensure application security, we first must identify and quantify the sources of risk. We do this through internal and external audits like <a href="https://research.nccgroup.com/2021/10/27/public-report-whatsapp-end-to-end-encrypted-backups-security-assessment/" target="_blank" rel="noopener">NCC Group’s public assessment</a> of WhatsApp’s end-to-end encrypted backups, fuzzing, <a href="https://engineering.fb.com/2021/10/20/security/static-analysis-award/" target="_blank" rel="noopener">static analysis</a>, supply chain management, and automated attack surface analysis. We also recently expanded our <a href="https://bugbounty.meta.com/" target="_blank" rel="noopener">Bug Bounty program</a> to introduce the <a href="https://bugbounty.meta.com/blog/15th-anniversary-2025/" target="_blank" rel="noopener">WhatsApp Research Proxy</a> – a tool that makes research into WhatsApp’s network protocol more effective.</p>
<p>Next, we reduce the identified risk. Like many others in the industry, we found that the majority of the high severity vulnerabilities we published were due to memory safety issues in code written in the C and C++ programming languages. To combat this we invest in three parallel strategies:</p>
<ol><li>Design the product to minimize unnecessary attack surface exposure.</li>
<li>Invest in security assurance for the remaining C and C++ code.</li>
<li>Default the choice of memory safe languages, and not C and C++, for new code.</li>
</ol><p>WhatsApp has added protections like CFI, hardened memory allocators, safer buffer handling APIs, and more. C and C++ developers have specialized security training, development guidelines, and automated security analysis on their changes. We also have strict SLAs for fixing issues uncovered by the risk identification process.</p>
<h2>Accelerating Rust Adoption to Enhance Security</h2>
<p>Rust enabled WhatsApp’s security team to develop a secure, high performance, cross-platform library to ensure media shared on the platform is consistent and safe across devices. This is an important step forward in adding additional security behind the scenes for users and part of our ongoing defense-in-depth approach. Security teams at WhatsApp and Meta are highlighting opportunities for high impact adoption of Rust to interested teams, and we anticipate accelerating adoption of Rust over the coming years.</p>]]></description>
      <link>https://engineering.fb.com/2026/01/27/security/rust-at-scale-security-whatsapp/</link>
      <guid>https://engineering.fb.com/2026/01/27/security/rust-at-scale-security-whatsapp/</guid>
      <pubDate>Tue, 27 Jan 2026 16:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Adapting the Facebook Reels RecSys AI Model Based on User Feedback]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’ve improved personalized video recommendations on Facebook Reels by moving beyond metrics such as likes and watch time and directly leveraging user feedback. </li>
<li class="c1" aria-level="1">Our new User True Interest Survey (UTIS) model, now helps surface more niche, high-quality content and boosts engagement, retention, and satisfaction.</li>
<li class="c1" aria-level="1">We’re doubling down on personalization, tackling challenges like sparse user data and bias, and exploring advanced AI to make recommendations even smarter and more diverse.</li>
<li class="c1" aria-level="1">Our paper, “<a href="https://dl.acm.org/doi/10.1145/3705328.3748119" target="_blank" rel="noopener">Improve the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback</a>” shares full details on this work. </li>
</ul><p>Delivering personalized video recommendations is a common challenge for user satisfaction and long-term engagement on large-scale social platforms. At Facebook Reels, we’ve been working to close this gap by focusing on “interest matching” – ensuring that the content people see truly aligns with their unique preferences. By combining large-scale user surveys with recent advances in machine learning, we are now able to better understand and model what people genuinely care about, which has led to significant improvements in both recommendation quality and overall user satisfaction.</p>
<h2>Why True Interest Matters</h2>
<p>Traditional recommendation systems often rely on engagement signals – such as likes, shares, and watch time – or heuristics to infer user interests. However, these signals can be noisy and may not fully capture the nuances of what people actually care about or want to see. Models trained only on these signals tend to recommend content that has high short-term user value measured by watch time and engagement but doesn’t capture true interests that are important for long-term utility of the product. To bridge this gap, we needed a more direct way to measure user perception of content relevance. Our research shows that effective interest matching goes beyond simple topic alignment; it also encompasses factors like audio, production style, mood, and motivation. By accurately capturing these dimensions, we can deliver recommendations that feel more relevant and personalized, encouraging people to return to the app more frequently.</p>
<figure id="attachment_23392" aria-describedby="caption-attachment-23392" class="wp-caption alignnone c2"><img class="wp-image-23392" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png" alt="" width="600" height="560" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png 1254w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png?resize=916,855 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png?resize=768,717 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png?resize=1024,955 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png?resize=96,90 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image1.png?resize=192,179 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23392" class="wp-caption-text">Recommendation systems are typically optimized based on user interactions on the product, such as watch time, likes, shares, etc. However, by incorporating user perception feedback – like interest match and novelty – we can significantly improve relevance, quality, and the overall ecosystem.</figcaption></figure><h2>How We Measured User Perception</h2>
<p>To validate our approach, we launched large-scale, randomized surveys within the video feed, asking users, “How well does this video match your interests?” These surveys were deployed across Facebook Reels and other video surfaces, enabling us to collect thousands of in-context responses from users every day. The results revealed that previous interest heuristics only achieved a <a href="https://dl.acm.org/doi/abs/10.1145/3705328.3748119" target="_blank" rel="noopener">48.3% precision in identifying true interests</a>, highlighting the need for a more robust measurement framework. </p>
<p>By weighting responses to correct for sampling and nonresponse bias, we built a comprehensive dataset that accurately reflects real user preferences – moving beyond implicit engagement signals to leverage direct, real-time user feedback.</p>
<p><img class="aligncenter wp-image-23429" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png" alt="" width="277" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png 924w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=423,916 423w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=768,1662 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=473,1024 473w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=710,1536 710w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=96,208 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image-updated.png?resize=192,415 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Framework: User True Interest Survey (UTIS) Model</h2>
<p>Daily, a certain proportion of users viewing sessions on the platform are randomly chosen to display a single-question survey asking, “To what extent does this video match your interests?” on a 1-5 scale. The survey aims to gather real-time feedback from users about the content they have just viewed.</p>
<p>The main candidate ranking model used by the platform is a large multi-task, multi-label model. We trained a lightweight UTIS <strong><em>alignment model</em></strong> <strong><em>layer</em></strong> on the collected user survey responses using existing predictions of the main model as input features. The survey responses used to train our model were binarized for easy modelling and denoises variance in responses. In addition, new features were engineered to capture user behavior, content attributes, and interest signals with the object function to optimize predicting users’ interest-matching extent.</p>
<p>The UTIS model outputs the probability that a user is satisfied with a video, and is designed to be interpretable, allowing us to understand the factors contributing to users’ interest matching experience.</p>
<figure id="attachment_23393" aria-describedby="caption-attachment-23393" class="wp-caption alignnone c2"><img class="wp-image-23393" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png" alt="" width="600" height="602" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png 1234w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png?resize=913,916 913w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png?resize=768,770 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png?resize=1021,1024 1021w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png?resize=96,96 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Facebook-Reels-Adapting-Recsys-image2.png?resize=192,193 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23393" class="wp-caption-text">User perception feedback collected using surveys are extremely sparse but such feedback can be generalized in large scale recommendation systems using our novel model “Perception Layer” architecture that uses existing event predictions as additional features.</figcaption></figure><h2>Integrating the UTIS Model in the Main Ranking System</h2>
<p>We have experimented with and deployed several use cases of the UTIS model in our ranking funnel, all of which showed successful tier 0 user retention metric improvements:</p>
<ol><li class="c1" aria-level="1"><strong>Late Stage Ranking (LSR)</strong>: UTIS is deployed in parallel to the LSR model, providing an additional input feature into the final value formula. This allows fine-tuning of the final ranking stage to incorporate true interests while balancing other concerns.</li>
<li class="c1" aria-level="1"><strong>Early Stage Ranking (Retrieval)</strong>: UTIS is used to reconstruct users’ true interest profiles by aggregating survey data to predict affinity for any given user-video pair, allowing us to re-rank the user interest profile and source more candidates relevant to users’ true interests. Also, large sequences based on user-to-item retrieval models are aligned using knowledge distillation based objectives trained on UTIS predictions from LSR as labels. </li>
</ol><p>The UTIS model score is now one of the inputs to our ranking system. Videos predicted to be of high interest receive a modest boost, while those with low predicted interest are demoted. This approach has led to:</p>
<ul><li class="c1" aria-level="1">Increased delivery of high-quality, niche content. </li>
<li class="c1" aria-level="1">A reduction in low-quality, generic popularity based recommendations.</li>
<li class="c1" aria-level="1">Improvements in like, share, and follow rates.</li>
<li class="c1" aria-level="1">Improved user engagement and retention metrics.</li>
</ul><p>Since launching this approach, we’ve observed robust offline and online performance</p>
<ol><li class="c1" aria-level="1"><strong>Offline Performance:</strong> <a href="https://dl.acm.org/doi/abs/10.1145/3705328.3748119" target="_blank" rel="noopener">The UTIS model delivered an improvement in accuracy and reliability over the heuristic rule baseline</a>. Accuracy increased from 59.5% to 71.5%, precision improved from 48.3% to 63.2%, and recall increased from 45.4% to 66.1%. These gains demonstrate the model’s ability to help in accurately identifying users’ interest preferences.</li>
<li class="c1" aria-level="1"><strong>Online Performance:</strong> Large-scale A/B testing with over 10 million users confirmed these improvements in real-world settings. <a href="https://dl.acm.org/doi/abs/10.1145/3705328.3748119" target="_blank" rel="noopener">The UTIS model consistently outperformed the baseline, driving higher user engagement and retention</a>. Notably, we saw a +5.4% increase in high survey ratings, a -6.84% reduction in low survey ratings, a +5.2% boost in total user engagement, and a -0.34% decrease in integrity violations. These results highlight the model’s effectiveness in improving user experience and matching users with relevant interests.</li>
</ol><h2>Future Work for Interest Recommendations</h2>
<p>By integrating survey-based measurement with machine learning, we are creating a more engaging and personalized experience – delivering content on Facebook Reels that feels truly tailored to each user and encourages repeat visits. While survey-driven modeling has already improved our recommendations, there remain important opportunities for improvement, such as better serving users with sparse engagement histories, reducing bias in survey sampling and delivery, further personalizing recommendations for diverse user cohorts and improving the diversity of recommendations. To address these challenges and continue advancing relevance and quality, we are also exploring advanced modeling techniques, including large language models and more granular user representations.</p>
<h2>Read the Paper</h2>
<p><a class="meta-btn" href="https://dl.acm.org/doi/10.1145/3705328.3748119?ref=engineeringatmeta">Improve the Personalization of Large-Scale Ranking Systems by Integrating User Survey Feedback</a></p>]]></description>
      <link>https://engineering.fb.com/2026/01/14/ml-applications/adapting-the-facebook-reels-recsys-ai-model-based-on-user-feedback/</link>
      <guid>https://engineering.fb.com/2026/01/14/ml-applications/adapting-the-facebook-reels-recsys-ai-model-based-on-user-feedback/</guid>
      <pubDate>Wed, 14 Jan 2026 21:51:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[CSS at Scale With StyleX]]></title>
      <description><![CDATA[<div class="x1lliihq x13xjzxd x14beivq x14z9mp x1lziwak x14l7nz5 xzboxd6 x6xjc82 x13xao1 x6z2cds xscc34e">
<p>Build a large enough website with a large enough codebase, and you’ll eventually find that CSS presents challenges at scale. It’s no different at Meta, which is why we open-sourced StyleX, a solution for CSS at scale. StyleX combines the ergonomics of CSS-in-JS with the performance of static CSS. It allows atomic styling of components while deduplicating definitions to reduce bundle size and exposes a simple API for developers.</p>
<p>StyleX has become the standard at companies like Figma and Snowflake. Here at Meta, <a href="https://engineering.fb.com/2025/11/11/web/stylex-a-styling-library-for-css-at-scale/" target="_blank" rel="noopener">it’s the standard styling system</a> across Facebook, Instagram, WhatsApp, Messenger, and Threads.</p>
<p>On this episode of the Meta Tech Podcast, meet Melissa, a software engineer at Meta and one of StyleX’s maintainers.  <a href="https://www.threads.com/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> talks to her about all things StyleX—its origins, how open source has been a force multiplier for the project, and what it’s like interacting with large companies across the industry as they’ve adopted StyleX.</p>
</div>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/39659430/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/051rjeBqtSMZALoJ02jpTN?ref=engineeringatmeta" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/css-at-scale-with-stylex/id1370910331?i=1000744322338" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pocketcasts.com/podcast/meta-tech-podcast/c4ede3e0-1fbf-0136-c266-7d73a919276a/css-at-scale-with-stylex/27b5439a-ab94-45a7-b89c-30f2540d9df5?ref=engineeringatmeta" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2026/01/12/web/css-at-scale-with-stylex/</link>
      <guid>https://engineering.fb.com/2026/01/12/web/css-at-scale-with-stylex/</guid>
      <pubDate>Mon, 12 Jan 2026 19:34:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption]]></title>
      <description><![CDATA[<p>The 2025 Typed Python Survey, conducted by contributors from JetBrains, Meta, and the broader Python typing community, offers a comprehensive look at the current state of Python’s type system and developer tooling. With 1,241 responses (a 15% increase from last year), the survey captures the evolving sentiment, challenges, and opportunities around Python typing in the open-source ecosystem. In this blog we’ll cover a summary of the key findings and trends from this year’s results<strong>.</strong></p>
<h2>Who Responded?</h2>
<p>The survey was initially distributed on official social media accounts by the survey creators, and subsequently shared organically across further platforms including Reddit, email newsletters, Mastodon, LinkedIn, Discord, and Twitter. When respondents were asked which platform they heard about the survey from, <strong>Reddit emerged as the most effective channel</strong>, but significant engagement also came from email newsletters and Mastodon, reflecting the diverse spaces where Python developers connect and share knowledge.</p>
<p><strong>The respondent pool was predominantly composed of developers experienced with Python and typing</strong>. Nearly half reported over a decade of Python experience, and another third had between five and 10 years. While there was representation from newcomers, the majority of participants brought substantial expertise to their responses. Experience with type hints was similarly robust, with most respondents having used them for several years and only a small minority indicating no experience with typing.</p>
<h2>Typing Adoption and Attitudes</h2>
<p>The survey results reveal that Python’s type hinting system has become a core part of development for most engineers. An impressive 86% of respondents report that they “always” or “often” use type hints in their Python code, a figure that remains consistent with <a href="https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/" target="_blank" rel="noopener">last year’s Typed Python survey</a>. </p>
<p>For the first time this year the survey also asked participants to indicate how many years of experience they have with Python and with Python typing. We found that adoption of typing is similar across all experience levels, but there are some interesting nuances:</p>
<ul><li class="c1" aria-level="1">Developers with 5–10 years of Python experience are the most enthusiastic adopters, with <strong>93%</strong> reporting regularly using type hints.</li>
<li class="c1" aria-level="1">Among the most junior developers (0–2 years of experience), adoption is slightly lower at <strong>83%</strong>. Possible reasons for this could be the learning curve for newcomers (repeatedly mentioned in later survey questions).</li>
<li class="c1" aria-level="1">For senior developers (10+ years of experience), adoption was the lowest of all cohorts, with  only <strong>80%</strong> reporting using them always or often. Reasons for this drop are unclear, it could reflect more experienced python developers having gotten used to writing Python without type hints before they were supported, or possibly they are more likely to work on larger or legacy codebases that are difficult to migrate.</li>
</ul><figure id="attachment_23493" aria-describedby="caption-attachment-23493" class="wp-caption alignnone c2"><img class="size-full wp-image-23493" src="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png" alt="" width="2428" height="936" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png 2428w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=916,353 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=768,296 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=1024,395 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=1536,592 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=2048,790 2048w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-1.png?resize=192,74 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23493" class="wp-caption-text">Percent of respondents who use types “often” or “always,” segmented by years of Python experience.</figcaption></figure><p>Overall, the data shows that type hints are widely embraced by the Python community, with strong support from engineers at all experience levels. However, we should note there may be some selection bias at play here, as it’s possible developers who are more familiar with types and use them more often are also more likely to be interested in taking a survey about it.</p>
<h2>Why Developers Love Python Typing</h2>
<p>When asked what developers loved about the Python type system there were some mixed reactions, with a number of responses just stating, “nothing” (note this was an optional question). This indicates the presence of some strong negative opinions towards the type system among a minority of Python users. The majority of responses were positive, with the following themes emerging prominently:</p>
<ul><li class="c1" aria-level="1"><strong>Optionality and Gradual Adoption</strong>: The optional nature of the type system and the ability to adopt it incrementally into existing projects are highly valued, allowing flexibility in development.</li>
<li class="c1" aria-level="1"><strong>Improved Readability and Documentation</strong>: Type hints serve as in-code documentation, making code clearer and easier to read, understand, and reason about for both the author and other developers, especially in larger codebases.</li>
<li class="c1" aria-level="1"><strong>Enhanced Tooling and IDE Support</strong>: The type system significantly improves IDE features like autocomplete/IntelliSense, jump-to-definition, and inline type hints, leading to a better developer experience.</li>
<li class="c1" aria-level="1"><strong>Bug Prevention and Code Correctness</strong>: It helps catch errors and subtle bugs earlier during development or refactoring, increasing confidence and leading to more robust and reliable code.</li>
<li class="c1" aria-level="1"><strong>Flexibility and Features</strong>: Respondents appreciate the flexibility, expressiveness, and powerful features of the system, including protocols, generics (especially the new syntax), and the ability to inspect annotations at runtime for use with libraries like Pydantic/FastAPI.</li>
</ul><figure id="attachment_23488" aria-describedby="caption-attachment-23488" class="wp-caption alignnone c3"><img class="size-full wp-image-23488" src="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png" alt="" width="1280" height="720" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png 1280w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-2.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23488" class="wp-caption-text">Sample of responses to the question, “What do you love about Python Typing?”</figcaption></figure><h2>Challenges and Pain Points</h2>
<p>In addition to assessing positive sentiment towards Python typing, we also asked respondents what challenges and pain points they face. With over 800 responses to the question, “What is the hardest part about using the Python type system?” the following themes were identified:</p>
<ul><li class="c1" aria-level="1"><strong>Third-Party Library/Framework Support:</strong> Many respondents cited the difficulty of integrating types with untyped, incomplete, or incorrect type annotations in third-party libraries (e.g., NumPy, Pandas, Django).</li>
<li class="c1" aria-level="1"><strong>Complexity of Advanced Features:</strong> Advanced concepts such as <strong>generics</strong>, TypeVar (including co/contravariance), <strong>callables/decorators</strong>, and <strong>complex/nested types</strong> were frequently mentioned as difficult to understand or express.</li>
<li class="c1" aria-level="1"><strong>Tooling and Ecosystem Fragmentation:</strong> The ecosystem is seen as chaotic, with inconsistencies between different type checkers (like Mypy and Pyright), slow performance of tools like Mypy, and a desire for an official, built-in type checker.</li>
<li class="c1" aria-level="1"><strong>Lack of Enforcement and Runtime Guarantees:</strong> The fact that typing is <strong>optional</strong> and is not enforced at runtime or by the Python interpreter makes it harder to convince others to use it, enforce its consistent use, and fully trust the type hints.</li>
<li class="c1" aria-level="1"><strong>Verbosity and Code Readability:</strong> The necessary type hints, especially for complex structures, can be verbose, make the code less readable, and feel non-Pythonic.</li>
<li class="c1" aria-level="1"><strong>Dealing with Legacy/Dynamic Code:</strong> It is hard to integrate typing into old, untyped codebases, particularly when they use dynamic Python features that do not play well with static typing.</li>
<li class="c1" aria-level="1"><strong>Type System Limitations and Evolution:</strong> The type system is perceived as incomplete or less expressive than languages like TypeScript, and its rapid evolution means syntax and best practices are constantly changing.</li>
</ul><h2>Most Requested Features</h2>
<p>A little less than half of respondents had suggestions for what they thought was missing from the Python type system, the most commonly requested features being:</p>
<ul><li class="c1" aria-level="1"><strong>Missing Features From TypeScript and Other Languages:</strong> Many respondents requested features inspired by TypeScript, such as <strong>Intersection types</strong> (like the &amp; operator), <strong>Mapped and Conditional types</strong>, <strong>Utility types</strong> (like Pick, Omit, keyof, and typeof), and better <strong>Structural typing</strong> for dictionaries/dicts (e.g., more flexible TypedDict or anonymous types).</li>
<li class="c1" aria-level="1"><strong>Runtime Type Enforcement and Performance:</strong> A significant number of developers desire <strong>optional runtime type enforcement</strong> or guarantees, as well as <strong>performance optimizations</strong> (JIT/AOT compilation) based on the type hints provided.</li>
<li class="c1" aria-level="1"><strong>Better Generics and Algebraic Data Types (ADTs):</strong> Requests include features like <strong>higher-kinded types (HKT)</strong>, improved support for <strong>TypeVarTuple</strong> (e.g., bounds and unpacking), better <strong>generics</strong> implementation, and official support for <strong>algebraic data types</strong> (e.g., Result, Option, or Rust-like enums/sum types).</li>
<li class="c1" aria-level="1"><strong>Improved Tooling, Consistency, and Syntax:</strong> Developers asked for an <strong>official/built-in type checker</strong> that is fast and consistent, a less <strong>verbose syntax</strong> for common patterns like nullable types (? instead of | None) and callables, and <strong>better support/documentation</strong> for complex types (like nested dicts, NumPy/Pandas arrays).</li>
<li class="c1" aria-level="1"><strong>Handling of Complex/Dynamic Patterns:</strong> Specific missing capabilities include better support for typing <strong>function wrappers/decorators</strong> (e.g., using ParamSpec effectively), being able to type <strong>dynamic attributes</strong> (like those added by Django/ORMs), and improved type <strong>narrowing</strong> and <strong>control flow analysis</strong>.</li>
</ul><h2>Tooling Trends</h2>
<p>The developer tooling landscape for Python typing continues to evolve, with both established and emerging tools shaping how engineers work.</p>
<p>Mypy remains the most widely used type checker, with 58% of respondents reporting using it. While this represents a slight dip from 61% in <a href="https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/" target="_blank" rel="noopener">last year’s survey</a>, Mypy still holds a dominant position in the ecosystem. At the same time, new Rust-based type checkers like Pyrefly, Ty, and Zuban are quickly gaining traction, now used by over 20% of survey participants collectively.</p>
<figure id="attachment_23487" aria-describedby="caption-attachment-23487" class="wp-caption alignnone c4"><img class="size-full wp-image-23487" src="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png" alt="" width="1999" height="1125" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-3.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23487" class="wp-caption-text">The top six most popular answers to the question, “What type checking tools do your projects use (select all that apply)?”</figcaption></figure><p>When it comes to development environments, VS Code leads the pack as the most popular IDE among Python developers, followed by PyCharm and (Neo)vim/vim. The use of type checking tools within IDEs also mimics the popularity of the IDE themselves, with VS Code’s default (Pylance/Pyright) and PyCharm’s built-in support being the first and third most popular options respectively.</p>
<h2>How Developers Learn and Get Help</h2>
<p>When it comes to learning about Python typing and getting help, developers rely on a mix of official resources, community-driven content, and AI-powered tools, a similar learning landscape to what we saw in <a href="https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/">last year’s survey</a>.</p>
<figure id="attachment_23490" aria-describedby="caption-attachment-23490" class="wp-caption alignnone c4"><img class="size-full wp-image-23490" src="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png" alt="" width="1999" height="1125" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Python-typing-survey-2025-image-4.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23490" class="wp-caption-text">Top six responses to the question, “How do you learn Python typing (select all that apply)?”</figcaption></figure><p>Official documentation remains the go-to resource for most developers. The majority of respondents reported learning about Python typing through the official docs, <strong>with 865 citing it as their primary source for learning and 891 turning to it for help</strong>. Python’s dedicated typing documentation and type checker-specific docs are also heavily used, showing that well-maintained, authoritative resources are still highly valued.</p>
<p>Blog posts have climbed in popularity, now ranking as the second most common way developers learn about typing, up from third place last year. Online tutorials, code reviews, and YouTube videos also play a significant role.</p>
<p>Community platforms are gaining traction as sources for updates and new features. Reddit, in particular, has become a key channel for discovering new developments in the type system, jumping from fifth to third place as a source for news. Email newsletters, podcasts, and Mastodon are also on the rise.</p>
<p>Large language models (LLMs) are now a notable part of the help-seeking landscape. Over 400 respondents reported using LLM chat tools, and nearly 300 use in-editor LLM suggestions when working with Python typing. </p>
<h2><strong>Opportunities and Next Steps</strong></h2>
<p>The 2025 Python Typing Survey highlights the Python community’s sustained adoption of typing features and tools to support their usage. It also points to clear opportunities for continued growth and improvement, including:</p>
<ul><li class="c1" aria-level="1"><strong>Increasing library coverage</strong>: One of the most consistent requests from the community is for broader and deeper type annotation coverage in popular libraries. Expanding type hints across widely used packages will make static typing more practical and valuable for everyone.</li>
<li class="c1" aria-level="1"><strong>Improving documentation</strong>: While official documentation remains the top resource, there’s a strong appetite for more discoverable and accessible learning materials. Leveraging channels like newsletters, blog posts, and Reddit can help surface new features, best practices, and real-world examples to a wider audience.</li>
<li class="c1" aria-level="1"><strong>Clarify tooling differences</strong>: The growing variety of type checkers and tools is a sign of a healthy ecosystem, but can also reflect a lack of consensus/standardisation and can be confusing for users. There’s an opportunity to drive more consistency between tools or provide clearer guidance on their differences and best-fit use cases.</li>
</ul><p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">website</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, <a href="https://bsky.app/profile/metaopensource.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>
<h2>Acknowledgements</h2>
<p><em>This survey ran from 29th Aug to 16th Sept 2025 and received 1,241 responses in total.</em></p>
<p><em>Thanks to everyone who participated! The Python typing ecosystem continues to evolve, and your feedback helps shape its future.</em></p>
<p><em>Also, special thanks to the Jetbrains PyCharm team for providing the graphics used in this piece.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/12/22/developer-tools/python-typing-survey-2025-code-quality-flexibility-typing-adoption/</link>
      <guid>https://engineering.fb.com/2025/12/22/developer-tools/python-typing-survey-2025-code-quality-flexibility-typing-adoption/</guid>
      <pubDate>Mon, 22 Dec 2025 15:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[DrP: Meta’s Root Cause Analysis Platform at Scale]]></title>
      <description><![CDATA[<p>Incident investigation can be a daunting task in today’s digital landscape, where large-scale systems comprise numerous interconnected components and dependencies</p>
<p><a href="https://arxiv.org/abs/2512.04250">DrP</a> is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil</p>
<p>Today, DrP is used by over 300 teams at Meta, running 50,000 analyses daily, and has been effective in reducing MTTR by 20-80% </p>
<p>By understanding DrP and its capabilities, we can unlock new possibilities for efficient incident resolution and improved system reliability.</p>
<h2>What It Is</h2>
<p>DrP is an end-to-end platform that automates the investigation process for large-scale systems. It addresses the inefficiencies of manual investigations, which often rely on outdated playbooks and ad-hoc scripts. These traditional methods can lead to prolonged downtimes and increased on-call toil as engineers spend countless hours triaging and debugging incidents.</p>
<p>DrP offers a comprehensive solution by providing an expressive and flexible SDK to author investigation playbooks, known as analyzers. These analyzers are executed by a scalable backend system, which integrates seamlessly with mainstream workflows such as alerts and incident management tools. Additionally, DrP includes a post-processing system to automate actions based on investigation results, such as mitigation steps.</p>
<p><img class="alignnone size-full wp-image-23498" src="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png" alt="" width="1999" height="1125" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-1.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>DrP’s key components include: </p>
<ol><li class="c1" aria-level="1"><strong>Expressive SDK</strong>: The DrP SDK allows engineers to codify investigation workflows into analyzers. It provides a rich set of helper libraries and machine learning (ML) algorithms for data access and problem isolation analysis, such as anomaly detection, event isolation, time series correlation and dimension analysis.</li>
<li class="c1" aria-level="1"><strong>Scalable backend</strong>: The backend system executes the analyzers, providing both multi-tenant and isolated execution environments. It ensures that analyzers can be run at scale, handling thousands of automated analyses per day.</li>
<li class="c1" aria-level="1"><strong>Integration with workflows</strong>: DrP integrates with alerting and incident management tools, allowing for the auto-triggering of analyzers on incidents. This integration ensures that investigation results are immediately available to on-call engineers.</li>
<li class="c1" aria-level="1"><strong>Post-processing system</strong>: After an investigation, the post-processing system can take automated actions based on the analysis results. For example, it can create tasks or pull requests to mitigate issues identified during the investigation.</li>
</ol><h2>How It Works </h2>
<h3>Authoring Workflow</h3>
<p><img class="alignnone size-full wp-image-23499" src="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg" alt="" width="1999" height="1125" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-DrP-image-2.jpg?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>The process of creating automated playbooks, or analyzers, begins with the DrP SDK. Engineers enumerate the investigation steps, listing inputs and potential paths to isolate problem areas. The SDK provides APIs and libraries to codify these workflows, allowing engineers to capture all required input parameters and context in a type-safe manner.</p>
<ol><li class="c1" aria-level="1"><strong>Enumerate investigation steps</strong>: Engineers start by listing the steps required to investigate an incident, including inputs and potential paths to isolate the problem.</li>
<li class="c1" aria-level="1"><strong>Bootstrap code</strong>: The DrP SDK provides bootstrap code to create a template analyzer with pre-populated boilerplate code. Engineers extend this code to capture all necessary input parameters and context.</li>
<li class="c1" aria-level="1"><strong>Data access and analysis</strong>: The SDK includes libraries for data access and analysis, such as dimension analysis and time series correlation. Engineers use these libraries to code the main investigation decision tree into the analyzer.</li>
<li class="c1" aria-level="1"><strong>Analyzer chaining</strong>: For dependent service analysis, the SDK’s APIs allow for seamless chaining of analyzers, passing context and obtaining outputs.</li>
<li class="c1" aria-level="1"><strong>Output and post-processing</strong>: The output method captures findings from the analysis, using special data structures for both text and machine-readable formats. Post-processing methods automate actions based on analyzer findings.</li>
</ol><p>Once created, analyzers are tested and sent for code review. DrP offers automated backtesting integrated into code review tools, ensuring high-quality analyzers before deployment.</p>
<h3>Consumption Workflow</h3>
<p>In production, analyzers integrate with tools like UI, CLI, alerts, and incident management systems. Analyzers can automatically trigger upon alert activation, providing immediate results to on-call engineers and improving response times. The DrP backend manages a queue for requests and a worker pool for secure execution, with results returning asynchronously.</p>
<ol><li class="c1" aria-level="1"><strong>Integration with alerts</strong>: DrP is integrated with alerting systems, allowing analyzers to trigger automatically when an alert is activated. This provides immediate analysis results to on-call engineers.</li>
<li class="c1" aria-level="1"><strong>Execution and monitoring</strong>: The backend system manages a queue for analyzer requests and a worker pool for execution. It monitors execution, ensuring that analyzers run securely and efficiently.</li>
<li class="c1" aria-level="1"><strong>Post-processing and insights</strong>: A separate post-processing system handles analysis results, annotating alerts with findings. The DrP Insights system periodically analyzes outputs to identify and rank top alert causes, aiding teams in prioritizing reliability improvements.</li>
</ol><h2>Why It Matters</h2>
<h3>Reducing MTTR</h3>
<p>DrP has demonstrated significant improvements in reducing MTTR across various teams and use cases. By automating manual investigations, DrP enables faster triage and mitigation of incidents, leading to quicker system recovery and improved availability.</p>
<ol><li class="c1" aria-level="1"><strong>Efficiency</strong>: Automated investigations reduce the time engineers spend on manual triage, allowing them to focus on more complex tasks. This efficiency translates to faster incident resolution and reduced downtime.</li>
<li class="c1" aria-level="1"><strong>Consistency</strong>: By codifying investigation workflows into analyzers, DrP ensures consistent and repeatable investigations. This consistency reduces the likelihood of errors and improves the reliability of incident resolution.</li>
<li class="c1" aria-level="1"><strong>Scalability</strong>: DrP can handle thousands of automated analyses per day, making it suitable for large-scale systems with complex dependencies. Its scalability ensures that it can support the needs of growing organizations.</li>
</ol><h3>Enhancing On-Call Productivity</h3>
<p>The automation provided by DrP reduces the on-call effort during investigations, saving engineering hours and reducing on-call fatigue. By automating repetitive and time-consuming steps, DrP allows engineers to focus on more complex tasks, improving overall productivity.</p>
<h3>Scalability and Adoption</h3>
<p>DrP has been successfully deployed at scale at Meta, covering over 300 teams and 2000 analyzers, executing 50,000 automated analyses per day. Its integration into mainstream workflows, such as alerting systems, has facilitated widespread adoption and demonstrated its value in real-world scenarios.</p>
<ol><li class="c1" aria-level="1"><strong>Widespread adoption</strong>: DrP has been adopted by hundreds of teams across various domains, demonstrating its versatility and effectiveness in addressing diverse investigation needs.</li>
<li class="c1" aria-level="1"><strong>Proven impact</strong>: DrP has been in production for over five years, with proven results in reducing MTTR and improving on-call productivity. Its impact is evident in the positive feedback received from users and the significant improvements in incident resolution times.</li>
<li class="c1" aria-level="1"><strong>Continuous improvement</strong>: DrP is continuously evolving, with ongoing enhancements to its ML algorithms, SDK, backend system, and integrations. This commitment to continuous improvement ensures that DrP remains a cutting-edge solution for incident investigations, while its growing adoption across teams enables existing workflows and analyzers to be reused by others, compounding the shared knowledge base and making it increasingly valuable across the organization.</li>
</ol><h2>What’s Next</h2>
<p>Looking ahead, DrP aims to evolve into an AI-native platform, playing a central role in advancing Meta’s broader AI4Ops vision, enabling more powerful and automated investigations. This transformation will enhance analysis by delivering more accurate and insightful results, while also simplifying the user experience through streamlined ML algorithms, SDKs, UI, and integrations facilitating effortless authoring and execution of analyzers.</p>
<h2>Read the Paper</h2>
<p><a href="https://arxiv.org/abs/2512.04250">DrP: Meta’s Efficient Investigations Platform at Scale</a></p>
<h2>Acknowledgements</h2>
<p><em>We wish to thank contributors to this effort across many teams throughout Meta</em></p>
<p><em>Team – </em> <em>Eduardo Hernandez</em><em>,</em> <em>Jimmy Wang</em><em>, Akash Jothi, Kshitiz Bhattarai,</em> <em>Shreya Shah</em><em>,</em> <em>Neeru Sharma</em><em>, Alex He, Juan-Pablo E, Oswaldo R, Vamsi Kunchaparthi, Daniel An, Rakesh Vanga, Ankit Agarwal, Narayanan Sankaran, Vlad Tsvang, Khushbu Thakur,</em> <em>Srikanth Kamath</em><em>, Chris Davis,</em> <em>Rohit JV</em><em>,</em> <em>Ohad Yahalom</em><em>, Bao Nguyen, Viraaj Navelkar, Arturo Lira, Nikolay Laptev, Sean Lee, Yulin Chen</em></p>
<p><em>Leadership – Sanjay Sundarajan, John Ehrhardt, Ruben Badaro, Nitin Gupta, Victoria Dudin, Benjamin Renard, Gautam Shanbhag, Barak Yagour, Aparna Ramani</em></p>]]></description>
      <link>https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/</link>
      <guid>https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/</guid>
      <pubDate>Fri, 19 Dec 2025 18:35:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How We Built Meta Ray-Ban Display: From Zero to Polish]]></title>
      <description><![CDATA[<p>We’re going behind the scenes of the <a href="https://about.fb.com/news/2025/09/meta-ray-ban-display-ai-glasses-emg-wristband/?ref=engineeringatmeta">Meta Ray-Ban Display</a>, Meta’s most advanced AI glasses yet. In a previous episode we met the team behind the Meta Neural Band, the EMG wristband packaged with the Ray-Ban Display. Now we’re delving into the glasses themselves.</p>
<p>Kenan and Emanuel, from Meta’s Wearables org, join <a href="https://www.threads.com/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> on the Meta Tech Podcast to talk about all the unique challenges of designing game-changing wearable technology, from the unique display technology to emerging UI patterns for display glasses.</p>
<p>You’ll also learn what particle physics and hardware design have in common and how to celebrate even the incremental wins in a fast-moving culture.</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/39382865/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/1xMZDrCSW74orGphqGLFE5?ref=engineeringatmeta" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/from-zero-to-polish-building-meta-ray-ban-display/id1370910331?i=1000741025569" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/3dhpd4np?ref=engineeringatmeta" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/12/17/virtual-reality/meta-ray-ban-display-from-zero-to-polish/</link>
      <guid>https://engineering.fb.com/2025/12/17/virtual-reality/meta-ray-ban-display-from-zero-to-polish/</guid>
      <pubDate>Wed, 17 Dec 2025 15:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta’s secure-by-default frameworks wrap potentially unsafe OS and third-party functions, making security the default while preserving developer speed and usability.</li>
<li class="c1" aria-level="1">These frameworks are designed to closely mirror existing APIs, rely on public and stable interfaces, and maximize developer adoption by minimizing friction and complexity.</li>
<li class="c1" aria-level="1">Generative AI and automation accelerate the adoption of secure frameworks at scale, enabling consistent security enforcement and efficient migration across Meta’s vast codebase.</li>
</ul><p>Sometimes functions within operating systems or provided by third parties come with a risk of misuse that could compromise security. To mitigate this, we wrap or replace these functions using our own secure-by-default frameworks. These frameworks play an important role in helping our security and software engineers maintain and improve the security of our codebases while maintaining developer speed.</p>
<p>But implementing these frameworks comes with practical challenges, like design tradeoffs. Building a secure framework on top of Android APIs, for example, requires a thoughtful balance between security, usability, and maintainability.</p>
<p>With the emergence of AI-driven tools and automation we can scale the adoption of these frameworks across Meta’s large codebase. AI can assist in identifying insecure usage patterns, suggesting or automatically applying secure framework replacements and continuously monitoring compliance. This not only accelerates migration but also ensures consistent security enforcement at scale.</p>
<p>Together, these strategies empower our development teams to ship well-secured software efficiently, safeguarding user data and trust while maintaining high developer productivity across Meta’s vast ecosystem.</p>
<h2>How We Design Secure-by-Default Frameworks at Meta</h2>
<p>Designing secure-by-default frameworks for use by a large number of developers shipping vastly different features across multiple apps is an interesting challenge. There are a lot of competing concerns such as discoverability, usability, maintainability, performance, and security benefits. </p>
<p>Practically speaking, developers only have a finite amount of time to code each day. The goal of our frameworks is to improve product security while being largely invisible and friction-free to avoid slowing developers down unnecessarily. This means that we have to correctly balance all those competing concerns discussed above. If we strike the wrong balance, some developers could avoid using our frameworks, which could reduce our ability to prevent security vulnerabilities. </p>
<p>For example, if we design a framework that improves product security in one area but introduces three new concepts and requires developers to provide five additional pieces of information per call site, some app developers may try to find a way around using them. Conversely, if we provide these same frameworks that are trivially easy to use, but they consume noticeable amounts of CPU and RAM, some app developers may, again, seek ways around using them, albeit for different reasons.</p>
<p>These examples might seem a bit obvious, but they are taken from real experiences over the last 10+ years developing ~15 secure-by-default frameworks targeting Android and iOS. Over that time, we’ve established some best practices for designing and implementing these new frameworks.</p>
<p>To the maximum extent possible, an effective framework should embody the following principles: </p>
<ul><li class="c1" aria-level="1"><strong>The secure framework API should resemble the existing API.</strong> This reduces the cognitive burden on framework users, forces security framework developers to minimize the complexity of the changes, and makes it easier to perform automated code conversion from the insecure to secure API usage.</li>
<li class="c1" aria-level="1"><strong>The framework should itself be built on public and stable APIs</strong>. APIs from OS vendors and third parties change all the time, especially the non-public ones. Even if access to those APIs is technically allowed in some cases, building on top of private APIs is a recipe for constant fire drills (best case) and dead-end investment in frameworks that simply can’t work with newer versions of operating systems and libraries (worst case).</li>
<li class="c1" aria-level="1"><strong>The framework should cover the maximum number of application users, not security use cases</strong>. There shouldn’t be one security framework that covers all security issues, and not every security issue is general enough to deserve its own framework. However, each security framework should be usable across all apps and OS versions for a particular platform. Small libraries are faster to build and deploy, and easier to maintain and explain to app developers.</li>
</ul><p>Now that we’ve looked at the design philosophy behind our frameworks, let’s look at one of our most widely used Android security frameworks, SecureLinkLauncher.</p>
<h2>SecureLinkLauncher: Preventing Android Intent Hijacking</h2>
<p>SecureLinkLauncher (SLL) is one of our widely-used secure frameworks. SLL is designed to prevent sensitive data from spilling through the <a href="https://developer.android.com/guide/components/intents-filters" target="_blank" rel="noopener">Android intents system</a>. It exemplifies our approach to secure-by-default frameworks by wrapping native Android intent launching methods with scope verification and security checks, preventing common vulnerabilities such as intent hijacking without sacrificing developer velocity or familiarity.</p>
<p>The system consists of intent senders and intent receivers. SLL is targeted to intent senders.</p>
<p>SLL offers a semantic API that closely mirrors the familiar Android Context API for launching intents, including methods like startActivity() and startActivityForResult(). Instead of invoking the potentially insecure Android API directly, such as context.startActivity(intent);, developers use SecureLinkLauncher with a similar method call pattern, for example, SecureLinkLauncher.launchInternalActivity(intent, context);. Internally, SecureLinkLauncher delegates to the stable Android startActivity() API, ensuring that all intent launches are securely verified and protected by the framework.</p>
<pre class="line-numbers"><code class="language-java">public void launchInternalActivity(Intent intent, Context context) {
   // Verify that the target activity is internal (same package)
   if (!isInternalActivity(intent, context)) {
       throw new SecurityException("Target activity is not internal");
   }
   // Delegate to Android's startActivity to launch the intent
   context.startActivity(intent);
}
</code></pre>
<p>Similarly, instead of calling context.startActivityForResult(intent, code); directly, developers use SecureLinkLauncher.launchInternalActivityForResult(intent, code, context);. SecureLinkLauncher (SLL) wraps Android’s startActivity() and related methods, enforcing scope verification before delegating to the native Android API. This approach provides security by default while preserving the familiar Android intent launching semantics.</p>
<p>One of the most common ways that data is spilled through intents is due to incorrect targeting of the intent. As an example, following intent isn’t targeting a specific package. This means it can be received by any app with a matching &lt;intent-filter&gt;. While the intention of the developer might be that their Intent ends up in the Facebook app based on the URL, the reality is that any app, including a malicious application, could add an &lt;intent-filter&gt; that handles that URL and receive the intent. </p>
<pre class="line-numbers"><code class="language-java">Intent intent = new Intent(FBLinks.PREFIX + "profile");
intent.setExtra(SECRET_INFO, user_id);
startActivity(intent); 
// startActivity can’t ensure who the receiver of the intent would be</code></pre>
<p>In the example below, SLL ensures that the intent is directed to one of the family apps, as specified by the developer’s scope for implicit intents. Without SLL, these intents can resolve to both family and non-family apps,potentially exposing SECRET_INFO to third-party or malicious apps on the user’s device. By enforcing this scope, SLL can prevent such information leaks.</p>
<pre class="line-numbers"><code class="language-java">SecureLinkLauncher.launchFamilyActivity(intent, context); 
// launchFamilyActivity would make sure intent goes to the meta family apps</code></pre>
<p>In a typical Android environment, two scopes – internal and external – might seem sufficient for handling intents within the same app and between different apps. However, Meta’s ecosystem is unique, comprising multiple apps such as Facebook, Instagram, Messenger, WhatsApp, and their variants (e.g., WhatsApp Business). The complexity of inter-process communication between these apps demands more nuanced control over intent scoping. To address this need, SLL provides a more fine-grained approach to intent scoping, offering scopes that cater to specific use cases:</p>
<ul><li class="c1" aria-level="1"><strong>Family scope</strong>: Enables secure communication between Meta-owned apps, ensuring that intents are only sent from one Meta app to another.</li>
<li class="c1" aria-level="1"><strong>Same-key scope</strong>: Restricts intent sending to Meta apps signed with the same key (not all Meta apps are signed by the same key), providing an additional layer of security and trust.</li>
<li class="c1" aria-level="1"><strong>Internal scope</strong>: Restricts intent sending within the app itself.</li>
<li class="c1" aria-level="1"><strong>Third-party scope</strong>: Allows intents to be sent to third-party apps, while preventing them from being handled by Meta’s own apps.</li>
</ul><p>By leveraging these scopes, developers can ensure that sensitive data is shared securely and intentionally within the Meta ecosystem, while also protecting against unintended or malicious access. SLL’s fine-grained intent scoping capabilities, which are built upon the secure-by-default framework principles discussed above, empower developers to build more robust and secure applications that meet the unique demands of Meta’s complex ecosystem.</p>
<h2>Leveraging Generative AI To Deploy Secure-by-Default Frameworks at Scale</h2>
<p>Adopting these frameworks in a large codebase is non-trivial. The main complexity is choosing the correct scope, as that choice relies on information that is not readily available at existing call sites. While one could imagine a deterministic analysis attempting to infer the scope based on dataflows, that would be a large undertaking. Furthermore, it would likely have some precision-scalability trade-off. </p>
<p>Instead, we explored using Generative AI for this case. AI can read the surrounding code and attempt to infer the scope based on variable names and comments surrounding the call site. While this approach isn’t always perfect, it doesn’t need to be. It just needs to provide good enough guesses, such that code owners can one-click accept suggested patches. </p>
<p>If the patches are correct in most cases, this is a big timesaver that enables efficient adoption of the framework. This complements our <a href="https://engineering.fb.com/2025/04/29/ai-research/autopatchbench-benchmark-ai-powered-security-fixes/" target="_blank" rel="noopener">recent work on AutoPatchBench</a>, a benchmark designed to evaluate AI-powered patch generators that leverage large language models (LLMs) to automatically recommend and apply security patches. Secure-by-default frameworks are a great example of the kinds of code modifications that an automatic patching system can apply to improve the security of a code base.</p>
<p>We’ve built a framework leveraging Llama as the core technology, which takes locations in the codebase that we want to migrate and suggests patches for code owners to accept:</p>
<p><img class="alignnone size-full wp-image-23462" src="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png" alt="" width="1999" height="750" srcset="https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=916,344 916w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=768,288 768w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=1024,384 1024w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=1536,576 1536w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=96,36 96w, https://engineering.fb.com/wp-content/uploads/2025/12/Meta-Secure-By-Default-frameworks.png?resize=192,72 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>Prompt Creation</h3>
<p>The AI workflow starts with a call site we want to migrate including its file path and line number. The location is used to extract a code snippet from the code base. This means opening the file where the call site is present, copying 10-20 lines before and after the call site location, and pasting this into the prompt template that gives general instructions as to how to perform the migration. This description is very similar to what would be written as an onboarding guide to the framework for human engineers.</p>
<h3>Generative AI</h3>
<p>The prompt is then provided to a Llama model (llama4-maverick-17b-128e-instruct). The model is asked to output two things: the modified code snippet, where the call site has been migrated; and, optionally, some actions (like adding an import to the top of a file). The main purpose of actions is to work around the limitations of this approach where all code changes are not local and limited to the code snippet. Actions enable the model fix to reach outside the snippet for some limited, deterministic changes. This is useful for adding imports or dependencies, which are rarely local to the code snippet, but are necessary for the code to compile. The code snippet is then inserted back to the code base and any actions are applied. </p>
<h3>Validation</h3>
<p>Finally, we perform a series of validations on the code base. We run all of these with and without the AI changes and only report the difference:</p>
<ul><li class="c1" aria-level="1">Lints: We run the linters again to confirm the lint issue was fixed and no new lint errors were introduced by the changes.</li>
<li class="c1" aria-level="1">Compiling: We compile and run tests covering the targeted file. This is not intended to catch all bugs (we rely on continuous integration for that), but give the AI some early feedback on its changes (such as compile errors). </li>
<li class="c1" aria-level="1">Formatting: The code is formatted to avoid formatting issues. We do not feed the formatting errors back to the AI.</li>
</ul><p>If any errors arise during the validation, their error messages are included in the prompt (along with the “fixed” code snippet) and the AI is asked to try again. We repeat this loop five times and give up if no successful fix is created. If the validation succeeds, we submit a patch for human review.</p>
<h2>Thoughtful Framework Design Meets Intelligent Automation</h2>
<p>By adhering to core design principles such as providing an API that closely resembles existing OS patterns, relying solely on public and stable OS APIs, and designing frameworks that cover broad user bases rather than niche use cases, developers can create robust, secure-by-default features that integrate seamlessly into existing codebases.These same design principles help us leverage AI for smoothly adopting frameworks at scale. While there are still challenges around the accuracy of generated code – for example, the AI choosing the incorrect scope, using incorrect syntax, etc., the internal feedback loop design allows the LLM to automatically move past easily solvable problems without human intervention, increasing scalability and reducing developer frustration.</p>
<p>Internally, this project helped prove that AI could be impactful for adopting security frameworks across a diverse codebase in a way that is minimally disruptive to our developers. There are now a variety of projects tackling similar problems across a variety of codebases and languages – including C/++ – using diverse models and validation techniques. We expect this trend to continue and accelerate in 2026 as developers become more comfortable with state of the art AI tools and the quality of code that they are capable of producing.</p>
<p>As our codebase grows and security threats become more sophisticated, the combination of thoughtful framework design and intelligent automation will be essential to protecting user data and maintaining trust at scale.</p>]]></description>
      <link>https://engineering.fb.com/2025/12/15/android/how-ai-transforming-secure-by-default-mobile-frameworks-adoption/</link>
      <guid>https://engineering.fb.com/2025/12/15/android/how-ai-transforming-secure-by-default-mobile-frameworks-adoption/</guid>
      <pubDate>Mon, 15 Dec 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. </li>
<li class="c1" aria-level="1">Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. </li>
<li class="c1" aria-level="1">Zoomer has delivered training time reductions, and significant QPS improvements, making it the de-facto tool for AI performance optimization across Meta’s entire AI infrastructure.</li>
</ul><p>At the scale that Meta’s AI infrastructure operates, poor performance debugging can lead to massive energy inefficiency, increased operational costs, and suboptimal hardware utilization across hundreds of thousands of GPUs. The fundamental challenge is achieving maximum computational efficiency while minimizing waste. Every percentage point of utilization improvement translates to significant capacity gains that can be redirected to innovation and growth.</p>
<p>Zoomer is Meta’s automated, one-stop-shop platform for performance profiling, debugging, analysis, and optimization of AI training and inference workloads. Since its inception, Zoomer has become the de-facto tool across Meta for GPU workload optimization, generating tens of thousands of profiling reports daily for teams across all of our apps. </p>
<h2>Why Debugging Performance Matters</h2>
<p><a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/" target="_blank" rel="noopener">Our AI infrastructure</a> supports <a href="https://engineering.fb.com/2024/06/12/production-engineering/maintaining-large-scale-ai-capacity-meta/" target="_blank" rel="noopener">large-scale and advanced workloads across a global fleet of GPU clusters, continually evolving to meet the growing scale and complexity of generative AI</a>.</p>
<p>At the training level it supports a diverse range of workloads, including powering models for <a href="https://engineering.fb.com/2025/11/10/ml-applications/metas-generative-ads-model-gem-the-central-brain-accelerating-ads-recommendation-ai-innovation/" target="_blank" rel="noopener">ads ranking</a>, <a href="https://engineering.fb.com/2025/05/21/production-engineering/journey-to-1000-models-scaling-instagrams-recommendation-system/" target="_blank" rel="noopener">content recommendations</a>, and <a href="https://engineering.fb.com/2025/05/20/web/metas-full-stack-hhvm-optimizations-for-genai/" target="_blank" rel="noopener">GenAI features</a>.  </p>
<p>At the inference level, we serve hundreds of trillions of AI model executions per day.</p>
<p>Operating at this scale means putting a high priority on eliminating GPU underutilization. Training inefficiencies delay model iterations and product launches, while inference bottlenecks limit our ability to serve user requests at scale. Removing resource waste and accelerating workflows helps us train larger models more efficiently, serve more users, and reduce our environmental footprint.</p>
<h2>AI Performance Optimization Using Zoomer</h2>
<p>Zoomer is an automated debugging and optimization platform that works across all of our AI model types (ads recommendations, GenAI, computer vision, etc.) and both training and inference paradigms, providing deep performance insights that enable energy savings, workflow acceleration, and efficiency gains.  </p>
<p>Zoomer’s architecture consists of three essential layers that work together to deliver comprehensive AI performance insights: </p>
<h3>Infrastructure and Platform Layer</h3>
<p>The foundation provides the enterprise-grade scalability and reliability needed to profile workloads across Meta’s massive infrastructure. This includes distributed storage systems using <a href="https://www.youtube.com/watch?v=tddb-zbmnTo">Manifold</a> (Meta’s blob storage platform) for trace data, fault-tolerant processing pipelines that handle huge trace files, and low-latency data collection with automatic profiling triggers across thousands of hosts simultaneously. The platform maintains high availability and scale through redundant processing workers and can handle huge numbers of profiling requests during peak usage periods.</p>
<h3>Analytics and Insights Engine</h3>
<p>The core intelligence layer delivers deep analytical capabilities through multiple specialized analyzers. This includes: GPU trace analysis via Kineto integration and NVIDIA DCGM, CPU profiling through <a href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/" target="_blank" rel="noopener">StrobeLight</a> integration, host-level metrics analysis via <a href="https://developers.facebook.com/blog/post/2022/11/16/dynolog-open-source-system-observability/" target="_blank" rel="noopener">dyno telemetry</a>, communication pattern analysis for distributed training, straggler detection across distributed ranks, memory allocation profiling (including GPU memory snooping), request/response profiling for inference workloads, and much more. The engine automatically detects performance anti-patterns and also provides actionable recommendations.</p>
<h3>Visualization and User Interface Layer</h3>
<p>The presentation layer transforms complex performance data into intuitive, actionable insights. This includes interactive timeline visualizations showing GPU activity across thousands of ranks, multi-iteration analysis for long-running training workloads, drill-down dashboards with percentile analysis across devices, trace data visualization integrated with Perfetto for kernel-level inspection, heat map visualizations for identifying outliers across GPU deployments, and automated insight summaries that highlight critical bottlenecks and optimization opportunities.</p>
<figure id="attachment_23438" aria-describedby="caption-attachment-23438" class="wp-caption alignnone c2"><img class="size-full wp-image-23438" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png" alt="" width="1928" height="1508" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png 1928w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=916,716 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=768,601 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=1024,801 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=1536,1201 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=96,75 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Zoomer-architecture.png?resize=192,150 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23438" class="wp-caption-text">The three essential layers of Zoomer’s architecture.</figcaption></figure><h2>How Zoomer Profiling Works: From Trigger to Insights</h2>
<p>Understanding how Zoomer conducts a complete performance analysis provides insight into its sophisticated approach to AI workload optimization.</p>
<h3>Profiling Trigger Mechanisms</h3>
<p>Zoomer operates through both automatic and on-demand profiling strategies tailored to different workload types. For training workloads, which involve multiple iterations and can run for days or weeks, Zoomer automatically triggers profiling around iteration 550-555 to capture stable-state performance while avoiding startup noise. For inference workloads, profiling can be triggered on-demand for immediate debugging or through integration with automated load testing and benchmarking systems for continuous monitoring.</p>
<h3>Comprehensive Data Capture</h3>
<p>During each profiling session, Zoomer simultaneously collects multiple data streams to build a holistic performance picture: </p>
<ul><li class="c1" aria-level="1"><strong>GPU Performance Metrics</strong>: SM utilization, GPU memory utilization, GPU busy time, memory bandwidth, Tensor Core utilization, power consumption, clock frequencies, and power consumption data via DCGM integration.</li>
<li class="c1" aria-level="1"><strong>Detailed Execution Traces</strong>: Kernel-level GPU operations, memory transfers, CUDA API calls, and communication collectives via <a href="https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html" target="_blank" rel="noopener">PyTorch Profiler</a> and <a href="https://github.com/pytorch/kineto" target="_blank" rel="noopener">Kineto</a>.</li>
<li class="c1" aria-level="1"><strong>Host-Level Performance Data</strong>: CPU utilization, memory usage, network I/O, storage access patterns, and system-level bottlenecks via dyno telemetry.</li>
<li class="c1" aria-level="1"><strong>Application-Level Annotations</strong>: Training iterations, forward/backward passes, optimizer steps, data loading phases, and custom user annotations.</li>
<li class="c1" aria-level="1"><strong>Inference-Specific Data</strong>: Rate of inference requests, server latency, active requests, GPU memory allocation patterns, request latency breakdowns via Strobelight’s Crochet profiler, serving parameter analysis, and thrift request-level profiling.</li>
<li class="c1" aria-level="1"><strong>Communication Analysis</strong>: NCCL collective operations, inter-node communication patterns, and network utilization for distributed workloads</li>
</ul><h3>Distributed Analysis Pipeline</h3>
<p>Raw profiling data flows through sophisticated processing systems that deliver multiple types of automated analysis including:</p>
<ul><li class="c1" aria-level="1"><strong>Straggler Detection</strong>: Identifies slow ranks in distributed training through comparative analysis of execution timelines and communication patterns.</li>
<li class="c1" aria-level="1"><strong>Bottleneck Analysis</strong>: Automatically detects CPU-bound, GPU-bound, memory-bound, or communication-bound performance issues.</li>
<li class="c1" aria-level="1"><strong>Critical Path Analysis</strong>: Systematically identifies the longest execution paths to focus optimization efforts on highest-impact opportunities.</li>
<li class="c1" aria-level="1"><strong>Anti-Pattern Detection</strong>: Rule-based systems that identify common efficiency issues and generate specific recommendations.</li>
<li class="c1" aria-level="1"><strong>Parallelism Analysis</strong>: Deep understanding of tensor, pipeline, data, and expert parallelism interactions for large-scale distributed training.</li>
<li class="c1" aria-level="1"><strong>Memory Analysis</strong>: Comprehensive analysis of GPU memory usage patterns, allocation tracking, and leak detection.</li>
<li class="c1" aria-level="1"><strong>Load Imbalance Analysis</strong>: Detects workload distribution issues across distributed ranks and recommendations for optimization.</li>
</ul><h3>Multi-Format Output Generation</h3>
<p>Results are presented through multiple interfaces tailored to different user needs: interactive timeline visualizations showing activity across all ranks and hosts, comprehensive metrics dashboards with drill-down capabilities and percentile analysis, trace viewers integrated with Perfetto for detailed kernel inspection, automated insights summaries highlighting key bottlenecks and recommendations, and actionable notebooks that users can clone to rerun jobs with suggested optimizations.</p>
<h3>Specialized Workload Support</h3>
<p>For massive distributed training for specialized workloads, like GenAI, Zoomer contains a purpose-built platform supporting LLM workloads that offers specialized capabilities including GPU efficiency heat maps and N-dimensional parallelism visualization. For inference, specialized analysis covers everything from single GPU models, soon expanding to massive distributed inference across thousands of servers.</p>
<h2>A Glimpse Into Advanced Zoomer Capabilities</h2>
<p>Zoomer offers an extensive suite of advanced capabilities designed for different AI workload types and scales. While a comprehensive overview of all features would require multiple blog posts, here’s a glimpse at some of the most compelling capabilities that demonstrate Zoomer’s depth:</p>
<p><strong>Training Powerhouse Features</strong>:</p>
<ul><li class="c1" aria-level="1"><strong>Straggler Analysis</strong>: Helps identify ranks in distributed training jobs that are significantly slower than others, causing overall job delays due to synchronization bottlenecks. Zoomer provides information that helps diagnose root causes like sharding imbalance or hardware issues.</li>
<li class="c1" aria-level="1"><strong>Critical Path Analysis</strong>: Identification of the longest execution paths in PyTorch applications, enabling accurate performance improvement projections. </li>
<li class="c1" aria-level="1"><strong>Advanced Trace Manipulation</strong>: Sophisticated tools for compression, filtering, combination, and segmentation of massive trace files (2GB+ per rank), enabling analysis of previously impossible-to-process large-scale training jobs</li>
</ul><p><strong>Inference Excellence Features</strong>:</p>
<ul><li class="c1" aria-level="1"><strong>Single-Click QPS Optimization</strong>: A workflow that identifies bottlenecks and triggers automated load tests with one click, reducing optimization time while delivering QPS improvements of +2% to +50% across different models, depending on model characteristics. </li>
<li class="c1" aria-level="1"><strong>Request-Level Deep Dive</strong>: Integration with Crochet profiler provides <a href="https://engineering.fb.com/2014/02/20/open-source/under-the-hood-building-and-open-sourcing-fbthrift/">Thrift</a> request-level analysis, enabling identification of queue time bottlenecks and serving inefficiencies that traditional metrics miss.</li>
<li class="c1" aria-level="1"><strong>Realtime Memory Profiling</strong>: GPU memory allocation tracking, providing live insights into memory leaks, allocation patterns, and optimization opportunities.</li>
</ul><p><strong>GenAI Specialized Features</strong>:</p>
<ul><li class="c1" aria-level="1"><strong>LLM Zoomer for Scale</strong>: A purpose-built platform supporting 100k+ GPU workloads with N-dimensional parallelism visualization, GPU efficiency heat maps across thousands of devices, and specialized analysis for tensor, pipeline, data, and expert parallelism interactions.</li>
<li class="c1" aria-level="1"><strong>Post-Training Workflow Support</strong>: Enhanced capabilities for GenAI post-training tasks including SFT, DPO, and ARPG workflows with generator and trainer profiling separation.</li>
</ul><p><strong>Universal Intelligence Features</strong>:</p>
<ul><li class="c1" aria-level="1"><strong>Holistic Trace Analysis (HTA)</strong>: Advanced framework for diagnosing distributed training bottlenecks across communication overhead, workload imbalance, and kernel inefficiencies, with automatic load balancing recommendations.</li>
<li class="c1" aria-level="1"><strong>Zoomer Actionable Recommendations Engine (Zoomer AR)</strong>: Automated detection of efficiency anti-patterns with machine learning-driven recommendation systems that generate auto-fix diffs, optimization notebooks, and one-click job re-launches with suggested improvements.</li>
<li class="c1" aria-level="1"><strong>Multi-Hardware Profiling</strong>: Native support across NVIDIA GPUs, AMD MI300X, <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">MTIA</a>, and CPU-only workloads with consistent analysis and optimization recommendations regardless of hardware platform.</li>
</ul><h2>Zoomer’s Optimization Impact: From Debugging to Energy Efficiency</h2>
<p>Performance debugging with Zoomer creates a cascading effect that transforms low-level optimizations into massive efficiency gains. </p>
<p>The optimization pathway flows from: identifying bottlenecks → improving key metrics → accelerating workflows → reducing resource consumption → saving energy and costs.</p>
<h3>Zoomer’s Training Optimization Pipeline</h3>
<p>Zoomer’s training analysis identifies bottlenecks in GPU utilization, memory bandwidth, and communication patterns. </p>
<p><strong>Example of Training Efficiency Wins: </strong></p>
<ul><li class="c1" aria-level="1"><strong>Algorithmic Optimizations</strong>: We delivered <strong>power savings</strong> through systematic efficiency improvements across the training fleet, by fixing reliability issues for low efficiency jobs.</li>
<li class="c1" aria-level="1"><strong>Training Time Reduction Success</strong>: In 2024, we observed a 75% training time reduction for Ads relevance models, leading to 78% reduction in power consumption.</li>
<li class="c1" aria-level="1"><strong>Memory Optimizations</strong>: One-line code changes for performance issues due to inefficient memory copy identified by Zoomer, delivered <strong>20% QPS improvements</strong> with minimal engineering effort. </li>
</ul><h3>Inference Optimization Pipeline:</h3>
<p>Inference debugging focuses on latency reduction, throughput optimization, and serving efficiency. Zoomer identifies opportunities in kernel execution, memory access patterns, and serving parameter tuning to maximize requests per GPU.</p>
<p><strong>Inference Efficiency Wins:</strong></p>
<ul><li class="c1" aria-level="1"><strong>GPU and CPU Serving parameters Improvements</strong>: Automated GPU and CPU bottleneck identification and parameter tuning, leading to 10% to 45% reduction in power consumption.</li>
<li class="c1" aria-level="1"><strong>QPS Optimization</strong>: GPU trace analysis used to boost serving QPS and optimize serving capacity.</li>
</ul><h3>Zoomer’s GenAI and Large-Scale Impact</h3>
<p>For massive distributed workloads, even small optimizations compound dramatically. <strong>32k GPU benchmark optimizations</strong> achieved 30% speedups through broadcast issue resolution, while <strong>64k GPU configurations</strong> delivered 25% speedups in just one day of optimization.</p>
<h2>The Future of AI Performance Debugging</h2>
<p>As AI workloads expand in size and complexity, Zoomer is advancing to meet new challenges focused on several innovation fronts: broadening unified performance insights across heterogeneous hardware (including MTIA and next-gen accelerators), building advanced analyzers for proactive optimization, enabling inference performance tuning through serving param optimization, and democratizing optimization with automated, intuitive tools for all engineers. As Meta’s AI infrastructure continues its rapid growth, Zoomer plays an important role in helping us innovate efficiently and sustainably.</p>]]></description>
      <link>https://engineering.fb.com/2025/11/21/data-infrastructure/zoomer-powering-ai-performance-meta-intelligent-debugging-optimization/</link>
      <guid>https://engineering.fb.com/2025/11/21/data-infrastructure/zoomer-powering-ai-performance-meta-intelligent-debugging-optimization/</guid>
      <pubDate>Fri, 21 Nov 2025 22:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Key Transparency Comes to Messenger]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re excited to share another advancement in the security of your conversations on Messenger: the launch of key transparency verification for end-to-end encrypted chats. </li>
<li class="c1" aria-level="1">This new feature enables an additional level of assurance that only you — and the people you’re communicating with — can see or listen to what is sent, and that no one else, not even Meta, can do so.</li>
</ul><p><a href="https://www.facebook.com/help/messenger-app/1084673321594605" target="_blank" rel="noopener">End-to-end encryption on Messenger</a> already ensures that the content of your direct messages and calls are protected from the moment they leave your device to the moment they reach the receiver’s device. As part of our end-to-end encrypted chat platform, we believe it’s also important that anyone can verify that the public keys (used by the sender’s device for encrypting each message) belong to the intended recipients and haven’t been tampered with.</p>
<p>This launch builds upon the valuable work and experiences shared by others in the industry. <a href="https://engineering.fb.com/2023/04/13/security/whatsapp-key-transparency/" target="_blank" rel="noopener">WhatsApp’s implementation of key transparency</a> in 2023 demonstrated the feasibility of this technology for large-scale encrypted messaging. We’ve extended these pioneering efforts in our Messenger implementation to deliver a robust and reliable solution with similar security properties.</p>
<h2>What Is Key Transparency?</h2>
<p>Key transparency provides messaging users with a verifiable and auditable record of public keys. It allows them to confirm that their conversations are indeed encrypted with the correct keys for their contacts, and that these keys haven’t been maliciously swapped by a compromised server. This means you can be more confident that your messages are only accessible to the people you intend to communicate with.</p>
<p><img class="alignnone size-full wp-image-23383" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png" alt="" width="1999" height="1646" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=916,754 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=768,632 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=1024,843 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=1536,1265 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=96,79 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Key-Transparency-Messenger.png?resize=192,158 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>You can already <a href="https://www.facebook.com/help/messenger-app/147596532316790/Check+your+keys+for+end-to-end+encrypted+chats+on+Messenger" target="_blank" rel="noopener">check your keys for end-to-end encrypted chats on Messenger</a>, but this can be cumbersome for people who have logged in to Messenger on multiple devices, each of which has its own key. Moreover, these keys change when new devices are added or are re-registered, which necessitates another check of the key every time this happens. </p>
<p>To address this, we’ve added a new security feature, based on key transparency, that allows users to verify these keys without having to compare them manually with their contacts. Of course, anyone who wishes to continue manually verifying their keys is free to do so.</p>
<h2>How We’re Handling Messenger Keys at Scale</h2>
<p>Our key transparency implementation leverages the <a href="https://github.com/facebook/akd" target="_blank" rel="noopener">Auditable Key Directory (AKD) library</a>, mirroring the system already in place for WhatsApp. This system allows Meta to securely distribute and verify users’ public keys. To further enhance the security of this process, we use <a href="https://developers.cloudflare.com/key-transparency/" target="_blank" rel="noopener">Cloudflare’s key transparency auditor</a> to provide an additional layer of verification, ensuring that the distribution of keys is transparent and verifiable by anyone. Cloudflare’s auditor maintains a live log of the latest entries on the <a href="https://dash.key-transparency.cloudflare.com/" target="_blank" rel="noopener">Key Transparency Dashboard</a>, for both the WhatsApp and Messenger directories.</p>
<p>Implementing key transparency on the scale of Messenger presented unique engineering challenges. One significant factor was the sheer volume and frequency of key updates. Messenger indexes keys for each and every device someone has logged in on, which means that a single user often has multiple, frequently-changing keys associated with their account.</p>
<p>This increased complexity leads to a much higher frequency of key updates being sequenced into our key transparency directory. Currently, we’re observing an epoch frequency of approximately 2 minutes per publish, with hundreds of thousands of new keys added in each epoch. Since we began indexing, our database has already grown to billions of key entries. We’ve implemented a number of advancements in our infrastructure and libraries to help manage this massive and constantly growing dataset, while ensuring high availability and real-time verification:</p>
<p>We improved the algorithmic efficiency of the existing key lookup and verification operations in the AKD library by optimizing for smaller proof sizes, even as the number of updates (versions) for a single key grows. Previously, these proofs grew linearly with the height of the transparency tree, which was still difficult to manage given the number of nodes in the tree.</p>
<p>We also updated our existing infrastructure to be more resilient to temporary outages and improved the process for recovering from long delays in key sequencing. These improvements were adapted from lessons learned from running WhatsApp’s key transparency log for the past two years.</p>
<p>With key transparency now live on Messenger, users will have the ability to automatically verify the authenticity of their contacts’ encryption keys for one-on-one chats. This represents another step forward in our ongoing investment in providing a secure and private service<strong>. </strong></p>
<p>Stay tuned for more updates as we continue to enhance the security and privacy of end-to-end encryption in Messenger.</p>]]></description>
      <link>https://engineering.fb.com/2025/11/20/security/key-transparency-comes-to-messenger/</link>
      <guid>https://engineering.fb.com/2025/11/20/security/key-transparency-comes-to-messenger/</guid>
      <pubDate>Thu, 20 Nov 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation]]></title>
      <description><![CDATA[<ul><li class="xdj266r x14z9mp xat24cr x1lziwak">We’ve released <strong>Ax 1.0</strong>, an open-source platform that uses machine learning to automatically guide complex, resource-intensive experimentation.</li>
<li class="xdj266r x14z9mp xat24cr x1lziwak">Ax is used at scale across Meta to improve AI models, tune production infrastructure, and accelerate advances in ML and even hardware design.</li>
<li class="xdj266r x14z9mp xat24cr x1lziwak">Our accompanying paper, “<a class="c1" href="https://openreview.net/forum?id=U1f6wHtG1g&amp;ref=engineeringatmeta" target="_blank" rel="noopener">Ax: A Platform for Adaptive Experimentation</a>” explains Ax’s architecture, methodology, and how it compares to other state-of-the-art black-box optimization libraries.</li>
</ul><p>How can researchers effectively understand and optimize AI models or systems that have a vast number of possible configurations? This is a challenge that is particularly prevalent in domains characterized by complex, interacting systems, such as modern AI development and deployment. Optimizing under these settings demands experimentation, and efficiency is of the utmost importance when evaluating a single configuration is extremely resource- and/or time-intensive.</p>
<p>Adaptive experimentation offers a solution to this problem by actively proposing new configurations for sequential evaluation, leveraging insights gained from previous evaluations</p>
<p>This year, we released version 1.0 of Ax, an open source adaptive experimentation platform that leverages machine learning to guide and automate the experimentation process. Ax employs Bayesian optimization to enable researchers and developers to conduct efficient experiments, identifying optimal configurations to optimize their systems and processes.</p>
<p>In conjunction with this major release, we published a paper titled, “<a href="https://openreview.net/forum?id=U1f6wHtG1g&amp;ref=engineeringatmeta" target="_blank" rel="noopener">Ax: A Platform for Adaptive Experimentation</a>” that explores Ax’s core architecture, provides a deeper explanation of the methodology powering the optimization, and compares Ax’s performance against other black-box optimization libraries.</p>
<p>Ax has been successfully applied across various disciplines at Meta, including:</p>
<ul><li class="c2" aria-level="1">Traditional machine learning tasks, such as hyperparameter optimization and architecture search.</li>
<li class="c2" aria-level="1">Addressing key challenges in GenAI, including discovering optimal data mixtures for training AI models.</li>
<li class="c2" aria-level="1">Tuning infrastructure or compiler flags in production settings.</li>
<li class="c2" aria-level="1">Optimizing design parameters in physical engineering tasks, such as designing AR/VR devices.</li>
</ul><p>By utilizing Ax, developers can employ state-of-the-art methodology to conduct complex experiments, ultimately gaining a deeper understanding and optimizing their underlying systems.</p>
<h2>How to Get Started With Ax</h2>
<p>To start using Ax to efficiently tune parameters in complex systems install the latest version of the library via `pip install ax-platform` and visit <a href="https://ax.dev/" target="_blank" rel="noopener">the Ax website</a> for a quickstart guide, tutorials, and deep dives on the methods that Ax uses under the hood.</p>
<h2>Ax Is for Real World Experimentation</h2>
<p>Adaptive experiments are incredibly useful, but can be challenging to run. Not only do these experiments require the use of sophisticated machine learning methods to drive the optimization, they also demand specialized infrastructure for managing experiment state, automating orchestration, providing useful analysis and diagnostics, and more. Additionally, the goals of any given experiment are often more complex than simply improving a single metric. In practice experimentation is usually a careful balance between multiple objective metrics subject to multiple constraints and guardrails.</p>
<p>We built Ax to empower users to easily configure and run these dynamic experiments using state-of-the-art techniques, and to provide a robust and mature platform for researchers to integrate cutting-edge methods directly into production systems.</p>
<h2>Ax for Understanding</h2>
<p>In addition to finding optimal configurations efficiently, Ax is a powerful tool for understanding the underlying system being optimized. Ax provides a suite of analyses (plots, tables, etc) which helps its users understand how the optimization is progressing over time, tradeoffs between different metrics via a <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto frontier</a>, visualize the effect of one or two parameters across the input space, and explain how much each input parameter contributes to the results (via sensitivity analysis). </p>
<p>These tools allow experimenters to walk away with both an optimal configuration to deploy to production and a deeper understanding of their system, which can inform decisions moving forward.</p>
<p><img class="alignnone size-full wp-image-23323" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png" alt="" width="1674" height="1094" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png 1674w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=916,599 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=768,502 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=1024,669 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=1536,1004 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image1.png?resize=192,125 192w" sizes="(max-width: 992px) 100vw, 62vw" /><img class="alignnone size-full wp-image-23327" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png" alt="" width="1674" height="1094" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png 1674w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=916,599 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=768,502 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=1024,669 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=1536,1004 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-2.png?resize=192,125 192w" sizes="(max-width: 992px) 100vw, 62vw" /><img class="alignnone size-full wp-image-23328" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png" alt="" width="1674" height="1094" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png 1674w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=916,599 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=768,502 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=1024,669 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=1536,1004 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-3.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><img class="alignnone size-full wp-image-23324" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png" alt="" width="1674" height="1094" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png 1674w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=916,599 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=768,502 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=1024,669 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=1536,1004 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-4.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>How Ax Works</h2>
<p>By default Ax uses Bayesian optimization, an effective adaptive experimentation method that excels at balancing <strong>exploration</strong> – learning how new configurations  perform, and <strong>exploitation</strong> – refining configurations previously observed to be good. Ax relies on <a href="https://botorch.org/" target="_blank" rel="noopener">BoTorch</a> for its implementation of Bayesian optimization components.</p>
<p><img class="alignnone size-full wp-image-23325" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png" alt="" width="1920" height="1120" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=916,534 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=768,448 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=1024,597 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=1536,896 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-5.png?resize=192,112 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Bayesian optimization is an iterative approach to solving the global optimization problem <img src="https://s0.wp.com/latex.php?latex=argmax_%7Bx+%5Cin+X%7D+f%28x%29&amp;bg=ffffff&amp;fg=000&amp;s=0&amp;c=20201002" alt="argmax_{x \in X} f(x)" class="latex" /> which does not assume any information about the form of the function f. In practice, this means optimizing systems by evaluating some candidate configurations <img src="https://s0.wp.com/latex.php?latex=x+%5C+in+X&amp;bg=ffffff&amp;fg=000&amp;s=0&amp;c=20201002" alt="x \ in X" class="latex" /> (i.e., trying some configurations out and measuring their effect), building a surrogate model using this data, using that surrogate to identify the most promising configuration to evaluate next, and repeating until an optimal solution has been found or the experimental budget is exhausted.</p>
<p>Under typical settings Ax uses a Gaussian process (GP) as the  surrogate model during the Bayesian optimization loop, a flexible model which can make predictions while quantifying uncertainty and is especially effective with very few data points. Ax then uses an acquisition function from a family called expected improvement (EI) to suggest the next candidate configurations to evaluate by capturing the expected value of any new configuration compared to the best previously evaluated configuration.</p>
<p>The following animation shows this loop with a GP modeling the goal metric plotted above in blue and EI plotted below in black; the highest value of EI informs the next value of x to evaluate. Once the new value of x has been evaluated, the GP is re-fit with the new data point and we calculate the next EI value.</p>
<p><img class="alignnone wp-image-23326" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Adaptive-Experimentation-Ax-image-6.gif?w=916" alt="" width="600" height="340" /></p>
<p>This 1-dimensional example can be expanded for many input and output dimensions, allowing Ax to optimize problems with many (potentially hundreds) of tunable parameters and outcomes. In fact, higher-dimensional settings, in which covering the search space becomes exponentially more costly, is where the surrogate-based approach really shines compared to other approaches.</p>
<p>You can read more about Bayesian optimization on Ax website’s <a href="https://ax.dev/docs/next/intro-to-bo" target="_blank" rel="noopener">Introduction to Bayesian Optimization page</a>.</p>
<h2>How We Use Ax at Meta</h2>
<p>Ax has been deployed at scale at Meta to solve some of the company’s most challenging optimization problems. Thousands of developers at Meta use Ax for tasks like hyperparameter optimization and architecture search for AI models, tuning parameters for online recommender and ranking models, infrastructure optimizations, and simulation optimization for AR and VR hardware design.</p>
<p>These experiments optimize nuanced goals and leverage sophisticated algorithms. For instance, we’ve used multi-objective optimization to simultaneously improve a machine learning model’s accuracy while minimizing its resource usage. When researchers were tasked with shrinking natural language models to fit on the first generation of  Ray-Ban Stories <a href="https://research.facebook.com/blog/2021/7/optimizing-model-accuracy-and-latency-using-bayesian-multi-objective-neural-architecture-search/" target="_blank" rel="noopener">they used Ax</a> to search for models that optimally traded off size and performance. Additionally, Meta engineers use constrained optimization techniques for <a href="https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram-explore-recommendations-system/" target="_blank" rel="noopener">tuning recommender systems</a> to optimize key metrics while avoiding regressions in others.</p>
<p>Recently, Ax was used to design <a href="https://engineering.fb.com/2025/07/16/data-center-engineering/ai-make-lower-carbon-faster-curing-concrete/" target="_blank" rel="noopener">new faster curing, low carbon concrete mixes</a> that were deployed at one of our data center construction sites. These new mixes are playing an important role in advancing our <a href="https://sustainability.fb.com/wp-content/uploads/2023/07/Meta-2023-Path-to-Net-Zero.pdf" target="_blank" rel="noopener">goal of net zero emissions in 2030</a>.</p>
<p>We see problems across every domain where the ultimate quality of a system is affected by parameters whose interactions are complex to reason about without experimentation and where experimentation has a meaningful cost: Ax addresses these challenges by employing a data-driven approach to adapt experiments as they unfold, enabling us to solve these problems efficiently and effectively.</p>
<h2>The Future of Ax</h2>
<p>We are always working to improve Ax by building new features for representing innovative experiment designs, exciting new optimization methods, or integrations for using Ax with external platforms. <a href="https://github.com/facebook/Ax/">Ax is proud to be open source</a> (MIT license), and we invite both the practitioner and research communities to contribute to the project whether that be through improved surrogate models or acquisition  functions,  extensions used for individual research applications that may benefit the larger community, or simply bug fixes or improvements to the core capabilities. Please reach out to the team via <a href="https://github.com/facebook/Ax/issues">Github Issues</a>.</p>
<h2>Read the Paper</h2>
<p><a href="https://openreview.net/forum?id=U1f6wHtG1g&amp;ref=engineeringatmeta" target="_blank" rel="noopener">Ax: A Platform for Adaptive Experimentation</a></p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">website</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, <a href="https://bsky.app/profile/metaopensource.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>
<h2>Acknowledgements</h2>
<p><em>Ax was created by Meta’s Adaptive Experimentation team: Sebastian Ament, Eytan Bakshy, Max Balandat, Bernie Beckerman, Sait Cakmak, Cesar Cardoso, Ethan Che, Sam Daulton, David Eriksson, Mia Garrard, Matthew Grange, Carl Hvarfner, Paschal Igusti, Lena Kashtelyan, Cristian Lara, Ben Letham, Andy Lin, Jerry Lin, Jihao Andreas Lin, Samuel Müller, Miles Olson, Eric Onofrey, Shruti Patel, Elizabeth Santorella, Sunny Shen, Louis Tiao, and Kaiwen Wu.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/11/18/open-source/efficient-optimization-ax-open-platform-adaptive-experimentation/</link>
      <guid>https://engineering.fb.com/2025/11/18/open-source/efficient-optimization-ax-open-platform-adaptive-experimentation/</guid>
      <pubDate>Tue, 18 Nov 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together]]></title>
      <description><![CDATA[<h2>Connecting Africa and the World</h2>
<p><strong>We’re excited to share the completion of the core 2Africa infrastructure,</strong> <strong>the world’s longest open access subsea cable system.</strong> 2Africa is a landmark subsea cable system that sets a new standard for global connectivity. This project is the result of years of collaboration, innovation, and a shared vision to connect communities, accelerate economic growth, and enable transformative digital experiences across Africa and beyond.</p>
<div class="wp-video c1"><a href="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa_Impact_Captions_1920x1080_Stereo_MPEG-4-1.mp4">https://engineering.fb.com/wp-content/uploads/2025/11/2Africa_Impact_Captions_1920x1080_Stereo_MPEG-4-1.mp4</a></div>
<br />
 
<h2>Unprecedented Scale and Reach</h2>
<p><strong>2Africa is the first cable to connect East and West Africa in a continuous system and link Africa to the Middle East, South Asia, and Europe.</strong> With a current reach of 33 countries and still counting, we’re enabling connectivity for 3 billion people across Africa, Europe, and Asia – more than 30% of the world’s population. This scale is unprecedented and we are proud to have partnered with stakeholders across the ecosystem to deliver new levels of connectivity at such scale.</p>
<figure id="attachment_23340" aria-describedby="caption-attachment-23340" class="wp-caption alignnone c2"><img class="size-full wp-image-23340" src="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png" alt="" width="1920" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-2025_1117-Map.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23340" class="wp-caption-text">The 2Africa Subsea Cable reaches 3 continents and lands in 33 countries, connecting over 3 billion people.</figcaption></figure><h2>Building 2Africa: Partnership, Scale, and Open Access</h2>
<p>Africa’s digital future depends on robust, scalable infrastructure built in partnership with local communities and stakeholders. As demand for high-speed internet grows, a consortium of global partners led by Meta, including <a href="https://bayobab.africa/" target="_blank" rel="noopener">Bayobab</a> (MTN Group), <a href="https://center3.com/" target="_blank" rel="noopener">center3</a> (stc), <a href="https://www.chinamobileltd.com/en/global/home.php" target="_blank" rel="noopener">CMI,</a> <a href="https://www.orange.com/en" target="_blank" rel="noopener">Orange</a>, <a href="https://www.te.eg/wps/portal/te/Personal" target="_blank" rel="noopener">Telecom Egypt</a>, <a href="https://www.vodafone.com/" target="_blank" rel="noopener">Vodafone Group</a>, and <a href="https://wiocc.net/" target="_blank" rel="noopener">WIOCC</a>, came together to design and invest in what would become the world’s longest open access subsea cable system. With the <a href="https://engineering.fb.com/2021/09/28/connectivity/2africa-pearls/">Pearls extension</a> scheduled to go live in 2026, 2Africa’s complete system length of 45,000 kilometers is longer than the equivalent of the Earth’s circumference. </p>
<p>Realizing this vision required close collaboration across both private and public sectors. We managed the project and facilitated engagement with local partners for cable landing, construction, and regulatory processes. The deployment spanned 50 jurisdictions and nearly six years of work, relying on the active engagement of regulators and policymakers to navigate requirements and keep progress on track.</p>
<p>The consortium’s shared goal is to develop an open, inclusive network that fosters competition, supports innovation, and unlocks new opportunities for millions. This open-access model ensures that multiple service providers can leverage the infrastructure, accelerating digital transformation and AI adoption across the region. New partners including Bharti Airtel and MainOne (an Equinix Company) collaborated on specific segments and data center integration, further expanding the cable’s impact and reach.</p>
<h2>Engineering Innovation and Overcoming Challenges</h2>
<p>Building 2Africa required us to push the boundaries of what’s possible in subsea infrastructure. We deployed advanced <a href="https://www.asn.com/sdm/" target="_blank" rel="noopener">spatial division multiplexing (SDM) technology</a>, supporting up to 16 fiber pairs per cable. This is <strong>double the capacity of older systems.</strong> It is the <strong>first 16-fiber-pair subsea cable to fully connect Africa</strong>. We incorporated undersea optical wavelength switching, enabling flexible bandwidth management and supporting evolving demands for AI, cloud, and high-bandwidth applications.</p>
<p><img class="alignnone size-full wp-image-23338" src="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png" alt="" width="1920" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Image-01.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We increased 2Africa’s burial depth by 50% over previous systems and carefully routed the cable to avoid seabed hazards such as seamounts at <a href="https://www.nature.com/articles/s43247-022-00482-x" target="_blank" rel="noopener">hot brine pools</a>, improving resilience and network availability. The system features two independent trunk powering architectures across its West, East, and Mediterranean segments, optimizing capacity and providing additional resiliency against electrical faults. Our branching unit switching capability allowed us to optimize for trunk capacity and reliability by utilizing routes much further offshore from hazards such as the <a href="https://www.bbc.co.uk/news/science-environment-57382529" target="_blank" rel="noopener">Congo Canyon turbidity currents</a>, while efficiently serving branches to West African nations. To further ensure the integrity and reach of the cable, we engineered compatible crossing solutions for over 60 oil and gas pipelines. </p>
<p><img class="alignnone size-full wp-image-23347" src="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png" alt="" width="1920" height="1142" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=916,545 916w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=768,457 768w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=1024,609 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=1536,914 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=96,57 96w, https://engineering.fb.com/wp-content/uploads/2025/11/2Africa-Blog-Post-Visual-02-updated.png?resize=192,114 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Over the course of construction, we deployed 35 offshore vessels, amounting to nearly 32 years of vessel operations, while dedicated shore-end operations required even more inshore vessels, locally mobilized for cable pulling, guarding, security, and dive support. In remote locations, we imported and mobilized specialist equipment such as dive decompression chambers and shore-end burial tooling to locally operated vessels.</p>
<h2>Economic Impact and Community Transformation</h2>
<p>2Africa is delivering a step change in international bandwidth for Africa, with technical capacity that far exceeds previous systems. For example, on the West segment, stretching from England to South Africa, and landing in countries such as Senegal, Ghana, Cote d’Ivoire, Nigeria, Gabon, the Republic of Congo, DRC, and Angola, the cable supports 21 terabits per second (Tbps) per fiber pair, with 8 fiber pairs on the trunk. This results in a total trunk capacity of up to 180 Tbps. </p>
<h3>But what does 180 Tbps mean for people?</h3>
<p>To put in perspective:</p>
<ul><li class="c3" aria-level="1">180 Tbps is enough to stream over 36 million HD movies simultaneously (assuming 5 megabits per second (Mbps) per stream).</li>
<li class="c3" aria-level="1">For an individual, this means the potential to download 15,000 full-length Nollywood films (each about 1.5 GB) per second, or enable students to access a remote university’s full library in a minute.</li>
<li class="c3" aria-level="1">For a city like Lagos, it means millions of people can video call, stream, and work online at the same time – without experiencing slowdowns or congestion.</li>
</ul><p>This massive capacity ensures a near-limitless supply of international internet bandwidth, allowing internet service providers (ISPs) and mobile network operators (MNOs) to secure capacity at much lower wholesale prices. This creates market competition, redundancy, and supports modern digital infrastructure including cloud services, data centers, and 5G deployment. </p>
<p>The impact is profound: 2Africa is expected to contribute up to <a href="https://www.rti.org/publication/economic-impact-2africa/fulltext.pdf">36.9 billion US dollars</a> to Africa’s GDP within just the first two to three years of operation. The cable’s arrival will boost job creation, entrepreneurship, and innovation hubs in connected regions. Evidence from previous cable landings shows that fast internet access increases employment rates, improves productivity, and supports shifts toward higher-skill occupations. </p>
<p>Meta’s vision is to empower African entrepreneurs, creators, and businesses to innovate and collaborate. By partnering with policymakers, regulators, and stakeholders, we advance Africa’s digital transformation and support its position as an emerging major player in the global digital economy.</p>
<h2>Building Connections, Empowering Progress</h2>
<p>The completion of 2Africa is a defining moment for Africa’s digital future. By leading the design, funding, and deployment of the world’s longest subsea cable system to date, we are building the infrastructure that will power transformative new experiences, drive economic growth, and connect billions of people. We are laying the foundation for the next generation of digital experiences. This subsea cable will enable faster, more reliable internet and support AI-driven services through digital access.</p>
<p>2Africa is part of Meta’s mission to build the future of human connection, opening more pathways for communities across Africa to help shape and play a critical role in the next chapter of the global digital economy. </p>]]></description>
      <link>https://engineering.fb.com/2025/11/17/connectivity/core-2africa-system-completion-future-connectivity/</link>
      <guid>https://engineering.fb.com/2025/11/17/connectivity/core-2africa-system-completion-future-connectivity/</guid>
      <pubDate>Tue, 18 Nov 2025 07:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Enhancing HDR on Instagram for iOS With Dolby Vision]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing how we’ve enabled Dolby Vision and ambient viewing environment (amve) on the Instagram iOS app to enhance the video viewing experience.</li>
<li class="c1" aria-level="1">HDR videos created on iPhones contain unique Dolby Vision and amve metadata that we needed to support end-to-end</li>
<li class="c1" aria-level="1">Instagram for iOS is now the first Meta app to support Dolby Vision video, with more support coming across all of Meta’s apps coming in the future.</li>
</ul><p>Every iPhone-produced HDR video encoding includes two additional pieces of metadata that help ensure the picture is consistent between different displays and viewing conditions:</p>
<ul><li class="c1" aria-level="1">Ambient viewing environment (amve), which provides the characteristics of the nominal ambient viewing environment for displaying associated video content. This information enables the final device to adjust the rendering of the video if the actual ambient viewing conditions differ from those for which it was encoded.</li>
<li class="c1" aria-level="1">Dolby Vision, which enhances color, brightness, and contrast to better match the video to the capabilities of the display.</li>
</ul><p>While the Instagram and Facebook iOS apps <a href="https://engineering.fb.com/2023/07/17/video-engineering/hdr-video-reels-meta/" target="_blank" rel="noopener">have supported high dynamic range (HDR) video</a> since 2022, our initial rollout of HDR didn’t support Dolby Vision or amve delivery and playback. Our derived encodings were done with <a href="https://www.ffmpeg.org/">FFmpeg</a>, which has traditionally lacked support for Dolby Vision and amve. Since our tooling was discarding this metadata, it meant that pictures were not entirely representative of the way they were meant to be viewed – something that was particularly noticeable at low screen brightness levels.</p>
<p>Now, after hearing feedback from people using our iOS apps, we’ve worked with our partners to preserve the iOS-produced amve and Dolby Vision metadata from end-to-end and significantly enhanced the HDR viewing experience on iOS devices.  </p>
<h2>How Meta Processes Video </h2>
<p>It may first be helpful to give some background on the <a href="https://engineering.fb.com/2021/04/05/video-engineering/how-facebook-encodes-your-videos/" target="_blank" rel="noopener">lifecycle of a video at Meta</a>. </p>
<p>The majority of videos uploaded through our apps go through three main stages:</p>
<h3>1. Client Processing </h3>
<p>In the client processing stage, the creator’s device flattens their composition into a single video file at a size appropriate for upload. For HDR videos produced by iOS devices this means encoding with HEVC using the Main 10 profile. This is the stage in which amve and Dolby Vision metadata are produced, added to the encoded bitstream, and uploaded to Meta’s servers.</p>
<h3>2.  Server Processing</h3>
<p>In the server processing stage, our transcoding system generates different versions of the video for different consumers. As playback occurs across a variety of devices with different capabilities, we need to produce the video in a format which will be optimal for each device. In the scope of HDR uploads, this means producing an SDR version for devices that don’t support HDR, a VP9 version to satisfy the majority of players, and (for our most popular videos) an <a href="https://engineering.fb.com/2023/02/21/video-engineering/av1-codec-facebook-instagram-reels/">AV1 version</a> with the highest quality at the lowest file size.</p>
<p>Each of these versions is produced at a different bitrate (essentially, file size) to ensure that consumers with varying network conditions are all able to play the video without waiting for a large download to complete (the tradeoff is that lower bitrates have lower quality). All of our derived encodings are created with FFmpeg, which historically lacked support for amve and Dolby Vision. This is the stage where metadata was getting dropped.</p>
<h3>3. Consumption</h3>
<p>In the consumption stage, the viewer’s device picks the version that will play back smoothly (without stalls), decodes it frame by frame, and draws each frame onto the screen. In the context of iOS, all HDR playback is done using Apple’s AVSampleBufferDisplayLayer (AVSBDL). This is the class that consumes amve and Dolby Vision metadata along with each decoded frame.</p>
<h2>How We Added Support for amve</h2>
<p>When we first set off to support amve in 2022, we noticed something interesting. As we operate on a decoupled architecture of lower-level components rather than a typical high-level AVPlayer setup, we were able to inspect an intact video encoding and get a look at the amve metadata in between the decoder and AVSBDL. We observed that every frame of every video seemed to have exactly the same metadata. This allowed us to hold ourselves over with a quick fix and hardcode these values directly into our player pipeline.</p>
<p>This was not a great situation to be in. Even though the value seemed to be static, there was nothing enforcing this. Maybe a new iPhone or iOS version would produce different values, then we’d be using the wrong ones. amve is also not a concept on Android, which would mean that viewing an Android-produced HDR encoding on iPhone would result in an image that was not technically accurate.</p>
<p>In 2024, we worked with the community to land amve support in FFmpeg. We also built in some logging, which showed that our two-year-old assertion that the values never change still stood. But if they ever do, we will be properly set up for it. </p>
<h2>Enabling Dolby Vision</h2>
<p>Dolby Vision was not as straightforward as amve to adopt.</p>
<p><strong>Challenge #1: The extant specification was for carriage of metadata within an HEVC bitstream. We don’t deliver HEVC.</strong></p>
<p>iPhone-produced HDR uses Dolby Vision profile 8.4, where 8 indicates a profile using HEVC (the video codec) and .4 means cross-compatible with HLG (the standard for HDR video that players without Dolby Vision support would adhere to). </p>
<p>In order to deliver Dolby Vision metadata we needed to carry it within a codec that we do deliver. Fortunately, Dolby has created Profile 10 for carriage of Dolby Vision within AV1. As VP9 does not offer a facility for carriage of additional metadata there is no support for Dolby Vision at this time, but we are interested in exploring alternate delivery mechanisms.</p>
<p>However, Dolby Vision Profiles 10 and 8 were not properly supported by our existing video processing tools, including FFmpeg and <a href="https://github.com/shaka-project/shaka-packager" target="_blank" rel="noopener">Shaka</a> packager. Based on the specifications from Dolby, we collaborated with the FFmpeg developers to fully implement support for Dolby Vision Profile 8 and Profile 10. In particular, we enabled support within FFmpeg to transcode HEVC with Profile 8.4 into AV1 with Profile 10.4 using both the libaom and libsvtav1 encoders, and made fixes to other parts of the stack, including dav1d decoder and Shaka packager, to properly support Dolby Vision metadata.</p>
<p><strong>Challenge #2: Getting Dolby Vision into AVSampleBufferDisplayLayer</strong></p>
<p>When you feed AVSBDL an encoded bitstream in a supported format, e.g., HEVC from an iPhone camera, Dolby Vision just works for free. But we feed buffers that we decode independently, as we need to be able to decode formats that Apple does not offer out of the box (AV1 on devices before the iPhone 15 Pro, for example). Given this setup, it’s only fair that we’d have to extract Dolby Vision independently as well.</p>
<p>Following the newly-minted specification for carriage of Profile 10 within an AV1 bitstream from Dolby, we implemented manual extraction of Dolby Vision metadata, packaged it into the same format that AVSBDL expected, and we were in business.</p>
<p>To prove that our setup was working as expected, we set up a series of identical Instagram posts with and without Dolby Vision metadata. Our partners at Dolby measured the brightness of each of these posts using a display color analyzer, at varying levels of screen brightness.</p>
<p>They captured the following:</p>
<figure id="attachment_23298" aria-describedby="caption-attachment-23298" class="wp-caption alignnone c2"><img class="size-full wp-image-23298" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg" alt="" width="1920" height="1193" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg 1920w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=916,569 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=768,477 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=1024,636 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=1536,954 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Dolby-Vision-iOS-Instagram_image-1.jpg?resize=192,119 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23298" class="wp-caption-text">Screen brightness settings versus image brightness, with and without Dolby Vision.</figcaption></figure><p>In this chart, the X-axis represents the screen brightness setting and the Y-axis represents the observed image brightness. The results demonstrate that with Dolby Vision metadata present, the brightness of the content much more closely follows the brightness setting of the screen.</p>
<p>It worked!…But we were not done yet. </p>
<h2>Testing Our Dolby Vision Implementation</h2>
<p>At Meta, we A/B test new features before shipping them to ensure they are performing as we expect. How do we A/B test metadata embedded within a video bitstream? The answer is that we produced an additional version of every video containing the new metadata. We delivered this new version to a randomly distributed test population while the randomly distributed control population continued receiving the existing experience. At our scale, we can assert that roughly an equal population will watch both flavors of each video.</p>
<p>For each flavor, we collected statistics such as how long the video was watched, how long it took to load, what type of connection it was watched on, and whether any errors were encountered during playback. Then we analyzed in aggregate to see how the flavor with metadata compared to the flavor without.</p>
<p>We hypothesized that if the metadata works as expected, videos with the new metadata would receive more watch time. But when we ran our initial test on Instagram Reels in 2024 we found that, on average, videos with Dolby Vision metadata were actually watched less than their standard counterparts.</p>
<p>How could this be possible? Isn’t Dolby Vision supposed to improve the image?</p>
<h3>Our First A/B Test With Dolby Vision Metadata </h3>
<p>Our data indicated that people were watching less Dolby Vision video because the videos were taking too long to load and people were just moving on to the next Reel in their feed.</p>
<p>There was a reasonable cause for the longer load times: The new metadata added on the order of 100 kbps to every video on average. It sounds petty, but our encodings are highly optimized for all kinds of diverse viewing conditions. Every bit counts in some situations, and a 100-kbps overhead was enough to regress engagement at the margins.</p>
<p>The answer to this was a compressed metadata format. The team at Dolby offered another specification which would lower the metadata overhead by a factor of four, to 25 kbps on average.</p>
<p>Would it be enough? We had to run another test to find out. But there was more work to be done first.</p>
<p>We needed to implement support for Dolby Vision metadata compression (and decompression while we’re at it) in FFmpeg using a bitstream filter. Also, while the uncompressed format was something we could extract from the bitstream and hand off to Apple, the compressed format was not something that was supported by Apple out of the box. We had to implement client-side decompression on our own.</p>
<p>About 2000 lines of code later, we were ready.</p>
<h3>Our Successful A/B Test </h3>
<p>This time, we found that consumers viewing with Dolby Vision metadata were spending more time in the app. We attribute this to people spending more time watching HDR videos in lower-light environments, when their screens are set to lower brightness levels and the HDR videos with proper metadata are less taxing on the eyes.</p>
<p>Because including Dolby Vision metadata had a tangibly positive outcome, we were able to make the case for shipping it across Instagram for iOS, making it our first app to take advantage of Dolby Vision. As of June 2025, all of our delivered AV1 encodings derived from iPhone-produced HDR include Dolby Vision metadata.</p>
<h2>The Future of Dolby Vision Across Meta</h2>
<p>The final challenge in the scope of this post is that Dolby Vision is not widely supported within the web ecosystem across different browsers and displays. Thus, we cannot accurately show the difference that it makes on this page, and hope you will experience it on Instagram on iPhone for yourself. The support for Dolby vision and amve is now in our encoding recipes and as such it’s ready for deployment to other platforms as well as we’re currently working on extending the support to Facebook Reels.</p>
<p>In collaboration with Dolby, we’ve solved the perceptible problem of HDR metadata preservation and collaborated with the FFmpeg developers to implement its support and make it readily available to the  community to take advantage of.</p>
<p>This is just the beginning. We look forward to expanding Dolby Vision to other Meta apps and their corresponding operating systems.</p>
<h2>Acknowledgements</h2>
<p><em>We’d like to thank Haixia Shi, the team at Dolby, and Niklas Haas from FFmpeg for their work supporting this effort.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/11/17/ios/enhancing-hdr-on-instagram-for-ios-with-dolby-vision/</link>
      <guid>https://engineering.fb.com/2025/11/17/ios/enhancing-hdr-on-instagram-for-ios-with-dolby-vision/</guid>
      <pubDate>Mon, 17 Nov 2025 18:30:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Open Source Is Good for the Environment]]></title>
      <description><![CDATA[<p>Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment?</p>
<p>On this episode of the Meta Tech Podcast, <a href="https://www.threads.com/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements from the <a href="https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/" target="_blank" rel="noopener">2025 Open Compute Project (OCP) Summit</a> – including a new open methodology for <a href="https://engineering.fb.com/2025/10/14/data-center-engineering/how-meta-is-leveraging-ai-to-improve-the-quality-of-scope-3-emission-estimates-for-it-hardware/" target="_blank" rel="noopener">leveraging AI to understand Scope 3 emissions</a>.</p>
<p>Learn about the history of OCP and its growth into an organization with more than 400 companies contributing to it. You’ll also hear how AI and open hardware are helping Meta push to achieve <a href="https://sustainability.atmeta.com/climate/" target="_blank" rel="noopener">net zero emissions in 2030</a>, including how <a href="https://engineering.fb.com/2025/07/16/data-center-engineering/ai-make-lower-carbon-faster-curing-concrete/" target="_blank" rel="noopener">AI is being used to develop new concrete mixes for data center construction</a>.</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/39036615/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/72stSBgoohgixM4t4eapmZ?ref=engineeringatmeta" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/lowering-emissions-with-the-open-compute-project/id1370910331?i=1000736776664?ref=engineeringatmeta" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/3dhpd4np?ref=engineeringatmeta" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/11/14/production-engineering/open-source-is-good-for-the-environment/</link>
      <guid>https://engineering.fb.com/2025/11/14/production-engineering/open-source-is-good-for-the-environment/</guid>
      <pubDate>Fri, 14 Nov 2025 21:54:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[StyleX: A Styling Library for CSS at Scale]]></title>
      <description><![CDATA[<p><a href="https://stylexjs.com/" target="_blank" rel="noopener">StyleX</a> is Meta’s styling system for large-scale applications. It combines the ergonomics of CSS-in-JS with the performance of static CSS, generating collision-free atomic CSS while allowing for expressive, type-safe style authoring. <a href="https://github.com/facebook/stylex" target="_blank" rel="noopener">StyleX was open sourced</a> at the end of 2023 and has since become the standard styling system across Meta products like Facebook, Instagram, WhatsApp, Messenger, and Threads, as well as external companies like Figma and Snowflake.</p>
<p>At its core, StyleX is a compiler that extracts styles at build time and generates a static stylesheet. But it’s also a philosophy: a framework for authoring, sharing, and maintaining CSS at scale. StyleX makes styling intuitive for everyday engineers by imposing constraints that encourage predictability, enable composition, and scale effortlessly across teams and codebases.</p>
<h2>How Do We Build CSS at Scale?</h2>
<p>To understand the purpose of StyleX, let’s look at the history of CSS at Meta. Serving CSS at such a scale resulted in collisions across bundles, difficulties in managing dependencies between stylesheets, and challenges in reconciling competing rules that frequently led to specificity wars. Engineers resorted to complex selectors and !important tags, making styles brittle and hard to maintain. Large, monolithic CSS bundles meant browsers were downloading hundreds of kilobytes of unused rules on every page load, slowing rendering and interaction. </p>
<p>To address these issues, Facebook built cx, a CSS-modules-like system that linked local CSS to JavaScript. cx resolved issues with namespace collisions and dependency management but remained limited to static styles defined in separate files.</p>
<pre class="line-numbers"><code class="language-javascript">// ComponentName.css; class uses ComponentName/namespace syntax
.ComponentName/header { margin-top: 10px }
// ComponentName.js 
&lt;div className={cx('ComponentName/header')} /&gt;</code></pre>
<p>When we <a href="https://engineering.fb.com/2020/05/08/web/facebook-redesign/" target="_blank" rel="noopener">rebuilt Facebook.com</a> from the ground up, we had the opportunity to build something better. Around this time, the <a href="https://speakerdeck.com/vjeux/react-css-in-js" target="_blank" rel="noopener">CSS-in-JS</a> movement was gaining momentum. Developers increasingly wanted to colocate styles with component code, write dynamic styles based on runtime state, and leverage JavaScript paradigms like import graphs, module scoping, and type systems. But early CSS-in-JS systems relied on runtime injection: dynamically generating &lt;style&gt; tags and mutating the DOM during render, patterns that introduced measurable performance overhead.</p>
<pre class="line-numbers"><code class="language-javascript">import * as stylex from '@stylexjs/stylex';
const styles = stylex.create({
  foo: { margin: 10 }
});
function MyComponent({}) {
  return &lt;div {...styles.props(styles.foo)}/&gt;
}
</code></pre>
<p>We built on the lessons of this movement and made a system that is CSS-in-JS only in form, with styles compiling to static CSS. StyleX soon replaced our precursor cx system and transformed the way we approached styling. With StyleX, styles were now defined in JavaScript, enabling composition, conditional logic, and build-time compilation. Atomic classes reduced CSS size by 80% and made styling maintainable across a rapidly scaling codebase.</p>
<p>Today, StyleX is the default styling system at Meta, powering everything from product surfaces to component libraries. Engineers use it to build interfaces that are expressive, reusable, and performant.</p>
<h2>Into the Compiler</h2>
<p>The power of StyleX lies in its abstraction. We automatically handle CSS specificity, variable generation, and static compilation to generate predictable, collision-free atomic CSS. This avoids the maintenance overload of hand-authored CSS styles, allowing users to focus on style authoring. </p>
<p>StyleX lives in a monorepo composed of several integrated packages. The core engine is a <a href="https://babel.dev/" target="_blank" rel="noopener">Babel</a> plugin that runs a transform across a project and returns the extracted CSS. At a high level, <a href="https://github.com/facebook/stylex/tree/main/packages/%40stylexjs/babel-plugin" target="_blank" rel="noopener">the compiler</a> traverses a set of files, extracts CSS metadata from style objects, and converts style declarations to atomic CSS classes. The collected metadata is then run through several processes: value normalization, at-rules wrapping, and legacy polyfills. Finally, the CSS rules are sorted and outputted into a static sheet.</p>
<p><img class="alignnone size-full wp-image-23259" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png" alt="" width="1999" height="955" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=916,438 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=768,367 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=1024,489 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=1536,734 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=96,46 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image1.png?resize=192,92 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Let’s explore the behind-the-scenes of this process through the <a href="https://stylexjs.com/docs/learn/thinking-in-stylex/" target="_blank" rel="noopener">values of StyleX</a>. </p>
<h3>Scalability</h3>
<p>At the heart of StyleX is its static compilation into <a href="https://compiledcssinjs.com/docs/atomic-css" target="_blank" rel="noopener">atomic CSS</a>. Styles are converted to classes containing a single style declaration for reuse across a codebase so CSS size plateaus as the application grows. Whenever possible, styles are compiled away and cached per file, so the system can analyze all reachable styles, deduplicate shared declarations, and emit only what’s needed at runtime.</p>
<p>The core API surface is intentionally lightweight:</p>
<ul><li class="c1" aria-level="1">stylex.create() is used to define style objects. Objects are stripped away at build time and converted to atomic CSS. Each property: value pair is hashed and outputted as a CSS class. This API is designed for cacheability and only allows for statically resolvable values.</li>
<li class="c1" aria-level="1">stylex.props() handles merging and deduping of style objects. Each call is transpiled to an object containing a space-separated className string corresponding to each atomic style, and a style prop for dynamic styles. When styles are local to the module, we compile at build time; when styles are used across module boundaries, we defer to a tiny runtime merge.</li>
</ul><pre class="line-numbers"><code class="language-javascript">import * as stylex from '@stylexjs/stylex';
const styles = stylex.create({
  foo: { margin: 10 }
  bar: { margin: 10, color: 'red' }
});
function MyComponent({style}) {
  return (
    &lt;&gt;
     &lt;div {...styles.props(styles.foo)}/&gt; 
     &lt;div {...styles.props(styles.bar)}/&gt; 
     &lt;div {...stylex.props(style)}/&gt; 
    &lt;/&gt;
  )
}
</code></pre>
<p>In each JavaScript file, API calls are replaced with the class names from the generated CSS and local styles are stripped away. The above component compiles to something like this:</p>
<pre class="line-numbers"><code class="language-javascript">import * as stylex from '@stylexjs/stylex';
function MyComponent({style}) {
  return (
    &lt;&gt;
     &lt;div className="m-10" /&gt;
     &lt;div className="c-red m-10" /&gt; 
     &lt;div {...stylex.props(style)}/&gt; 
    &lt;/&gt;
  )
}
</code></pre>
<p>After the transform is run across files, we process the collected metadata, generate LTR/RTL variants, resolve constants, and order CSS rules by priority. The output is a string that can be emitted as a static stylesheet and post-processed by any of our bundlers.</p>
<pre class="line-numbers"><code class="language-css">.m-10 { margin: 10px }
.c-red { color: red }
</code></pre>
<h3>Expressiveness</h3>
<p>StyleX enforces constraints as design principles instead of limitations. We disallow conflict-prone patterns like <em>styling at a distance</em> and enforce patterns that are statically resolvable. Within these boundaries, however, StyleX remains expressive. We’ve designed for maximum flexibility within these constraints through the following <a href="https://stylexjs.com/docs/api/">APIs</a>.</p>
<h4>Shareable Values</h4>
<p>stylex.create()is designed for per-file cacheability: all CSS metadata must be derived solely from the JavaScript defined within that file. We use an extended version of Babel’s evaluate function to resolve values. The compiler never needs to read the contents of imported modules to generate the stylesheet. </p>
<p>To enable reusable values across files, we provide APIs like stylex.defineVars() and stylex.defineConsts(). These functions generate deterministic hashes based on variable name and import path that remain consistent across modules. This allows us to resolve variables anywhere they’re imported without traversing the file that declares them. At build time, shared constants are fully inlined, while shared variables become global CSS custom properties that can be referenced across components.</p>
<pre class="line-numbers"><code class="language-javascript">// varsFile.stylex.js
const varColors = stylex.defineVars({primary: "#eee"})
// constsFile.stylex.js
const constColors = stylex.defineConsts({primary: "#fff"})
// Component.react.js
import {varColors} from 'varsFile.stylex.js' 
import {constColors} from 'constsFile.stylex.js' 
const styles = stylex.create({
    foo: {color: varColors.primary} // → .x { var(--hash('varsFile.varColors.primary')) }
    bar: {color: constColors.primary} // → hash('constsFile.constColors.primary') → #fff
  },
});
</code></pre>
<h4 class="line-numbers"><br />
Styling at a Distance</h4>
<p>As mentioned, StyleX is a system for styling components. Elements are styled using classnames. Global and complex selectors are disallowed to avoid styling at a distance: rules that affect elements indirectly from elsewhere in the DOM. Global baseline rules like element selectors or CSS resets must be defined in a separate stylesheet. This is to minimize indirect styling and promote <a href="https://stylexjs.com/docs/learn/thinking-in-stylex/#encapsulation" target="_blank" rel="noopener">encapsulation</a> of styles.</p>
<pre class="line-numbers"><code class="language-css">/* Unsafe: styles leak to child elements rather than being explicitly applied */
.csuifyiu:hover &gt; div { ... }
/* Safe: styles are scoped to a specific element based on observed state */
div:hover &gt; .ksghfhjsfg { ... } </code></pre>
<p>However, we do allow <em>observing from a distance</em> using the stylex.when APIs. This API provides a suite of relational selectors to style a component based on the state of its ancestors, descendants, or siblings. Observed elements must be marked with stylex.defaultMarker(), ensuring styles remain directly applied while supporting contextual behavior.</p>
<pre class="line-numbers"><code class="language-javascript">const styles = stylex.create({
    foo: {
      backgroundColor: {
        default: 'blue',
        [stylex.when.ancestor(':hover')]: 'red',
      },
    },
});
&lt;div {...stylex.props(stylex.defaultMarker())}&gt;
  &lt;div {...stylex.props(styles.foo)}&gt; Some Content &lt;/div&gt;
&lt;/div&gt;
</code></pre>
<h4>Preserving CSS Features</h4>
<p>StyleX preserves most of the CSS feature set (media queries, pseudoclasses, keyframes, and more) through static transforms at build time. Wherever possible, we mirror native CSS behavior so styling feels expansive and familiar.</p>
<p>While StyleX is built around static compilation, it also supports dynamic styles. When a value isn’t known at build time, the compiler emits a CSS variable reference, and the runtime writes it inline through the style prop. </p>
<pre class="line-numbers"><code class="language-javascript">const styles = stylex.create({
    // Height is unknown until runtime
    foo: (height) =&gt; ({
      height,
    }),
});
// { .d-height {var(--height)}, style: {--height: height} }
&lt;div {...stylex.props(styles.foo(height))}/&gt; 
</code></pre>
<p>Theming APIs like stylex.defineVars() and stylex.createTheme() allow users to create and mutate shareable design tokens. defineVars() creates a variable grouping, and createTheme() allows users to create variants by redefining variable groups at a higher specificity.</p>
<pre class="line-numbers"><code class="language-css">/* const spacing = stylex.defineVars({sm: 2px, md: 4px, lg: 8px}) */
:root, .sp-group{--sp-sm:2px;--sp-md:4px;--sp-lg:8px;}
/* const desktopSpacing = stylex.createTheme(spacing, {sm: 5px, md: 10px, lg: 20px}) */
.sp-dktp.sp-dktp, .sp-dktp.sp-dktp:root{--sp-sm:5px;--sp-md:10px;--sp-lg:20px;}
</code></pre>
<p>stylex.defineConsts() allows users to define shareable constants and media queries without overloading browser memory with CSS variables. During compilation, StyleX gathers metadata across all defineConsts() calls, generates placeholder hashes in  create() calls, and inlines the constant values directly into the generated stylesheet.</p>
<p>Finally, APIs like stylex.keyframes() and stylex.viewTransitionClass() support animations by generating @keyframes and ::view-transition-* rules.</p>
<h3>Predictability</h3>
<p>StyleX is a system for styling components. We discourage global styling in favour of applying localized classnames on elements directly. Our design is centered around predictable style merging: The last style always wins! You can think of the stylex.props function as a deterministic merge of style objects: given stylex.props(styles.foo, styles.bar), bar always overrides foo. This makes it easy to share and combine styles predictably across files.</p>
<p>CSS specificity follows a hierarchy where selectors are assigned different priorities. The calculation is based on a three-column value of IDs, classes, and types, commonly written as, (ID, Class, Type). Because StyleX is entirely class-based, resolving conflicts between style objects means determining which class names to apply and enforcing priorities between them. </p>
<pre class="line-numbers"><code class="language-javascript">const styles = stylex.create({
  foo: { color: 'red', margin: 0 }
  bar: { color: 'black', marginTop: 10 }        
});
function MyComponent() {
  // becomes &lt;div className="c-black m-0 mt-10" /&gt; 
  return &lt;div {...stylex.props(styles.foo, styles.bar)} /&gt; 
}
</code></pre>
<p>During merge, repeated properties across style objects are deduplicated so that only the last value is applied. As a result, each class name in the DOM node corresponds to a single property. In the above example, the color: red class is dropped during merge so color: black takes precedence. But resolving overlaps between shorthands and constituent longhands is more complex. </p>
<p>Consider the following HTML:</p>
<pre class="line-numbers"><code class="language-html">&lt;style&gt;
.margin-top-10 { margin-top: 0px }
.margin-10 { margin: 10px }
&lt;style/&gt;
&lt;div class="margin-0 margin-top-10" /&gt;
</code></pre>
<p>When multiple classes are applied on a div, the resulting styling is based solely on the specificity of the selectors (in this case, the order of the classes). Without additional handling, margin overrides margin-top here completely!</p>
<p>Throw pseudoclasses and media queries in the mix and things become even more complex:</p>
<pre class="line-numbers"><code class="language-javascript">{
   [ "m-0", { "css": ".m-10 {margin: 0}" }, 3000 ], 
   [ "mt-10", { "css": ".mt-10 {margin-top: 10px}", }, 4000 ], 
   [ "mt-10-mq", { "css": "@media (...) {.mt-10-mq {margin-top: 10px} }", }, 4200 ], 
   [ "mt-10-mq", { "css": "@media (...) {.mt-10-mq:hover {margin-top: 10px} }", }, 4320 ], 
}
</code></pre>
<p>To handle this ambiguity, we compute a numerical priority alongside each CSS rule. We use these priorities alongside a user-configured styleResolution to determine the specificity of each class selector using the @layer at-rule or equivalent polyfill.</p>
<p>The enforced ordering looks something like this: </p>
<p><img class="alignnone size-full wp-image-23258" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png" alt="" width="1122" height="368" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png 1122w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png?resize=916,300 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png?resize=768,252 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png?resize=1024,336 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png?resize=96,31 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-StyleX-image2.png?resize=192,63 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>The result? Longhands and shorthands merge predictably, :active states override :hover states, media queries override default behaviour, and user-authored order is respected when possible. This behind-the-scenes specificity handling allows developers to combine and reuse styles without manually resolving conflicts.</p>
<h2>Looking Forward</h2>
<p>StyleX is maintained by a team of CSS enthusiasts who aim to make styling accessible to everyone. Beyond the compiler, the monorepo includes an <a href="https://github.com/facebook/stylex/tree/main/packages/%40stylexjs/eslint-plugin" target="_blank" rel="noopener">ESLint plugin</a> for style validation, a <a href="https://github.com/facebook/stylex/tree/main/packages/%40stylexjs/cli">CLI</a> for easy stylesheet generation, a <a href="https://github.com/facebook/stylex/blob/main/packages/%40stylexjs/postcss-plugin/README.md" target="_blank" rel="noopener">PostCSS plugin</a> for post-processing, and an experimental <a href="https://github.com/facebook/stylex/tree/main/packages/style-value-parser" target="_blank" rel="noopener">CSS parser</a>. </p>
<p>The open source community has been critical in shaping the direction of StyleX. With the help of thousands of contributors, the ecosystem includes a community-built <a href="https://venerable-melomakarona-255f96.netlify.app/" target="_blank" rel="noopener">playground</a>, <a href="https://marketplace.visualstudio.com/items?itemName=yash-singh.stylex" target="_blank" rel="noopener">VS Code extensions</a>, an <a href="https://github.com/Dwlad90/stylex-swc-plugin" target="_blank" rel="noopener">SWC compiler</a>, <a href="https://www.npmjs.com/package/vite-plugin-stylex" target="_blank" rel="noopener">multiple</a> <a href="https://github.com/sukkaw/stylex-webpack" target="_blank" rel="noopener">bundler</a> <a href="https://www.npmjs.com/package/unplugin-stylex" target="_blank" rel="noopener">integrations</a>, and <a href="https://stylexjs.com/docs/learn/ecosystem/" target="_blank" rel="noopener">more</a>!</p>
<p>We’re always exploring new ways to make StyleX the styling system for the modern web. Our work is an ongoing dialogue between the needs of the community and the values that guide our design. Roadmap highlights include an API for shareable functions, LLM-ready context files, support for inline styles, developer extensions, strict compiler validation, logical styles utilities, and an official unplugin for bundler integrations. Our goal is to continue to evolve alongside the browser and keep imagining what styling on the web can be.</p>
<p>Happy style authoring! We make StyleX for you.</p>
<h2>Learn More</h2>
<p>To hear the latest on StyleX, check out the <a href="https://stylexjs.com/" target="_blank" rel="noopener">StyleX website</a>, <a href="https://github.com/facebook/stylex" target="_blank" rel="noopener">GitHub</a>, <a href="https://bsky.app/profile/stylexjs.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://x.com/stylexjs" target="_blank" rel="noopener">X</a>. </p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">website</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, <a href="https://bsky.app/profile/metaopensource.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>
<h2>Acknowledgements</h2>
<p><em>Special thanks to past maintainers Naman Goel and Sebastian McKenzie; contributors Frank Yan, Jerry Su, Ankit Sardesai, Joel Austin, Daniel Neiter, Nicolas Gallagher, Vincent Riemer, Ezzudin Alkotob, Andrey Sukhachev, Nitish Mehrotra, Nadiaa D., Prakshal Jain, JC Pérez Chávez, Samantha Zhan, Anay Bhakat; advisors Christopher Chedeau, Chris Callahan, Richard Hansen, Robert Maratos, Andrew Imm, Tim Yung, Eli White; Modern CSS leads; the Web Platform org; the open source community; and the lineage of systems like React Native and Linaria that continue to inspire our work.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/11/11/web/stylex-a-styling-library-for-css-at-scale/</link>
      <guid>https://engineering.fb.com/2025/11/11/web/stylex-a-styling-library-for-css-at-scale/</guid>
      <pubDate>Tue, 11 Nov 2025 18:58:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing details about Meta’s Generative Ads Recommendation Model (GEM), a new foundation model that delivers increased ad performance and advertiser ROI by enhancing other ads recommendation models’ ability to serve relevant ads.</li>
<li class="c1" aria-level="1">GEM’s novel architecture allows it to scale with an increasing number of parameters while consistently generating more precise predictions efficiently.</li>
<li class="c1" aria-level="1">GEM propagates its learnings, leveraging a suite of post-training techniques across the entire ads model fleet, enabling a paradigm shift in Meta’s Ads Recommendation system.</li>
<li class="c1" aria-level="1">GEM leverages enhanced training scalability that efficiently utilizes thousands of GPUs for building and iterating an LLM-scale ads foundation model.</li>
<li class="c1" aria-level="1">GEM is already driving significant increases in ad conversions across Instagram and Facebook.</li>
</ul><p>Meta has been at the forefront of harnessing AI across our products and services to drive business value for advertisers. Leveraging advanced techniques to personalize ads for people and maximize the performance of each ad impression is an integral part of how we develop our Ads Recommendation system. </p>
<p>The <a href="https://www.facebook.com/business/news/ai-innovation-in-metas-ads-ranking-driving-advertiser-performance" target="_blank" rel="noopener">Generative Ads Recommendation Model (GEM)</a> is Meta’s most advanced ads foundation model, built on an LLM-inspired paradigm and trained across thousands of GPUs.  It is the largest foundation model for recommendation systems (RecSys) in the industry, trained at the scale of large language models. GEM introduces architectural innovations that unlock efficient scaling laws, delivering performance gains that scale cost-effectively with data and compute. Training breakthroughs such as multi-dimensional parallelism, custom GPU kernels, and memory optimizations make it feasible to train GEM at its scale. Post-training, GEM applies advanced knowledge transfer techniques to amplify the performance of downstream models across the entire ads stack, delivering more relevant and personalized ad experiences aligned with people’s preferences. Since launching GEM <a href="https://www.facebook.com/business/news/ai-innovation-in-metas-ads-ranking-driving-advertiser-performance" target="_blank" rel="noopener">earlier this year</a>, GEM’s launch across Facebook and Instagram has delivered a 5% increase in ad conversions on Instagram and a 3% increase in ad conversions on Facebook Feed in Q2.</p>
<p>In Q3, we made improvements to GEM’s model architecture that doubled the performance benefit we get from adding a given amount of data and compute. This will enable us to continue scaling up the amount of training capacity we use on GEM at an attractive ROI.</p>
<h2>Introducing GEM</h2>
<p>GEM represents a significant advancement in RecSys through three key innovations: model scaling with advanced architecture, post-training techniques for knowledge transfer, and enhanced training infrastructure to support scalability. These innovations efficiently boost ad performance, enable effective knowledge sharing across the ad model fleet, and optimize the use of thousands of GPUs for training. GEM has driven a paradigm shift in ads RecSys, transforming ad performance across the funnel — awareness, engagement, and conversion — through joint optimization of both user and advertiser objectives.</p>
<p>Building a large foundation model for Meta’s ads RecSys requires addressing several key challenges:</p>
<ul><li class="c1" aria-level="1"><strong>Handling a large, dynamic feature space across all of Meta’s apps:</strong> Every day, billions of user-ad interactions occur across our platforms, but meaningful signals — such as clicks and conversions — are very sparse. GEM must learn from this vast but imbalanced data, recognizing meaningful patterns and generalizing across diverse users and behaviors.</li>
<li class="c1" aria-level="1"><strong>Processing a diverse array of data</strong>: GEM must learn from a diverse array of ads data — including advertiser goals, creative formats, measurement signals, and user behaviors across multiple delivery channels. This heterogeneity adds significant modeling complexity, requiring GEM to unify multimodal, multi-source inputs and capture nuanced interactions to power other ads recommendation models.</li>
<li class="c1" aria-level="1"><strong>Training efficiently:</strong> Training and scaling a large foundation model demands thousands of GPUs and leveraging advanced parallelism and system-level optimization to ensure efficient hardware utilization. </li>
</ul><p>GEM overcomes these challenges through:</p>
<ul><li class="c1" aria-level="1"> A scalable model architecture that is now <strong>4x more efficient at driving ad performance gains</strong> for a given amount of data and compute than our original ads recommendation ranking models. </li>
<li class="c1" aria-level="1">A new framework that <strong>improves knowledge transfer effectiveness,</strong> <strong>achieving 2x the effectiveness</strong> of standard knowledge distillation.</li>
<li class="c1" aria-level="1">A new training stack that delivers a <strong>23x increase in effective training FLOPS with a 1.43x increase in model FLOPS utilization (MFU)</strong> using <strong>16x more GPUs</strong>. </li>
</ul><h2>Building and Scaling GEM’s Architecture</h2>
<p>GEM is trained on ad content and user engagement data from both ads and organic interactions. From this data, we derive features that we categorize into two groups: sequence features (such as activity history) and non-sequence features (such as user and ad attributes — e.g., age, location, ad format, and creative representation). Customized attention mechanisms are applied to each group independently, while also enabling cross-feature learning. This design improves accuracy and scales both the depth and breadth of each attention block, delivering 4× the efficiency of our previous generation of models.</p>
<p><img class="alignnone wp-image-23243 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png" alt="" width="1996" height="886" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png 1996w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=916,407 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=768,341 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=1024,455 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=1536,682 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=96,43 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-1-e1762534706646.png?resize=192,85 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>Non-Sequence Feature Interaction Modeling</h3>
<p>Understanding how user attributes interact with ad characteristics is crucial for accurate recommendations. GEM enhances the <a href="https://arxiv.org/abs/2403.02545" target="_blank" rel="noopener">Wukong architecture</a> to use stackable factorization machines with cross-layer attention connections, allowing the model to learn which feature combinations matter most. Each Wukong block can scale vertically (for deeper interactions) and horizontally (for broader feature coverage), enabling the discovery of increasingly complex user-ad patterns.</p>
<h3>Offline Sequence Feature Modeling</h3>
<p>User behavior sequences — spanning long sequences of ad / content clicks, views, and interactions — contain rich signals about preferences and intent, yet traditional architectures struggle to process such long sequences efficiently. GEM overcomes this challenge with a pyramid-parallel structure, stacking multiple parallel interaction modules in a pyramid formation to capture complex user-ad relationships at scale. The new scalable offline feature infrastructure processes sequences of up to thousands of events with minimal storage cost, so GEM can learn from a much longer history of user organic and ad interactions. By modeling these extended user behavior sequences, GEM can more effectively uncover patterns and relationships, resulting in a deeper and more accurate understanding of the user’s purchase journey.</p>
<h3>Cross-Feature Learning</h3>
<p>Existing approaches compress user behavior sequences into compact vectors for downstream tasks, which risks losing critical engagement signals. GEM takes a different approach that preserves full sequence information while enabling efficient cross-feature learning. Our design, <a href="https://arxiv.org/pdf/2411.09852" target="_blank" rel="noopener">InterFormer</a>, employs parallel summarization with an interleaving structure that alternates between sequence learning (e.g., <a href="https://engineering.fb.com/2024/11/19/data-infrastructure/sequence-learning-personalized-ads-recommendations/" target="_blank" rel="noopener">custom transformer architecture</a>) and cross-feature interaction layers. This allows progressively refining its sequence understanding while maintaining access to the complete user journey. This design facilitates efficient interaction learning while preserving the structural integrity of user sequence data — enabling GEM to scale to higher layer counts without losing critical behavioral signals.</p>
<h3>Multi-Domain Learning With Domain-Specific Optimization</h3>
<p>Traditional ad recommendation systems struggle to balance learning across a broad product ecosystem — treating surfaces either in isolation (thus missing valuable cross-platform insights) or identically (ignoring platform-specific behaviors). Different Meta surfaces like Facebook, Instagram, and Business Messaging each have unique user behaviors and interaction patterns. GEM solves this through learning from cross-surface user interactions while ensuring predictions remain tailored to each surface’s unique characteristics. For example, this enables GEM to use insights from Instagram video ad engagement to improve Facebook Feed ad predictions, while also optimizing each domain’s predictions for its specific objective (such as clicks or conversions).</p>
<h2>Maximizing Transfer Efficiency With Post Training Techniques</h2>
<p>GEM only delivers impact if its knowledge can be efficiently transferred to hundreds of user-facing vertical models (VMs). To translate the performance of the GEM foundation model (FM) into measurable gains for user-facing VMs, we employ both direct and hierarchical knowledge transfer strategies. </p>
<p>Direct transfer enables GEM to transfer knowledge to major VMs within the same data spaces where GEM was trained. Hierarchical transfer distills knowledge from GEM into domain-specific FMs, which then teach VMs, driving broad improvements across ad models. Together, these approaches use a suite of techniques, including knowledge distillation, representation learning, and parameter sharing to maximize transfer efficiency across the entire ad model space, achieving 2x the effectiveness of <a href="https://arxiv.org/abs/1503.02531" target="_blank" rel="noopener">standard knowledge distillation</a>.</p>
<p><img class="alignnone size-full wp-image-23244" src="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png" alt="" width="1999" height="1020" srcset="https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=916,467 916w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=768,392 768w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=1024,523 1024w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=1536,784 1536w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=96,49 96w, https://engineering.fb.com/wp-content/uploads/2025/11/Meta-Generative-Ads-Model-GEM-image-2.png?resize=192,98 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>Knowledge Distillation</h3>
<p>In Meta’s ads system, VMs often suffer from stale supervision caused by delays in FM training and evaluation as well as domain mismatches between GEM or FM predictions and the VMs’ surface-specific objectives. These outdated or misaligned signals between the VMs (students) and GEM (the teacher) can degrade the accuracy and adaptability of student models over time.</p>
<p>To address this, we use a <a href="https://arxiv.org/pdf/2502.17494" target="_blank" rel="noopener">Student Adapter</a> during training, a lightweight component that refines the teacher’s outputs using the most recent ground-truth data. It learns a transformation that better aligns teacher predictions with observed outcomes, ensuring that student models receive more up-to-date and domain-relevant supervision throughout training.</p>
<h3>Representation Learning</h3>
<p>Representation learning is the process by which models automatically drive meaningful and compact features from raw data, enabling more effective downstream tasks like ad click prediction. Representation learning complements knowledge distillation by generating semantically aligned features that support efficient knowledge transfer from teacher to student models. With this approach GEM can effectively improve FM-to-VM transfer efficiency without adding inference overhead.</p>
<h3>Parameter Sharing</h3>
<p>Parameter sharing is a technique in which multiple models or components reuse the same set of parameters to reduce redundancy, improve efficiency, and facilitate knowledge transfer.</p>
<p>In our context, parameter sharing enables efficient knowledge reuse by allowing VMs to selectively incorporate components from FMs. This lets smaller, latency-sensitive VMs leverage the rich representations and pre-learned patterns of FMs without incurring their full computational cost.</p>
<h2>How GEM Was Trained</h2>
<p>GEM operates at a scale typically only seen by modern LLMs. Training GEM required a complete overhaul of our training recipes. The re-engineered training stack delivers a 23x increase in effective training FLOPs using 16x more GPUs while also improving efficiency. MFU, a key measure of hardware efficiency, increased by 1.43x, reflecting better use of GPU resources. This ability to increase both throughput and efficiency is important to training foundation models of this scale.</p>
<p>To support massive model sizes and multimodal workloads, we employ strategies such as multi-dimensional parallelism, custom GPU kernels, and model-system co-design. These techniques enable near-linear scaling, applied to thousands of GPUs, improving compute throughput, memory usage, and overall hardware efficiency. </p>
<h3>Distributed Training</h3>
<p>Training large models, like GEM, requires carefully orchestrated parallelism strategies across both dense and sparse components. For the dense parts of the model, techniques like Hybrid Sharded Distributed Parallel (HSDP) optimize memory usage and reduce communication costs, enabling efficient distribution of dense parameters across thousands of GPUs. In contrast, the sparse components — primarily large embedding tables used for user and item features — employ a <a href="https://pytorch.org/blog/scaling-recommendation-2d-sparse-parallelism/" target="_blank" rel="noopener">two-dimensional approach using <strong>data parallelism</strong> and <strong>model parallelism</strong></a>, optimized for synchronization efficiency and memory locality.</p>
<h3>System-Level Optimizations for GPU Throughput</h3>
<p>Beyond parallelism, we implemented a suite of techniques to saturate GPU compute throughput and reduce training bottlenecks:</p>
<ul><li class="c1" aria-level="1">A custom in-house GPU kernel designed for variable-length (jagged) user sequences and computation fusion, leveraging the latest GPU hardware features and optimization techniques.</li>
<li class="c1" aria-level="1">Graph-level compilation in PyTorch 2.0 that automates key optimizations, including activation checkpointing for memory savings and operator fusion for improved execution efficiency.</li>
<li class="c1" aria-level="1">Memory compression techniques such as FP8 quantization for activations and unified embedding formats to reduce memory footprint.</li>
<li class="c1" aria-level="1">Additionally, we developed GPU communication collectives that operate without utilizing Streaming Multiprocessor (SM) resources via NCCLX (Meta’s fork of NVIDIA’s NCCL) to eliminate contention between communication and compute workloads, improving overlap and GPU utilization.</li>
</ul><h3>Reducing Training Overhead and Job Startup Time</h3>
<p>To improve training agility and minimize GPU idleness, we optimized effective training time (ETT) — the proportion of training time spent processing new data. We reduced job startup time by 5x through optimizing trainer init, data reader setup, checkpointing, and PyTorch 2.0 compilation time, etc. Notably we reduced PyTorch 2.0 compilation time by 7x via caching strategies. </p>
<h3>Maximizing GPU Efficiency Across the Development Lifecycle </h3>
<p>GPU efficiency is optimized across all stages of the model lifecycle — from early experimentation to large-scale training and post-training. In the exploration phase, we accelerate iteration using lightweight model variants at a much lower cost compared to full-sized models. These variants support over half of all experiments, enabling faster idea validation with minimal resource overhead. During the post-training stage, the model runs forward passes to generate knowledge, including labels and embeddings, for downstream models. Unlike in large language models, we also perform continuous online training to refresh the FMs. We enhance traffic sharing between training and post-training knowledge generation, as well as between the foundation model and downstream models, to reduce computational demand. Additionally, GPU efficiency optimization has been applied across all stages to improve end-to-end system throughput. </p>
<h2>The Future of Foundation Models for Ads Recommendations</h2>
<p>The future of ads recommendation systems will be defined by a deeper understanding of people’s preferences and intent, making every interaction feel personal. For advertisers, this translates into one-to-one connections at scale, driving stronger engagement and outcomes.</p>
<p>Looking ahead, GEM will learn from Meta’s entire ecosystem including user interactions on organic and ads content across modalities such as text, images, audio, and video. These learnings from GEM will be extended to cover all major surfaces across Facebook and Instagram. This stronger multimodal foundation helps GEM capture nuances behind clicks, conversions, and long-term value, paving the way for a unified engagement model that can intelligently rank both organic content and ads, delivering maximum value for people and advertisers.</p>
<p>We will continue to scale GEM and train on even larger clusters by advancing its architecture and advancing training recipes on the latest AI hardware, enabling it to learn efficiently from more data with diverse modalities to deliver precise predictions. We will also evolve GEM to reason with inference-time scaling to optimize compute allocation, power intent-centric user journeys, and enable agentic, insight-driven advertiser automation that drive higher ROAS.</p>
<h2>Acknowledgements</h2>
<p><em>We would like to thank Yasmine Badr, John Bocharov, Shuo Chang, Laming Chen, Wenlin Chen, Wentao Duan, Xiaorui Gan, Shuo Gu, Mengyue Hang, Yuxi Hu, Yuzhen Huang, Shali Jiang, Santanu Kolay, Zhijing Li, Boyang Liu, Rocky Liu, Xi Liu, Liang Luo, GP Musumeci, Sandeep Pandey, Richard Qiu, Jason Rudy, Vibha Sinha, Matt Steiner, Musharaf Sultan, Chonglin Sun, Viral Vimawala, Ernest Wang, Xiaozhen Xia, Jackie (Jiaqi) Xu, Fan Yang, Xin Zhang, Buyun Zhang, Zhengyu Zhang, Qinghai Zhou, Song Zhou, Zhehui Zhou, Rich Zhu and the entire team behind the development and productionization of the largest foundation model in Meta’s ads recommendation system.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/11/10/ml-applications/metas-generative-ads-model-gem-the-central-brain-accelerating-ads-recommendation-ai-innovation/</link>
      <guid>https://engineering.fb.com/2025/11/10/ml-applications/metas-generative-ads-model-gem-the-central-brain-accelerating-ads-recommendation-ai-innovation/</guid>
      <pubDate>Mon, 10 Nov 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Video Invisible Watermarking at Scale]]></title>
      <description><![CDATA[<ul><li>At Meta, we use invisible watermarking for a variety of content provenance use cases on our platforms.</li>
<li>Invisible watermarking serves a number of use cases, including detecting AI-generated videos, verifying who posted a video first, and identifying the source and tools used to create a video.</li>
<li>We’re sharing how we overcame the challenges of scaling invisible watermarking, including how we built a CPU-based solution that offers comparable performance to GPUs, but with better operational efficiency.</li>
</ul><p>Invisible watermarking is a powerful media-processing technique that allows us to embed a signal into media in a way that’s imperceptible to humans but detectable by software. This technology offers a robust solution for content provenance tagging (an indication of where the content came from), enabling the identification and tracking of content to support various use cases. At its core, invisible watermarking works by subtly modifying pixel values in images, waveforms in audio, or text tokens generated by large language models (LLMs) to embed a small amount of data. The design of watermarking systems adds necessary redundancy; this ensures the embedded identification remains persistent through transcodes and editing, unlike metadata tags that can be lost.</p>
<p>Bringing an invisible watermarking solution to production at scale presents many challenges. In this blog post, we’ll discuss how we overcame challenges with deployment environments, bitrate increases, and visual quality regressions to adapt to real-world use cases.</p>
<div class="jetpack-video-wrapper"><iframe title="Invisible Watermarking: Content Provenance for Videos at Scale | Wes Castro, Meta" width="1778" height="1000" src="https://www.youtube.com/embed/DOwt8ptgMAk?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Some Helpful Definitions</h2>
<p>Digital watermarking, steganography, and invisible watermarking are related concepts, but it’s important to understand their differences:</p>
<table border="1"><tbody><tr><td><strong>Feature</strong></td>
<td><strong>Digital Watermarking</strong></td>
<td><strong>Steganography</strong></td>
<td><strong>Invisible Watermarking</strong></td>
</tr><tr><td>Purpose</td>
<td>Content attribution, protection, provenance</td>
<td>Secret communication</td>
<td>Content attribution, protection, provenance</td>
</tr><tr><td>Visibility</td>
<td>Visible or invisible</td>
<td>Invisible</td>
<td>Invisible</td>
</tr><tr><td>Robustness
<p>against content modifications</p>
</td>
<td>Medium to high</td>
<td>Usually low</td>
<td>High (survives edits)</td>
</tr><tr><td>Payload / Message  Capacity</td>
<td>Medium (varies)</td>
<td>Varies</td>
<td>Medium (e.g., &gt;64 bits)</td>
</tr><tr><td>Computational Cost</td>
<td>Low (visible) to high (invisible)</td>
<td>Varies</td>
<td>High (advanced ML models)</td>
</tr></tbody></table><h2>The Need for Robust Content Tagging</h2>
<p>In today’s digital landscape, where content is constantly shared, remixed, and even AI-generated, important questions arise:</p>
<h3>Who Published the Video First?</h3>
<p>In the photos in Figure 1, you can see two different user names, but there’s no visual indicator of who uploaded this image first. Invisible watermarking can help identify the first time a video was uploaded.</p>
<figure id="attachment_23206" aria-describedby="caption-attachment-23206" class="wp-caption alignnone c1"><img class="wp-image-23206" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png" alt="" width="700" height="701" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png 1078w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png?resize=914,916 914w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png?resize=768,769 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png?resize=1022,1024 1022w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png?resize=96,96 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-1a.png?resize=192,192 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23206" class="wp-caption-text">Figure 1: Screenshots of what appear to be the same video uploaded by two users on Instagram Reels.</figcaption></figure><h3>Is It Even a Real Image?</h3>
<p>With increasingly realistic generative AI (GenAI) videos, distinguishing between real and AI-generated content is increasingly challenging. Invisible watermarking can be used to infer if content such as in Figure 2 is AI-generated.</p>
<figure id="attachment_23203" aria-describedby="caption-attachment-23203" class="wp-caption alignnone c2"><img class="size-full wp-image-23203" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg" alt="" width="1999" height="1091" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=916,500 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=768,419 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=1024,559 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=1536,838 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-2.jpg?resize=192,105 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23203" class="wp-caption-text">Figure 2: An AI-generated image.</figcaption></figure><h3>What Camera Was Used? </h3>
<p>When encountering a compelling image or video like the one in Figure 3, people often wonder about the source and tools used for creation. Invisible watermarking can infer this information directly.</p>
<figure id="attachment_23204" aria-describedby="caption-attachment-23204" class="wp-caption alignnone c3"><img class="wp-image-23204" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-3.png" alt="" width="434" height="700" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-3.png 600w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-3.png?resize=568,916 568w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-3.png?resize=96,155 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Invisible-Watermarking-Figure-3.png?resize=192,309 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23204" class="wp-caption-text">Figure 3: A screenshot from a video captured with Ray-Ban Meta glasses.</figcaption></figure><p>Traditional methods such as visual watermarks (which can be distracting) or metadata tags (which can be lost if a video is edited or re-encoded) do not address these challenges adequately and robustly. Due to its persistence and imperceptibility, invisible watermarking presents a superior alternative.</p>
<h2>The Scaling Journey: From GPUs to CPUs</h2>
<p>Earlier digital watermarking research (starting in the 1990s) employed digital signal-processing techniques (such as DCT and DWT) to modify an image’s spectral properties to hide imperceptible information. Although these methods proved highly effective for static images and were considered a “solved problem,” they are not adequately robust against the various types of geometric transformations and filtering we see in social media and other real-world applications. </p>
<p>Today’s state-of-the-art solutions (such as <a href="https://www.aidemos.meta.com/videoseal/" target="_blank" rel="noopener">VideoSeal</a>) use machine-learning (ML) techniques providing significantly improved robustness against the type of edits seen in social media. However, the application of solutions to the video problem domain (i.e., frame-by-frame watermarking) may be prohibitively computationally expensive without the necessary inference optimizations. </p>
<p>GPUs may seem an obvious solution for deploying ML-based video watermarking solutions. However, most types of GPU hardware are specialized for the training and inference of large-scale models (such as LLMs and diffusion models). They have partial or no support for video transcoding (compression and decompression). Enabling invisible watermarking for videos has therefore posed unique challenges for our existing video-processing software (FFmpeg) and hardware stack (GPUs without video-transcoding capabilities or other custom accelerators for video processing without efficient ML-model inference capabilities).</p>
<h2>GPU Optimization Attempts and the Shift to CPUs</h2>
<p>Our embedding architecture uses FFmpeg with a custom filter to compute and apply invisible watermark masks to the videos. The filter acts as a reusable block that can be added easily to existing video processing pipelines. Migrating to a more optimal inference service for warmed-up models would mean sacrificing the flexibility of our FFmpeg filter, so for our application that was not an option.</p>
<p>Profiling our invisible watermarking filter revealed low GPU utilization. We implemented frame batching and threading in the filter, but these efforts yielded no significant improvements to latency or utilization. GPUs with hardware video encoders and decoders can more easily reach high throughput, but the GPUs available for our service lack video encoders, requiring frames to be sent back to the CPU for encoding. Here a software video encoder can end up being a major bottleneck for pipelines using low-complexity ML models on a GPU.</p>
<p>Specifically, we encountered three primary bottlenecks:</p>
<ul><li class="c4" aria-level="1"><strong>Data transfer overhead:</strong> Transferring high-resolution input video frames back and forth between CPUs and multiple GPUs posed challenges to thread and memory optimizations, yielding suboptimal GPU utilization. </li>
<li class="c4" aria-level="1"><strong>Inference latency:</strong> Processing multiple invisible watermarking requests across multiple GPUs in parallel on the same host led to a dramatic increase in inference latency.</li>
<li class="c4" aria-level="1"><strong>Model loading time:</strong> Despite the model’s small size, loading the model consumed a significant portion of the total processing time. Relying on FFmpeg prevented us from using warmed-up, pre-loaded models on the GPUs.</li>
</ul><p>Recognizing these limitations, we began investigating CPU-only inference. The embedder’s neural network architecture is more favorable to GPUs, and initial benchmarks showed that end-to-end (E2E) performance was more than two times slower on CPUs. By adjusting threading parameters for the encoder, decoder, and PyTorch, and optimizing sampling parameters used by the invisible watermarking filter, we saw significant improvements. </p>
<p>Ultimately, with properly tuned threading and embedding parameters, the E2E latency for running invisible watermarking on a CPU in a single process was within 5% of GPU performance. Crucially, we could run multiple FFmpeg processes in parallel on CPUs without increased latency. This breakthrough allowed us to calculate the capacity needed and achieve a more operationally efficient solution compared to a GPU-based solution.</p>
<p>To validate our CPU solution’s scalability in a distributed system, we conducted comprehensive load tests. Given a pool of CPU workers, we generated test traffic at increasing request rates to identify the peak performance point before per-request latency began to rise. For comparison, we used the same parameters with GPU inference on a pool of GPU workers with similar capabilities. The results confirmed that our CPU solution could perform at scale, comparable to our local test findings. This achievement allowed us to provision the required capacity with greater operational efficiency compared to a GPU-based approach.</p>
<h2>Optimization Considerations and Trade-offs</h2>
<p>Deploying invisible watermarking at scale presented several optimization challenges, primarily involving trade-offs between four metrics:</p>
<ul><li class="c4" aria-level="1"><strong>Latency:</strong> The speed at which the watermarking process occurs</li>
<li class="c4" aria-level="1"><strong>Watermark detection bit-accuracy:</strong> The accuracy of detecting embedded watermarks</li>
<li class="c4" aria-level="1"><strong>Visual quality:</strong> Ensuring the embedded watermark is imperceptible to the human eye</li>
<li class="c4" aria-level="1"><strong>Compression efficiency (measured by</strong> <a href="https://ottverse.com/what-is-bd-rate-bd-psnr-calculation-interpretation/" target="_blank" rel="noopener"><strong>BD-Rate</strong></a><strong>)</strong>: Ensuring the embedded watermark does not significantly increase bitrate</li>
</ul><p>Optimizing for one metric may negatively impact others. For example, a stronger watermark for higher bit accuracy might lead to visible artifacts and increased bitrate. We can’t create a perfectly optimal solution for all four metrics si</p>
<h2>Managing BD-Rate Impact</h2>
<p>Invisible watermarking, while imperceptible, introduces increased entropy, which can lead to a higher bitrate for video encoders. Our initial implementation showed a BD-Rate regression of around 20%, meaning users would need more bandwidth to watch a watermarked video. To mitigate this, we devised a novel frame-selection method for watermarking so that the BD-Rate impact is largely reduced while increasing visual quality and minimally impacting watermark bit detection accuracy.</p>
<h2>Addressing Regressions in Visual Quality </h2>
<p>We need to ensure the “invisible” watermark remains truly invisible. We initially observed noticeable visual artifacts despite high-quality metric scores (VMAF and SSIM).</p>
<p>We addressed the visual-quality evaluations by implementing a custom post-processing technique and iterating through different embedding settings through crowdsourced manual inspections. This subjective evaluation was crucial for unblocking us, as traditional visual quality metrics proved insufficient for detecting the type of artifact an invisible watermark can at times introduce. As we tuned the algorithm for human invisibility, we closely monitored the bit accuracy impact to achieve an optimal balance between visual quality and detection accuracy.</p>
<h2>Learnings and the Road Ahead</h2>
<p>Our journey to deploy a scalable, invisible watermarking solution provided valuable insights:</p>
<p><strong>With proper optimizations, CPU-only pipelines can reach performances comparable to GPU pipelines for specific use cases at much lower cost.</strong> Contrary to our initial assumptions, with the right optimizations CPUs offered us a more operationally efficient and scalable solution for our invisible-watermarking system. While GPUs are still faster for the invisible watermark model inference, we were able to use optimizations to bring down the overall compute and latency with the CPU fleet.</p>
<p><strong>Traditional video quality scores are insufficient for invisible watermarking:</strong> We learned that metrics like VMAF and SSIM do not fully capture the perceptual quality issues introduced by invisible watermarking, necessitating manual inspection. More research is needed to develop a metric to programmatically detect the visual-quality loss incurred by invisible watermarking.</p>
<p><strong>The quality bar for production use is high:</strong> Watermarking techniques may not directly apply to real-world use cases due to the impact on BD-Rate and downstream video compression. We needed to expand upon the literature to keep BD-Rate impacts low while maintaining excellent bit accuracy for detection.</p>
<p>We successfully shipped a scalable watermarking solution with excellent latency, visual quality, detection bit accuracy, and a minimal BD-Rate impact.</p>
<p>As our North Star goal, we aim to continue to improve the precision and copy-detection recall with invisible watermark detection. This will involve further tuning of model parameters, pre- and post-processing steps, and video encoder settings. Ultimately, we envision invisible watermarking as a lightweight “filter block” that can be seamlessly integrated into a wide range of video use cases without product-specific tweaks, providing minimal impact on the user experience while offering robust content provenance.</p>]]></description>
      <link>https://engineering.fb.com/2025/11/04/video-engineering/video-invisible-watermarking-at-scale/</link>
      <guid>https://engineering.fb.com/2025/11/04/video-engineering/video-invisible-watermarking-at-scale/</guid>
      <pubDate>Tue, 04 Nov 2025 19:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Scaling Privacy Infrastructure for GenAI Product Innovation]]></title>
      <description><![CDATA[<p>How does Meta empower its product teams to harness GenAI’s power responsibly? In this post, we delve into how Meta addresses the challenges of safeguarding data in the GenAI era by scaling its <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Privacy Aware Infrastructure (PAI)</a>, with a particular focus on Meta’s AI glasses as an example GenAI use case. </p>
<ul><li class="c1" aria-level="1">We’ll describe in detail the technology behind <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">data lineage</a>, explain how modern infrastructure supports privacy at scale, and discuss how these advances accelerate product innovation while keeping privacy at the core.</li>
<li class="c1" aria-level="1">AI glasses are only one of the latest examples of how generative AI (GenAI) has been driving a range of new product experiences across all our platforms at Meta.</li>
<li class="c1" aria-level="1">While GenAI enables new features like <a href="https://about.fb.com/news/2025/10/improving-your-recommendations-apps-ai-meta/">hyper-personalized recommendations</a> and <a href="https://about.fb.com/news/2025/04/introducing-meta-ai-app-new-way-access-ai-assistant/">responsive real-time assistants</a>, it is also reinforcing the importance of earning and maintaining user trust by maintaining and protecting user data.</li>
</ul><p>As AI products like our AI glasses ingest, process, and generate increasingly rich data, they also introduce new opportunities for embedding privacy into those processes. Our vision to empower our product teams to responsibly harness the power of GenAI is a bold one: We scale our <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Privacy Aware Infrastructure (PAI)</a> as a foundational backbone of AI innovation.</p>
<p>By empowering product teams with lineage insights and automated <a href="https://engineering.fb.com/2025/07/23/security/policy-zones-meta-purpose-limitation-batch-processing-systems/">privacy controls</a>, we accelerate GenAI product innovation while upholding user trust and privacy as foundational principles.</p>
<h2>The Key Privacy Challenges of GenAI</h2>
<p>We’ve encountered three primary challenges to ensuring privacy for GenAI:</p>
<ul><li class="c1" aria-level="1"><strong>Technological evolution and explosive data growth</strong>: The emergence of GenAI has introduced novel data types and dramatically increased data volumes, presenting new complexities in data observability and management.</li>
<li class="c1" aria-level="1"><strong>Shifting requirements landscape:</strong> Advancements in technology continually generate new privacy and compliance requirements. Our ability to remain competitive and innovative hinges on how swiftly we can adapt to these evolving demands.</li>
<li class="c1" aria-level="1"><strong>Accelerated innovation cycles:</strong> GenAI-powered features drive faster product development, necessitating infrastructure that can scale rapidly and enforce privacy controls automatically.</li>
</ul><p>Meta’s AI glasses integrate wearable technology with GenAI to deliver  real-time information, personalized assistance, and creative capabilities—all contextualized to the wearer’s surroundings.</p>
<ul><li class="c1" aria-level="1"><strong>Real-time scene understanding</strong>: Meta’s AI glasses leverage advanced cameras and sensors to interpret your surroundings, enabling you to ask questions like, “What is that building?” or “Can you read this sign to me?” and receive instant, relevant answers. </li>
<li class="c1" aria-level="1"><strong>Contextual overlays</strong>: GenAI models deliver dynamic overlays and summaries, offering guidance and information tailored to your current location or activity for a more personalized experience.</li>
<li class="c1" aria-level="1"><strong>Natural and intuitive interactions</strong>: Innovative input methods such as the <a href="https://www.meta.com/emerging-tech/emg-wearable-technology/">Meta Neural Band</a> and advanced output technologies, like those featured in the <a href="https://about.fb.com/news/2025/09/meta-ray-ban-display-ai-glasses-emg-wristband/">Meta Ray-Ban Display</a> glasses, enable seamless and intuitive interactions and low-latency, full-duplex conversations that go beyond simple commands.</li>
</ul><p>Forward-looking use cases like these highlight the intricate data flows enabled by GenAI: continuous sensor inputs, real-time processing both on-device and in the cloud, and a dynamic feedback loop to the user. They also speak to our key challenges and underscore the need for robust, adaptable systems that prioritize privacy as GenAI continues to transform our products and data ecosystem.</p>
<p>At Meta, we tackle these challenges with integrated privacy via a scalable infrastructure that is deeply embedded from the ground up during product development.</p>
<p>For example, Figure 1 outlines how we use our PAI technologies to track and protect user interactions with the Meta AI app that happen through our AI glasses.</p>
<figure id="attachment_23217" aria-describedby="caption-attachment-23217" class="wp-caption alignnone c2"><img class="size-full wp-image-23217" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png" alt="" width="1999" height="728" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=916,334 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=768,280 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=1024,373 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=1536,559 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-1.png?resize=192,70 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23217" class="wp-caption-text">Figure 1: System context for AI glasses interactions with the Meta AI app.</figcaption></figure><h2>An Overview of Meta’s Privacy-Aware Infrastructure</h2>
<p>Meta’s PAI sits at the heart of our privacy strategy. PAI is a suite of infrastructure services, APIs, and monitoring systems designed to integrate privacy into every aspect of product development. </p>
<p>Addressing the challenges listed in the section above, it includes:</p>
<ul><li class="c1" aria-level="1"><strong>Enhanced observability:</strong> Automated data detection through advanced scanning and tagging identifies relevant data at the point of ingestion. This is further strengthened by data-lineage tracking, which maintains a real-time map of data origins, propagation paths, and usage—providing comprehensive visibility into how data flows across systems.</li>
<li class="c1" aria-level="1"><strong>Efficient privacy controls:</strong> Policy-enforcement APIs to programmatically enforce privacy constraints at the data storage, processing, and access layers. Policy automation that embeds regional and global requirements into automated checks and workflow constraints.</li>
<li class="c1" aria-level="1"><strong>Scalability:</strong> Supports thousands of microservices and product teams across Meta’s vast ecosystem.</li>
</ul><p>PAI empowers engineers to innovate while automatically ensuring policy adherence and safety. Figure 2 summarizes this lifecycle and highlights the Discover stage we focus on below.</p>
<figure id="attachment_23215" aria-describedby="caption-attachment-23215" class="wp-caption alignnone c2"><img class="size-full wp-image-23215" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png" alt="" width="1999" height="676" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=916,310 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=768,260 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=1024,346 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=1536,519 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=96,32 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-2.png?resize=192,65 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23215" class="wp-caption-text">Figure 2: Key privacy workflows.</figcaption></figure><h2>A Deep Dive Into the “Discover” Stage of Our PAI</h2>
<p>One of PAI’s most transformative technologies is our approach to data lineage at scale. Our data lineage system continuously tracks and maps data flows across the entire infrastructure. While we discussed the technical foundations in our prior blog post on <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">how Meta discovers data flows via lineage at scale</a>, here we’ll explore a new perspective — highlighting how we’ve adapted our lineage capabilities to meet the unique challenges of GenAI’s rapidly evolving environment.</p>
<p>Meta’s vast scale and diverse ecosystem of systems present significant challenges for observing data lineage. Our lineage solution must operate across millions of data and code assets, spanning hundreds of platforms and a wide array of programming languages. </p>
<p>Let’s take a look at how this works.</p>
<h3>Collect Cross-Stack Lineage for Interaction Data</h3>
<p>To maintain the privacy requirements for the data under consideration — for example, for user-interaction data from the scenario above with our AI glasses — we need a <em>complete map</em> of its movement. This traceability is what cross-stack lineage provides, as illustrated in Figure 3:</p>
<ul><li class="c1" aria-level="1"><strong>[A] Within web</strong>: We capture data flows as the interaction data enters Meta’s web servers and further downstream between web components with privacy probes, so we know exactly what is collected and how it’s processed.</li>
<li class="c1" aria-level="1"><strong>[B] Web -&gt; logger -&gt; warehouse</strong>: When the web processing persists data, lineage tracks the logger that writes to the data-warehouse tables. Then, when the data is batch-processed downstream, we parse logger configs, SQL queries, and processing logs to extract data lineage.</li>
<li class="c1" aria-level="1"><strong>[C] Web &lt;&gt; inference</strong>: For large language model (LLM) calls, we collect lineage signals at service/RPC boundaries; for example, which model checkpoints are invoked, what are the inputs, and what are the responses returned to the app.</li>
<li class="c1" aria-level="1"><strong>[D] Warehouse -&gt; training</strong>: Finally, lineage links warehouse tables into training jobs and the checkpoints they produce. This boundary is where we can enforce and demonstrate privacy requirements regarding the purposes that are allowed.</li>
</ul><figure id="attachment_23216" aria-describedby="caption-attachment-23216" class="wp-caption alignnone c3"><img class="size-full wp-image-23216" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png" alt="" width="1924" height="808" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png 1924w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=916,385 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=768,323 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=1024,430 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=1536,645 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=96,40 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-3.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23216" class="wp-caption-text">Figure 3: Cross-stack lineage.</figcaption></figure><p>PAI collects these lineage signals crossing all stacks, including web probes, logger, batch-processing lineage, RPC lineage, and training manifests. Together they form an end-to-end graph for interaction data. Figure 4 shows this graph. With this visibility, we can reason about privacy in concrete terms: We know exactly which systems are involved and which ones aren’t. That clarity is what enables us to enforce data flow at boundaries and prove policy adherence.</p>
<figure id="attachment_23214" aria-describedby="caption-attachment-23214" class="wp-caption alignnone c4"><img class="size-full wp-image-23214" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png" alt="" width="1260" height="946" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png 1260w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png?resize=916,688 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png?resize=768,577 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png?resize=1024,769 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png?resize=96,72 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-4.png?resize=192,144 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23214" class="wp-caption-text">Figure 4: End-to-end lineage for AI-glasses interaction.</figcaption></figure><h3>Building Comprehensive Lineage Observability</h3>
<p>A sound lineage-observability system must catch all actual data flows or I/O operations comprehensively when data is processed. To achieve that we:</p>
<ul><li class="c1" aria-level="1"><strong>C</strong><strong>apture and link all read operations to the write operation</strong>: When we write a data asset, we ensure that we log all relevant write operations with the same correlation key as the one used for the read operation. We perform this logging for both SQL and non-SQL queries, as well as when I/O operations occur in a distributed manner.</li>
<li class="c1" aria-level="1"><strong>Create a common privacy library to log data flow information</strong>: Our privacy library (PrivacyLib) is designed to initialize and propagate privacy policies, offer a generic abstraction for diverse operations (e.g. reads, writes, remote calls), and standardize extensions such as logging. Figure 5 illustrates how PrivacyLib is being used to link reads and writes across systems.</li>
<li class="c1" aria-level="1"><strong>Place library integration points in all involved data systems at Meta</strong>: We have integrated the library into all relevant data systems, implemented it in various programming languages, and ensured comprehensive coverage of I/O operations.</li>
</ul><figure id="attachment_23219" aria-describedby="caption-attachment-23219" class="wp-caption alignnone c2"><img class="size-full wp-image-23219" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png" alt="" width="1999" height="667" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=916,306 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=768,256 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=1024,342 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=1536,513 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=96,32 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-5.png?resize=192,64 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23219" class="wp-caption-text">Figure 5: Building lineage observability via PrivacyLib.</figcaption></figure><h3>From Lineage to Proof for AI Glasses</h3>
<p>​​Data lineage tells us which systems process AI-glasses-interaction data. Based on that, we can protect the data in the following manner:</p>
<ul><li class="c1" aria-level="1">We use lineage to guide the placement of <a href="https://engineering.fb.com/2025/07/23/security/policy-zones-meta-purpose-limitation-batch-processing-systems/">Policy Zones</a>, protecting interaction data. </li>
<li class="c1" aria-level="1">We start the training job for a model using a data asset in this zone only if all training-data assets are permitted for this purpose; otherwise, we remediate it.</li>
<li class="c1" aria-level="1">Finally, our verifiers watch these edges over time, so that any new or changed data-processing jobs are identified early during feature development.</li>
</ul><p>As shown in Figure 6, this set of workflows is how we transform lineage into protection: place Policy Zones, block boundary crossings, and continuously prove it.</p>
<figure id="attachment_23218" aria-describedby="caption-attachment-23218" class="wp-caption alignnone c5"><img class="size-full wp-image-23218" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png" alt="" width="1200" height="1148" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png 1200w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png?resize=916,876 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png?resize=768,735 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png?resize=1024,980 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png?resize=96,92 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-PAI-for-GenAI-Figure-6.png?resize=192,184 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23218" class="wp-caption-text">Figure 6: From lineage to proof via Policy Zones.</figcaption></figure><h2>Zooming out: Privacy-Safe Data Across the GenAI Lifecycle</h2>
<p>Scaling privacy from early prototypes to global rollouts requires infrastructure that adapts across products, regions, and evolving AI capabilities. PAI’s <a href="https://engineering.fb.com/2025/04/28/security/how-meta-understands-data-at-scale/">data understanding</a>, <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">data flow lineage</a>, and <a href="https://engineering.fb.com/2025/07/23/security/policy-zones-meta-purpose-limitation-batch-processing-systems/">policy enforcement</a> facilitate safe and conformant data flows. This infrastructure enables Meta to launch products such as ourAI glasses confidently at a global scale, providing users with rich, personalized experiences powered by GenAI, while ensuring transparent and verifiable privacy guarantees.</p>
<h2>Key Takeaways: How PAI Scales GenAI Privacy</h2>
<p>Meta’s approach to privacy is straightforward: scale the infrastructure, not just the rules. By embedding PAI technologies including data lineage into the stack, Meta empowers engineers to deliver the next wave of GenAI products safely, quickly, and globally. </p>
<ul><li class="c1" aria-level="1">Lightning-fast advancements from GenAI and its powered products bring new privacy and policy challenges that require rapid development of privacy-aware infrastructure.</li>
<li class="c1" aria-level="1">Privacy Aware Infrastructure (PAI) provides reusable workflows (Understand -&gt; Discover -&gt; Enforce -&gt; Demonstrate) that scale privacy enforcement for GenAI products as well.</li>
<li class="c1" aria-level="1">Scalable data lineage technology facilitates privacy controls by giving us auditable, real-time insight into every data flow.</li>
<li class="c1" aria-level="1">Automatic guardrails and instant development feedback help product teams move faster and safer, with lower friction..</li>
</ul><h2>As GenAI Evolves, So Does Privacy</h2>
<p>Scaling privacy for GenAI is an ongoing journey. As AI capabilities advance, so do the complexity and expectations around privacy protection. Meta’s PAI is evolving in step—integrating smarter lineage analysis and increasingly developer-friendly tools to meet these new demands.</p>
<p>As GenAI ushers in the next era of digital experiences, our focus on privacy remains strong. By scaling privacy infrastructure as a product enabler, not a barrier, Meta is laying the groundwork for responsible AI-product innovation.</p>
<p>Interested in learning more? Follow <a href="https://www.facebook.com/Engineering">the Engineering at Meta blog on Facebook</a> and stay engaged in the evolving dialogue on infrastructure for responsible innovation.</p>
<h2>Acknowledgements</h2>
<p><em>The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing privacy infrastructure over the years. In particular, we would like to extend special thanks to (in last name alphabetical order):</em> <em>Taha Bekir Eren, Abhishek Binwal, Sergey Doroshenko, Rajkishan Gunasekaran, Ranjit Gupta, Jason Hendrickson, Kendall Hopkins, Aleksandar Ilic, Gabriela Jacques da Silva, Anuja Jaiswal, Joel Krebs, Vasileios Lakafosis, Tim LaRose, Yang Liu, Rishab Mangla, Komal Mangtani, Diana Marsala, Sushaant Mujoo, Andrew Nechayev, Alex Ponomarenko, Peter Prelich, Ramnath Krishna Prasad, Benjamin Renard, Hannes Roth, Christy Sauper, David Taieb, Vitalii Tsybulnyk, Pieter Viljoen, Lucas Waye, Yizhou Yan, Danlei Yang, Hanzhi Zhang, and Adrian Zgorzalek. </em></p>
<p><em>We would also like to express our gratitude to all reviewers of this post, including (in last name alphabetical order):</em> <em>Albert Abdrashitov, Jennifer Billock, Jordan Fieulleteau, Ahmed Fouad, Angie Galloway, Xenia Habekoss, Kati London, Koosh Orandi, Brianna O’Steen, Zef RosnBrick, Tobias Speckbacher, and Emil Vazquez.</em></p>
<p><em>We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, and Supriya Anand and Chloe Lu for pulling required support together to make this blog post happen.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/10/23/security/scaling-privacy-infrastructure-for-genai-product-innovation/</link>
      <guid>https://engineering.fb.com/2025/10/23/security/scaling-privacy-infrastructure-for-genai-product-innovation/</guid>
      <pubDate>Thu, 23 Oct 2025 10:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Disaggregated Scheduled Fabric: Scaling Meta’s AI Journey]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Disaggregated Schedule Fabric (DSF) is Meta’s next-generation network fabric technology for AI training networks that addresses the challenges of existing Clos-based networks.</li>
<li class="c1" aria-level="1">We’re sharing the challenges and innovations surrounding DSF and discussing future directions, including the creation of mega clusters through DSF and non-DSF region interconnectivity, as well as the exploration of alternative switching technologies.</li>
</ul><p><a href="https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/">Disaggregated Schedule Fabric (DSF)</a> is Meta’s next-generation network fabric. The GenAI boom has created a surge in demand for high-performance, low-latency, and lossless AI networks to support training AI models at a large scale. DSF helps us build scalable AI networks by breaking the physical limit of the traditional monolithic chassis-switch architecture. By disaggregating line cards and fabric cards into distinct, interconnected hardware devices, the DSF network creates a distributed system that offers scalability and performance for AI networks. </p>
<p>DSF is a VOQ-based system powered by the open <a href="https://github.com/opencomputeproject/SAI">OCP-SAI</a> standard and <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/">FBOSS</a> with a modular architecture designed to optimize load balancing and congestion control, ensuring high performance for both intra and inter-cluster traffic. </p>
<p>With DSF we’ve already been able to build increasingly larger clusters that interconnect thousands of GPUs in a data center region. </p>
<div class="jetpack-video-wrapper"><iframe title="Scaling AI Network with DSF - Live from SCC" width="1778" height="1000" src="https://www.youtube.com/embed/aiThXaT48Y8?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Background: Our Challenges With Traditional IP Fabric</h2>
<p>While running training jobs over traditional IP fabric, we faced several challenges. These problems were specific to training applications that use remote direct memory access (RDMA) technology, which uses UDP protocol to exchange data. </p>
<p>We encountered these three types of problems:</p>
<ul><li class="c1" aria-level="1"><strong>Elephant flows</strong>: AI workloads tend to have long-duration, heavy-traffic flows that have the potential to congest the fabric links they hash onto and create head-of-the-line blocking. </li>
<li class="c1" aria-level="1"><strong>Low entropy:</strong> Depending on the number of GPUs involved in the collective operations, the number of IP flows could be lower, which results in inefficient hashing and, possibly, in congestion, despite the availability of adequate capacity in the fabric.</li>
<li class="c1" aria-level="1"><strong>Suboptimal fabric utilization:</strong> We have observed that, as a combined effect, there is a large skew in the bandwidth utilization of fabric links. This is important data because it impacts how much we should overprovision the fabric to support good pacing and  maintain steady performance in the event of failures.</li>
</ul><p>We tried several solutions to handle these issues, but each presented challenges. For example, we created Border Gateway Protocol (BGP) policies such that when traffic is received from accelerators via leaf switches, it is pinned to a specific uplink, depending on its destination. This alleviated the problem of low entropy in steady state but didn’t handle failure scenarios where the fallback was equal-cost multipath (ECMP) routing.</p>
<p>We also tried load-aware ECMP schemes that could handle fat flows and low entropy, but they were difficult to tune and created out-of-order packets, which is detrimental to RDMA communication.</p>
<p>We also created a traffic-engineering solution that would pre-compute the flow pattern depending on the models used and configure the leaf switches before the job starts. This could handle fat flows and low entropy but grew too complex as network size increased. And due to its centralized nature, this set-up was slow to react to failures.</p>
<h2>A Primer on Disaggregated Scheduled Fabric </h2>
<p>The idea behind DSF stems from the aforementioned characteristics of AI training workloads, particularly their tendency to generate “elephant flows” — extraordinarily large, continuous data streams — and “low entropy” traffic patterns that exhibit limited variation in flow and result in hash collisions and sub-optimal load distribution across network paths. The fundamental innovation of DSF lies in its two-domain architecture, which separates the network into the Ethernet domain, where servers and traditional networking protocols operate, and the “fabric” domain, where packets will be broken into cells, sprayed across the fabric, and subsequently reassembled at the hardware before being delivered back to the Ethernet domain.</p>
<p>DSF is built on two components: interface nodes (INs), also referred to as rack distributed switches (RDSWs), and fabric nodes (FNs), known as fabric distributed switches (FDSWs). INs serve as the network-facing components that handle external connectivity and routing functions, and that interface with the broader data center infrastructure. FNs operate as internal switching elements dedicated to high-speed traffic distribution across the fabric without requiring Layer 3 routing capabilities. </p>
<p>To the external network infrastructure, this distributed collection of INs and FNs appears as a single, unified switch, with the total number of external ports equivalent to the aggregate of all external ports across all INs, effectively creating a virtual chassis switch that scales far beyond the physical limitations of traditional designs. The control plane that orchestrates this distributed system is built upon Meta’s <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/">FBOSS</a>, an open-source network operating system that supports the multi-ASIC control requirements of disaggregated fabrics. Its communication with FBOSS State DataBase (FSBD) enables real-time state synchronization across nodes.</p>
<p>DSF achieves traffic management by packet spraying and a credit-based, congestion control algorithm. Unlike conventional Ethernet fabrics that rely on hash-based approaches, DSF utilizes packet spraying that distributes traffic across all available paths through the fabric. Such a feature is enabled by the hardware’s ability to reassemble packet cells at the interface nodes within the fabric domain while ensuring in-order delivery to end hosts. </p>
<p>This packet-spraying capability is orchestrated through a credit-based allocation scheme where ingress INs dynamically request credit tokens from egress INs, allowing the system to make real-time decisions based on current path availability, congestion levels, and bandwidth utilization. Virtual output queuing (VOQ) helps with ensuring lossless delivery throughout this process, directing incoming packets to virtual output queues targeting specific destination ports and service classes, with each virtual output queue being scheduled independently for transmission, providing fine-grained traffic management that can accommodate the requirements of AI workloads and communication patterns.</p>
<p>This approach allows DSF to achieve near-optimal load balancing across all available network paths, effectively utilizing the full bandwidth capacity of the fabric. It provides the flexibility to handle mixed traffic patterns and adapt to dynamic network conditions without requiring manual reconfiguration or traffic engineering.</p>
<h2>DSF Fabric for GenAI Applications</h2>
<h3>DSF Fabric (GenAI) </h3>
<p>Using the DSF technology, we built a massive cluster that interconnects thousands of GPUs within a data center region. Figure 1 illustrates the network topology of a single AI zone that is a building block for the larger cluster.</p>
<figure id="attachment_23151" aria-describedby="caption-attachment-23151" class="wp-caption alignnone c2"><img class="size-full wp-image-23151" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png" alt="" width="1968" height="1654" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png 1968w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=916,770 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=768,645 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=1024,861 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=1536,1291 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=96,81 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-1.png?resize=192,161 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23151" class="wp-caption-text">Figure 1: A building block of a single DSF L1 zone.</figcaption></figure><p>An AI zone contains multiple scaling units, shown in Figure 1 as “SUx.” A scaling unit is a grouping of GPU racks connected to RDSWs within the scaling unit. All the RDSWs within the AI zone are connected via a common layer of FDSWs. RDSWs are powered by deep-buffer Jerico3-AI chips, while FDSWs use Ramon3 chips. FBOSS is the network operating system for all the roles in this topology. We are using 2x400G FR4 optics between RDSW-FDSW connections.</p>
<p>The GPU to RDSW connections are rail optimized, which benefits hierarchical collectives like allreduce and allgather, both of which are latency sensitive.</p>
<p>To support high GPU scale in a single AI zone, two network planes that are identical to each other are created. This is called a DSF L1 zone and is a building block for larger GenAI clusters, as we will see in the next section.</p>
<h3>DSF Dual-Stage Fabric (GenAI)</h3>
<p>As depicted in Figure 2 (below) we interconnected 4x DSF L1 zones through a second stage of spine DSF switches (SDSWs). SDSWs use the same hardware as FDSWs and aggregate DSF L1 zones, enabling them to act as a single DSF fabric. This is a non-blocking topology providing an interconnected GPU scale of 18K x 800G GPUs. </p>
<figure id="attachment_23152" aria-describedby="caption-attachment-23152" class="wp-caption alignnone c3"><img class="size-full wp-image-23152" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png" alt="" width="1999" height="1002" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=916,459 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=768,385 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=1024,513 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=1536,770 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-2.png?resize=192,96 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23152" class="wp-caption-text">Figure 2: A DSF L2 zone with a second stage of SDSW interconnecting four L1 zones.</figcaption></figure><p>All RDSWs in this topology maintain fully meshed FDSB sessions to exchange information such as IPv6 neighbor states. There is also an innovative feature — input-balanced mode — enabled over this fabric to smartly balance the reachability info across the layers such that, in case of failures, congestion is avoided over the fabric and spine layer. This feature will be explained in a separate section below. We call this topology the DSF L2 zone.</p>
<h3>DSF Region (GenAI)</h3>
<p>To achieve a larger interconnected GPU scale, we connected 5x DSF L2 zones via the L2 super-spine layer. (See Figure 3 below.) We did this by using a special edge point of delivery (PoD) in each of the buildings. Edge PoDs consist of 40 FDSWs and 128 edge DSF switches (EDSWs). From a hardware point of view, EDSW is the same as RDSW but differs in its function of providing connectivity to the L3 super spine.</p>
<p>Each EDSW connects to four superspine devices using 4x800G links provisioning a total of 2k x800G ports per edge PoD. </p>
<p>The way training models are sharded we don’t expect a lot of traffic transiting the L3 super-spine layer; hence, an oversubscription of 4.5:1 is sufficient.</p>
<p>This creates an L3 interconnect, which means we need to exchange the routing information. We created iBGP sessions with EDSW and all RDSWs within the building, with BGP add-path enabled such that RDSWs learn aggregates via all 2k next-hops.  </p>
<p>eBGP is used between EDSW and the L3 super spine, and only aggregates are exchanged over BGP peerings. </p>
<figure id="attachment_23153" aria-describedby="caption-attachment-23153" class="wp-caption alignnone c3"><img class="size-full wp-image-23153" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png" alt="" width="1999" height="1144" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=916,524 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=768,440 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=1024,586 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=1536,879 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=96,55 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-3.png?resize=192,110 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23153" class="wp-caption-text">Figure 3: An L3 super spine connecting five DSF L2 zones.</figcaption></figure><p>Given that L3 spine is used, some of the problems, including entropy and fat flow, tend to reappear; however, at this network tier where there’s much less traffic, those problems are less profound.</p>
<h2>Input Balanced Mode</h2>
<p>Input Balanced Mode is a critical feature that supports balanced traffic throughout the network in the face of remote link failures. The feature avoids severe congestion on the fabric and spine layer of the DSF network.</p>
<h3>Mechanism</h3>
<p>The purpose of Input Balanced Mode is to ensure any DSF devices have <em>equal or less input BW compared to output BW</em>. No oversubscription should occur in the network, even in the case of remote link failure. Devices experiencing link failure will propagate the reduced reachability information across the cluster, notifying other devices to send proportionally less traffic to the affected device. </p>
<figure id="attachment_23154" aria-describedby="caption-attachment-23154" class="wp-caption alignnone c3"><img class="size-full wp-image-23154" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png" alt="" width="1999" height="786" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=916,360 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=768,302 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=1024,403 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=1536,604 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-4.png?resize=192,75 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23154" class="wp-caption-text">Figure 4: A mini-scale DSF network with two clusters connected by two SDSWs.</figcaption></figure><p><em>Note: For clarity, in Figure 4, FDSW/SDSW are simplified to only show one virtual device. The above graph will be used to illustrate two different link failures and mechanisms.</em></p>
<h4>RDSW&lt;-&gt;FDSW Link Failure</h4>
<p>In the case of RDSW&lt;-&gt;FDSW link failure, RDSW will lose connectivity to the FDSW, hence losing both input and output capacity on the link. FDSW also loses connectivity to the RDSW and then stops advertising the connectivity toward RDSW3. In Figure 5 (below) FDSW1 in Cluster X loses connection to RDSW3, hence it stops advertising reachability to SDSW0 and SDSW1.</p>
<figure id="attachment_23155" aria-describedby="caption-attachment-23155" class="wp-caption alignnone c3"><img class="size-full wp-image-23155" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png" alt="" width="1999" height="764" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=916,350 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=768,294 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=1024,391 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=1536,587 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-5.png?resize=192,73 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23155" class="wp-caption-text">Figure 5: Link failure in Cluster X and propagation towards SDSW.</figcaption></figure><p>From SDSW0’s perspective, it receives no reachability to RDSW3 from FDSW1 in Cluster X, but still has reachability to RDSW3 through FDSW0. (See Figure 6.) Toward destination RDSW3 in Cluster X, the input capacity of 4 (FDSW0 and FDSW1 from Cluster X-1) is greater than the output capacity of 2 (FDSW0 in Cluster X). To avoid oversubscription, SDSW0 will pick two input links and stop advertising reachability toward RDSW3 in Cluster X. The same sequence will also take place in SDSW1.</p>
<figure id="attachment_23156" aria-describedby="caption-attachment-23156" class="wp-caption alignnone c3"><img class="size-full wp-image-23156" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png" alt="" width="1999" height="776" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=916,356 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=768,298 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=1024,398 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=1536,596 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-6.png?resize=192,75 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23156" class="wp-caption-text">Figure 6: Input Balanced Mode kicks in and stops advertising reachability to FDSWs in Cluster X-1.</figcaption></figure><p>The link selection for balanced input mode should be randomized. As shown in Figure 7 (below), for simplicity’s sake, assume SDSW0 stops advertising reachability to FDSW0, and SDSW1 stops advertising reachability to FDSW1. Both FDSW0 and FDSW1 have an input capacity of 4 but an output capacity of 2, hence randomly selecting two links on each device to not advertise reachability.</p>
<figure id="attachment_23157" aria-describedby="caption-attachment-23157" class="wp-caption alignnone c3"><img class="size-full wp-image-23157" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png" alt="" width="1999" height="770" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=916,353 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=768,296 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=1024,394 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=1536,592 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-7.png?resize=192,74 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23157" class="wp-caption-text">Figure 7: FDSWs in Cluster X-1 stop advertising reachability to RDSWs.</figcaption></figure><p>Assume FDSW0 randomly selects links to RDSW0 and RDSW1, while FDSW1 randomly selects links to RDSW2 and RDSW3. This completes the propagation of link failure, resulting in RDSWs in Cluster X-1 having 50% capacity to forward traffic toward RDSW3 in Cluster X.</p>
<h4>FDSW&lt;-&gt;SDSW Link Failure</h4>
<p>Upon FDSW&lt;-&gt;SDSW link failure, there are two directions to propagate the reduced capacity:  1) on FDSW, reduce input capacity from RDSW, and 2) on SDSW, reduce input capacity from FDSWs in other clusters. (See Figure 8)</p>
<figure id="attachment_23158" aria-describedby="caption-attachment-23158" class="wp-caption alignnone c3"><img class="size-full wp-image-23158" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png" alt="" width="1999" height="789" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=916,362 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=768,303 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=1024,404 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=1536,606 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-8.png?resize=192,76 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23158" class="wp-caption-text">Figure 8: Link Failure between SDSW1 and FDSW1 in Cluster X.</figcaption></figure><h5>FDSW Propagation</h5>
<p>Consider the traffic egressing out of Cluster X thru FDSW1 (see Figure 9): From FDSW1’s perspective, input capacity is 4 (from RDSW0-RDSW3) while output capacity is reduced to 3 due to link failure. To balance input capacity, FDSW1 will randomly pick one FDSW&lt;-&gt;RDSW link to stop advertising reachability to ALL destinations outside of the cluster.</p>
<figure id="attachment_23159" aria-describedby="caption-attachment-23159" class="wp-caption alignnone c3"><img class="size-full wp-image-23159" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png" alt="" width="1999" height="768" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=916,352 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=768,295 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=1024,393 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=1536,590 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-9.png?resize=192,74 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23159" class="wp-caption-text">Figure 9: FDSW1 in Cluster X stops advertising reachability to RDSW2.</figcaption></figure><p>Assume Cluster X FDSW1 randomly picks the link to RDSW2. It will stop advertising reachability to all RDSWs in Cluster X-1. Note that the same link can still be utilized for intra-cluster traffic, as it has full reachability to RDSWs in Cluster X.</p>
<h4>SDSW Propagation</h4>
<p>Consider traffic ingressing into Cluster X thru SDSW1 (see Figure 10): From SDSW1’s perspective, input capacity is 4 (from FDSW0 and FDSW1 in Cluster X-1), while due to link failure, output capacity is 3. SDSW1 will randomly pick one link towards Cluster X-1 and stop advertising reachability to all RDSWs in Cluster X.</p>
<figure id="attachment_23160" aria-describedby="caption-attachment-23160" class="wp-caption alignnone c3"><img class="size-full wp-image-23160" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png" alt="" width="1999" height="791" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=916,362 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=768,304 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=1024,405 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=1536,608 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-10.png?resize=192,76 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23160" class="wp-caption-text">Figure 10: SDSW1 stops advertising reachability to FDSW0 in Cluster X-1.</figcaption></figure><p>A similar calculation will take place on FDSW0 in Cluster X-1, resulting in Cluster X-1 FDSW0 randomly picking one link and stopping advertising reachability to all RDSWs in Cluster X. (See Figure 11 below) This completes the propagation, leading to RDSW1 in Cluster X-1 losing one link to forward traffic toward Cluster X.</p>
<figure id="attachment_23161" aria-describedby="caption-attachment-23161" class="wp-caption alignnone c3"><img class="size-full wp-image-23161" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png" alt="" width="1999" height="788" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=916,361 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=768,303 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=1024,404 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=1536,605 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-11.png?resize=192,76 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23161" class="wp-caption-text">Figure 11: Input Balanced Mode propagation from FDSW0 to RDSW1 in Cluster X-1.</figcaption></figure><h3>FDSW&lt;-&gt;SDSW and RDSW&lt;-&gt;FDSW Link Failure</h3>
<p>Figure 12 illustrates another example of link failures occurring in between FDSW &lt;-&gt; SDSW, as well as RDSW &lt;-&gt; FDSW. The reduced reachability will propagate and then converge in both directions.</p>
<ol><li class="c1" aria-level="1">FDSW&lt;-&gt;SDSW link failure.</li>
<li class="c1" aria-level="1">RDSW&lt;-&gt;FDSW link failure.</li>
</ol><figure id="attachment_23162" aria-describedby="caption-attachment-23162" class="wp-caption alignnone c3"><img class="size-full wp-image-23162" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png" alt="" width="1999" height="785" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=916,360 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=768,302 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=1024,402 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=1536,603 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-12.png?resize=192,75 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23162" class="wp-caption-text">Figure 12: Link failures in both FDSW&lt;-&gt;RDSW and SDSW&lt;-&gt;FDSW.</figcaption></figure><h4>FDSW Propagation for FDSW&lt;-&gt;SDSW Link Failure </h4>
<p>Similar to the above FDSW propagation, FDSW1 in cluster X will randomly pick one connecting RDSW and advertise no reachability to devices towards Cluster X-1. (See Figure 13 below) </p>
<figure id="attachment_23163" aria-describedby="caption-attachment-23163" class="wp-caption alignnone c3"><img class="size-full wp-image-23163" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png" alt="" width="1999" height="763" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=916,350 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=768,293 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=1024,391 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=1536,586 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-13.png?resize=192,73 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23163" class="wp-caption-text">Figure 13: FDSW1 in Cluster X stops advertising reachability to RDSW2.</figcaption></figure><h4>SDSW propagation for FDSW&lt;-&gt;SDSW Link Failure</h4>
<p>Similar to the SDSW propagation above, SDSW1 will randomly pick one link towards cluster X-1 and propagate no reachability to Cluster X. Imagine SDSW1 picks one of the links connecting FDSW0 in cluster X-1.</p>
<figure id="attachment_23164" aria-describedby="caption-attachment-23164" class="wp-caption alignnone c3"><img class="size-full wp-image-23164" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png" alt="" width="1999" height="756" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=916,346 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=768,290 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=1024,387 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=1536,581 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=96,36 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-14.png?resize=192,73 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23164" class="wp-caption-text">Figure 14: SDSW1 stops advertising reachability to FDSW0 in Cluster X-1.</figcaption></figure><p>Note in Figure 14 that FDSW0 in Cluster X-1 already has one link failure connecting RDSW0. The input and output capacity towards Cluster X is already balanced on FDSW0, thus finishing propagation in this direction.</p>
<h4>FDSW Propagation for RDSW&lt;-&gt;FDSW Link Failure </h4>
<p>As FDSW0 in Cluster X-1 loses connectivity to RDSW0, it will stop advertising reachability to SDSW0 and SDSW1 on both of the links. (See Figure 15.)</p>
<figure id="attachment_23165" aria-describedby="caption-attachment-23165" class="wp-caption alignnone c3"><img class="size-full wp-image-23165" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png" alt="" width="1999" height="778" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=916,357 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=768,299 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=1024,399 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=1536,598 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-15.png?resize=192,75 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23165" class="wp-caption-text">Figure 15: Link failure in Cluster X-1 and propagation towards SDSW.</figcaption></figure><p>SDSW0 will randomly pick two links to stop advertising reachability to RDSW0 in Cluster X-1 (in the example in Figure 16 it picks one link in FDSW0 and one in FDSW1). On SDSW1, however, it already has one link failure connecting FDSW1 in Cluster X. Therefore, only one more link needs to be selected to propagate the reduced reachability (in the example it picks the other link towards FDSW1).  </p>
<figure id="attachment_23166" aria-describedby="caption-attachment-23166" class="wp-caption alignnone c3"><img class="size-full wp-image-23166" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png" alt="" width="1999" height="761" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=916,349 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=768,292 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=1024,390 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=1536,585 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-16.png?resize=192,73 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23166" class="wp-caption-text">Figure 16: Input Balanced Mode kicks in and stops advertising reachability to FDSWs in Cluster X.</figcaption></figure><p>From Cluster X FDSW1’s perspective, the output capacity towards RDSW0 in Cluster X-1 is 1 (two links with no reachability, and one link failure). Therefore, to balance input it should select three links to stop advertising reachability towards RDSW0 in Cluster X-1. Note that the link FDSW1&lt;-&gt;RDSW2 already has no reachability towards Cluster X-1 due to 1.1 propagation above. Hence, it will pick two more links (RDSW0 and RDSW1 in Figure 17) to not advertise reachability.</p>
<p>For Cluster X FDSW0, it will randomly pick one downlink (RDSW0 in Figure 17) to not advertise reachability to RDSW0 in Cluster X-1. </p>
<figure id="attachment_23167" aria-describedby="caption-attachment-23167" class="wp-caption alignnone c3"><img class="size-full wp-image-23167" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png" alt="" width="1999" height="845" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=916,387 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=768,325 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=1024,433 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=1536,649 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=96,41 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-DSF-Figure-17.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23167" class="wp-caption-text">Figure 17: The composition effect of both link failures.</figcaption></figure><h2>Future Work With DSF </h2>
<ul><li class="c1" aria-level="1">We are interconnecting multiple regions to create <a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/">mega clusters that will provide interconnectivity of GPUs with different regions</a> that are tens of kilometers apart. </li>
<li class="c1" aria-level="1">This will create an interesting challenge of addressing heterogeneity between different GPU types and fabric involving different regions.</li>
<li class="c1" aria-level="1">We are also working on a new technology called Hyperports, which will combine multiple 800G ports at ASIC level to act as a single physical port. This will reduce the effect of fat flows on IP interconnects.</li>
</ul><p>In addition, DSF is a smart fabric that <em>inherently</em> supports a wide range of GPUs/NICs. We are increasing our deployments to include an increasing variety of GPU/NIC models.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/20/data-center-engineering/disaggregated-scheduled-fabric-scaling-metas-ai-journey/</link>
      <guid>https://engineering.fb.com/2025/10/20/data-center-engineering/disaggregated-scheduled-fabric-scaling-metas-ai-journey/</guid>
      <pubDate>Mon, 20 Oct 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism]]></title>
      <description><![CDATA[<ul><li>At Meta, we are constantly pushing the boundaries of LLM inference systems to power applications such as the Meta AI App.</li>
<li>We’re sharing how we developed and implemented advanced parallelism techniques to optimize key performance metrics related to resource efficiency, throughput, and latency.</li>
</ul><p>The rapid evolution of large language models (LLMs) has ushered in a new era of AI-powered applications, from conversational agents to advanced content generation. However, deploying these massive models at scale for real-time inference presents significant challenges, particularly in achieving high throughput, low latency, and better resource efficiency. </p>
<p>Our primary goal is to optimize key performance metrics:</p>
<ul><li class="c1" aria-level="1"><strong>Resource efficiency:</strong> Maximizing GPU utilization to improve operational efficiency.</li>
<li class="c1" aria-level="1"><strong>Throughput (queries/s):</strong> Serving more users by processing a higher volume of requests.</li>
<li class="c1" aria-level="1"><strong>Latency:</strong> Minimizing response times for a seamless user experience. This includes:
<ul><li class="c1" aria-level="2"><strong>Time-to-first-token (TTFT) for prefill:</strong> The time it takes for the first part of the response to appear, ideally under 350ms.</li>
<li class="c1" aria-level="2"><strong>Time-to-incremental-token (TTIT) for decoding:</strong> The latency between subsequent words, targeting less than 25ms.</li>
</ul></li>
</ul><p>These metrics highlight the distinct computational demands of LLM inference: Prefill is compute-intensive, while decoding is memory bandwidth-intensive. To address these challenges and enable the deployment of large models, we have developed and implemented advanced parallelism techniques.</p>
<div class="jetpack-video-wrapper"><iframe title="Inference Deployments and Comms Implication - Live from SCC" width="1778" height="1000" src="https://www.youtube.com/embed/sTgwtGxZJx4?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>The Two Stages of LLM Inference</h2>
<p>A typical LLM generative-inference task unfolds in two stages:</p>
<ol><li class="c1" aria-level="1"><strong>Prefill stage:</strong> This stage processes the input prompt (which can be thousands of tokens long) to generate a key-value (KV) cache for each transformer layer of the LLM. Prefill is <strong>compute-bound,</strong> because the attention mechanism scales quadratically with sequence length.</li>
<li class="c1" aria-level="1"><strong>Decoding Stage:</strong> This stage utilizes and incrementally updates the KV cache to generate tokens (words) one by one. Decoding is <strong>memory-bound</strong>, as the I/O time of reading memory dominates attention time, with model weights and the KV cache occupying the majority of memory.</li>
</ol><h2>Addressing Bottlenecks With Parallelism</h2>
<p>To scale LLM inference effectively, especially for handling long contexts and massive models, we employ three main types of inference parallelism:</p>
<p><strong>1. Tensor parallelism</strong> (TP), which improves fitting large models across multiple GPUs and achieving high throughput that a single device cannot provide. It involves sharding individual layers of the model, such as attention blocks and multi-layer perceptron (MLP) layers, into smaller, independent blocks that can be executed on different devices.</p>
<p>A challenge in tensor parallelism is the “allreduce” communication operation, which can contribute up to 30% of end-to-end latency. To mitigate this, we developed <strong>direct data access (DDA)</strong> algorithms:</p>
<ul><li class="c1" aria-level="1"><strong>DDA flat algorithm:</strong> Improves small message-size allreduce latency by allowing each rank to directly load memory from other ranks and perform local reduce operations. This reduces latency from O(N) to O(1) by increasing the amount of data exchange from O(n) to O(n^2).</li>
<li class="c1" aria-level="1"><strong>DDA tree algorithm:</strong> Breaks the allreduce into two phases (reduce-scatter and all-gather) and uses direct data access in each step. This moves the same amount of data as the ring algorithm but reduces latency to a constant factor, making it suitable for slightly larger message sizes.</li>
</ul><p>Our DDA solutions demonstrate significant speedups against baselines such as NCCL (NVIDIA Collective Communications Library) and RCCL (ROCm Communication Collectives Library for AMD GPUs). For instance, with AMD MI300X, we achieved overall performance parity with Nvidia H100, with DDA outperforming RCCL baseline by 10-50% for decode (small message sizes) and yielding 10-30% speedup for prefill, resulting in approximately 10% reduction in TTIT.</p>
<p><strong>2. Context parallelism</strong> (CP), which facilitates managing and processing extremely long contexts, such as the <a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/" target="_blank" rel="noopener">1M/10M token capabilities introduced with Llama 4</a>. Long-context inference presents unique challenges:</p>
<ul><li class="c1" aria-level="1"><strong>Compute:</strong> Dense attention FLOPs scale quadratically with context length, leading to attention-compute dominating.</li>
<li class="c1" aria-level="1"><strong>Memory:</strong> The KV cache grows linearly with context.</li>
<li class="c1" aria-level="1"><strong>Communication:</strong> Communication latency increases when parallelizing across multiple hosts.</li>
</ul><p>We have implemented <a href="https://arxiv.org/pdf/2411.01783" target="_blank" rel="noopener">two variants of context parallelism</a> in the attention module, often referred to as “ring attention”:</p>
<ul><li class="c1" aria-level="1"><strong>Pass-KV:</strong> In this approach, input tokens are split across multiple CP ranks. Each rank calculates its portion of query, key, and value tensors. Then, key and value tensors are exchanged between ranks to enable attention interactions across the full context.</li>
<li class="c1" aria-level="1"><strong>Pass-Q:</strong> Similar to Pass-KV, but query tensors are exchanged between ranks.</li>
</ul><p>Our context parallelism optimizations, combined with a fast-attention kernel, have enabled remarkable performance for long-context capabilities. We achieved less than one minute for one million tokens on a single H100 host and less than one minute for 10 million tokens using distributed inference across multiple H100 hosts (e.g., 32 H100 hosts). With Llama 3 405B, we demonstrated near-linear scaling, achieving 128K token prefill in 3.8 seconds with CP over 16 nodes, and 1M-token prefill in 77 seconds.</p>
<p><strong>3. Expert parallelism (EP)</strong>, which helps with scaling mixture-of-experts (MoE) models, where a large number of “experts” (neural network modules) make it impossible to fit the entire model onto a single host. In EP-based inference, we utilize a two-shot, all-to-all communication pattern to exchange tokens between data parallelism and expert parallelism ranks based on routing.</p>
<p>The all-to-all communication can contribute 10-30% to end-to-end latency, especially for decode messages (100KB to 2MB). To optimize this, we are exploring solutions including:</p>
<ul><li class="c1" aria-level="1"><strong>Dynamic all-to-all:</strong> Sending sub-chunks of data to remote neighbors.</li>
<li class="c1" aria-level="1"><strong>Persistent all-to-all:</strong> Addressing slowdowns primarily caused by memory-handle exchange, network-load balancing, and CPU overhead.</li>
</ul><h2>Looking Ahead: Disaggregated Inference and Future Challenges</h2>
<p>To further optimize LLM inference, we are moving towards <strong>N-D parallelism</strong> (CP, PP, EP, TP across nodes, with separate DP) and <strong>disaggregating prefill and decoding tiers</strong>. This allows for better resource balancing and the potential to use heterogeneous hardware, where compute-heavy hardware is used for prefill and memory bandwidth-heavy hardware for decoding. This multi-dimensional parallelism can help unblock the serving and evaluation of colossal models.</p>
<p>Future challenges in this space include:</p>
<ul><li class="c1" aria-level="1"><strong>Cloud fabric design:</strong> Optimizing the underlying cloud infrastructure for LLM workloads.</li>
<li class="c1" aria-level="1"><strong>Communication going to kernel (fused kernel):</strong> Integrating communication operations directly into computational kernels for greater efficiency.</li>
<li class="c1" aria-level="1"><strong>Device-initiated kernel:</strong> Enabling devices to initiate operations directly, reducing CPU overhead.</li>
</ul><p>These advancements in parallelization and system-level improvements have helped enable the next generation of AI applications and push the boundaries of what LLMs can achieve. We are committed to continuous innovation to ensure efficient and scalable LLM inference for millions of users worldwide.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/</link>
      <guid>https://engineering.fb.com/2025/10/17/ai-research/scaling-llm-inference-innovations-tensor-parallelism-context-parallelism-expert-parallelism/</guid>
      <pubDate>Fri, 17 Oct 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Branching in a Sapling Monorepo]]></title>
      <description><![CDATA[<p><a href="https://sapling-scm.com/" target="_blank" rel="noopener">Sapling</a> is a scalable, user-friendly, and open-source source control system that powers Meta’s monorepo. As discussed at the GitMerge 2024 conference <a href="https://github.com/git/git-merge/blob/main/breakouts/branching-in-a-monorepo.md" target="_blank" rel="noopener">session on branching</a>, designing and implementing branching workflows for large monorepos is a challenging problem with multiple tradeoffs between scalability and the developer experience.</p>
<p>After the conference, we designed, implemented, and open sourced our monorepo branching solution in Sapling. While the code is already open source, in this article we share learnings on:</p>
<ul><li class="c1" aria-level="1">How we resolved scalability and developer experience tradeoffs in the design and implementation.</li>
<li class="c1" aria-level="1">What problems it solved.</li>
<li class="c1" aria-level="1">What feedback we received from other developers at Meta.</li>
</ul><p>The key technical insight is that two workflows — <strong>non-mergeable full-repo branching</strong> and <strong>mergeable directory branching</strong> — solved all of the branching-related problems for a large and diverse set of products built at Meta.</p>
<p>We hope that the Sapling open source code and the learnings shared in this article will benefit the wider industry and open source communities.</p>
<h2>How Source Control Is Handled at Meta </h2>
<p>At Meta, our engineering teams work within a large monorepo with a single main branch. This approach enables unified dependency management, large-scale refactoring, easier collaboration, and code reuse across projects. However, this approach introduces challenges for teams that must manage multiple versions of their code.</p>
<p>In multi-repo setups, teams can rely on repository branches to manage different versions.  Source control gives them tools, like cherry-pick and merge, that let them manage the differences between the versions.</p>
<p>In the monorepo, however, repository branches do not work as well for this. Branches affect the whole repository, so creating a branch means unrelated projects and dependencies will remain frozen, and quickly become stale.</p>
<p>In this article we refer to whole repository branching as <strong>full-repo branching</strong>. What we learned is that for workflows that do not require merging back to the main branch (e.g., product releases where the branch ceases to exist after the release completes and the development moves back to the main branch) full-repo branching is a good solution. In Sapling, this workflow is well supported with the sl bookmark family of commands.</p>
<p>However, for product development workflows where merging back to the main branch is required, we learned that full-repo branching is not a scalable approach. This is because full-repo merges create merge commits with multiple parents, making the commit graph <em>wide</em> (high branching factor) and <em>non-linear</em>. In large monorepos, this creates performance problems for operations like sl log and sl blame. Maintaining a linear commit graph,where most commits have a single parent, is crucial for keeping these operations fast for all monorepo users, not just those utilizing branches.</p>
<p>The core limitation is that full-repo branches are all-or-nothing. If you need to patch a legacy version, or maintain a custom variant for a particular project, you cannot create a branch for the part that you own. Branching forks everything.</p>
<p>A common pattern when attempting to solve this problem was for teams to make multiple copies of their code. However, by doing this they lose a lot of the standard developer tools for managing their branches. This resulted in duplicated effort and error-prone copying of patches between directories.</p>
<h2>Directory Branching: Sapling’s Monorepo Branching Solution</h2>
<p>To solve these challenges, we have introduced a new set of source control tools in Sapling that can be used to implement a new kind of branching: <strong>directory branching</strong>. This bridges the gap between using multiple repository branches and maintaining copies of code as separate directories.</p>
<p>With these tools, you are able to treat directories in the monorepo much like traditional repository branches. You create branches by copying the code, maintain the code by cherry-picking, and merging changes between directories as if they were branches, and look at the history of each directory in the context of the copies and merges that were made.</p>
<p>Crucially, while directory branches support merging between directories, at the level of the monorepo’s commit graph, they appear as linear commits. This resolves the scalability challenge with the repo-level merge commits and still provides merging workflows at the directory level.</p>
<h2>How Directory Branching Is Implemented in Sapling </h2>
<p>Directory branching in Sapling is implemented using a series of operations centered around the sl subtree command.</p>
<p>To branch a directory, you use the sl subtree copy command to copy a directory (or file), either at the current version or from any historical version, to a new location in the repository. Sapling records metadata in the commit that tracks the source directory, source revision, and copy relationship, which allows us to recover the complete history of all files in the new branch. If the code you want to branch is not in the monorepo yet, you can use sl subtree import to create a directory branch of an external repository branch.</p>
<p>Once you have a directory branch, you can use sl subtree graft and sl subtree merge to cherry-pick or merge changes between directory branches. These operations use the stored copy/merge metadata to reconstruct the relationship between directories, enabling Sapling to perform three-way merges between directory branches. The merge algorithm finds the common ancestor of the two directory branches (using the copy metadata) and performs a standard three-way merge, just as it would for regular repository merges, but scoped to the specific directory content.</p>
<h2>The Build System and Wider Developer Tooling Integration</h2>
<p>An advantage of this approach is that the latest versions of all directory branches are visible at the same time.  This means continuous integration (CI) can test against multiple branches with a single checkout, and you can be confident that there are no hidden old branches that are unexpectedly still in use.</p>
<p>At Meta we use <a href="https://buck2.build/" target="_blank" rel="noopener">Buck2</a> as our build system. When a component depends on another component that uses directory branching, we use Buck <a href="https://buck2.build/docs/concepts/modifiers/" target="_blank" rel="noopener">config modifiers</a> (i.e., buck build with the -m flag) to allow us to select which branch is being used.</p>
<p>One downside of directory branching is that code searches can result in multiple hits for each of the branches. It is relevant that the searched-for code appears in multiple places, however it can be difficult to look through the results from multiple branches if they are mingled together. Code search systems capable of ranking results can resolve this issue.</p>
<h2>User Feedback on Directory Branching</h2>
<p>The introduction of directory branching has been a success, with a large and diverse set of engineering teams within Meta adopting it to manage multiple versions of code. Some teams have also found it useful to temporarily freeze the majority of the monorepo for development stability by remaining on an old commit and using directory branching to merge in changes for specific projects, effectively combining both full-repo branching and directory branching workflows.</p>
<p>We observed the following <strong>three common themes</strong> of valid reasons for adopting directory branching:</p>
<p>1.)  <strong>When CI is prohibitively expensive or changes could cause major disruptions</strong>. Some teams at Meta used directory branches to effectively separate development and production versions of the code, giving them more control over when their code changes are deployed to production.</p>
<p>2.) <strong>Experimental changes</strong> where a large number of developers are collaborating over several months, but the changes have the potential of disrupting the production version. At the same time, the collaboration scale is large enough that using a very large stack of diffs to simulate a branch is not practical.</p>
<p>3.) Unblocking <strong>migrations from Git</strong>. Even if the ultimate goal is to have only one or a few versions in the Sapling monorepo, during the migrations we need an equivalent to Git branches so that the migration can complete and consolidation can take place within the monorepo. It is not always possible to consolidate all branches in Git before migrating to monorepo.</p>
<p>It is worth noting that having a single version of code remains the default assumption for the monorepo. However, if any of the three reasons above apply, directory branching can be used as a solution, providing branching workflows without sacrificing the benefits of a monorepo.</p>
<h2>Future Work With Directory Branching</h2>
<p>We are also planning to leverage directory branching for better integration of Git repositories into the Sapling monorepo. More specifically, we are developing a lightweight repository migration mechanism. Instead of making an irreversible decision of committing all of the Git repository commits into the monorepo history, we create a <em>soft link</em> to an external repository where Sapling can load the Git history on the fly when the user requests it. This lowers the barrier of entry of Git repositories into the monorepo and is useful for integrations before committing to migrating full history. This will be provided as an option to the sl subtree import command when working with external Git repositories. </p>
<p>Stay tuned—we will publish a separate article on this topic once we have enough learnings to share.</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">website</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, <a href="https://bsky.app/profile/metaopensource.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>
<h2>Acknowledgements</h2>
<p><em>Multiple people at Meta’s Source Control, Developer Experience and Open Source organisations contributed to the design and implementation of directory branching in Sapling. We would like to thank: Chris Cooper, George Giorgidze, Mark Juggurnauth-Thomas, Jon Janzen, Pingchuan Liu, Muir Manders, Mark Mendoza, Jun Wu, and Zhaolong Zhu.</em></p>
<p><em>We are also grateful to the</em> <a href="https://git-scm.com/" target="_blank" rel="noopener"><em>Git</em></a><em>,</em> <a href="https://www.mercurial-scm.org/" target="_blank" rel="noopener"><em>Mercurial</em></a><em>, and</em> <a href="https://jj-vcs.github.io/" target="_blank" rel="noopener"><em>Jujutsu</em></a> <em>open source communities for their</em> <a href="https://github.com/git/git-merge/blob/main/breakouts/branching-in-a-monorepo.md" target="_blank" rel="noopener"><em>branching-related discussions</em></a> <em>at the GitMerge 2024 conference in Berlin. We hope that the Sapling open source code and the learnings shared in this article will benefit all source control systems.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/10/16/developer-tools/branching-in-a-sapling-monorepo/</link>
      <guid>https://engineering.fb.com/2025/10/16/developer-tools/branching-in-a-sapling-monorepo/</guid>
      <pubDate>Thu, 16 Oct 2025 19:10:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[10X Backbone: How Meta Is Scaling Backbone Connectivity for AI]]></title>
      <description><![CDATA[<ul><li>We’re sharing details on our journey to scale Meta’s Backbone network to support the increasing demands of new and existing AI workloads.</li>
<li>We’ve developed new technologies and designs to address our 10x scaling needs and applying some of these same principles to help <a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/" target="_blank" rel="noopener">extend our AI clusters</a> between multiple data centers.</li>
</ul><div class="jetpack-video-wrapper"><iframe title="10x Backbone: Scaling Backbone Connectivity to Serve AI Demands - Live from SCC" width="1778" height="1000" src="https://www.youtube.com/embed/hiKNpZ_kEEU?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<p>Meta’s Backbone network is composed of a set of interconnected routing platforms and provides WAN (wide area network) connectivity among network locations. Meta has architected Backbone in two different networks: Classic Backbone (CBB) and Express Backbone (EBB). They differ in some fundamental ways. </p>
<p>CBB is used to achieve global reach from data centers (DCs) to our points of presence (POPs) where we connect with external carriers. CBB is flexible: It can shrink or grow to support a diverse set of geographies and accommodate a broad range of connectivity requirements. It uses traditional IP/MPLS-TE (Internet Protocol/Multiprotocol Label Switching/Traffic Engineering) technologies.</p>
<p>EBB, in contrast, is built to provide scalable DC-to-DC interconnection. EBB is less flexible, having a sizable minimum installation. It runs a heavily customised stack of software, such as the Open/R routing protocol, and an in-house traffic-engineering stack with <em>onbox</em> agents and a centralized controller.</p>
<p>While we see growth in both networks, it’s EBB that presents the most challenging scalability problems.</p>
<p>In the rest of this post, we will focus on EBB and describe how we actually addressed EBB’s growth and the resulting challenges.</p>
<p><img class="alignnone size-full wp-image-23028" src="https://engineering.fb.com/wp-content/uploads/2025/10/image5.png" alt="" width="1806" height="916" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image5.png 1806w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=916,465 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=768,390 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=1024,519 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=1536,779 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=96,49 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image5.png?resize=192,97 192w" sizes="(max-width: 992px) 100vw, 62vw" /><em>Figure 1: Traffic growth in Meta’s Backbone network</em></p>
<p>EBB network first started serving traffic around 2015. Figure 1 represents the growth since then for EBB, DC-to-DC traffic flows versus CBB, and DC-to-POP traffic flows.</p>
<p>Prior to 2015, CBB was used for both DC-to-DC and DC-to-POP traffic. Figure 2 represents some of the EBB adoption and technology milestones.</p>
<p><img class="alignnone size-full wp-image-23029" src="https://engineering.fb.com/wp-content/uploads/2025/10/image3.png" alt="" width="1798" height="974" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image3.png 1798w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=916,496 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=768,416 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=1024,555 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=1536,832 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image3.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><em>Figure 2: EBB origins and growth</em></p>
<p>A significant amount of fiber in terms of quantity and distance is required to interconnect DC locations at the necessary scale. The existing DCs continue to grow in footprint and capacity due to the addition of more powerful servers and, where possible, the addition of new buildings at existing locations.</p>
<p>Connecting DCs reliably and repeatedly at high capacity to the rest of the network can be challenging, especially due to the speed at which new DCs are being built. While the network has some input into the site-selection process, there are many influencing factors beyond ease of connectivity that determine how new data center locations are chosen.</p>
<h2>10X Backbone</h2>
<p>10X Backbone is the evolution of EBB in terms of scale, topology, and technology. Below are the three techniques used to scale to 10X Backbone.</p>
<h3>DC Metro Architecture</h3>
<p>Historically, building long-haul fibers to new DC locations has been painful, especially when these long-haul fibers need to extend hundreds of miles.</p>
<p>Our first technique to scale up to 10X Backbone was to pre-build some of the components of DC metro architecture. By pre-building them, we could more quickly provide connectivity to new DCs.</p>
<p>First, we built two rings of fiber to provide scalable capacity in the metro, and we connected long-haul fibers to the rings. Next, we built two POPs to provide connectivity toward remote sites. Last, we connected DCs to the rings, and therefore increased or enabled capacity between the DC and POPs. (See Figure 3.)</p>
<p>DC metro architecture has several advantages:</p>
<ul><li class="c1" aria-level="1">A simplified building of DC connectivity and WAN topology</li>
<li class="c1" aria-level="1">A standardized scalable physical design</li>
<li class="c1" aria-level="1">Separate metro and long-haul networks </li>
</ul><p><img class="size-full wp-image-23030 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/10/image6.png" alt="" width="1999" height="1158" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image6.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=916,531 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=768,445 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=1024,593 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=1536,890 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image6.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><em>Figure 3:</em> <em>DC metro architecture</em></p>
<h3>IP Platform Scaling</h3>
<p>The second technique we use for 10X Backbone is IP platform scaling, which has two flavors: scaling up and scaling out.</p>
<p>Scaling up, as illustrated in Figure 4, relies heavily on vendor technology and has primarily two forms: </p>
<ul><li class="c1" aria-level="1">Larger chassis: a 12-slot chassis provides 50% more capacity than an 8-slot chassis. However, a larger chassis introduces another set of important considerations:
<ul><li class="c1" aria-level="2">More challenging mechanical and thermal designs</li>
<li class="c1" aria-level="2">Higher power and space requirements, and higher power density per rack</li>
<li class="c1" aria-level="2">Higher number of ASICs (application-specific integrated circuit) and the implications on control plane programming across them</li>
<li class="c1" aria-level="2">More challenging infrastructure design with regard to higher interface and cabling count</li>
<li class="c1" aria-level="2">Increased network operating system (NOS) complexity to support higher interface scale</li>
<li class="c1" aria-level="2">Simpler NPI (new product introduction) when keeping the same ASIC/line-card technology</li>
</ul></li>
<li class="c1" aria-level="1">Faster interfaces. By leveraging modern ASICs and line cards, we can double the capacity when we move from 400G to 800G platforms. Important considerations arising from this technique:
<ul><li class="c1" aria-level="2">More challenging thermal designs</li>
<li class="c1" aria-level="2">Higher power requirements and power density per rack</li>
<li class="c1" aria-level="2">Complex NPI introduced by a new ASIC and forwarding pipeline</li>
<li class="c1" aria-level="2">More challenging infrastructure design with regard to higher interface and cabling count</li>
<li class="c1" aria-level="2">Increased network OS complexity to support potentially higher interface scale</li>
<li class="c1" aria-level="2">Support for 800G-ZR+ transceivers (a set of pluggables that support extended reach)</li>
</ul></li>
</ul><p><img class="size-full wp-image-23031 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/10/image7.png" alt="" width="934" height="476" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image7.png 934w, https://engineering.fb.com/wp-content/uploads/2025/10/image7.png?resize=916,467 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image7.png?resize=768,391 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image7.png?resize=96,49 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image7.png?resize=192,98 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><em>Figure 4: EBB techniques to scale up</em></p>
<p>In contrast to the dependency on vendors/industry in scaling up, scaling out (illustrated in Figure 5) is more under our control and has historically taken two flavors in EBB:</p>
<ul><li class="c1" aria-level="1">Adding more Backbone planes. Going from four to eight planes results in doubling capacity <em>globally</em>; however, this technique has the following considerations:
<ul><li class="c1" aria-level="2">Implementation is quite disruptive and requires a lot of planning, specifically when it comes to fiber restriping (this needs to be coordinated in many locations simultaneously)</li>
<li class="c1" aria-level="2">Higher power and space requirements globally, but power density per rack remains the same</li>
<li class="c1" aria-level="2">Routing support for planes with uneven capacity can be complex</li>
<li class="c1" aria-level="2">Additional capacity on interconnects might be needed for compatibility with the final Backbone design</li>
<li class="c1" aria-level="2">Doesn’t require introducing new technology</li>
</ul></li>
<li class="c1" aria-level="1">We can add multiple devices per plane. This technique is more sophisticated and allows us to scale capacity only in a chosen location. Considerations include:
<ul><li class="c1" aria-level="2">Implementation is quite disruptive for the target site and requires a moderate amount of planning to execute</li>
<li class="c1" aria-level="2">Higher power and space requirements in the target location, but power density per rack remains the same</li>
<li class="c1" aria-level="2">Interconnect with other layers might be more challenging: full mesh needs to be extended to Nx devices</li>
<li class="c1" aria-level="2">Introduces new failure modes: Device failure can impact some but not all of the of the Backbone in that plane/site</li>
<li class="c1" aria-level="2">Network operations can become more complex due to new failure modes and the handling of sets of devices (software upgrades, maintenance, etc.)</li>
<li class="c1" aria-level="2">Doesn’t require introducing new technology</li>
</ul></li>
</ul><p><img class="size-full wp-image-23032 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/10/image1.png" alt="" width="924" height="494" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image1.png 924w, https://engineering.fb.com/wp-content/uploads/2025/10/image1.png?resize=916,490 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image1.png?resize=768,411 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image1.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image1.png?resize=192,103 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><em>Figure 5: EBB techniques to scale out</em></p>
<p>Scaling up and scaling out are not mutually exclusive, and in our 10X Backbone journey we have used them both.</p>
<h3>IP and Optical Integration</h3>
<p>The third technique to scale to 10X Backbone is IP and optical integration. By leveraging ZR technology, we are changing the power footprint per terabit in the network.</p>
<p>Prior to ZR:</p>
<ul><li class="c1" aria-level="1">We had many transponders per router. Each of the transponders consumed up to 2kW for 4.8-6.4Tb of capacity.</li>
<li class="c1" aria-level="1">There was a clear demarcation between IP and optical layers. This permitted work to occur at either layer with simple coordination.</li>
</ul><p>With ZR: </p>
<ul><li class="c1" aria-level="1">We no longer need transponders; this functionality is now in the plugs in the router. By removing transponders, we recover large amounts of space and power.</li>
<li class="c1" aria-level="1">Each of the plugs consumes 10-15W of incremental power.</li>
<li class="c1" aria-level="1">As a result of ZR plugs being installed in the routers, the split between IP and optical functions is not as clear as before.</li>
</ul><p><img class="alignnone size-full wp-image-23034" src="https://engineering.fb.com/wp-content/uploads/2025/10/image2.png" alt="" width="1999" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=1024,553 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=1536,830 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image2.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><em>Figure 6: Network topology before and after ZR introduction</em></p>
<p>In summary, the use of ZR transceivers increases the power consumption in the router, which is offset by the considerable power savings from removing standalone transponders. In aggregate, we use 80 to 90% less power.</p>
<p>Using ZR technology has introduced important high-level changes:</p>
<ul><li class="c1" aria-level="1">Cost and power efficiency: 
<ul><li class="c1" aria-level="2">The same Backbone capacity can be deployed in a smaller S&amp;P envelope</li>
<li class="c1" aria-level="2">Rack allocation between optical and IP devices goes from 90/10 (prior to ZR) to 60/40 (with ZR)</li>
<li class="c1" aria-level="2">Prior to ZR, we could land 1x fiber pairs/rack; with ZR, since we don’t use standalone transponders, we can land 4x fiber pairs/rack</li>
</ul></li>
<li class="c1" aria-level="1">Simplifies network deployments; installing a set of pluggables instead of standalone transponders makes network deployments easier and more predictable</li>
<li class="c1" aria-level="1">Uses fewer active devices and therefore simplifies network operations</li>
<li class="c1" aria-level="1">Enables interoperability and vendor diversity</li>
<li class="c1" aria-level="1">Optical channel terminates in IP devices, and the demarcation of optical and IP is more complex than in non-ZR scenarios</li>
<li class="c1" aria-level="1">Telemetry and collections on the state of the optical channel is bound to IP devices, causing additional CPU consumption</li>
</ul><p>By leveraging DC metro architecture, IP platform scaling, and IP/Optical integration, we transformed EBB from the experimental network of 2016 to a large-scale Backbone that supports all DC&lt;&gt;DC traffic at Meta. </p>
<h2>AI Backbone</h2>
<p>Over the last 18 months, we’ve seen an increasing interest in growing the megawatts footprint in support of building larger GPU clusters. The requirements have grown beyond what can fit in an existing data center campus, even considering undeveloped land or land adjacent to existing locations. Right now cluster performance is impacted by latency between endpoints, so we began to search for suitable expansion locations within bounded geographical proximity, expanding outwards until we achieve the target scale for a region. </p>
<p>As we identify sites of interest, we work with our fiber-sourcing team to determine the timing and feasibility to connect to existing locations at a very high scale as well as the most appropriate technology to utilize. In most cases, construction work is needed to place additional fiber in the ground, due to the significant quantities required.</p>
<p>We came up with three solutions based on the necessary reach:</p>
<ol><li class="c1" aria-level="1"><strong>FR plugs</strong>: A solution that addresses buildings in the 3-kilometer range. (Note<strong>:</strong> We make some different assumptions about loss/connector count to permit this distance versus the standard specification, which states 2 kilometers.)</li>
<li class="c1" aria-level="1"><strong>LR plugs</strong>: Increasing the distance to a 10-kilometer range by using longer reach optics.</li>
<li class="c1" aria-level="1"><strong>ZR plugs + Optical DWDM</strong> (dense wavelength division multiplexing) technology: To go beyond 10-kilometer range, we need active optical components to multiplex and amplify the signals to get the desired reach. Multiplexing reduces the fiber count by a factor of 64 versus FR/LR.</li>
</ol><p>For longer reach connectivity, a more complex solution is required. We use a relatively tried-and-tested design incorporating optical-protection switching, albeit using the latest generation <a href="https://ieeexplore.ieee.org/abstract/document/11029361">C+L-Band 800G ZR technology</a>.</p>
<p>Today’s requirements are at the lower end of the distance capabilities, and the initial deployments do not require any of the intermediate amplification sites that come into play when you go beyond 150 kilometers. This is fortunate, as these sites would be quite large given the amount of fiber pairs to be amplified (meaning additional lead times for construction, planning permits,etc.).</p>
<p>Protection switching introduces some additional operational challenges to how we run the network, as we require external tooling/monitoring to determine if the underlying connectivity for an IP circuit is in a protected or unprotected state. The primary reason to use them is to reduce the number of ports that we consume on the IP platforms, versus providing protection at the IP layer with additional capacity.</p>
<p>With this design, each fiber pair can carry 64x 800G (51.2T). To achieve the overall capacity needed between a given site-pair, we just scale this horizontally.</p>
<p><img class="alignnone size-full wp-image-23033" src="https://engineering.fb.com/wp-content/uploads/2025/10/image4.png" alt="" width="1999" height="970" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/image4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=916,444 916w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=768,373 768w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=1024,497 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=1536,745 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=96,47 96w, https://engineering.fb.com/wp-content/uploads/2025/10/image4.png?resize=192,93 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p><em>Figure 7: AI Backbone topology</em></p>
<p>The above diagram underscores the scale of these interconnects. Right now, a single AI Backbone site-pair is twice the size of the global backbone that we’ve been building for the last 10 years.</p>
<p>This presents many interesting challenges in how we deploy and provision this capacity. We’ll be putting a lot of time and effort into streamlining the sheer volume of this equipment and these connections as we complete the physical build-out of the fiber.</p>
<h2>What We Learned and What the Future Holds</h2>
<p>Scaling EBB has been a wild journey over the last eight or nine years, and it is a story of unexpected acceleration, where our scalability plans had to be accelerated, from 2028 to 2024.</p>
<p>These are our key learnings:</p>
<ul><li class="c1" aria-level="1">10x Backbone is possible because of the innovation in scaling up and scaling out.</li>
<li class="c1" aria-level="1">Pre-building scalable metro designs enables a faster response to network growth.</li>
<li class="c1" aria-level="1">IP/optical integration reduces the number of active devices, space and power footprint, and allows further scaling.</li>
<li class="c1" aria-level="1">Re-using 10X Backbone technology enables the build of AI Backbone.</li>
</ul><p>Meta is planning to build city-size DCs, and our Backbone has to evolve and scale.</p>
<ul><li class="c1" aria-level="1">We see leaf-and-spine architecture as the next step to scale out our platforms. This architecture provides the needed scale with fewer disruptive scaling steps.</li>
<li class="c1" aria-level="1">We will execute on the initial plan for AI Backbone, iterate as we go to more sites, and mature our operations. Throughout this process, we’ll come to understand AI intricacies as they develop through our optical network.</li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/10/16/data-center-engineering/10x-backbone-how-meta-is-scaling-backbone-connectivity-for-ai/</link>
      <guid>https://engineering.fb.com/2025/10/16/data-center-engineering/10x-backbone-how-meta-is-scaling-backbone-connectivity-for-ai/</guid>
      <pubDate>Thu, 16 Oct 2025 18:30:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta Is Leveraging AI To Improve the Quality of Scope 3 Emission Estimates for IT Hardware]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">As we focus on our goal of achieving net zero emissions in 2030, we also aim to create a common taxonomy for the entire industry to measure carbon emissions.</li>
<li class="c1" aria-level="1">We’re sharing details on a new methodology we presented at the <a href="https://www.youtube.com/watch?v=rTfPdI31VIE" target="_blank" rel="noopener">2025 OCP regional EMEA summit</a> that leverages AI to improve our understanding of our IT hardware’s Scope 3 emissions.</li>
<li class="c1" aria-level="1">We are collaborating with the OCP PCR workstream to open source this methodology for the wider industry. This collaboration will be introduced at the 2025 OCP Global Summit.</li>
</ul><p>As Meta focuses on <a href="https://sustainability.atmeta.com/climate/" target="_blank" rel="noopener">achieving net zero emissions in 2030</a>, understanding the carbon footprint of server hardware is crucial for making informed decisions about sustainable sourcing and design. However, calculating the precise carbon footprint is challenging due to complex supply chains and limited data from suppliers. IT hardware used in our data centers is a significant source of emissions, and the embodied carbon associated with the manufacturing and transportation of this hardware is particularly challenging to quantify.</p>
<p>To address this, we developed <a href="https://sustainability.atmeta.com/blog/2024/09/10/estimating-embodied-carbon-in-data-center-hardware-down-to-the-individual-screws/" target="_blank" rel="noopener">a methodology to estimate and track the carbon emissions of hundreds of millions of components in our data centers</a>. This approach involves a combination of cost-based estimates, modeled estimates, and component-specific product carbon footprints (PCFs) to provide a detailed understanding of embodied carbon emissions. These component-level estimates are ranked by the quality of data and aggregated at the server rack level.</p>
<p>By using this approach, we can analyze emissions at multiple levels of granularity, from individual screws to entire rack assemblies. This comprehensive framework allows us to identify high-impact areas for emissions reduction. </p>
<p>Our ultimate goal is to drive the industry to adopt more sustainable manufacturing practices and produce components with reduced emissions. This initiative underscores the importance of high-quality data and collaboration with suppliers to enhance the accuracy of carbon footprint calculations to drive more sustainable practices.</p>
<p>We leveraged AI to help us improve this database and understand our Scope 3 emissions associated with IT hardware by:</p>
<ul><li class="c1" aria-level="1"><strong>Identifying similar components</strong> and applying existing PCFs to similar components that lack these carbon estimates.</li>
<li class="c1" aria-level="1"><strong>Extracting data from heterogeneous data sources</strong> to be used in parameterized models.</li>
<li class="c1" aria-level="1">Understanding the carbon footprint of IT racks and <strong>applying generative AI (GenAI) as a categorization algorithm to create a new and standard taxonomy</strong>. This taxonomy helps us understand the hierarchy and hotspots in our fleet and allows us to provide insights to the data center design team in their language. We hope to iterate on this taxonomy with the data center industry and agree on an industry-wide standard that allows us to compare IT hardware carbon footprints for different types and generations of hardware.</li>
</ul><h2>Why We Are Leveraging AI </h2>
<p>For this work we used various AI methods to enhance the accuracy and coverage of Scope 3 emission estimates for our IT hardware. Our approach leverages the unique strengths of both  natural language processing (NLP) and large language models (LLMs). </p>
<h3>NLP For Identifying Similar Components</h3>
<p>In our first use case (<em>Identifying similar components with AI</em>), we employed various NLP techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and Cosine similarity to identify patterns within a bounded, relatively small dataset. Specifically, we applied this method to determine the similarity between different components. This approach allowed us to develop a highly specialized model for this specific task.</p>
<h3>LLMs For Handling and Understanding Data</h3>
<p>LLMs are pre-trained on a large corpus of text data, enabling them to learn general patterns and relationships in language. They go through a post-training phase to adapt to specific use cases such as chatbots. We apply LLMs, specifically Llama 3.1, in the following three different scenarios:</p>
<ul><li class="c1" aria-level="1">To <strong>extract and process information</strong> from diverse data sources. The benefit of LLMs is that the model can recognize different representations of the same information, even if formatted or phrased differently. (see section: <a href="#extracting"><em>Extracting Data From Heterogeneous Data</em></a>)</li>
<li class="c1" aria-level="1">To <strong>understand potential groupings</strong> of components, aiding in the creation of a new taxonomy. (see section: <em><a href="#breakdown">A Component-Level Breakdown of IT Hardware Emissions Using AI</a>)</em></li>
<li class="c1" aria-level="1">Once categories are identified, we use an LLM to <strong>strictly classify components</strong> based on text strings. This method allows us to save significant training time compared to a traditional AI model, as LLMs can be quickly prompt-engineered to handle various tasks. (see section: <em><a href="#breakdown">A Component-Level Breakdown of IT Hardware Emissions Using AI</a>)</em></li>
</ul><p>Unlike the first use case, where we needed a highly specialized model to detect similarities, we opted for LLM for these three use cases because it leverages general human language rules.  This includes handling different units for parameters, grouping synonyms into categories, and recognizing varied phrasing or terminology that conveys the same concept. This approach allows us to efficiently handle variability and complexity in language, which would have required significantly more time and effort to achieve using only traditional AI. </p>
<h2>Identifying Similar Components With AI</h2>
<p>When analyzing inventory components, it’s common for multiple identifiers to represent the same parts or slight variations of them. This can occur due to differences in lifecycle stages, minor compositional variations, or new iterations of the part.</p>
<p>PCFs following the <a href="https://ghgprotocol.org/" target="_blank" rel="noopener">GHG Protocol</a> are the highest quality input data we can reference for each component, as they typically account for the Scope 3 emissions estimates throughout the entire lifecycle of the component. However, conducting a PCF is a time-consuming process that typically takes months. Therefore, when we receive PCF information, it is crucial to ensure that we map all the components correctly.</p>
<p>PCFs are typically tied to a specific identifier, along with aggregated components. For instance, a PCF might be performed specifically for a particular board in a server, but there could be numerous variations of this specific component within an inventory. The complexity increases as the subcomponents of these items are often identical, meaning the potential impact of a PCF can be significantly multiplied across a fleet.</p>
<p>To maximize the utility of a PCF, it is essential to not only identify the primary component and its related subcomponents but also identify all similar parts that a PCF could be applied to. If these similar components are not identified their carbon footprint estimates will remain at a lower data quality. Therefore, identifying similar components is crucial to ensure that we:</p>
<ul><li class="c1" aria-level="1">Leverage PCF information to ensure the highest data quality for all components.</li>
<li class="c1" aria-level="1">Maintain consistency within the dataset, ensuring that similar components have the same or closely aligned estimates.</li>
<li class="c1" aria-level="1">Improve traceability of each component’s carbon footprint estimate for reporting.</li>
</ul><p>To achieve this, we employed a natural language processing (NLP) algorithm, specifically tailored to the language of this dataset, to identify possible proxy components by analyzing textual descriptions and filtering results by component category to ensure relevance.</p>
<p>The algorithm identifies proxy components in two distinct ways:</p>
<ol><li class="c1" aria-level="1"><strong>Leveraging New PCFs</strong>: When a new PCF is received, the algorithm uses it as a reference point. It analyzes the description names of components within the same category to identify those with a high percentage of similarity. These similar components can be mapped to a representative proxy PCF, allowing us to use high-quality PCF data in similar components.</li>
<li class="c1" aria-level="1"><strong>Improving Low Data Quality Components</strong>: For components with low data quality scores, the algorithm operates in reverse with additional constraints. Starting with a list of low-data-quality components, the algorithm searches for estimates that have a data quality score greater than a certain threshold. These high-quality references can then be used to improve the data quality of the original low-scoring components.</li>
</ol><p>Meta’s Net Zero team reviews the proposed proxies and validates our ability to apply them in our estimates. This approach enhances the accuracy and consistency of component data, ensures that high-quality PCF data is effectively utilized across similar components, and enables us to design our systems to more effectively reduce emissions associated with server hardware.</p>
<h2 id="extracting">Extracting Data From Heterogeneous Data Sources</h2>
<p>When PCFs are not available, we aim to avoid using spend-to-carbon methods because they tie sustainability too closely to spending on hardware and can be less accurate due to the influence of factors like supply chain disruptions. </p>
<p>Instead, we have developed a portfolio of methods to estimate the carbon footprint of these components, including through parameterized modeling. To adapt any model at scale, we require two essential elements: a deterministic model to scale the emissions, and a list of data input parameters. For example, we can scale the carbon footprint calculation for a component by knowing its constituent components’ carbon footprint.</p>
<p>However, applying this methodology can be challenging due to inconsistent description data or locations where information is presented. For instance, information about cables may be stored in different tables, formats, or units, so we may be unable to apply models to some components due to difficulty in locating  input data.</p>
<p>To overcome this challenge, we have utilized large language models (LLMs) that extract information from heterogeneous sources and inject the extracted information into the parameterized model. This differs from how we apply NLP, as it focuses on extracting information from specific components. Scaling a common model ensures that the estimates provided for these parts are consistent with similar parts from the same family and can inform estimates for missing or misaligned parts.</p>
<p>We applied this approach to two specific categories: memory and cables. The LLM extracts relevant data (e.g., the capacity for memory estimates and length/type of cable for physics-based estimates) and scales the components’ emissions calculations according to the provided formulas. </p>
<h2 id="breakdown">A Component-Level Breakdown of IT Hardware Emissions Using AI</h2>
<p>We utilize our centralized component carbon footprint database not only for reporting emissions, but also to drive our ability to efficiently deploy emissions reduction interventions. Conducting a granular analysis of component-level emissions enables us to pinpoint specific areas for improvement and prioritize our efforts to achieve net zero emissions. For instance, if a particular component is found to have a disproportionately high carbon footprint, we can explore alternative materials or manufacturing processes to mitigate its environmental impact. We may also determine that we should reuse components and extend their useful life by testing or augmenting component reliability. By leveraging data-driven insights at the component level and driving proactive design interventions to reduce component emissions, we can more effectively prioritize sustainability when designing new servers.</p>
<p>We leverage a bill of materials (BOM) to list all of the components in a server rack in a tree structure, with “children” component nodes listed under “parent” nodes. However, each vendor can have a different BOM structure, so two identical racks may be represented differently. This, coupled with the heterogeneity of methods to estimate emissions, makes it challenging to easily identify actions to reduce component emissions.</p>
<p>To address this challenge, we have used AI to categorize the descriptive data of our racks into two hierarchical levels:</p>
<ul><li class="c1" aria-level="1"><strong>Domain-level</strong>: A high-level breakdown of a rack into main functional groupings (e.g., compute, network, power, mechanical, and storage)</li>
<li class="c1" aria-level="1"><strong>Component-level</strong>: A detailed breakdown that highlights the major components that are responsible for the bulk of Scope 3 emissions (e.g., CPU, GPU, DRAM, Flash, etc.)</li>
</ul><p>We have developed two classification models: one for “domain” mapping, and another for “component” mapping. The difference between these mappings lies in the training data and the additional set of examples provided to each model. We then combine the two classifications to generate a mutually exclusive hierarchy.</p>
<p>During the exploration phase of the new taxonomy generation, we allowed the GenAI model to operate freely to identify potential categories for grouping. After reviewing these potential groupings with our internal hardware experts, we established a fixed list of major components. Once this list was finalized, we switched to using a strict GenAI classifier model as follows:</p>
<ol><li class="c1" aria-level="1">For each rack, recursively identify the highest contributors, grouping smaller represented items together.</li>
<li class="c1" aria-level="1">Run a GenAI mutually exclusive classifier algorithm to group the components into the identified categories.</li>
</ol><figure id="attachment_23112" aria-describedby="caption-attachment-23112" class="wp-caption alignnone c2"><img class="wp-image-23112" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png" alt="" width="700" height="258" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=916,337 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=768,283 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=1024,377 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=1536,566 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=192,71 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23112" class="wp-caption-text">The emissions breakdown for a generic compute rack.</figcaption></figure><p>This methodology has been presented at the <a href="https://www.youtube.com/watch?v=rTfPdI31VIE" target="_blank" rel="noopener">2025 OCP regional EMEA summit</a> with the goal to drive the industry toward a common taxonomy for carbon footprint emissions, and open source the methodology we used to create our taxonomy.</p>
<p>These groupings are specifically created to aid carbon footprint analysis, rather than for other purposes such as cost analysis. However, the methodology can be tailored for other purposes as necessary.</p>
<h2>Coming Soon: Open Sourcing Our Taxonomies and Methodologies</h2>
<p>As we work toward achieving net zero emissions across our value chain in 2030, this component-level breakdown methodology is necessary to help understand our emissions at the server component level. By using a combination of high-quality PCFs, spend-to-carbon data, and a portfolio of methods that leverage AI, we can enhance our data quality and coverage to more effectively deploy emissions reduction interventions. </p>
<p>Our next steps include open sourcing:</p>
<ul><li class="c1" aria-level="1">The taxonomy and methodology for server rack emissions accounting.</li>
<li class="c1" aria-level="1">The taxonomy builder using GenAI classifiers.</li>
<li class="c1" aria-level="1">The aggregation methodology to improve facility reporting processes across the industry.</li>
</ul><p>We are committed to sharing our learnings with the industry as we evolve this methodology, now as part of a collaborative effort with the OCP PCR group.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/14/data-center-engineering/how-meta-is-leveraging-ai-to-improve-the-quality-of-scope-3-emission-estimates-for-it-hardware/</link>
      <guid>https://engineering.fb.com/2025/10/14/data-center-engineering/how-meta-is-leveraging-ai-to-improve-the-quality-of-scope-3-emission-estimates-for-it-hardware/</guid>
      <pubDate>Tue, 14 Oct 2025 22:40:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Design for Sustainability: New Design Principles for Reducing IT Hardware Emissions]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re presenting Design for Sustainability,  a set of technical design principles for new designs of IT hardware to reduce emissions and cost through reuse, extending useful life, and optimizing design.</li>
<li class="c1" aria-level="1">At Meta, we’ve been able to significantly reduce the carbon footprint of our data centers by integrating several design strategies such as modularity, reuse, retrofitting, dematerialization, using greener materials, and extended hardware lifecycles, Meta can significantly reduce the carbon footprint of its data center infrastructure. </li>
<li class="c1" aria-level="1">We’re inviting the wider industry to also adopt the strategies outlined here to help reach sustainability goals.</li>
</ul><p>The data centers, server hardware, and global network infrastructure that underpin Meta’s operations are a critical focus to address the environmental impact of our operations. As we develop and deploy the compute capacity and storage racks used in data centers, we are focused <a href="https://sustainability.atmeta.com/climate/" target="_blank" rel="noopener">on our goal to reach net zero emissions across our value chain in 2030</a>. To do this, we prioritize interventions to reduce emissions associated with this hardware, including collaborating with hardware suppliers to reduce upstream emissions.</p>
<h2>What Is Design for Sustainability? </h2>
<p>Design for Sustainability is a set of guidelines, developed and proposed by Meta, to aid hardware designers in reducing the environmental impact of IT racks. This considers various factors such as energy efficiency and the selection, reduction, circularity, and end-of-life disposal of materials used in hardware. Sustainable hardware design requires collaboration between hardware designers, engineers, and sustainability experts to create hardware that meets performance requirements while limiting environmental impact.</p>
<p>In this guide, we specifically focus on the design of racks that power our data centers and offer alternatives for various components (e.g., mechanicals, cooling, compute, storage and cabling) that can help rack designers make sustainable choices early in the product’s lifecycle. </p>
<h2>Our Focus on Scope 3 Emissions</h2>
<p>To reach our net zero goal, we are primarily focused on reducing our Scope 3 (or value chain) emissions from physical sources like data center construction and our IT hardware (compute, storage and cooling equipment) and network fiber infrastructure.</p>
<p>While the energy efficiency of the hardware itself deployed in our data centers helps reduce energy consumption, we have to also consider IT hardware emissions associated with the manufacturing and delivery of equipment to Meta, as well as the end-of-life disposal, recycling, or resale of this hardware.</p>
<p>Our methods for controlling and reducing Scope 3 emissions generally involve optimizing material selection, choosing and developing <a href="https://engineering.fb.com/2025/07/16/data-center-engineering/ai-make-lower-carbon-faster-curing-concrete/">lower carbon alternatives in design</a>, and helping to reduce the upstream emissions of our suppliers.</p>
<p>For internal teams focused on hardware, this involves:</p>
<ul><li class="c1" aria-level="1">Optimizing hardware design for the lowest possible emissions, extending the useful life of materials as much as possible with each system design, or using lower carbon materials.</li>
<li class="c1" aria-level="1">Being more efficient by extending the useful life of IT racks to potentially skip new generations of equipment.</li>
<li class="c1" aria-level="1">Harvesting server components that are no longer available to be used as spares. When racks reach their end-of-life, some of the components still have service life left in them and can be harvested and reused in a variety of ways. Circularity programs harvest components such as dual In-line memory modules (DIMMs) from end-of-life racks and redeploy them in new builds.</li>
<li class="c1" aria-level="1">Knowing the emissions profiles of suppliers, components, and system designs. This in turn informs future roadmaps that will further reduce emissions.</li>
<li class="c1" aria-level="1">Collaborating with suppliers to electrify their manufacturing processes, to transition to renewable energy, and to leverage lower carbon materials and designs.</li>
</ul><p>These actions to reduce Scope 3 emissions from our IT hardware also have the additional benefit of reducing the amount of electronic waste (e-waste) generated from our data centers.</p>
<h2>An Overview of the Types of Racks We Deploy </h2>
<p>There are many different <a href="https://www.opencompute.org/contributions">rack designs</a> deployed within Meta’s data centers to support different workloads and infrastructure needs, mainly:</p>
<ol><li class="c1" aria-level="1"><strong>AI –</strong> AI training and inference workloads</li>
<li class="c1" aria-level="1"><strong>Compute –</strong> General compute needed for running Meta’s products and services</li>
<li class="c1" aria-level="1"><strong>Storage –</strong> Storing and maintaining data used by our products</li>
<li class="c1" aria-level="1"><strong>Network –</strong> Providing Low-latency interconnections between servers</li>
</ol><p>While there are differences in architecture across these different rack types, most of these racks apply general hardware design principles and contain active and passive components from a similar group of suppliers. As such, <strong>the same design principles for sustainability apply across these varied rack types.</strong></p>
<p>Within each rack, there are five main categories of components that are targeted for emissions reductions: </p>
<ol><li class="c1" aria-level="1">Compute (i.e., memory, HDD/SSD)</li>
<li class="c1" aria-level="1">Storage</li>
<li class="c1" aria-level="1">Network</li>
<li class="c1" aria-level="1">Power</li>
<li class="c1" aria-level="1">Rack infrastructure (i.e., mechanical and thermals)</li>
</ol><p>The emissions breakdown for a generic compute rack is shown below.</p>
<p><img class="alignnone size-full wp-image-23112" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png" alt="" width="1999" height="736" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=916,337 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=768,283 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=1024,377 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=1536,566 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-1.png?resize=192,71 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Our Techniques for Reducing Emissions</h2>
<p>We focus on four main categories to address emissions associated with these hardware components:</p>
<p><img class="alignnone size-full wp-image-23113" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png" alt="" width="1398" height="546" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png 1398w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png?resize=916,358 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png?resize=768,300 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png?resize=1024,400 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-2.png?resize=192,75 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We will cover a few of the levers listed above in detail below.</p>
<h3>Modular Rack Designs</h3>
<p><strong>Modular Design which allows older rack components to be re-used in newer racks.</strong> <a href="https://www.opencompute.org/wiki/Open_Rack/SpecsAndDesigns" target="_blank" rel="noopener">Open Rack designs</a> (ORv2 &amp; ORv3) form the bulk of high volume racks that exist in our data centers. </p>
<p><strong><img class="wp-image-23116 alignnone" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png" alt="" width="450" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png 1500w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png?resize=687,916 687w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png?resize=768,1024 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png?resize=1152,1536 1152w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png?resize=96,128 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Orv3-rack.png?resize=192,256 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></strong></p>
<p>Here are some key aspects of the ORv3 modular rack design:</p>
<ul><li class="c1" aria-level="1"><strong>ORv3 separates Power Supply Units (PSUs) and Battery Backup Units (BBUs) into their own shelves</strong>.This allows for more reliable and flexible configurations, making repairs and replacements easier as each field replaceable unit (FRU) is toolless to replace.</li>
<li class="c1" aria-level="1"><strong>Power and flexibility</strong> The ORv3 design includes a 48 V power output, which allows the power shelf to be placed anywhere in the rack. This is an improvement over the previous ORV2 design, which limited the power shelf to a specific power zone</li>
<li class="c1" aria-level="1"><strong>Configurations</strong> The rack can accommodate different configurations of PSU and BBU shelves to meet various platform and regional requirements. For example, North America uses a dual AC input per PSU shelf, while Europe and Asia use a single AC input. </li>
<li class="c1" aria-level="1"><strong>Commonization effort</strong> There is an ongoing effort to design a “commonized” ORv3 rack frame that incorporates features from various rack variations into one standard frame. This aims to streamline the assembly process, reduce quality risks, and lower overall product costs </li>
<li class="c1" aria-level="1"><strong>ORv3N</strong> A derivative of ORv3, known as ORv3N, is designed for network-specific applications. It includes in-rack PSU and BBU, offering efficiency and cost improvements over traditional in-row UPS systems </li>
</ul><p>These design principles should continue to be followed in successive generations of racks. With the <a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/" target="_blank" rel="noopener">expansion of AI workloads</a>, new specialized racks for compute, storage, power and cooling are being developed that are challenging  designers to adopt the most modular design principles. </p>
<h3>Re-Using/Retrofitting Existing Rack Designs</h3>
<p><a href="https://www.rittal.com/us-en_US/Company/Rittal-Stories/How-to-Retrofit-a-Colocation-Data-Center-For-High-Density" target="_blank" rel="noopener">Retrofitting existing rack designs</a> for new uses/high density is a cost-effective and sustainable approach to meet evolving data center needs. This strategy can help reduce e-waste, lower costs, and accelerate deployment times. Benefits of re-use/retrofitting include:</p>
<ul><li class="c1" aria-level="1"><strong>Cost savings</strong> Retrofitting existing racks can be significantly cheaper compared to purchasing new racks.</li>
<li class="c1" aria-level="1"><strong>Reduced e-waste</strong> Reusing existing racks reduces the amount of e-waste generated by data centers.</li>
<li class="c1" aria-level="1"><strong>Faster deployment</strong> Retrofitting existing racks can be completed faster than deploying new racks, as it eliminates the need for procurement and manufacturing lead times.</li>
<li class="c1" aria-level="1"><strong>Environmental benefits</strong> Reducing e-waste and reusing existing materials helps minimize the environmental impact of data centers.</li>
</ul><p>There are several challenges when considering re-using or retrofitting racks:</p>
<ul><li class="c1" aria-level="1"><strong>Compatibility issues</strong>Ensuring compatibility between old and new components can be challenging.</li>
<li class="c1" aria-level="1"><strong>Power and cooling requirements</strong>Retrofitting existing racks may require upgrades to power and cooling systems to support new equipment.</li>
<li class="c1" aria-level="1"><strong>Scalability and flexibility</strong> Retrofitting existing racks may limit scalability and flexibility in terms of future upgrades or changes.</li>
<li class="c1" aria-level="1"><strong>Testing and validation</strong>Thorough testing and validation are required to ensure that retrofitted racks meet performance and reliability standards.</li>
</ul><p>Overall, the benefits of retrofitting existing racks are substantial and should be examined in every new rack design.</p>
<h3>Green Steel</h3>
<p>Steel is a significant portion of a rack and chassis and substituting traditional steel with green steel can reduce emissions. Green steel is typically produced using electric arc furnaces (EAF) instead of traditional basic oxygen furnaces (BOF), allowing for the use of clean and renewable electricity and a higher quantity of recycled content. This approach significantly reduces carbon emissions associated with steel production. Meta collaborates with suppliers who offer green steel produced with 100% clean and renewable energy.</p>
<h3>Recycled Steel, Aluminum, and Copper</h3>
<p>While steel is a significant component of rack and chassis, aluminum and copper are extensively used in heat sinks and wiring. Recycling steel, aluminum, and copper saves significant energy needed to produce hardware from raw materials. </p>
<p>As part of our <a href="https://sustainability.atmeta.com/wp-content/uploads/2025/08/Meta_2025-Sustainability-Report_.pdf" target="_blank" rel="noopener">commitment to sustainability</a>, we now require all racks/chassis to contain a minimum of 20% recycled steel. Additionally, all heat sinks must be manufactured entirely from recycled aluminum or copper. These mandates are an important step in our ongoing sustainability journey.</p>
<p>Several of our steel suppliers, such as <a href="https://www.tatasteelnederland.com/nieuws/en/tata-steel-maubeuge-accelerates-decarbonisation-with-innovative-coil-coating-technology" target="_blank" rel="noopener">Tata Steel</a>, provide recycled steel. Product design teams may ask their original design manufacturer (ODM) partners to make sure that recycled steel is included in the steel vendor(s) selected by Meta’s ODM partners. Similarly, there are many vendors that are providing recycled aluminum and copper products.</p>
<h3>Improving Reliability to Extend Useful Life</h3>
<p>Extending the useful life of racks, servers, memory, and SSDs helps Meta reduce the number of hardware equipment that needs to be ordered. This has helped achieve significant reductions in both emissions and costs. </p>
<p>A key requirement for extending useful life of hardware is the reliability of the hardware component or rack. Benchmarking reliability is an important element to determine whether hardware life extensions are feasible and for how long. Additional consideration needs to be given to the fact that spares and vendor support may have diminishing availability. Also, extending hardware life also comes with the risk of increased equipment failure, so a clear strategy to deal with the higher incidence of potential failure should be put in place.</p>
<h3>Dematerialization</h3>
<p>Dematerialization and removal of unnecessary hardware components can lead to a significant reduction in the use of raw materials, water, and/or energy. This entails reducing the use of raw materials such as steel on racks or removing unnecessary components on server motherboards while maintaining the design constraints established for the rack and its components. </p>
<p>Dematerialization also involves consolidating multiple racks into fewer, more efficient ones, reducing their overall physical footprint. </p>
<p>Extra components on hardware boards are included for several reasons:</p>
<ol><li class="c1" aria-level="1"><strong>Future-proofing</strong> Components might be added to a circuit board in anticipation of future upgrades or changes in the design. This allows manufacturers to easily modify the board without having to redesign it from scratch.</li>
<li class="c1" aria-level="1"><strong>Flexibility</strong> Extra components can provide flexibility in terms of configuration options. For example, a board might have multiple connectors or interfaces that can be used depending on the specific application.</li>
<li class="c1" aria-level="1"><strong>Debugging and testing</strong>Additional components can be used for debugging and testing purposes. These components might include test points, debug headers, or other features that help engineers diagnose issues during development.</li>
<li class="c1" aria-level="1"><strong>Redundancy</strong> In some cases, extra components are included to provide redundancy in case one component fails. This is particularly important in high-reliability applications where system failure could have significant consequences.</li>
<li class="c1" aria-level="1"><strong>Modularity</strong> Extra components can make a board more modular, allowing users to customize or upgrade their system by adding or removing modules.</li>
<li class="c1" aria-level="1"><strong>Regulatory compliance</strong>Some components might be required for regulatory compliance, such as safety features or electromagnetic interference (EMI) filtering.</li>
</ol><p>In addition, changes in requirements over time can also lead to extra components. While it is very difficult to modify systems in production, it is important to make sure that each hardware design optimizes for components that will be populated. </p>
<p>Examples of extra components on hardware boards include:</p>
<ul><li class="c1" aria-level="1">Unpopulated integrated circuit (IC) sockets or footprints</li>
<li class="c1" aria-level="1">Unused connectors or headers</li>
<li class="c1" aria-level="1">Test points or debug headers</li>
<li class="c1" aria-level="1">Redundant power supplies or capacitors</li>
<li class="c1" aria-level="1">Optional memory or storage components</li>
<li class="c1" aria-level="1">Unconnected or reserved pins on ICs</li>
</ul><p>In addition to hardware boards, excess components may also be present in other parts of the rack. Removing excess components can lead to lowering the emissions footprint of a circuit board or rack. </p>
<h3>Productionizing New Technologies With Lower Emissions</h3>
<p>Productionizing new technologies can help Meta significantly reduce emissions. Memory and SSD/HDD are typically the single largest source of embodied carbon emissions in a server rack. New technologies can help Meta reduce emissions and costs while providing a substantially higher power-normalized performance. </p>
<p>Examples of such technologies include:</p>
<ul><li class="c1" aria-level="1">Transitioning to SSD from HDD can reduce emissions by requiring fewer drives, servers, racks, BBUs, and PSUs, as well as help reduce overall energy usage. </li>
<li class="c1" aria-level="1">Depending on local environmental conditions, and the data center’s workload, using liquid cooling in server racks can be up to <a href="https://www.youtube.com/watch?v=3bpGyt12AoM" target="_blank" rel="noopener">17% more carbon-efficient</a> than traditional air cooling.</li>
</ul><figure id="attachment_23117" aria-describedby="caption-attachment-23117" class="wp-caption alignnone c2"><img class="size-full wp-image-23117" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png" alt="" width="1454" height="816" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png 1454w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=916,514 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=768,431 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=1024,575 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-Design-for-Sustainability-image-3.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23117" class="wp-caption-text">Source: <a href="https://www.youtube.com/watch?v=3bpGyt12AoM" target="_blank" rel="noopener">OCP Global Summit, Oct 15-17, 2024, San Jose, CA</a>.</figcaption></figure><p>Teams can explore additional approaches to reduce emissions associated with memory/SSD/HDD which include:</p>
<ul><li class="c1" aria-level="1">Alternate technologies such as phase-change memory (PCM) or Magnetoresistive Random-Access Memory (MRAM) that have the same performance with low carbon.</li>
<li class="c1" aria-level="1">Use Low-Power Double Data Rates (LPDDRs ) for low power consumption and high bandwidth instead of DDR.</li>
<li class="c1" aria-level="1">Removing/reusing unused memory modules to reduce energy usage or down-clocking them during idle periods.</li>
<li class="c1" aria-level="1">Using fewer high capacity memory modules to reduce power and cooling needs. Use High Bandwidth Memory (HBM) which uses much less energy than the DDR memory.</li>
</ul><h3>Choosing the Right Suppliers</h3>
<p>Meta engages with suppliers to reduce emissions through its <a href="https://sustainability.atmeta.com/net-zero-supplier-engagement/">net zero supplier engagement program</a>. This program is designed to set GHG reduction targets with selected suppliers to help achieve our net zero target. Key aspects of the program include:</p>
<ul><li class="c1" aria-level="1"><strong>Providing capacity building</strong>: Training suppliers on how to measure emissions, set science-aligned targets, build reduction roadmaps, procure renewable energy, and understand energy markets. </li>
<li class="c1" aria-level="1"><strong>Scaling up</strong>: In 2021 the program started with 39 key suppliers; by 2024 it expanded to include 183 suppliers, who together account for over half of Meta’s supplier-related emissions. </li>
<li class="c1" aria-level="1"><strong>Setting target goals</strong>: Meta aims to have two-thirds of its suppliers set science-aligned greenhouse gas reduction targets by 2026 . As of end-2024, 48% (by emissions contribution) have done so. </li>
</ul><p>The Clean Energy Procurement Academy (CEPA), launched in 2023 (with Meta and other corporations), helps suppliers — especially in the Asia-Pacific region — learn how to procure renewable energy via region-specific curricula. </p>
<h2>The Road to Net Zero Emissions</h2>
<p>The Design for Sustainability principles outlined in this guide represent an important step forward in Meta’s goal to achieve net zero emissions in 2030. By integrating innovative design strategies such as modularity, reuse, retrofitting, and dematerialization, alongside the adoption of greener materials and extended hardware lifecycles, Meta can significantly reduce the carbon footprint of its data center infrastructure. These approaches not only lower emissions but also drive cost savings, e-waste reductions, and operational efficiency, reinforcing sustainability as a core business value.</p>
<p>Collaboration across hardware designers, engineers, suppliers, and sustainability experts is essential to realize these goals. The ongoing engagement with suppliers further amplifies the impact by addressing emissions across our entire value chain. As Meta continues to evolve its rack designs and operational frameworks, the focus on sustainability will remain paramount, ensuring that future infrastructure innovations support both environmental responsibility and business performance.</p>
<p>Ultimately, the success of these efforts will be measured by tangible emissions reductions, extended useful life of server hardware, and the widespread adoption of low carbon technologies and materials.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/14/data-center-engineering/design-for-sustainability-new-design-principles-for-reducing-it-hardware-emissions/</link>
      <guid>https://engineering.fb.com/2025/10/14/data-center-engineering/design-for-sustainability-new-design-principles-for-reducing-it-hardware-emissions/</guid>
      <pubDate>Tue, 14 Oct 2025 22:40:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[OCP Summit 2025: The Open Future of Networking Hardware for AI]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">At Open Compute Project Summit (OCP) 2025, we’re sharing details about the direction of next-generation network fabrics for our AI training clusters.</li>
<li class="c1" aria-level="1">We’ve expanded our network hardware portfolio and are contributing new disaggregated network platforms to OCP.</li>
<li class="c1" aria-level="1">We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards to benefit companies of all sizes across the industry.</li>
</ul><p>At Meta, we believe that open hardware is a catalyst for innovation — especially <a href="https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/" target="_blank" rel="noopener">as data center infrastructure increasingly supports new and emerging AI technologies</a>. Open hardware plays a crucial role in enabling disaggregation, allowing us to break down traditional data center technologies into their core components. This approach empowers us to build systems that are more flexible, scalable, and efficient.</p>
<p>Since co-founding the Open Compute Project (OCP) in 2011, Meta has shared data center and component designs, and open-sourced our network operating system, <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/">FBOSS</a>,  to inspire new ideas both within our own operations and across the industry. These efforts have played an important role in making Meta’s data centers sustainable and efficient. Today, through OCP, we continue to advance open network technologies for the next generation of AI applications.</p>
<p>We’re announcing several new milestones for our data center networking: </p>
<ul><li class="c1" aria-level="1">The evolution of <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/" target="_blank" rel="noopener">Disaggregated Scheduled Fabric (DSF)</a> to support scale-out interconnect for large AI clusters that span entire data center buildings. </li>
<li class="c1" aria-level="1">A new Non-Scheduled Fabric (NSF) architecture based entirely on shallow-buffer, disaggregated Ethernet switches that will support our largest AI clusters like <a href="https://www.facebook.com/zuck/videos/for-our-superintelligence-effort-im-focused-on-building-the-most-elite-and-talen/2300161320399228/" target="_blank" rel="noopener">Prometheus</a>. </li>
<li class="c1" aria-level="1">The addition of Minipack3N, based on NVIDIA’s Ethernet Spectrum-4 ASIC, to our portfolio of 51 Tbps OCP switches that use OCP’s SAI and Meta’s FBOSS software stack.</li>
<li class="c1" aria-level="1">The launch of the Ethernet for Scale-Up Networking (ESUN) initiative, where Meta has worked with other large-scale operators and leading Ethernet vendors to advance using Ethernet for scale-up networking (specifically the high-performance interconnects required for next-generation AI accelerator architectures..</li>
</ul><h2>Dual-Stage DSF: Scaling Scheduled Fabrics for Larger AI Clusters</h2>
<p>At last year’s OCP Global Summit we shared <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/" target="_blank" rel="noopener">Disaggregated Scheduled Fabric (DSF)</a>, a VOQ-based system powered by the open <a href="https://github.com/opencomputeproject/SAI" target="_blank" rel="noopener">OCP-SAI</a> standard and <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/" target="_blank" rel="noopener">FBOSS</a>. The DSF fabric supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several xPUs and NICs, including Meta’s <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">MTIA</a> as well as from several vendors.</p>
<p>Over the last year, we have evolved DSF to a 2-stage architecture, scaling to support a non-blocking fabric that interconnects up to <a href="https://drive.google.com/file/d/1OC4xVJIdeEFHTdIun-yJ40HeHpX3C-fU/view" target="_blank" rel="noopener">18,432 XPUs</a>. These clusters are a fundamental building block for constructing AI clusters that span regions (and even multiple regions) in order to meet the increased capacity and performance demands of Meta’s AI workloads.</p>
<figure id="attachment_23098" aria-describedby="caption-attachment-23098" class="wp-caption alignnone c2"><img class="size-full wp-image-23098" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png" alt="" width="1999" height="1182" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=916,542 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=768,454 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=1024,605 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=1536,908 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=96,57 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-DSF-architecture.png?resize=192,114 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23098" class="wp-caption-text">The new dual-stage DSF architecture supports non-blocking fabric, enabling interconnect between a larger number of GPUs in a cluster. At Meta, we’ve used it to build out clusters of 18k GPUs at the scale of entire data center buildings.</figcaption></figure><h2>Non-Scheduled Fabrics (NSF) for Large AI Clusters</h2>
<p>In parallel with the evolution of the DSF architecture, we have also devised a new architecture called the Non-Scheduled Fabric (NSF), with the following key features:</p>
<ul><li class="c1" aria-level="1">Based on shallow-buffer OCP Ethernet switches.</li>
<li class="c1" aria-level="1">Delivers low round-trip latency.</li>
<li class="c1" aria-level="1">Supports adaptive routing for effective load-balancing, ensuring optimal utilization and minimizing congestion.</li>
<li class="c1" aria-level="1">Serves as foundational building block for Gigawatt-scale AI clusters like Prometheus.</li>
</ul><figure id="attachment_23096" aria-describedby="caption-attachment-23096" class="wp-caption alignnone c2"><img class="size-full wp-image-23096" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png" alt="" width="1999" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=1024,553 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=1536,830 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-NSF.png?resize=192,104 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23096" class="wp-caption-text">NSF — Three-tier Non-Scheduled Fabrics for building scale AI clusters.</figcaption></figure><h2>New OCP Switch Platforms for Next-Generation AI Fabrics</h2>
<p>Last year, Meta introduced two new <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/" target="_blank" rel="noopener">51T Ethernet switches</a>: Minipack3 (based on Broadcom Tomahawk5) and Cisco 8501 (based on Cisco Silicon One G200). These OCP switches offer 51.2 Tbps (64x OSFP ports), are power-efficient without the need for retimers, and run our large-scale network operating system, FBOSS. These platforms have served as the foundation for building our next-generation frontend and backend data center fabrics. </p>
<p>This year, we are introducing Minipack3N, a new 51T Ethernet switch that is based on the NVIDIA Spectrum-4 switching ASIC and leverages the same system design as Minipack3.</p>
<figure id="attachment_23097" aria-describedby="caption-attachment-23097" class="wp-caption alignnone c3"><img class="wp-image-23097" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png" alt="" width="700" height="394" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-Minipack3N.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23097" class="wp-caption-text">The Minipack3N, a 51.2 Tbps switch (designed by Meta and manufactured by Accton) based on the NVIDIA Spectrum-4 Ethernet switching ASIC.</figcaption></figure><h2>Evolving FBOSS and SAI for DSF and NSF</h2>
<p><img class="alignnone size-full" src="https://engineering.fb.com/wp-content/uploads/2024/10/SAI-FBOSS-logo.png" width="456" height="168" alt="image" /></p>
<p>Meta continues to embrace OCP-SAI as the foundation for onboarding new network fabrics, switch hardware platforms, and optical transceivers into FBOSS. Through close collaboration with vendors and the OCP community, we have evolved SAI to support advanced features and concepts, including DSF, NSF, and other enhanced routing schemes tailored for modern data center and AI workloads.</p>
<p>This open approach empowers developers and engineers worldwide to engage with cutting-edge hardware, contribute innovative software, and leverage these solutions for their own needs. By sharing advancements and fostering collaboration, we help accelerate progress across the industry, ensuring that open hardware and software remain at the heart of scalable, efficient, and future-ready data center infrastructure.</p>
<h2>Optics: 2x400G FR4-LITE and 400G/2x400G DR4 Optics for 400G/800G Optical Interconnections</h2>
<p>Last year, Meta introduced 2x400G FR4 BASE (3-km) optics, the primary solution supporting next-generation 51T platforms across both backend and frontend networks and DSFs. These optics have now been widely deployed throughout Meta’s data centers. </p>
<p>This year, we are expanding our portfolio with the launch of 2x400G FR4 LITE (500-m) optics. Developed as part of an efficiency initiative, FR4 LITE is optimized for the majority of intra–data center use cases, supporting fiber links up to 500 meters. This new variant is designed to accelerate optics cost reduction while maintaining robust performance for shorter-reach applications.</p>
<p>In addition, we are introducing the 400G DR4 OSFP-RHS optics — our first-generation DR4 solution for AI host-side NIC connectivity. Complementing this, the new 2x400G DR4 OSFP optics are being deployed on the switch side, providing connectivity from host to switch.</p>
<figure id="attachment_23099" aria-describedby="caption-attachment-23099" class="wp-caption alignnone c3"><img class="wp-image-23099" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png" alt="" width="700" height="461" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=916,603 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=768,506 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=1024,675 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=1536,1012 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OCP-2025-Optics.png?resize=192,126 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23099" class="wp-caption-text">The 400G DR4 (left), 2x400G DR4 (center), and the 2x400G FR4 LITE (right).</figcaption></figure><h2>Ethernet for Scale-Up Networking in OCP: Meta’s Industry Leadership</h2>
<p>At Meta, we recognize that the future of AI and data center infrastructure depends on open, scalable, and interoperable networking solutions. As part of our ongoing commitment to open hardware and industry collaboration, Meta is a founding participant in the new Ethernet for Scale-Up Networking (ESUN) initiative, which launched within OCP at the 2025 OCP Global Summit.</p>
<h3>What Is ESUN?</h3>
<p>ESUN is a new workstream within the OCP Networking Project. It functions as an open technical forum where industry operators and leading vendors can collaborate to advance the use of Ethernet technology. The specific goal of ESUN is to leverage and adapt the mature Ethernet ecosystem to meet the unique, high-performance demands of the scale-up domain within modern AI systems.</p>
<p>ESUN is focused specifically on the <strong>network functionality</strong> aspect of scale-up systems. The workstream is designed to address the technical challenges related to how data traffic is managed and transmitted across network switches. This includes defining best practices and standards for:</p>
<ul><li class="c1" aria-level="1">Protocol headers</li>
<li class="c1" aria-level="1">Error handling mechanisms</li>
<li class="c1" aria-level="1">Achieving lossless data transfer across the network</li>
</ul><p>The initiative brings together operators, vendors, and standards bodies to:</p>
<ul><li class="c1" aria-level="1">Collaborate on Ethernet solutions tailored for scale-up networking.</li>
<li class="c1" aria-level="1">Focus on Ethernet framing and switching layers to ensure robust, lossless, and error-resilient multi-hop topologies.</li>
<li class="c1" aria-level="1">Align with open standards by working closely with organizations like UEC and IEEE.</li>
</ul><h3>Meta’s Contributions to ESUN</h3>
<p>Meta is proud to be among the initial group of OCP members driving ESUN, alongside industry leaders AMD, Arista, ARM, Broadcom, Cisco, HPE Networking, Marvell, Meta, Microsoft, NVIDIA, OpenAI, and Oracle. </p>
<p>Our contributions include:</p>
<ul><li class="c1" aria-level="1">Technical leadership in defining the requirements for ESUN in AI clusters.</li>
<li class="c1" aria-level="1">Open collaboration with vendors and standards bodies to ensure that solutions are interoperable and not tied to proprietary technologies.</li>
<li class="c1" aria-level="1">Sharing best practices and lessons learned from deploying advanced Ethernet fabrics in Meta’s own data centers., </li>
</ul><h2>An Industry Invitation: Join the Open Future</h2>
<p>Driving progress in AI requires data center infrastructure that delivers more than just scale — it must also be flexible, efficient, and sustainable. At Meta, we envision a future where AI hardware systems are not only highly scalable, but also open and collaborative, enabling rapid innovation and adaptation to evolving workloads.</p>
<p>We invite engineers, developers, and industry partners to join us and the OCP community in shaping the next generation of networking hardware for AI. By working together and sharing ideas, we can accelerate the development of open, future-ready AI infrastructure that benefits the entire industry and supports the demands of tomorrow’s technologies.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/</link>
      <guid>https://engineering.fb.com/2025/10/13/data-infrastructure/ocp-summit-2025-the-open-future-of-networking-hardware-for-ai/</guid>
      <pubDate>Tue, 14 Oct 2025 02:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing the React Foundation: The New Home for React & React Native]]></title>
      <description><![CDATA[<p>Meta open-sourced React over a decade ago to help developers build better user experiences. Since then, React has grown into one of the world’s most popular open source projects, <a href="https://trends.builtwith.com/websitelist/React">powering over 50 million websites</a> and products built by <a href="https://reactnative.dev/showcase">companies such as Microsoft, Shopify, Bloomberg, Discord, Coinbase, the NFL, and many others</a>. With React Native, React has expanded to support platforms beyond the web, including mobile, tablets, desktops, TVs, gaming consoles, and even mixed reality devices.</p>
<p>This incredible growth is thanks to the thousands of educators, companies, and projects that have contributed to the development of React. The community is the heart of React, and we’re proud to play a part in the cycle of open source innovation throughout the ecosystem that benefits everyone. We’re pleased to give a seat at the table to the people and companies that have made React what it is today.</p>
<p>Today, we are excited to announce the next step for React. Several projects within the React ecosystem, including React and React Native, as well as supporting projects such as JSX, will transition to the React Foundation. The React Foundation’s mission is to help the React community and its members. The React Foundation will maintain React’s infrastructure, organize <a href="https://conf.react.dev/">React Conf</a>, and create initiatives to support the React ecosystem. The React Foundation will be part of the Linux Foundation, which has long fostered a vendor-neutral environment for open source projects.</p>
<h2>Formalizing Governance</h2>
<p>The React Foundation’s governing board will consist of representatives from Amazon, Callstack, Expo, Meta, Microsoft, Software Mansion, and Vercel, with the intention to expand further over time.</p>
<p>There will be a clear separation between the business and technical governance of React. Releases, features, and technical direction will be governed by a new structure driven by the maintainers and contributors of React. This new technical governance structure will be independent of the React Foundation. The React team is actively working on this new technical governance structure and will share more details in a future post on <a href="https://react.dev/blog">the React blog</a>.</p>
<h2>Meta and the React Foundation</h2>
<p>Meta is committing to a five-year partnership with the React Foundation, including over $3 million in funding and dedicated engineering support. This investment will ensure React’s smooth transition to independent governance while maintaining the stability and innovation the community expects. Meta will continue to invest in React and use it as our primary tool for building UI on the web and across many of Meta’s apps. Meta will also continue to have a dedicated team of engineers working full-time on React and React Native.</p>
<p>We believe the best of React is yet to come. The React Foundation will unlock new opportunities for collaboration, innovation, and growth that will benefit the entire ecosystem. We’re excited to see what the community will build together under this new model. With strengthened governance, broader industry participation, and continued technical excellence, React is positioned to tackle the next generation of challenges in UI development.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/07/open-source/introducing-the-react-foundation-the-new-home-for-react-react-native/</link>
      <guid>https://engineering.fb.com/2025/10/07/open-source/introducing-the-react-foundation-the-new-home-for-react-react-native/</guid>
      <pubDate>Tue, 07 Oct 2025 20:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing OpenZL: An Open Source Format-Aware Compression Framework]]></title>
      <description><![CDATA[<ul><li><a href="https://openzl.org/" target="_blank" rel="noopener">OpenZL</a> is a new open source data compression framework that offers lossless compression for structured data.</li>
<li>OpenZL is designed to offer the performance of a format-specific compressor with the easy maintenance ofa  single executable binary.</li>
<li>You can get started with OpenZL today by visiting our <a href="https://facebook.github.io/openzl/getting-started/quick-start/" target="_blank" rel="noopener">Quick Start guide</a> and the the <a href="https://github.com/facebook/openzl" target="_blank" rel="noopener">OpenZL GitHyub repository.</a></li>
</ul><p>Today, we are excited to announce the public release of <a href="https://openzl.org/" target="_blank" rel="noopener">OpenZL</a>, a new data compression framework. OpenZL offers lossless compression for structured data, with performance comparable to specialized compressors. It accomplishes this by applying a configurable sequence of transforms to the input, revealing hidden order in the data, which can then be more easily compressed. Despite applying distinct transformation permutations for every file type, all OpenZL files can be decompressed using the same universal OpenZL decompressor.</p>
<h2>A Decade of Lessons</h2>
<p>When <a href="https://engineering.fb.com/2016/08/31/core-infra/smaller-and-faster-data-compression-with-zstandard/" target="_blank" rel="noopener">Zstandard</a> was announced, it came with a simple pitch: It promised the same or better compression ratio of prior default but at the much increased speed required by datacenter workloads. By pairing strong entropy coding with a design that fully utilized modern CPU capabilities, Zstandard offered a substantial improvement that justified its presence in datacenters.</p>
<p>However, while it was improved over time, remaining within the Zstandard framework offers diminishing returns. So we started looking for the next great leap in data compression.</p>
<p>In this quest, one pattern kept repeating: Using generic methods on structured data leaves compression gains on the table. Data isn’t just byte soup. It can be columnar, encode enums, be restricted to specific ranges, or carry highly repetitive fields. More importantly, it has predictable shapes. A bespoke compressor that leans into that structure can beat general-purpose tools on both ratio and speed. But there’s a catch — every bespoke scheme means another compressor and decompressor to create, ship, audit, patch, and trust.</p>
<p>OpenZL is our answer to the tension between the performance of format-specific compressors and the maintenance simplicity of a single executable binary.</p>
<h2>Make the Structure Explicit</h2>
<p>General compressors rely on a one-size fits all processing strategy, or alternatively spend a lot of their cycles guessing which techniques to use. OpenZL saves those cycles by making the structure an explicit input parameter. Compression can then focus on a sequence of reversible steps that surface patterns before coding.</p>
<p>As a user, you provide OpenZL with the data shape (via a preset or a thin format description). Then the trainer, an offline optimization component, builds an effective compression config that can be re-employed for similar data. During encoding that config resolves into a concrete decode recipe that’s embedded into the frame. The universal decoder will directly execute that recipe, without any out-of-band information.</p>
<h2>An Example Compression Using OpenZL</h2>
<p>As an example, let’s compress sao, which is part of the <a href="https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia" target="_blank" rel="noopener">Silesia Compression Corpus</a>. This file follows a <a href="http://tdc-www.harvard.edu/software/catalogs/catalogsb.html" target="_blank" rel="noopener">well-defined format</a> featuring an array of records, each one describing a star. Providing this information to OpenZL is enough to give it an edge over generic lossless compressors, which only see bytes.</p>
<p>Comparison on a M1 cpu, using clang-17</p>
<table border="1"><tbody><tr><td>Compressor</td>
<td>zstd -3</td>
<td>xz -9</td>
<td>OpenZL</td>
</tr><tr><td>Compressed Size</td>
<td>5,531,935 B​</td>
<td>4,414,351​ B</td>
<td>3,516,649​ B</td>
</tr><tr><td>Compression Ratio</td>
<td>x1.31</td>
<td>x1.64</td>
<td>x2.06</td>
</tr><tr><td>Compression Speed</td>
<td>220 MB/s</td>
<td>3.5 MB/s</td>
<td>340 MB/s</td>
</tr><tr><td>Decompression Speed</td>
<td>850 MB/s</td>
<td>45 MB/s</td>
<td>1200 MB/s</td>
</tr></tbody></table><p>Crucially, OpenZL produces a higher compression ratio <em>while preserving or even improving speed</em>, which is critical for data center processing pipelines.</p>
<p>For illustration, this result is achieved using the following simple graph:</p>
<p><img class="alignnone wp-image-23054" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png" alt="" width="700" height="498" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=916,651 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=768,546 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=1024,728 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=1536,1092 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=96,68 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-graph-1.png?resize=192,136 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3><em>A Brief Explanation</em></h3>
<p>So what is happening in this example?</p>
<p>We start by separating the header from the rest, a large table of structures. Then each field gets extracted into its own stream: the array of structures becomes a structure of arrays. After that point, we expect that each stream contains homogeneous data of the same type and semantic meaning. We can now focus on finding an optimal compression strategy for each one.</p>
<ul><li class="c1" aria-level="1">SRA0 is a position on the X axis. Due to the way the table is generated, the index is <em>mostly</em> sorted, inviting the use of delta to reduce the range of values represented. This mechanically makes the resulting stream easier to compress. </li>
<li class="c1" aria-level="1">SDEC0 is a position on the Y axis. It’s not as well sorted as the X axis, but we can at least exploit the fact that it’s bounded between a minimum and a maximum. This makes the higher bytes more predictable, which can be exploited for better compression with the transpose operation.</li>
<li class="c1" aria-level="1">The other fields (IS, MAG, XRPM, XDPM) share a common property: their cardinality is much lower than their quantities, and there is no relation between 2 consecutive values. This makes them a good target for tokenize, which will convert the stream into a dictionary and an index list.</li>
<li class="c1" aria-level="1">The resulting dictionaries and index lists are very different. They benefit from completely different compression strategies. So they are sent to dedicated processing graphs.</li>
</ul><p>The graph continues beyond these steps. But at some point, we can also stop making decisions. The main work is to group data into homogeneous streams. After that, one can count on openzl to take care of the rest. </p>
<p>To go even further, we would like to generate compression strategies that are specifically fine-tuned for each stream. This is where the <strong>offline trainer stage</strong> comes into play. </p>
<h2>Generate a Compressor Automatically</h2>
<p>It’s possible to take full control of the compression process, but it’s also not required. A faster strategy is to just describe your data and let the system learn a <strong>compression config.</strong></p>
<p><strong>Describe the input:</strong> With the <a href="https://facebook.github.io/openzl/api/c/graphs/sddl/" target="_blank" rel="noopener">Simple Data Description Language (SDDL)</a>, you sketch how the bytes map to fields — rows, columns, enums, nested records. SDDL is for parsing only; it just tells OpenZL the shape of your data. Alternatively, you can write your own parser function directly using one of the supported languages, and register it with OpenZL to delegate the logic.</p>
<p><strong>Learn the config:</strong> Starting from a preset, a parser function or an SDDL description, the <strong>trainer</strong> runs a budgeted search over transform choices and parameters to produce a <strong>Plan</strong>. It can provide a full set of speed/ratio tradeoffs, or directly target the best configuration respecting some speed constraints. Internally it uses a cluster finder (to group fields that behave alike) and a graph explorer (to try candidate subgraphs and keep score).</p>
<p><strong>Resolve at encode-time:</strong> While compressing, the encoder turns the Plan into a concrete recipe — the <strong>Resolved Graph</strong>. If the Plan has control points, it picks the branch that fits the data and records that choice into the frame.</p>
<p><strong>Decode without coordination:</strong> Each frame chunk carries its own resolved graph. The single decoder checks it, enforces limits, and runs the steps in order. When a plan improves, you just roll out the new plan, no new decompressor needed. Old data keeps decoding; new data get improved gains. </p>
<p><em>In practice the loop is straightforward: describe (SDDL) → train (produce a plan) → compress (emit frames with resolved graphs) → decode anywhere with the same binary.</em></p>
<h2 class="c2">Embracing Changes: Re-Training and In-Flight Control</h2>
<p>In the real world, data evolves constantly, in both structure and content. A compressor built for one version of a schema would have a short lifetime. </p>
<p>Thankfully, with the flexibility offered by compression plans, we can react swiftly to data changes. At Meta, this is the core mission of <strong>Managed Compression</strong>, originally created to automate dictionary compression with Zstandard, and presented in an earlier blog <a href="https://engineering.fb.com/2018/12/19/core-infra/zstandard/" target="_blank" rel="noopener">on how we improved compression at with Zstanard</a>. </p>
<p>OpenZL offers a training process that updates compression plans to maintain or improve compression performance, based on provided data samples. Now the synergy with Managed Compression is apparent: Each registered use case is monitored, sampled, periodically re-trained, and receives new configs when they prove beneficial. The decompression side continues to decode both old and new data without any change.</p>
<p><strong>Runtime Adaptation:</strong> A compression config can include <strong>control points</strong> that read lightweight statistics at compression time (e.g., string repetition stats, run-length, histogram skew, delta variance) and choose the best branch of the Plan to go to next. Many technologies can be used, and textbook classifiers qualify. Control points handle bursts, outliers, and seasonal shifts without brute-force exploration: exploration is bounded, in order to maintain speed expectations. Taken branches are then recorded into the frame, and the decoder just executes the recorded path.</p>
<p>This gives the best of both worlds: dynamic behavior at compression time to handle variations and exceptions — without turning compression into an unbounded search problem — and with zero complexity added to the decoder.</p>
<h2>The Advantages of the Universal Decoder</h2>
<p>OpenZL is capable of compressing a vast array of data formats, and they can all be decompressed with a single decompressor binary. Even when the compression configuration changes, the decoder does not. This may sound like operational minutiae, but it’s critical to OpenZL’s deployment success.</p>
<ul><li class="c1" aria-level="1"><strong>One audited surface:</strong> Security and correctness reviews focus on a single binary with consistent invariants, fuzzing, and hardening; there’s no myriad of per-format tools that can drift apart.</li>
<li class="c1" aria-level="1"><strong>Fleet-wide improvements:</strong> A decoder update (security or performance — SIMD kernels, memory bounds, scheduling) benefits every compressed file, even those that predate the change.</li>
<li class="c1" aria-level="1"><strong>Operational clarity:</strong> Same binary, same CLI, same metrics and dashboards across datasets; patching and rollout are uneventful by design.</li>
<li class="c1" aria-level="1"><strong>Continuous training:</strong> With one decoder and many compression plans, we can keep improving while the system is live. Train a plan offline, try it on a small slice, then roll it out like any other config change. Backward compatibility is built-in — old frames still decode while new frames get better.</li>
</ul><p>In other words, it’s possible to afford domain-specific compression without fragmenting the ecosystem.</p>
<h2>Results With OpenZL</h2>
<p>When OpenZL is able to understand and parse the file format, it is able to offer large improvements in compression ratio, while still providing fast compression and decompression speed. However, this is no magic bullet. When OpenZL doesn’t understand the input file format, it simply falls back to zstd.</p>
<p>OpenZL, through its offline training capabilities, is also able to offer a wide range of configurations in the tradeoff space of compression ratio, compression speed, and decompression speed. Unlike traditional compressors, which offer configuration by setting a compression level, OpenZL offers configuration by serializing the compressor graph. This allows an immense amount of flexibility to select diverse tradeoffs.</p>
<p>These results are based on datasets we’ve developed for our whitepaper. The datasets were chosen because they are highly structured and in a format that OpenZL supports. Every figure below is produced with <a href="https://github.com/facebook/openzl/blob/dev/contrib/reproducibility/figures/script.sh" target="_blank" rel="noopener">scripts in the OpenZL repository</a> so they can be reproduced, and the input data and logs from our runs have been uploaded <a href="https://github.com/facebook/openzl/releases/tag/openzl-sample-artifacts" target="_blank" rel="noopener">to GitHub</a>.</p>
<p>Note that data points connected by a line are pareto-optimal. All such points have the property that there is no point in the same dataset which beats them in both metrics.</p>
<figure id="attachment_23055" aria-describedby="caption-attachment-23055" class="wp-caption alignnone c3"><img class="wp-image-23055" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png" alt="" width="700" height="334" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=916,437 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=768,367 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=1024,489 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=1536,733 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=96,46 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-SAO-dataset-speeds-vs-ratio.png?resize=192,92 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23055" class="wp-caption-text"><strong>Figure 1 — SAO:</strong> These figures show compression speed and decompression speed vs. ratio for SAO comparing OpenZL with three general compression tools. As shown in the example, OpenZL destructures the star records into columns for each field, and then the trainer learns how to best compress each field to produce a set of OpenZL configurations offering a wide range of tradeoffs.</figcaption></figure><figure id="attachment_23056" aria-describedby="caption-attachment-23056" class="wp-caption alignnone c3"><img class="wp-image-23056" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png" alt="" width="700" height="324" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=916,424 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=768,355 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=1024,474 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=1536,711 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=96,44 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-ERA5-dataset.png?resize=192,89 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23056" class="wp-caption-text"><strong>Figure 2 — Columnar numeric data:</strong> These figures show compression speed and decompression speed vs. ratio for the ERA5 Flux dataset for OpenZL and three general compression tools. The data is presented to the compressor as a single array of 64-bit numeric data. For a given time budget, OpenZL achieves substantially higher compression ratios. Likewise, for a given compression ratio, OpenZL can complete the job with greater speed.</figcaption></figure><figure id="attachment_23057" aria-describedby="caption-attachment-23057" class="wp-caption alignnone c3"><img class="wp-image-23057" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png" alt="" width="700" height="318" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=916,417 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=768,349 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=1024,466 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=1536,698 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=96,44 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-Binance-dataset.png?resize=192,87 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23057" class="wp-caption-text"><strong>Figure 3 — Parquet:</strong> These two figures show compression speed vs. ratio for the Binance and TLC Green Trip dataset for OpenZL and three general compression tools, presented as uncompressed Parquet files. OpenZL parses the Parquet format and learns the schema in order to tune compression to each file.</figcaption></figure><figure id="attachment_23058" aria-describedby="caption-attachment-23058" class="wp-caption alignnone c3"><img class="wp-image-23058" src="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png" alt="" width="700" height="314" srcset="https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=916,411 916w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=768,345 768w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=1024,460 1024w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=1536,690 1536w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=96,43 96w, https://engineering.fb.com/wp-content/uploads/2025/10/Meta-OpenZL-PPMF-unit-dataset.png?resize=192,86 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-23058" class="wp-caption-text"><strong>Figure 4 — CSV:</strong> This figure shows the compression speed vs. ratio tradeoff for the PPMF Unit dataset for OpenZL and three general compression tools, presented as CSV files. OpenZL is able to offer excellent compression ratios, but the cost of parsing CSV caps the compression speed at about 64 MB/s. An improved parser will speed that up, however this strategy will likely never approach Zstd’s speeds of 1 GB/s. Nonetheless and not pictured here, OpenZL always has the option to fallback to the zstd codec, so its performance can be lower-bounded by zstd.</figcaption></figure><h3><em>When It’s Not Useful</em></h3>
<p>OpenZL relies on a description of some structure to leverage its set of transforms. When there is no structure, there is no advantage. This is typically the case in pure text documents, such as enwik or dickens. In these cases, OpenZL falls back to zstd, offering essentially the same level of performance.</p>
<h2>Getting Started With OpenZL</h2>
<p>OpenZL’s selection of codecs is well-suited to compressing vector, tabular, or tree-structured data, and can be expected to perform well with numeric, string, or binary data. Common examples include timeseries datasets, ML tensors, and database tables. Keep in mind that we are bound by the limits of information theory, so the input needs to have some order that can be uncovered. As time goes on, we plan to incorporate additional codecs, as described in the next section.</p>
<p>If your data fits one of the above categories, then give it a try! Visit the <a href="https://openzl.org/" target="_blank" rel="noopener">OpenZL site</a> and our <a href="https://facebook.github.io/openzl/getting-started/quick-start/" target="_blank" rel="noopener">Quick Start guide</a> to get started.</p>
<p>If you want to dive into the code, check out the <a href="https://github.com/facebook/openzl" target="_blank" rel="noopener">GitHub repository</a> for source, documentation, and examples. We welcome contributions and feedback from the community!</p>
<h2>Where We’re Going</h2>
<p>OpenZL’s general direction is set: make it easier to expose structures, and exploit it with automated compression plans for evolving data.</p>
<p><strong>Next up</strong>: We’re extending the transform library for time-series and grid-shaped data, improving performance of codecs, and enabling the trainer to find better compression plans faster. We also are actively working to extend SDDL to describe nested data formats more flexibly. Finally, the automated compressor explorer is getting better at proposing safe, testable changes to a compression plan within a specified budget.</p>
<p><strong>Where the community can help:</strong> If you have a format or a dataset with obvious structure, try compressing it with an OpenZL prebuilt Plan. If it’s promising, try generating a new plan with the trainer or customizing it with our documentation to improve it. If it’s a format that the public might want, send it to us in a PR.</p>
<p>You can also contribute to the OpenZL core. If you have a knack for optimizing C/C++, help us speed up the engine or add transforms to cover new data formats. If your super power is reliability, the project would surely benefit from more validation rules and resource caps. And if you care about benchmarks, add your dataset to the harness so others can reproduce your results.</p>
<p><strong>How to engageL</strong> Open an issue on the GitHub issue board. If you have a use-case for which you would expect OpenZL to do better, provide a few small samples, so that we can analyze them together. You may also contribute to codec optimizations, and propose new graphs, parsers or control points. All these topics do not impact the universality of the decoder.</p>
<p>We believe OpenZL opens up a new universe of possibilities to the data compression field, and we’re excited to see what the open source community will do with it!</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">website</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, <a href="https://bsky.app/profile/metaopensource.bsky.social" target="_blank" rel="noopener">Bluesky</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/</link>
      <guid>https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/</guid>
      <pubDate>Mon, 06 Oct 2025 16:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing the Candle Subsea Cable, Updates to Our Asia-Pacific Connectivity Projects]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re introducing Candle, a new submarine cable connecting countries across East Asia and Southeast Asia.</li>
<li class="c1" aria-level="1">We’re also announcing several updates to our subsea cables across the Asia-Pacific, including the completion of the Bifrost cable system.</li>
</ul><p>The Asia-Pacific (APAC) region is home to over <a href="https://www.statista.com/topics/9080/internet-usage-in-the-asia-pacific-region/#:~:text=While%20boasting%20more%20than%203.3,excluded%20from%20the%20digital%20world." target="_blank" rel="noopener">58% of the world’s internet users</a><sup>1</sup> – many who rely on robust global infrastructure for online connectivity and access to innovative tech such as AI.</p>
<p>At Meta, we imagine a future where everyone has access to AI, <a href="https://www.meta.com/superintelligence/?srsltid=AfmBOop7E12CAYa5k3P2DFoE7spO0qv77Xkk6-A0oZvJsivrcWJgy6a3" target="_blank" rel="noopener">personal superintelligence</a>, and other emerging technologies to improve their lives and connect with each other. As such, we continue to build world-class network infrastructure with enough capacity and resilience to enable rich online experiences for people all over the world. Earlier this year, for example, we announced <a href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/" target="_blank" rel="noopener">Project Waterworth</a>, our most ambitious subsea cable project yet, which will land in five continents, including Asia, by the end of the decade.  </p>
<p>Today, we’re sharing updates on four of our other subsea cable investments in APAC with onward connections to the rest of the world. Once complete, these cables will help deliver Meta’s products, services, AI, and new levels of connectivity to billions of people in the region.</p>
<h2>Introducing Candle, APAC’s largest capacity subsea cable system </h2>
<p>Candle will be the largest capacity cable in APAC, bringing increased connectivity to Japan, Taiwan, the Philippines, Indonesia, Malaysia, and Singapore, in 2028. Spanning 8,000 kilometers, Candle will connect over <a href="https://data.worldbank.org/indicator/SP.POP.TOTL" target="_blank" rel="noopener">580 million people</a> with 570 terabits per second (Tbps) of capacity.</p>
<p>In collaboration with leading telecommunications companies in the region, Candle will leverage recently developed 24 fiber-pair cable technology to deliver bandwidth similar to our largest capacity cable today, <a href="https://about.fb.com/es/news/2024/10/anjana-el-mayor-cable-transatlantico-submarino-del-mundo-aterriza-en-santander-para-conectar-estados-unidos-y-europa/">Anjana</a>.</p>
<p><img class="alignnone size-full wp-image-23047" src="https://engineering.fb.com/wp-content/uploads/2025/10/APAC_Subsea_Cables_Blog_Post_Visual_2025_02.gif" alt="" width="1080" height="608" /></p>
<h2>The Bifrost subsea cable has arrived; updates to Echo and Apricot </h2>
<p>In 2021, Meta and our partners committed to increase transpacific capacity by 70% through two subsea cables, <a href="https://engineering.fb.com/2021/03/28/connectivity/echo-bifrost/" target="_blank" rel="noopener">Bifrost and Echo</a>. </p>
<p>Bifrost now connects Singapore, Indonesia, the Philippines, and the United States, with Mexico expected in 2026. Bifrost will chart a different path from prior transpacific cables to add over 260 Tbps of redundancy to this popular digital route.</p>
<p>Echo now delivers 260 Tbps of capacity between Guam and California, with options for onward connectivity into Asia in the future.</p>
<p>We also announced <a href="https://engineering.fb.com/2021/08/15/connectivity/apricot-subsea-cable/" target="_blank" rel="noopener">Apricot</a>, which is now available between Japan, Taiwan, and Guam. With future extensions to the Philippines, Indonesia and Singapore, this 12,000-kilometer system will complement the Bifrost and Echo systems with 290 Tbps of capacity.</p>
<p>Together, Candle, Echo, Bifrost, and Apricot will enable the Asia-Pacific region with intra-Asia connectivity and transpacific bridges to the Americas. Additionally, our investments in projects such as <a href="https://engineering.fb.com/2021/09/28/connectivity/2africa-pearls/" target="_blank" rel="noopener">2Africa</a>, will chart a path to India, the Middle East, and Europe, while Project Waterworth will enable global connectivity.</p>
<p>Our digital infrastructure development in the Asia-Pacific is part of our commitment to bring people together wherever they are in the world. Together with our partners, these investments will enhance the scale and reliability of the global telecommunications network and ensure the delivery of Meta’s services quickly and efficiently for businesses and people across APAC and beyond.</p>
<footer class="blockquote-footer"><cite>[1] According to Statista, Asia-Pacific has <a href="https://www.statista.com/topics/9080/internet-usage-in-the-asia-pacific-region/#topicOverview">~3.3 billion</a> of the <a href="https://www.statista.com/statistics/325706/global-internet-user-penetration/#:~:text=Global%20internet%20user%20penetration%202014%2D2025&amp;text=As%20of%20July%202025%2C%2068.7,worldwide%20were%20around%205.65%20billion">5.65 billion</a> internet users worldwide, as of September 5, 2025.</cite></footer>]]></description>
      <link>https://engineering.fb.com/2025/10/05/connectivity/introducing-the-candle-subsea-cable-updates-to-our-asia-pacific-connectivity-projects/</link>
      <guid>https://engineering.fb.com/2025/10/05/connectivity/introducing-the-candle-subsea-cable-updates-to-our-asia-pacific-connectivity-projects/</guid>
      <pubDate>Mon, 06 Oct 2025 02:30:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Accelerating our Android apps with Baseline Profiles]]></title>
      <description><![CDATA[<p>Key Takeways:</p>
<ul><li>With billions of Android app users, we’re always looking to improve the Meta app experience, and in this post, we explore the ways we’ve leveraged Android’s Baseline Profiles to significantly improve their performance.</li>
<li>We discuss the performance challenges we’ve faced as Meta’s apps, how the needs of users have become more complex over time, and the infrastructure we’ve created to solve them.</li>
<li>We share our insights on creating Baseline Profiles with user data and the tuning we’ve used to make them even more effective. Altogether, Baseline Profiles have improved performance for various critical metrics by up to 40% across Meta’s apps.</li>
</ul><p>Application performance is critical for a good user experience. Slow startups, dropped frames and poor responsiveness are all key drivers of user frustration and, ultimately, attrition.</p>
<p>Performance consciousness during application development, and use of appropriate data structures, algorithms, caching strategies, and so on, are fundamental parts of mitigating these issues. However, it is equally important to understand the underlying representations of compiled application code, and the manner in which it is loaded and executed, such that build tools and runtimes can be configured and tuned optimally.</p>
<p>Over the past few years at Meta, we have developed infrastructure for profile-guided compiler and runtime optimizations targeting our Android applications. A major component of this infrastructure is the Android Runtime’s Baseline Profiles feature, which we have leveraged extensively to significantly improve the performance of our Android applications.</p>
<p>In this post, we’ll describe some performance considerations related to the Android Runtime (ART), explore some related performance challenges we have faced in our apps, and explain how we utilized Baseline Profiles to overcome them.</p>
<h2>ART Performance Considerations</h2>
<p>On Android, the preferred, and thus dominant, languages for user application development are Kotlin and Java. Kotlin/Java code is compiled to Dalvik bytecode (“Dex code”) and packaged into “.dex” files, which are organized into classes and methods reflecting their original sources.</p>
<p>Before any dex code associated with a method can be executed by the Android Runtime, its parent class must be loaded by the runtime. This happens when a class is first accessed during application execution, and involves locating the class’ metadata, registering it with ART, initializing static data, and anything else required to interact with the class.</p>
<p>Once its parent class is loaded, its methods may be executed. Dex code is, of course, not machine code that can be directly executed on hardware, and thus the Android Runtime must perform this translation. By default, at runtime, methods in dex code will simultaneously be executed via interpretation and profiled to determine if they are hot. Once a method is determined to be hot, it is compiled to machine code via ART’s just-in-time compiler, and the compiled version is executed thereafter. (Executing machine code is generally significantly faster than interpretation.)</p>
<p>Both class loads and the interpretation/profiling stage of dex method execution have runtime cost, which often result in temporary, but user-perceptible, performance degradation.  Furthermore, classes must be re-loaded following every app cold start. <a href="https://developer.android.com/topic/performance/vitals/launch-time#cold">Cold starts</a> happen when the system starts the app for the first time. After a cold start, the app is in memory, and subsequent starts are much faster. (Note that this is somewhat mitigated by “runtime app images” on Android 14+.) ART does have a means of persisting compiled methods across cold starts (this is simplified—they are not strictly “persisted,” and require a background dexopt run between cold starts), but they must be re-profiled and re-compiled following an app version update.</p>
<h2>Meta’s Mobile App Challenges</h2>
<p>Meta’s mobile applications are the primary point of access for most of our users, the majority of which use Android. Our mobile apps face several challenges in balancing shipping velocity with our performance goals. Startup performance, in particular, is especially important, as it can have a disproportionate impact on user experience.</p>
<p>Maintaining a minimal set of classes loaded on startup is a key focus for startup performance. As our apps continue to add new features, such as Instagram Reels or Messenger’s End-to-End Encryption, the startup classes set grows as well. Besides user-visible features, critical functionality such as crash reporting, login authentication, and performance logging are all involved in startup. Facebook and Instagram, for example, each load more than 20,000 classes on startup, and several thousand more for feed scrolling.</p>
<p>We also care about improving performance for user journeys after startup. These user journeys measure key parts of the user experience, such as scrolling the user’s feed, or the time it takes to fetch and render a photo. Additionally, these journeys typically specify both the user’s behavior, such as scrolling or navigating, as well as where it’s happening. For example, a user scrolling their feed is considered separately from scrolling their inbox, and navigating to a profile is considered separately from navigating to a feed. We prioritize optimizing different user journeys for each app, and regularly revisit whether new ones should be added, or existing ones removed.</p>
<p>Optimizing user journeys requires understanding exactly what classes get loaded. For this, we collect profiles of class load sequences from many different users, and look for what they have in common. We’ve found that these profiles can look dramatically different across different users, even for the same user journey. Moreover, the exact same user can still have a different profile of class loads on another day, with different code paths taken due to <a href="https://engineering.fb.com/2012/08/08/uncategorized/building-and-testing-at-facebook/">experimentation</a>, and can be different again the next week, as both the code and user behavior changes. Our monorepo sees thousands of commits each day. There is no easy one-size-fits-all solution here.</p>
<p>In total, we have a large, growing, and dynamic set of code to manage. We need a solution that can intelligently adapt to frequent code changes between each release, can quickly generate compiled code and profiles, and can benefit both startup and other user journeys.</p>
<h2>ART Install-Time Optimizations</h2>
<p>Since Android 9, ART has offered the following install-time optimizations:</p>
<ul><li>AOT (“Ahead of Time”) compilation of specified methods</li>
<li>Creation of an “app image” with specified classes</li>
</ul><p>AOT compilation means that specified methods will be compiled to machine code by ART before the app runs for the first time. This eliminates the overhead involved in interpreting and profiling the method’s initial execution.</p>
<p>An <a href="https://www.youtube.com/watch?v=fwMM6g7wpQ8&amp;t=2145s">app image</a> is a file containing a partial representation of the in-memory ART data structures, which would be created or populated by class loads for specified classes. When an app is started with an app image, it is mapped into the process’ heap, and  any necessary fixups are applied. The end result is that many classes may be effectively loaded extremely quickly at startup, and any later runtime cost associated with loading these classes is eliminated.</p>
<p>These optimizations can be triggered by supplying a special profile to ART at app install time.  There are two main mechanisms for this: Cloud Profiles and Baseline Profiles.</p>
<h3>Cloud Profiles</h3>
<p>Cloud Profiles are aggregations of profiling data from many different users collected by Google Play during the initial rollout of an app version. After the Cloud Profile has been created, all subsequent users installing that app version via Google Play will receive that Cloud Profile, which will be used by ART for AOT compilation and app image creation.</p>
<p>Cloud Profiles have <a href="https://developer.android.com/topic/performance/baselineprofiles/overview#cloud-profiles">several downsides</a>, however:</p>
<ul><li>Earlier users in the rollout do not benefit at all from Cloud Profiles, as they’re the ones providing the profiling data.</li>
<li>App developers do not have any way to observe or control the classes and methods in the profile.</li>
<li>They are generated in a way that is strongly skewed towards early startup improvement.</li>
<li>They are only available via Google Play—applications installed through other means such as different app stores or sideloading can’t use them.</li>
</ul><h3>Baseline Profiles</h3>
<p><a href="https://developer.android.com/topic/performance/baselineprofiles/overview">Baseline Profiles</a> are similar to Cloud Profiles, as they also trigger ART install-time optimizations, with a few key differences. Whereas Cloud Profiles are generated by Google Play through collecting and aggregating data from early users of an app version, Baseline Profiles are generated and provided by application developers. Developers can simply package their Baseline Profile inside their corresponding APK or AAB. When both Cloud Profiles and Baseline Profiles are available, they can be used in <a href="https://developer.android.com/topic/performance/baselineprofiles/overview#compilation-behaviors">tandem</a>.</p>
<p><img class="alignnone size-full wp-image-22895" src="https://engineering.fb.com/wp-content/uploads/2025/09/image1.png" alt="" width="1600" height="561" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/image1.png 1600w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=916,321 916w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=768,269 768w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=1024,359 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=1536,539 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=96,34 96w, https://engineering.fb.com/wp-content/uploads/2025/09/image1.png?resize=192,67 192w" sizes="(max-width: 992px) 100vw, 62vw" /><em>Diagram showing the flow for Baseline and Cloud Profiles in Google Play. “Improving App Performance with Baseline Profiles” by Kateryna Semenova, Rahul Ravikumar, and Chris Craik, 28 Jan. 2022. <a href="https://android-developers.googleblog.com/2022/01/improving-app-performance-with-baseline.html">Android Developers Blog.</a></em></p>
<p>Baseline Profiles give full control of install time optimizations to application developers, and are available to users immediately. This allows developers to control install-time optimizations in a way which is much more tuned to the specific needs of their app than Cloud Profiles, including the ability to optimize for scenarios beyond startup.</p>
<p>Google offers some mechanisms of generating baseline profiles from benchmarks (e.g. Macrobenchmark). However, they can also be generated by directly specifying classes and methods in a <a href="https://developer.android.com/topic/performance/baselineprofiles/manually-create-measure#rule_syntax">well-specified format</a> to a tool called <a href="https://android.googlesource.com/platform/tools/base/+/refs/heads/mirror-goog-studio-master-dev/profgen/">profgen</a>, which offers flexibility.</p>
<p>Next, we will look at how Baseline Profiles have been a very beneficial technology for the performance of Meta’s apps, solving many of the challenges with ART.</p>
<h2>How We’ve Created Baseline Profiles at Meta</h2>
<p>Earlier, we described challenges we have faced with our Android applications’ performance. In particular, we mentioned how our apps’ startups can load tens of thousands of classes on each cold start, and how our weekly shipping wipes all compiled code on every update.</p>
<p>We have long been aware of and focused on these challenges, particularly related to cold start. In the past, we have seen major performance gains via ordering classes within the underlying dex file according to their typical load position during startup, due to improved locality of reference. We call this “Interdex Ordering,” which is done via InterdexPass in <a href="https://engineering.fb.com/2016/04/12/android/open-sourcing-redex-making-android-apps-smaller-and-faster/">Redex</a>, our bytecode optimizer.  (Google’s analog of this in R8 is called “startup profiles.”). ART’s install-time optimizations complement and improve upon this optimization by entirely eliminating the loading cost for some of these classes, and ensuring that their hot methods are compiled before the first run of the app version.</p>
<p>Previously, we mentioned how developers do not directly control the Cloud Profile’s contents. This particularly impacted Meta, as once a startup exceeds five seconds, the Android Runtime automatically considers the startup to be complete. This caused the Cloud Profile to insufficiently mark which classes were necessary for startup. While Cloud Profiles have undoubtedly helped here, the control and flexibility of Baseline Profiles have allowed us to fully realize the potential of these optimizations and measure large performance wins.</p>
<p>To create our Baseline Profiles, we use data from a variety of sources, which we process and aggregate together, based on configurations that are subject to continuous experimentation and tuning.</p>
<h3>Collecting Profile Data</h3>
<p>In our initial Baseline Profiles experiments, we simply used the static profiles for the AndroidX libraries that are shipped alongside them. Today, we have a sophisticated set of collection technologies we use together to produce profiles for our apps.</p>
<p>Benchmarks are one approach to collecting profile data. At Meta, we leverage some local benchmarks in Baseline Profile creation for some of our apps, using internal tooling we have written to collect class and method usage information. However, for apps like Facebook and Instagram, benchmarks are not sufficiently representative of production behaviour. For more complex apps like these, we additionally collect class and method usage data from users to obtain a more complete picture.</p>
<p>To collect class usage data from users, we make use of a custom <a href="https://developer.android.com/reference/java/lang/ClassLoader">ClassLoader</a>, in which we insert code that logs which classes are being loaded, which is then periodically uploaded. As this collection has a performance cost, it is only conditionally enabled with a very low sample rate. The collected class load logs are then aggregated together to derive appearance frequencies, and classes exceeding a certain frequency threshold are included in the Baseline Profile for the next release.</p>
<p>There is no hook that allows us to log method usage as easily. However, we include some specialized telemetry into our apps that allow us to granularly identify clusters of methods that are typically called by users. We then sample and aggregate this data similarly to class data.</p>
<p>All of this data is then combined into a “Human Readable Profile” and fed to profgen, which generates the final Baseline Profile. Below is an example Human Readable Profile:</p>
<p><img class="alignnone size-full wp-image-23013" src="https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png" alt="" width="1400" height="322" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png 1400w, https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png?resize=916,211 916w, https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png?resize=768,177 768w, https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png?resize=1024,236 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png?resize=96,22 96w, https://engineering.fb.com/wp-content/uploads/2025/09/image2_81e361.png?resize=192,44 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Breaking it down we can see:</p>
<ul><li class="c1" aria-level="1">“#” are used for comment lines.</li>
<li class="c1" aria-level="1">Classes can be directly specified by their descriptor.</li>
<li class="c1" aria-level="1">Methods can be directly specified with optional flags.</li>
<li class="c1" aria-level="1">Wildcards can be used to match all classes or methods matching a given prefix.</li>
</ul><h3>Tuning and Experimentation</h3>
<p>Cold start was the first scenario we wanted to optimize with Baseline Profiles. We started conservatively, with high frequency thresholds for including classes and methods into the profiles, requiring a class/method to appear in more than 80% to 90% of all collected user traces. Our concern was that shipping a Baseline Profile that was too large could actually negatively impact performance. Compiled machine code is generally 10 times larger than its original interpreted code. This size difference incurs an increased I/O cost, with more page faults or cache misses.</p>
<p>Over time, we have experimented with different inclusion thresholds, and have expanded beyond cold start to other user interactions. At present, we include classes and methods which appear in &gt;= 20% of cold start user traces for most apps. Interactions we have optimized with Baseline Profiles include newsfeed scrolling in Facebook and Instagram, navigation from thread lists to thread views in Messenger and Instagram’s direct messages inbox, and general latency when navigating between app surfaces.</p>
<p>We have occasionally observed startup and other regressions when running experiments that  increase the baseline profile size, typically with indications that memory pressure has increased.  However, with targeted and carefully measured additions, we have managed to grow our profiles quite a bit larger than we expected to be possible. At present, we have several tens of thousands of entries in the Baseline Profiles for all of our apps.</p>
<h2>The Impact of Baseline Profiles at Meta</h2>
<p>Over the past few years, we have implemented Baseline Profiles across all of our major Android apps, and observed consistently positive results from doing so.  As we have integrated and improved upon our Baseline Profiles over time, we have measured high-percentage improvements to app start, scroll performance, navigation latency between surfaces, and several other critical performance metrics, ranging from 3% all the way up to 40%.</p>
<p>Baseline Profiles have provided a powerful lever for our teams to meaningfully improve our users’ experience year-on-year. Our continual investment and experimentation with Baseline Profiles have proven to be well worth it. For all Android developers, whether you already use Baseline Profiles or have yet to start, we encourage you to take some of our lessons here and apply them for yourself.</p>]]></description>
      <link>https://engineering.fb.com/2025/10/01/android/accelerating-our-android-apps-with-baseline-profiles/</link>
      <guid>https://engineering.fb.com/2025/10/01/android/accelerating-our-android-apps-with-baseline-profiles/</guid>
      <pubDate>Wed, 01 Oct 2025 16:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[LLMs Are the Key to Mutation Testing and Better Compliance]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Following our keynote presentations at FSE 2025 and Eurostar 2025, we’re delving further into the development of Meta’s <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Automated Compliance Hardening (ACH) tool</a>, an LLM-based tool for software testing that is automating aspects of compliance adherence at Meta, while accelerating developer and product velocity.</li>
<li class="c1" aria-level="1">By leveraging LLMs we’ve been able to overcome the barriers that have prevented mutation testing from being efficiently deployed at scale. This allows us to greatly simplify risk assessments, reduce cognitive load for developers, and, ultimately, create a safer online ecosystem by enabling continuous compliance.</li>
<li class="c1" aria-level="1">We’re also inviting the community to join us in exploring new challenges and opportunities for leveraging LLMs in software testing through efforts like our <a href="https://arxiv.org/pdf/2504.16472" target="_blank" rel="noopener">Catching Just-in-Time Test (JiTTest) Challenge</a>.</li>
</ul><p>Today, AI is accelerating the pace and complexity of technology development worldwide, requiring compliance systems to keep up. However, compliance has traditionally relied on manual processes, which can be error-prone and challenging to scale.</p>
<p>At Meta, we’ve been investing in advanced AI-enabled detection mechanisms to help us ensure we’re upholding our responsibility to keep our products and services safe for everyone while adhering to compliance obligations at scale. AI-powered solutions help our engineers, developers, and product teams meet global regulatory requirements more easily and efficiently so they can spend more time focusing on building new and innovative products and services.</p>
<p>Earlier this year, we released new research into <a href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/" target="_blank" rel="noopener">leveraging large language models (LLMs) for mutation-guided test generation</a> – where faults (mutants) are deliberately introduced into source code as a method of assessing how well a testing framework can detect those faults. </p>
<p>Meta’s <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Automated Compliance Hardening (ACH) tool</a> successfully combines automated test generation techniques with the capabilities of LLMs to generate highly-relevant mutants for testing as well as tests that are guaranteed to catch those mutants. Through simple, plain-text prompts where engineers describe the mutant to test, ACH makes this process intuitive and reliable. It’s one of our latest AI-powered detection mechanisms that helps us safeguard our operations and catch code that is out of compliance. With ACH we can more easily and proactively identify bugs that would negatively impact our compliance, and prevent them from entering our systems in the future. This technology provides Meta engineers and our product teams with the consistency and confidence they need to ensure our codebase remains risk- resilient.</p>
<p>Since empowering ACH with our research findings, we’ve presented our work at keynote presentations at <a href="https://conf.researchr.org/info/fse-2025/keynotes" target="_blank" rel="noopener">FSE 2025</a> and <a href="https://conference.eurostarsoftwaretesting.com/event/2025/assured-llm-based-software-test-generation/" target="_blank" rel="noopener">EuroSTAR 2025</a>. Our presentations shared insights into how we’ve used LLMs to solve the major barriers that have prevented mutation testing at scale and highlighted new areas in automated software testing where LLMs can have a significant impact. </p>
<p>For a long time people thought of mutation testing as a way of assessing test quality but less as a way to generate tests. By leveraging generative AI, we’ve been able to make what studies have consistently shown to be the most powerful form of software testing even more efficient and scalable. </p>
<h2>The Challenge of Scaling Mutation Testing</h2>
<p>The idea behind mutation testing is to go beyond traditional structural coverage criteria like statement coverage or branch coverage (which only show if lines of code are run), to a more robust system of testing. Where statement or branch coverage might still fail to detect a bug if a line still runs, mutation testing reveals whether a test fails after inserting a mutation, indicating that the tests are not effectively checking the code’s behavior.  As an example, ACH can simulate privacy faults that would introduce compliance risk (such as messages being shared with unintended audiences) to model a potential real-world issue. It then creates unit tests to catch these bugs, preventing them from reaching production, even if they’re reintroduced in future code changes. </p>
<p>Even though mutation testing cannot exist on its own (it requires a test to already exist), it helps engineers and developers identify weak assertions and encourages them to write tests that truly validate code behavior instead of just executing it. </p>
<p>In practice however, mutation testing has been notoriously difficult to deploy. Despite <a href="https://web.eecs.umich.edu/~weimerw/2022-481F/readings/mutation-testing.pdf" target="_blank" rel="noopener">over five decades of research</a>, mutation testing has traditionally faced five major barriers. </p>
<h3>1. Mutation Testing Isn’t Scalable</h3>
<p>Traditional mutation testing generates a very large number of mutants, making it computationally expensive and difficult to scale to large industrial codebases. The sheer volume of mutants can overwhelm testing infrastructure and slow down development cycles.</p>
<h3>2. Mutation Testing Can Create Unrealistic Mutants</h3>
<p>Mutants generated via traditional means can be unrealistic or irrelevant to real faults that developers are interested in.</p>
<p>This can happen for a few reasons:</p>
<ul><li class="c1" aria-level="1"><strong>Rule-based mutation operators</strong>: Traditional mutation testing relies on predefined, rule-based mutation operators that apply generic syntactic changes to code (e.g., flipping boolean conditions, changing arithmetic operators). These operators do not consider the specific context or domain of the code, leading to mutants that do not represent faults that developers would realistically introduce.</li>
<li class="c1" aria-level="1"><strong>Lack of specific focus</strong>: Mutants generated without targeting a specific class of faults or domain concerns often produce changes that are irrelevant to the actual risks or issues faced by the system.</li>
<li class="c1" aria-level="1"><strong>Semantic irrelevance</strong>: Some mutants may syntactically change the code but do not affect the program’s semantics in a meaningful way or do not simulate realistic fault conditions. These mutants do not help in improving test quality because they do not represent faults that tests should catch.</li>
<li class="c1" aria-level="1"><strong>Overgeneralization</strong>: Applying broad mutation rules uniformly across all code can generate mutants that aren’t useful in the context of the specific software, leading to wasted effort in trying to kill mutants that do not correspond to real-world bugs.</li>
</ul><h3>3. Equivalent Mutants Waste Time and Resources</h3>
<p>Equivalent mutants –  mutants that are syntactically different but semantically equivalent to the original code – have been a persistent challenge for mutation testing that wastes developer time and computational resources. Determining whether a mutant is equivalent or not is known to be mathematically undecidable, adding to the technical challenge of the problem</p>
<h3>4. Mutation Testing Requires a Lot of Computational Resources</h3>
<p>Mutation testing is costly in terms of computational resources and developer effort. Running tests against many mutants and analyzing results requires a significant amount of infrastructure and time, which can be prohibitive in fast-paced industrial environments.</p>
<h3>5. Mutation Testing Can Overstretch Testing Efforts</h3>
<p>Mutation testing can overstretch testing efforts by focusing on killing mutants that may not correspond to meaningful or high-impact faults. This can lead to diminishing returns where additional testing effort does not translate into better fault detection or software quality.</p>
<h2>How LLMs Solve the Challenges of Mutation Testing</h2>
<p>While it has been challenging for large organizations like Meta to deploy mutation testing at scale, what they have been able to do is collect vast amounts of data on the bugs found in various stages of their software development. All of this data can be used to train an LLM to guide test generation.</p>
<p>When we construct mutants that are both highly-relevant and currently not caught (unkilled) by any existing testing framework, we can use these mutants as prompts for LLM-based test generation (hence, mutation-guided, LLM-based test generation). The end result is ACH – a system and workflow that can generate both problem-specific mutants <em>and</em> the tests that can catch them, using plain text instructions.</p>
<p>By leveraging LLMs, ACH solves for each of the barriers to mutation testing deployment: </p>
<h3>1. ACH Enables Scalable Mutant Testing</h3>
<p>Meta’s ACH system uses LLMs to generate fewer, more realistic, and highly specific mutants targeted at particular fault classes (e.g., privacy faults), increasing scalability and relevance. This mutation-guided approach focuses on faults relevant to the specific problem domain, which improves the relevance and quality of mutants and also resolves scalability issues by significantly lowering the number of mutants that need to be generated in order to be relevant and useful.</p>
<h3>2. ACH Creates Realistic Mutants</h3>
<p>With ACH, a security or privacy engineer can use textual descriptions of issues they are concerned about to generate very realistic problem-specific bugs that apply directly to an area of concern. </p>
<h3>3. ACH Detects and Kills Equivalent Mutants With LLMs </h3>
<p>ACH features an LLM-based Equivalence Detector agent that is often capable of judging whether a mutant is equivalent to the original code. In our <a href="https://arxiv.org/pdf/2501.12862">own research and testing with ACH</a> we found that when combined with simple static analysis preprocessing (e.g., stripping comments), this approach achieves high precision (0.79) and recall (0.47) – rising to 0.95 and 0.96 with simple preprocessing – in detecting equivalent mutants, efficiently filtering out unkillable mutants.</p>
<p>ACH also automatically generates unit tests that kill the mutants, so engineers only ever need to look at tests and, if they wish, mutants that are <strong>guaranteed</strong> to be non-equivalent.</p>
<h3>4. Tests Generated by ACH Are Computationally Efficient and Easier To Deploy</h3>
<p>From October to December 2024, we <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">ran a trial</a> where ACH was deployed for privacy testing use cases on several platforms at Meta, including Facebook, Instagram, WhatsApp, and our wearables platforms (Quest and Ray-Ban Meta glasses). Over thousands of mutants and hundreds of generated tests, privacy engineers at Meta accepted 73% of the generated tests, with 36% judged as privacy relevant. Feedback showed engineers found tests useful even when they weren’t directly relevant to privacy. Our engineers appreciate the additional safety net AI can provide and the augmentation of their skillset at scale for handling edge cases. But importantly, they valued being able to focus on evaluating tests rather than having to construct them.</p>
<h3>5. ACH Helps Prevent Overstretching</h3>
<p>ACH generates mutants that are closely coupled to the issue of concern and produces tests that catch faults missed by existing tests. <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Our empirical results show</a> that many generated tests add coverage and catch faults that would otherwise go undetected, highlighting mutation testing’s superiority over structural coverage criteria alone.</p>
<h2>The Catching JiTTest Challenge: More Frontiers for LLMs in Software Testing</h2>
<p>LLMs have opened up exciting new challenges and areas of exploration in the domain of automated software testing, specifically around generating hardening tests and catching tests. Hardening tests protect against future regressions by ensuring that new changes do not break existing functionality. Catching tests detect faults in new or changed functionality. </p>
<p>Based on our work with ACH, we believe there is even more opportunity to leverage LLMs to improve test generation. Currently, we’re particularly interested in using LLMs to tackle the challenge of generating just-in-time (JiT) tests, where tests are generated for human review just in time for pull requests to catch faults before code ends up in production. What makes this particularly challenging is the <a href="https://ieeexplore.ieee.org/document/6963470">Test Oracle Problem</a> – the challenge of distinguishing the desired and correct behavior based on a given input from an incorrect behavior. </p>
<p>To that end we’re proposing the <a href="https://arxiv.org/pdf/2504.16472" target="_blank" rel="noopener">Catching Just-in-Time Test (JiTTest) Challenge</a> to the wider community. We want to encourage engineers and developers to build systems capable of generating tests that reveal bugs in pull requests with high precision, while also keeping humans in the loop to ensure low false positives.  </p>
<h3>A “Just-In-Time” Call to Action</h3>
<p>Our paper, “<a href="https://arxiv.org/pdf/2504.16472" target="_blank" rel="noopener">Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges</a>,” which was recently presented as a keynote at <a href="https://conf.researchr.org/info/fse-2025/keynotes" target="_blank" rel="noopener">FSE 2025</a>, shares more about the JiTTest Challenge as well as the open problems around applying LLMs to automated software testing. </p>
<h2>LLMs and the Future of Software Testing</h2>
<p>AI has helped us streamline and optimize our compliance and overall risk management frameworks at Meta. Processes that have historically been time consuming, error prone, and difficult to comprehensively identify potential risks, are being transformed into systems that save engineer and developer time while also enhancing compliance. </p>
<p>However, there is still a lot of exciting work ahead to be done for ACH and in the larger area of applying LLMs to software testing to enable continuous compliance. </p>
<p>While our own testing with ACH explored its uses in privacy testing and focused on <a href="https://engineering.fb.com/2024/12/18/android/translating-java-to-kotlin-at-scale/">Kotlin</a> as the main language, we’re currently working to expand into other domains and more languages. We’re also investigating ways to leverage techniques like fine-tuning and prompt engineering to make mutant generation even more precise and relevant. </p>
<p>More broadly, our work with ACH, as well as the JiTTest Challenge, will focus on addressing the Test Oracle Problem – exploring ways to enable testing of existing faults with high precision while avoiding false positives. </p>
<p>We also cannot ignore the human element in all of this. In addition to examining ways to ensure that human reviewers are present to help prevent false positives, we should also investigate how developers are interacting with LLM-generated tests to improve their adoption and usability.  </p>
<p>We’ll be presenting more of our work in the near future, including at the upcoming <a href="https://atscaleconference.com/" target="_blank" rel="noopener">Product@Scale conference</a>. We hope you’ll join us on our journey to further explore AI’s potential to transform software testing and raise the bar for risk management across industries.</p>]]></description>
      <link>https://engineering.fb.com/2025/09/30/security/llms-are-the-key-to-mutation-testing-and-better-compliance/</link>
      <guid>https://engineering.fb.com/2025/09/30/security/llms-are-the-key-to-mutation-testing-and-better-compliance/</guid>
      <pubDate>Tue, 30 Sep 2025 16:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta 3D AssetGen: Generating 3D Worlds With AI]]></title>
      <description><![CDATA[<p>Imagine being able to use AI to create 3D virtual worlds using prompts as easily as you can generate images.</p>
<p>The intersection of AI and VR was one of the biggest topics at Meta Connect this year. In his <a href="https://youtu.be/D97ILdUbYww?t=3902" target="_blank" rel="noopener">keynote</a>, Mark Zuckerberg shared his vision of a <a href="https://youtu.be/D97ILdUbYww?t=3902" target="_blank" rel="noopener">future where anyone can create virtual worlds</a> using AI-powered tools like the ones available in the upcoming <a href="https://developers.meta.com/horizon-worlds" target="_blank" rel="noopener">Meta Horizon Studio</a>.</p>
<p>But AI is already making it easier than ever to create 3D assets.</p>
<p>On this episode of the Meta Tech Podcast, <a href="https://www.threads.com/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> is joined by Mahima and Rakesh from Meta’s XR Tech team to discuss <a href="https://developers.meta.com/horizon/blog/AssetGen2">AssetGen</a>, a new foundation model for 3D assets.</p>
<p>They talk about how they built and trained AssetGen, the important role LLMs have to play in the future of VR, and how they’re tackling the ambitious goal of generating entire 3D worlds from simple text prompts.</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/38282575/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/5VtjehlnMQIS2ArZ0iq11r?ref=engineeringatmeta" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/generating-3d-worlds-with-ai/id1370910331?i=1000727530767" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pocketcasts.com/podcast/meta-tech-podcast/c4ede3e0-1fbf-0136-c266-7d73a919276a/generating-3d-worlds-with-ai/b05b30a2-7b3f-4283-a08c-2e802aee1d92?ref=engineeringatmeta" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/09/29/virtual-reality/assetgen-generating-3d-worlds-with-ai/</link>
      <guid>https://engineering.fb.com/2025/09/29/virtual-reality/assetgen-generating-3d-worlds-with-ai/</guid>
      <pubDate>Mon, 29 Sep 2025 14:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta’s Infrastructure Evolution and the Advent of AI]]></title>
      <description><![CDATA[<div class="wp-video c1"><a href="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-centers-AI-Infra-video_small.mp4">https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-centers-AI-Infra-video_small.mp4</a></div>
<p>Over the past 21 years, Meta has grown exponentially from a small social network connecting a few thousand people in a handful of universities in the U.S. into several apps and novel hardware products that serve over 3.4 billion people throughout the world.</p>
<p>Our infrastructure has evolved significantly over the years, growing from a handful of software systems on a small fleet of servers in a few co-location facilities to a massive, globally networked operation. We faced numerous challenges along the way and developed innovative solutions to overcome them.</p>
<p>The advent of AI has changed all of our assumptions on how to scale our infrastructure. Building infrastructure for AI requires innovation at every layer of the stack, from hardware and software, to our networks, to our data centers themselves.</p>
<p>Facebook was built on the open source Linux, Apache, MySQL, and PhP (LAMP) stack. True to our roots, much of our work has been openly shared with the engineering community in the form of research papers or open source hardware and software systems. We remain committed to this open source vision and describe how we are committed to an open standards approach to silicon and hardware systems as we push the frontiers of computer science.</p>
<h2>Scaling Our Infrastructure Stack (2004 – 2010)</h2>
<p>In our earliest years, we focused our engineering work on scaling our software stack. As Facebook expanded from Harvard to other universities, each university got its own database. Students logging on to Facebook would connect to a set of common web servers that would in turn connect each student to their university’s database. We quickly realized that students wished to connect with their friends who attended other universities — this was the birth of our social graph that interconnected everyone on the social network. </p>
<p>As Facebook expanded beyond universities to high schools and then the general public, there was a dramatic increase in the number of people on our platform. We managed database load by scaling our <a href="https://research.facebook.com/publications/scaling-memcache-at-facebook/">Memcache</a> deployments and then building entirely new software systems such as the <a href="https://engineering.fb.com/2013/06/25/core-infra/tao-the-power-of-the-graph/">TAO social graph</a>, and a whole host of new caching and data management systems. We also developed a new ranking service for News Feed and a <a href="https://engineering.fb.com/2012/03/22/web/under-the-hood-improving-facebook-photos/">photo service for sharing photos</a> and videos.</p>
<p>Soon, we were expanding beyond the US to Europe. Scaling our software systems was critical, but no longer sufficient. We needed to find other ways to scale. So we moved one layer below software and started scaling our physical infrastructure. We expanded beyond small co-location facilities in the Bay Area to a co-lo in Ashburn, Va. In parallel, we built out our first data centers in <a href="https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/">Prineville, Ore</a>. and Forest City, N.C.</p>
<p>As our physical infrastructure scaled to multiple data centers, we ran into two new problems. First, we needed to connect our user base distributed across the US and Europe to our data centers. We tackled this problem by aggressively building out our edge infrastructure where we obtained some compute capacity beside every local internet service provider (ISP) and bought into the peering network that connected the ISP to our data centers. Second, we needed to replicate our entire software stack to each data center so that people would have the same experience irrespective of which actual physical location they connected to. This required us to build a high bandwidth, multipath backbone network that interconnected our data centers. Initially, this entailed building out our terrestrial fiber network to connect the various co-location facilities in California and Virginia to our new data centers in Oregon and North Carolina.</p>
<p>As our user base grew globally, we scaled beyond single data center buildings and into data center regions consisting of multiple buildings. We also aggressively built out our edge presence, where we now operate hundreds of <a href="https://engineering.fb.com/2017/08/21/networking-traffic/steering-oceans-of-content-to-the-world/">points-of-presence (POPs)</a> across the world.</p>
<figure id="attachment_22972" aria-describedby="caption-attachment-22972" class="wp-caption alignnone c2"><img class="size-full wp-image-22972" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png" alt="" width="1999" height="1166" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=916,534 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=768,448 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=1024,597 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=1536,896 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-points-of-presence-map.png?resize=192,112 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22972" class="wp-caption-text">Over the decades, our infrastructure has grown to a global footprint of interconnected data centers and edge points-of-presence.</figcaption></figure><h2>The Challenges of Scaling (2010 – 2020)</h2>
<p>Building out a global infrastructure also brought along all of the complex corner cases of computer science.</p>
<h3>Cache Consistency </h3>
<p>First, we needed to solve for cache consistency. We saw issues where people would receive notifications about being tagged in a photo, but couldn’t see the photo. Or people in a chat thread would receive messages out-of-order. These problems manifested because we were serving a fraction of our user base out of each data center region. People served out of the same region would receive notifications and see the right data, while people in a different region would experience a lag as the data update was replicated across our distributed fleet. This lag directly led to an inconsistent user experience. We solved these problems by building <a href="https://research.facebook.com/publications/existential-consistency-measuring-and-understanding-consistency-at-facebook/">novel software systems</a> that delivered cache invalidations, eventually <a href="https://engineering.fb.com/2022/06/08/core-infra/cache-made-consistent/">building a consistency API for distributed systems</a>.</p>
<h3>Fleet management</h3>
<p>As we added new data center regions and grew our machine fleet, we also had to develop new abstractions to manage them. This included systems and associated components like:</p>
<ul><li class="c3" aria-level="1"><a href="https://engineering.fb.com/2019/06/06/data-center-engineering/twine/">Twine</a>: a cluster management system that scales to manage millions of machines in a data center region.</li>
<li class="c3" aria-level="1"><a href="https://engineering.fb.com/2021/06/21/data-infrastructure/tectonic-file-system/">Tectonic</a>: a data center scale distributed file system.</li>
<li class="c3" aria-level="1"><a href="https://engineering.fb.com/2021/08/06/core-infra/zippydb/">ZippyDB</a>: a strongly consistent distributed key value store.</li>
<li class="c3" aria-level="1"><a href="https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/">Shard Manager</a>: a global system to manage tens of millions of shards of data, hosted on hundreds of thousands of servers for hundreds of applications.</li>
<li class="c3" aria-level="1"> <a href="https://engineering.fb.com/2019/06/06/data-center-engineering/delos/">Delos</a>: a new control plane for our global infrastructure.</li>
<li class="c3" aria-level="1"> <a href="https://www.usenix.org/conference/osdi23/presentation/saokar">Service Router</a>: to manage our global service mesh.</li>
</ul><p>We developed the above systems, and many others, so we could operate a global fleet of millions of machines, while also providing excellent performance.</p>
<h3>Masking hardware failure</h3>
<p>More machines also implies a higher likelihood of failure. To address this, we worked to ensure that we could mask failures from users and provide a highly available and accessible service. We accomplished this by building new systems like:</p>
<ul><li class="c3" aria-level="1"><a href="https://research.facebook.com/publications/kraken-leveraging-live-trafc-tests-to-identify-and-resolve-resource-utilization-bottlenecks-in-large-scale-web-services/">Kraken</a>: which leverages live traffic load tests to identify and resolve resource utilization bottlenecks.</li>
<li class="c3" aria-level="1"><a href="https://research.facebook.com/publications/taiji-managing-global-user-traffic-for-large-scale-internet-services-at-the-edge/">Taiji</a>: to manage user traffic load balancing.</li>
<li class="c3" aria-level="1"><a href="https://research.facebook.com/publications/maelstrom-mitigating-datacenter-level-disasters-by-draining-interdependent-traffic-safely-and-efficiently/">Maelstrom</a>: which handled data center-scale disasters safely and efficiently while minimizing user impact.</li>
</ul><p>We continue to invest heavily in reliability and fault tolerance as stability is critical for all the people who use our services to connect with their friends, family, and the businesses that serve them.</p>
<h2>Enter AI Workloads (2020)</h2>
<p>While we were navigating the challenges of scaling, we were also seeing glimpses of how AI workloads would impact our infrastructure.</p>
<h3>The Emergence of GPUs</h3>
<p>Our first encounter with AI-induced infrastructure challenges actually started in the late 2010s when short-form videos were becoming very popular. The people who consumed this type of content wanted personalized recommendations – this differed dramatically from our format of ranking content to date.</p>
<p>Meta’s apps were built on the premise that people are part of communities with shared interests. Thus, Facebook surfaced content based on what the community liked rather than having a direct understanding of the individual and their interests. In contrast, if you want to give people an entertaining stream of short form videos, you have to be able to understand all videos uploaded to the platform and pick videos that are interesting to every single person.</p>
<p>This is a significantly different problem. In the first case, all we’re ranking is content that someone’s friends (typically just a few hundred people) have interacted with. In this new model, we have to rank all content that has been uploaded, which is orders of magnitude larger than the number of friends each person has. And we need to produce this ranking not just once, but a custom ranking for each person for each piece of content.</p>
<p>This is where GPUs and other AI accelerators enter the picture. In contrast to a CPU which is primarily a load-store machine, a GPU is a vector and matrix processing machine which can perform orders of magnitude more computation than a CPU.</p>
<p>When given an extremely large corpus of data, for example, a video library, we can build an embedding, which is a mathematical representation of each video as a vector of numbers. This vector captures the context of the video in a lower-dimensional space so that semantically similar content is positioned close to each other. We can now build a model that tracks the sequence of clicks a user makes as they navigate through a library of videos and predict future videos that they might be interested in. Thus, AI combines the mathematical notion of similarity in content, with the computational power of a GPU to provide personalized recommendations.</p>
<p>Internet services scaled throughout the 2000s and 2010s by buying CPUs/memory/hard drives that were extremely cost efficient but unreliable, and then built software systems to mask failures. In contrast, an AI cluster is a high performance computational system consisting of hundreds or even thousands of extremely powerful GPUs with ample memory interconnected with a high bandwidth, a low latency network, and a custom software stack optimized to squeeze the maximum performance out of the system.</p>
<p>Our initial AI clusters interconnected 4k GPUs that were used to train our ranking and recommendation models.</p>
<figure id="attachment_22973" aria-describedby="caption-attachment-22973" class="wp-caption alignnone c4"><img class="wp-image-22973" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Infra-holistic-plan.gif" alt="" width="700" height="710" /><figcaption id="caption-attachment-22973" class="wp-caption-text">Right as we built our first 4k AI cluster, we realized that we needed to holistically plan our infrastructure across data center space, cooling, mechanical systems, hardware, network, storage, and software. And the challenges have only grown as our AI clusters have scaled in scale and complexity.</figcaption></figure><h2>The Rise of Large Language Models (2022)</h2>
<p>This remained the case until large language models (LLMs) started to take off in 2022. At the time, while our AI clusters were 4k in size, each of our training jobs tended to run on 128 GPUs. </p>
<p>When we started to train LLMs, this quickly changed. </p>
<p>LLMs required dramatically more compute capacity, and the more compute you were able to throw at the pretraining job, the better the model you were able to produce. In a few weeks, we had to scale our training job sizes from 128 GPUs to 2k and then 4k GPUs. </p>
<p>For the first time, we were regularly dealing with training jobs where we needed thousands of GPUs to run synchronously. Any single straggling GPU would hold up the performance of the entire cluster. </p>
<p>We quickly learned that scaling training jobs came with all kinds of challenges. GPUs can fail, memory can have errors, the network can experience jitter… And, as with traditional web workloads, the more machines you have, the more likely you are to experience failure. Except this time, it was not so easy to avoid the failures because, unlike the case of serving web requests — where you can simply retry your request on a different machine —  in the case of AI training workloads, your entire training cluster is running one job, and any single failure can bring that job to a halt. If jobs fail too frequently, we stop making progress because of how long it takes to checkpoint and restart jobs. Through collaboration with the industry and our partners, we were able to <a href="https://arxiv.org/pdf/2407.21783">drive the interruption rate down by ~50x</a> (based on normalized interruption/reliability metrics).As we built larger clusters, we also invested in fundamental research and development across our AI Infrastructure. LLMs influenced how we developed our ranking and recommendation models. For instance, <a href="https://generative-rec.github.io/workshop/assets/slides/Meta-actions-speak-louder-than-words.pdf">Hierarchical Sequential Transduction Units (HSTU)</a> accelerated training and inference by 10-1000x for Generative Recommenders. </p>
<h2>Accelerating Our GPU Scale and AI Infrastructure (2023)</h2>
<p>As we were working to get our 4k jobs to run well, we also realized we needed to figure out how to build even larger clusters. Taking advantage of what was available to us, we designed a cluster to use all the power available in a data center building, which is typically low 10s of megawatts. This led to us to build <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/">two clusters of 24k H100s</a> each in late 2023, one using Infiniband and the other using <a href="https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/">RoCE</a>. This allowed us to explore different network technologies while providing our AI teams with the capacity they needed to train increasingly larger LLM models such as Llama 3.</p>
<p>While our two 24k clusters were amongst the largest in the world in 2023, our AI researchers were finding that the more computational power we dedicated to pre-training, the higher quality and more performant the LLM models became. Thus, our infrastructure engineers were tasked with scaling our AI cluster up by another order of magnitude.</p>
<p>To accomplish this, we did something we had never done in Meta’s history: As we mentioned, Meta’s data centers are usually deployed as regions of five or more identical buildings in a single location. By emptying out five production data centers we were able to build a single AI cluster with 129k H100 GPUs – all in a matter of months!</p>
<figure id="attachment_22974" aria-describedby="caption-attachment-22974" class="wp-caption alignnone c2"><img class="size-full wp-image-22974" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png" alt="" width="1999" height="1229" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=916,563 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=768,472 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=1024,630 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=1536,944 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-center-overhead-photo.png?resize=192,118 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22974" class="wp-caption-text">Our data centers are typically made up of multiple buildings in a single location.</figcaption></figure><p>The final challenge that we are tackling is one of efficiency: What hardware and software solutions can most efficiently support the workloads we care about and maximize utilization of our data center capacity?</p>
<p>Unfortunately, our AI workloads are not homogenous. The ranking and recommendation models that deliver personalized user experiences on our apps have different needs than LLMs. And LLMs themselves are rapidly evolving. We are quickly moving beyond the pre-training era to one where reinforcement learning, supervised fine tuning, test time inference, and reasoning are all increasing in importance and require custom hardware and software support.</p>
<p>Given the size of Meta’s AI ambitions, we need to work with different vendors to encourage market diversity. We believe that having multiple options leads to a healthier ecosystem and better solutions in the long run.</p>
<p>To build out our AI infrastructure, we’ve leveraged solutions from partners like AMD and NVIDIA as well as our own custom silicon. The image below shows a pod consisting of six racks. The middle two racks house 72 NVIDIA Blackwell GPUs that consume ~140kW of power! We do not have facility liquid cooling in our traditional data centers, so we had to deploy four air assisted liquid cooling (AALC) racks so the heat wouldn’t melt the machines!</p>
<figure id="attachment_22975" aria-describedby="caption-attachment-22975" class="wp-caption alignnone c5"><img class="size-full wp-image-22975" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg" alt="" width="1179" height="740" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg 1179w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg?resize=916,575 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg?resize=768,482 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg?resize=1024,643 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Catalina-GB200-rack.jpg?resize=192,121 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22975" class="wp-caption-text">Our GB200 rack, <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/" target="new">Catalina</a>, with AALC systems connected into single pod.</figcaption></figure><figure id="attachment_22988" aria-describedby="caption-attachment-22988" class="wp-caption alignnone c6"><img class="size-full wp-image-22988" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp" alt="" width="2048" height="1152" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp 2048w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-advanced-liquid-cooling-AALC-system-rear.webp?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22988" class="wp-caption-text">A look at the rear of one of our AALC systems.</figcaption></figure><p>This pod, together, produces 360 PFLOPS of FP16 compute capacity. To put things in perspective, this pod consumes more than 800x the power a typical CPU consumes, and produces hundreds of thousand times the compute capacity! We are also starting to work with the next system, GB300, which is an improvement in many ways over GB200. </p>
<p>We have invested in other AI accelerators such as AMD’s MI300, which serves a variety of workloads at Meta. We have also invested heavily in the software layer to abstract away hardware differences from our developers as much as possible. Here is where open source software stacks such as <a href="https://pytorch.org/">PyTorch</a> and <a href="https://github.com/triton-lang/triton">Triton</a> have really paid off for us.</p>
<h2>Meta Training &amp; Inference Accelerator (MTIA) </h2>
<p>We have also invested heavily in developing our own silicon. <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">The Meta Training and Inference Accelerator (MTIA)</a> is optimized for our ranking and recommendation inference workloads. This chip is now deployed at scale in our data centers, <a href="https://engineering.fb.com/2024/12/02/production-engineering/meta-andromeda-advantage-automation-next-gen-personalized-ads-retrieval-engine/">primarily serving our ads workloads</a>, and has given us massive benefits in efficiency over vendor silicon. </p>
<p>This is only the beginning of our silicon program. Our training chip for ranking and recommendations is also starting to ramp up production. And we have multiple chips in various stages of development that we expect to deploy in the next couple of years.</p>
<figure id="attachment_22976" aria-describedby="caption-attachment-22976" class="wp-caption alignnone c2"><img class="size-full wp-image-22976" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg" alt="" width="1999" height="1125" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-MTIA-v2.jpg?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22976" class="wp-caption-text">MTIA v2, which will power our ranking and recommendation ads models.</figcaption></figure><p>As we’ve been diving further into designing our own silicon we’ve encountered some scaling challenges.</p>
<h3>The Need for Advanced Packaging Techniques</h3>
<p>Transistors aren’t scaling at the same pace as the need for performance. Right now, reticle size is limited to 830 mm², which means that if anyone needs more performance than a single die can enable, their only option is to invest in more dies.</p>
<p>Working with LLMs we’ve found that the need to scale is so innate that it forces us into this exact scenario to keep up with the performance needs of each new model generation. The challenge is only compounded by the fact that these dies can only be placed adjacently through advanced 2.5D and 3D packaging, which limits the size of the arrays we can build and creates concerns around energy efficiency and cooling as well.</p>
<p>We suspect that, along with advanced cooling solutions, advanced packaging techniques can help overcome these challenges by integrating multiple chiplets, or diverse capabilities (compute, memory, I/O).</p>
<h3>Investing in Solutions for Memory Disaggregation</h3>
<p>The rise of reasoning models, test-time inference, and reinforcement learning are all adding additional pressure to memory subsystems. We are starting to stack high-bandwidth memory (HBM) adjacent to the compute chiplets to maximize I/O bandwidth. But we only have so much silicon beachfront, so we have to make hard tradeoffs between the computational capability of the chip, versus memory size, versus network bandwidth. Not to mention that adding several HBMs creates more cooling concerns.</p>
<p>Investing in higher performance networks instead and locating high bandwidth memory off-chip, or even off machines, might mitigate these issues.</p>
<h3>The Case for Silicon Photonics</h3>
<p>As we have been planning our silicon roadmap, we’ve found that the minimum power budget for each rack has grown dramatically. We’re building larger and larger interconnected chips, and that comes with increasing power demands.</p>
<p>Silicon photonics, which offer a range of benefits, such as allowing for faster signaling over larger distances, could significantly reduce the rack’s overall power consumption.</p>
<p>Advanced optical solutions like these are also the only viable path to increasing shoreline beyond 3.2T and moving beyond the constraints of backplanes required to connect more endpoints.</p>
<p>These solutions would come with challenges of their own, such as higher power consumption and less reliability compared to electrical signaling. Ultimately, future solutions will have to be interoperable between different technologies and vendors, more reliable than electrical signaling, and capable of being manufactured at a high volume. </p>
<p>We are actively engaging in research to tackle these difficult hardware challenges and collaborating with the industry ecosystem to evolve the field and develop higher performance hardware. </p>
<h2>The Role of Open Standards in Scaling AI</h2>
<p>While the proliferation of hardware provides options and allows us to handle workload heterogeneity by matching customized solutions that optimize for each need, they also create management challenges for hyperscalers, cloud operators, and hardware and software developers.</p>
<figure id="attachment_22977" aria-describedby="caption-attachment-22977" class="wp-caption alignnone c7"><img class="size-full wp-image-22977" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png" alt="" width="1168" height="584" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png 1168w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png?resize=916,458 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png?resize=768,384 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png?resize=1024,512 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-accelerators-2025.png?resize=192,96 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22977" class="wp-caption-text">A subset of the accelerators we’ve introduced into production in 2025.</figcaption></figure><p>From an operator point of view, it is difficult for Meta to deal with 5-6 different SKUs of hardware deployed every year. Heterogeneity of the fleet makes it difficult to move workloads around, leading to underutilized hardware. It is difficult for software engineers to think about building and optimizing workloads for different types of hardware. If new hardware necessitates the rewriting of libraries, kernels, and applications, then there will be strong resistance to adoption of new hardware. In fact, the current state of affairs is making it hard for hardware companies to design products because it is difficult to know what data center, rack, or power specifications to build for. </p>
<p>What is needed here are open standards, open weight models, and open source software.</p>
<p>Open source software like PyTorch and Triton can help by providing a consistent programming interface for machine learning developers and researchers. Open weight models give application developers cost efficient access to high quality LLMs, and at the same time, give infrastructure and hardware engineers a standard workload to optimize for. </p>
<p>From the very beginning, <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/">we’ve been strong supporters of open hardware for data center infrastructure</a>. We were a founding member of the <a href="https://www.opencompute.org/">Open Compute Project</a> and continue to be a leading contributor of technical content and IP into it. Since its inception, Meta has made 187 contributions (approximately 25% of all tech contributions) to OCP. Working with the OCP community has benefited us operationally by improving consistency in our fleet, financially through economies of scale, and technologically by enabling companies to come together and debate solutions. While we’ve seen this produce great results in our general purpose compute fleet, the benefits will only be amplified in the era of AI.  </p>
<p>Last year at the annual OCP Global Summit, for example, we unveiled <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/">Catalina</a>, our open-design, high-powered rack for AI workloads, and a new version of <a href="https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/">Grand Teton</a>, our AI hardware platform that features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces.</p>
<p>But we have a long way to go in continuing to push open standards. We need standardization of systems, racks and power as rack power density continues to increase. These common abstractions help us continue to innovate quickly and deploy at scale as we build out the next generation of data centers and power grids. An example of this standardization is the recent push to adapt the Open Compute rack standards to accommodate AI needs. </p>
<p>We need standardization of the scale up and scale out network that these AI clusters use so that customers can mix/match different GPUs and accelerators to always use the latest and more cost effective hardware. We need software innovation and standards to allow us to run jobs across heterogeneous hardware types that may be spread in different geographic locations. These open standards need to exist all the way through the stack, and there are massive opportunities to eliminate friction that is slowing down the build out of AI infrastructure.</p>
<h2>The Next Stage (2026 and Beyond)</h2>
<p>No one can say for certain how the AI field will continue to evolve. Yet, what we do know is that computational capability is key to building higher quality models.</p>
<p>At Meta, our goal is to build models that will deliver the best, most engaging experiences, and act as personal assistants to each one of the billions of people that use our products every day.</p>
<p>Building the infrastructure for models this sophisticated means actively addressing challenges throughout our data centers – everything from advanced packaging, thermal management, power delivery, to memory disaggregation, while enabling scalable networks through optics.</p>
<p>Our next AI cluster, <a href="https://www.threads.com/@zuck/post/DMF6uUgx9f9?xmt=AQF0VyePoT5FvvT5T1DXLTKV5diUG2T8yUWrg49-m1Plfg">Prometheus</a>, will be a 1-gigawatt cluster spanning across multiple data center buildings. Constructing Prometheus has been a monumental engineering feat, with infrastructure spanning five or more data center buildings in a single data center region. While a region is large, it is a small fraction of a gigawatt facility. Thus, we needed to find innovative ways to scale: We accomplished this by building this cluster across several of our traditional data center buildings as well as several weatherproof tents, and adjacent colocation facilities. We are also evolving our software stack, including Twine and <a href="https://www.usenix.org/conference/osdi24/presentation/choudhury">MAST</a>, to support long-distance training across a geographically distributed set of data centers.</p>
<figure id="attachment_22978" aria-describedby="caption-attachment-22978" class="wp-caption alignnone c2"><img class="size-full wp-image-22978" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png" alt="" width="1999" height="1153" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=916,528 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=768,443 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=1024,591 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=1536,886 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=96,55 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Promethesus-tent-construction.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22978" class="wp-caption-text">Prometheus, our 1-gigawatt cluster, is currently underway.</figcaption></figure><p>We also have an even larger cluster, Hyperion, expected to come online beginning in 2028. Once finished, the Hyperion cluster will have the ability to scale up to a capacity of 5 gigawatts.</p>
<figure id="attachment_22979" aria-describedby="caption-attachment-22979" class="wp-caption alignnone c2"><img class="size-full wp-image-22979" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg" alt="" width="1999" height="1000" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=916,458 916w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=768,384 768w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=1024,512 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=1536,768 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-Hyperion-construction-overhead.jpg?resize=192,96 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22979" class="wp-caption-text">The Hyperion cluster will have a 5-gigawatt capacity once complete.</figcaption></figure><p>We are still early in the evolution and adoption of AI workloads. The last few years have been busy, but the next few years are going to move at an even faster pace. The demands AI will push on hardware show no signs of slowing down.</p>]]></description>
      <link>https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/</link>
      <guid>https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/</guid>
      <pubDate>Mon, 29 Sep 2025 13:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Networking at the Heart of AI — @Scale: Networking 2025 Recap]]></title>
      <description><![CDATA[<p>AI is everywhere and, as network engineers, we are right in the thick of it: building the network infrastructure for AI. This year, at  our largest <a href="https://www.youtube.com/playlist?list=PLBnLThDtSXOwnlfRbY4SNfP6oB7iJHmfH" target="_blank" rel="noopener">@Scale:Networking</a> ever, engineers from Meta, ByteDance, Google, Microsoft, Oracle, AMD, Broadcom, Cisco, and NVIDIA came together to share our latest experiences in architecting, designing, operating, and debugging our AI networks. It’s clear what an important role the network has had so far in enabling our large-scale AI advances. Looking forward, we are enabling and defining the future of AI with our networking.</p>
<div class="jetpack-video-wrapper"><iframe title="Keynote Welcome by Gaya Nagarajan - Live from SCC" width="1778" height="1000" src="https://www.youtube.com/embed/dTG97H7jrIs?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Setting Context: Rapid Changes and Evolution</h2>
<p>Given AI continues to drive so much innovation in networking and general infrastructure, we once again <a href="https://atscaleconference.com/events/networking-scale-2024/" target="_blank" rel="noopener">focused @Scale: Networking on AI networking</a>, sharing the new insights and progress in the field. In this past year, we’ve seen two important trends: </p>
<h3>AI Infra on the Center Stage. </h3>
<p>Across the industry, AI companies are planning hundreds of billions of dollars of infrastructure build over the next several years. At Meta, this meant investing in building our gigawatt-scale clusters like <a href="https://www.facebook.com/watch/?v=2300161320399228" target="_blank" rel="noopener">Prometheus and Hyperion</a>, <a href="https://about.fb.com/news/2025/06/meta-constellation-partner-clean-energy-project/" target="_blank" rel="noopener">providing clean and renewable</a> power, and laying the <a href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/" target="_blank" rel="noopener">largest transoceanic fiber cable systems in the world</a> to ensure billions across the globe have access to all this AI innovation. In the short-term, we’ve even expanded our construction portfolio with <a href="https://www.youtube.com/watch?v=qDDOy90V4Jo" target="_blank" rel="noopener">“sprung structures”</a> to bring capacity on-line as quickly as possible. </p>
<h3>The Models and the Primary AI Workloads Are Rapidly Evolving.</h3>
<p>We’ve focused a lot over the last several years on the requirements of large-scale, foundational training. At Meta, we went from <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/" target="_blank" rel="noopener">4K to 24K to 129K-GPU clusters</a> based on Ethernet/<a href="https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/" target="_blank" rel="noopener">RoCE</a> in less than two years, tackling new challenges in <a href="https://ai.meta.com/research/publications/the-llama-3-herd-of-models/" target="_blank" rel="noopener">high performance and high reliability</a> with each leap. Now, in the last 9-12 months, we’ve seen a rapid expansion of workloads that include <a href="https://pytorch.org/blog/metashuffling-accelerating-llama-4-moe-inference/" target="_blank" rel="noopener">mixture-of-experts</a>, reasoning models, reinforcement learning, post-training, synthetic data generation, distributed inference, and more. All of these have different network requirements, and they are all now part of our challenge.</p>
<h2>The Role of the Network in AI</h2>
<p><img class="alignnone wp-image-22946 size-medium" src="https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?w=916" alt="" width="916" height="611" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=916,611 916w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=768,512 768w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=1024,683 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=1536,1024 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=96,64 96w, https://engineering.fb.com/wp-content/uploads/2025/09/@ScaleNetworkingFinalSendsSmall947A1305-copy.png?resize=192,128 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>With this context, the network’s importance becomes even more clear. </p>
<h3>The Network Is the Computer</h3>
<p>Between the rapidly changing AI workloads and massive physical infrastructure builds, the <strong>network plays the interface role of abstracting the underlying infrastructure</strong> as much as possible to the workloads. From the model perspective, the infrastructure should look like one gigantic GPU, and the network is key to this abstraction.</p>
<h3>Co-Designing the Network With the AI Stack</h3>
<p>Achieving this abstraction goal requires addressing challenges like varying distances and bandwidths (especially in the scale-up and scale-out domains), and hardware variety across different accelerators, NICs, and fabrics. It’s a full-stack/end-to-end problem for networking, bringing to bear all our experience in NICs, routing, and congestion control, and tuning all these closely with the GPU-based stack.</p>
<h3>Reliability Is Key</h3>
<p>Not only do we have to provide the performance and ease-of-use the models expect, but we also must operate this infrastructure with high reliability, finding and quickly reacting to those failures seamlessly. </p>
<h3>Innovation and Optionality</h3>
<p>Going forward, we need to continually innovate to stay ahead and provide optionality<strong>, as we expect constant change</strong> above us in the models/workloads and below us in the rest of the infrastructure. We want a network stack that blends the best of high performance computing’s capabilities with open and scalable distributed system principles, ensuring we’re ready for whatever comes next.</p>
<h2>More from @Scale:Networking 2025 </h2>
<p><img class="alignnone wp-image-22945 size-medium" src="https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?w=916" alt="" width="916" height="611" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=916,611 916w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=768,512 768w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=1024,683 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=1536,1024 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=96,64 96w, https://engineering.fb.com/wp-content/uploads/2025/09/@Scale-Networking-Quick-Sends947A1440-copy.png?resize=192,128 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Please visit <a href="https://www.youtube.com/playlist?list=PLBnLThDtSXOwnlfRbY4SNfP6oB7iJHmfH" target="_blank" rel="noopener">the @Scale YouTube channel</a> to check out all the talks from this year’s Networking @Scale. Meta continually organizes all the @Scale events  (<a href="https://atscaleconference.com/events/scale-systems-reliability/" target="_blank" rel="noopener">Systems &amp; Reliability</a>, <a href="https://atscaleconference.com/events/scale-data-ai-infra/" target="_blank" rel="noopener">AI &amp; Data</a>, and the upcoming <a href="https://atscaleconference.com/events/scale-product/" target="_blank" rel="noopener">Product</a> in October) so our communities can share the innovations and challenges we’re tackling and can learn from each other.</p>
<p>We had a variety of talks with live Q&amp;As with two major themes:</p>
<ol><li class="c1" aria-level="1"><strong>Underlying physical network infrastructure talks</strong>: switch topologies and control plane, NIC and host networking, and scalable operations/high reliability.</li>
<li class="c1" aria-level="1"><strong>Higher-layer, model-oriented talks</strong>: parallelism design, job-level debuggability, scaling for large pre-training, and handling new use cases in reinforcement learning, mixture of experts, and inference.</li>
</ol><p>From the perspective of what’s coming in the future for AI and networking, we had keynotes from Meta and Microsoft and a vendor panel with key GPU and network ASIC vendors.</p>
<p>We thank again everyone from Meta, ByteDance, Google, Microsoft, Oracle, AMD, Broadcom, Cisco, and NVIDIA who worked with us to share all of their latest learnings with the community. We look forward to what promises to be another rapid year of network and AI innovation that we’ll cover at the next @Scale: Networking in 2026!</p>]]></description>
      <link>https://engineering.fb.com/2025/09/26/networking-traffic/networking-at-the-heart-of-ai-scale-networking-2025-recap/</link>
      <guid>https://engineering.fb.com/2025/09/26/networking-traffic/networking-at-the-heart-of-ai-scale-networking-2025-recap/</guid>
      <pubDate>Fri, 26 Sep 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Video Streaming With the AV1 Video Codec in Mobile Devices]]></title>
      <description><![CDATA[<p>Today, Meta, Vodafone, and Google released a white paper, “<a href="https://go.fb.me/bb887d">Video Streaming with the AV1 Video Codec in Mobile Devices</a>,” detailing the benefits of the <a href="https://aomedia.org/specifications/av1/">AV1 codec</a>, an advanced video compression technique, to enhance the streaming video experience on mobile devices. </p>
<p>The <a href="https://go.fb.me/bb887d" target="_blank" rel="noopener">white paper</a> recommends that:</p>
<ul><li class="c1" aria-level="1">Vendors of core processors (SoCs) should evaluate the adoption of AV1 hardware.</li>
<li class="c1" aria-level="1">In scenarios where this is not an option, vendors should consider using a software-based AV1 decoder, which can help with the transition to AV1 in low/mid-tier devices.</li>
</ul><p>Today, video content represents between 70-80% of all mobile data traffic. And low- and mid-tier handsets account for around 75% of handset sales globally. AV1 decoding can be implemented in smartphones in both hardware and software. Our testing has shown that increasing the use of AV1 in low- to mid-tier smartphones can deliver video quality on par with premium handsets and free up network capacity while optimising computing power and storage. </p>
<h2>Why We Recommend AV1 for Streaming Video on Mobile</h2>
<p>At Meta, we’ve already <a href="https://engineering.fb.com/2023/02/21/video-engineering/av1-codec-facebook-instagram-reels/">implemented AV1-compatible codecs into our technologies</a>. Along with other large technology companies, like Google, we’ve found that <strong>AV1 codes can enhance video compression by 30%</strong> compared to prior standards like H.264 and VP9, making it suitable for most types of video format.</p>
<figure id="attachment_22932" aria-describedby="caption-attachment-22932" class="wp-caption alignnone c2"><img class="wp-image-22932" src="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-AV1-White-Paper-Compression-Chart.png" alt="" width="600" height="360" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/Meta-AV1-White-Paper-Compression-Chart.png 750w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-AV1-White-Paper-Compression-Chart.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2025/09/Meta-AV1-White-Paper-Compression-Chart.png?resize=192,115 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22932" class="wp-caption-text">Caption: AV1 offers up to 30% better compression over earlier codecs such as VP9 and H.264.</figcaption></figure><p>However, many of the mobile phones in use today are lower- and mid-tier handsets that lack the necessary codec, particularly built-in hardware, to decompress files with AV1 to deliver a buffer-free video experience. </p>
<p>Despite the improvement in user-experience, smartphone chipset support for such codecs today remains limited to higher tier products. Advanced video compression technology in mid- and low-tier smartphones would enhance the viewing experience for more people. </p>
<p>This is an opportunity to further collaborate among content providers and network operators to align with chipset manufacturers and device operating system developers to guarantee the best quality of experience to end-users, while ensuring optimal utilization of network resources with reduced congestion.</p>
<h2>The Advantage of AV1</h2>
<p>AV1 has matured enough to help operators deal with the increasing amount of video traffic to mobile devices while also saving on compute, edge cache resources, and energy costs. Greater adoption of AV1 would help reduce network capacity for mobile operators while also helping them meet increasing user demand.</p>
<p>AV1 can be implemented either as a smartphone software upgrade or embedded in mobile devices’ core processors (SoCs) for better battery efficiency and performance for end users. As highlighted in <a href="https://go.fb.me/bb887d" target="_blank" rel="noopener">the white paper</a>, AV1 hardware can give smartphone manufacturers superior energy-efficient compression gains compared to other techniques, without compromising connection speeds.</p>
<h2>Download the White Paper</h2>
<p>“<a href="https://go.fb.me/bb887d" target="_blank" rel="noopener">Video Streaming with the AV1 Video Codec in Mobile Devices</a>“</p>]]></description>
      <link>https://engineering.fb.com/2025/09/24/video-engineering/video-streaming-with-av1-video-codec-mobile-devices-meta-white-paper/</link>
      <guid>https://engineering.fb.com/2025/09/24/video-engineering/video-streaming-with-av1-video-codec-mobile-devices-meta-white-paper/</guid>
      <pubDate>Wed, 24 Sep 2025 10:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Read Meta’s 2025 Sustainability Report]]></title>
      <description><![CDATA[<div><div class="container entry-content"><div class="block-focus-areas"><div class="block-focus-areas__areas"><div class="block-focus-areas__column"><div class="block-focus-areas-item block-focus-areas-item__inner"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="5184" height="2916" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?resize=5184%2C2916" alt="A nature scene overlooking treetops and mountains." class="wp-image-7471" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=5184 5184w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/2_Meta-2025-Sustainability-Landing-Page_Pillar_Climate.jpg?w=3000 3000w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure><p class="has-grey-700-color has-text-color has-link-color wp-elements-f919317aa504d67d959de4012698d2f4 c1">As climate change impacts become increasingly prevalent, decarbonizing our business is a critical step for Meta to do our part in connecting to a healthier planet, more resilient communities and a net zero reality. </p></div><div class="block-focus-areas-item block-focus-areas-item__inner"><figure class="wp-block-image size-large c2"><img data-recalc-dims="1" height="911" width="1620" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=1620&amp;resize=1620%2C911" alt="A person in safety glasses and a lab coat working on a mobile phone with small pliers." class="wp-image-7473" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=5120 5120w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/4_Meta-2025-Sustainability-Landing-Page_Pillar_RSC.jpg?w=3000 3000w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure><p class="has-grey-700-color has-text-color has-link-color wp-elements-c06fec86edd0c2511f481c3b9b3a5f9a">Meta is part of a complex value chain that impacts lives and communities around the globe and we strive to empower workers and protect the environment through open communication, initiatives that support safe working conditions and a deep understanding of core sustainability issues.<br /></p></div></div><div class="block-focus-areas__column"><div class="block-focus-areas-item block-focus-areas-item__inner"><figure class="wp-block-image size-large"><img data-recalc-dims="1" height="1078" width="1620" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=1620&amp;resize=1620%2C1078" alt="A view of the ocean, showing waves crashing." class="wp-image-7472" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=4928 4928w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/3_Meta-2025-Sustainability-Landing-Page_Pillar_Water.jpg?w=3000 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><p class="has-grey-700-color has-text-color has-link-color wp-elements-a8d74ab0427d17bae2388b9d7f9396fd">Water is a vital resource for life on earth, and we strive to connect its management to technical expertise and responsibility that help ensure healthy aquatic ecosystems.</p></div><div class="block-focus-areas-item block-focus-areas-item__inner"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="3200" height="2136" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?resize=3200%2C2136" alt="A nature scene of a meadow with wildflowers." class="wp-image-7474" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=3200 3200w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/5_Meta-2025-Sustainability-Landing-Page_Pillar_Biodiversity.jpg?w=3000 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><p class="has-grey-700-color has-text-color has-link-color wp-elements-2e625c83e33c340fb375d6e700ff9c30">Biodiversity supports ecosystem stability, ensuring that living things are able to thrive on our planet. We strive to support the biodiversity of the native ecosystems at our data center properties through actions that play a positive role and invest in the long term vitality of local communities.</p></div></div></div></div><div class="horizontal-slider"><div class="horizontal-slider--container"><div class="horizontal-slider-item wp-block-media-text is-stacked-on-mobile is-vertically-aligned-center is-style-media-text-stretched"><figure class="wp-block-media-text__media"><img data-recalc-dims="1" width="8272" height="6200" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=1620&amp;resize=8272%2C6200" alt="A field with wind turbines at sunset." class="wp-image-7478 size-full" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=8272 8272w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/7_Meta-2025-Sustainability-Landing-Page_Highlight_Clean-and-Renewable-Energy.jpg?w=3000 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><div class="wp-block-media-text__content"><p>Meta-supported wind and solar projects are adding more than 15 gigawatts (GW) of clean and renewable energy to grids globally.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open Meta blog titled ‘Our approach to clean and renewable energy’" href="https://sustainability.atmeta.com/blog/2024/10/14/our-approach-to-clean-and-renewable-energy/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="horizontal-slider-item wp-block-media-text is-stacked-on-mobile is-vertically-aligned-center is-style-media-text-stretched"><figure class="wp-block-media-text__media"><img data-recalc-dims="1" width="3840" height="2160" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=1620&amp;resize=3840%2C2160" alt="A river with trees along the banks and plants moving with the current." class="wp-image-7554 size-full" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=3840 3840w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/8_Meta-2025-Sustainability-Landing-Page_Highlight_Water-Restoration_PLACEHOLDER-FOR-WATER-PROJECT-1.jpeg?w=3000 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><div class="wp-block-media-text__content"><p>Since 2017, we have funded more than 40 water restoration projects. In 2024 these projects restored over 1.6 billion gallons of water to high and medium water stress regions.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open the 2025 Sustainability Report section on water" href="https://sustainability.atmeta.com/asset/2025-sustainability-report#page=48" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="horizontal-slider-item wp-block-media-text is-stacked-on-mobile is-vertically-aligned-center is-style-media-text-stretched"><figure class="wp-block-media-text__media"><img data-recalc-dims="1" width="2048" height="1536" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=1620&amp;resize=2048%2C1536" alt="The Meta data center in Mesa, AZ." class="wp-image-7480 size-full" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/9_Meta-2025-Sustainability-Landing-Page_Highlight_Data-Center-Campus-Biodiversity.jpeg?w=1536 1536w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><div class="wp-block-media-text__content"><p>More than 50% of our operational data center campus footprint, more than 4,000 acres, is planned, installed or preserved to intentionally support local habitats with native species.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open the 2025 Sustainability Report section on biodiversity" href="https://sustainability.atmeta.com/asset/2025-sustainability-report#page=61" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="horizontal-slider-item wp-block-media-text is-stacked-on-mobile is-vertically-aligned-center is-style-media-text-stretched"><figure class="wp-block-media-text__media"><img data-recalc-dims="1" width="5616" height="3744" src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=1620&amp;resize=5616%2C3744" alt="Manufacturing workers in safety gear." class="wp-image-7481 size-full" srcset="https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=5616 5616w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=768 768w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=1620 1620w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=1536 1536w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=2048 2048w, https://sustainability.atmeta.com/wp-content/uploads/2025/08/10_Meta-2025-Sustainability-Landing-Page_Highlight_RSC.jpeg?w=3000 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure><div class="wp-block-media-text__content"><p>We continued promoting safe process chemical management at supplier sites, led awareness raising and risk mitigation training and supported the substitution of hazardous chemicals for safer alternatives where feasible.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open Responsible Supply Chain landing page" href="https://sustainability.atmeta.com/responsible-supply-chain/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div></div></div><div class="meta-stats-table meta-stats-table-single meta-stats-table-has-content"><div class="meta-stats-table-content-wrapper"><div class="meta-stats-table-content"><p>Building and delivering world-class AI capabilities is critical to our company’s near-term product and business success and long-term vision. We have invested in creating scalable infrastructure to support our needs today and for years to come. </p><p>Our vision blends high performance with a mix of custom solutions specific to our unique needs. This design requires fewer square feet to provide similar compute capacity to previous data center designs, improving delivery time and cost efficiency.</p></div></div></div><div class="wp-block-stepper-carousel stepper-carousel alignwide" role="region" aria-label="this is a title"><div class="stepper-carousel__wrapper"><div class="stepper-carousel__items stepper-carousel__items--desktop"><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true" id="stepper-slide-1"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/12_Meta-2025-Sustainability-Landing-Page_Data-Center-Feature_Energy-Innovation-1.jpg" alt="A nuclear energy facility in Illinois." /></div><p class="is-style-paragraph-overline">Energy innovation</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>With the goal of adding one to four GW of nuclear generation capacity in the US, Meta is working with developers that can accelerate the availability of new nuclear generators and scale sufficiently to reduce costs. </p><p><a class="wp-block-button__link wp-element-button" aria-label="Navigate to Meta’s blog titled ‘Accelerating the Next Wave of Nuclear to Power AI Innovation’" href="https://sustainability.atmeta.com/blog/2024/12/03/accelerating-the-next-wave-of-nuclear-to-power-ai-innovation/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true" id="stepper-slide-2"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/13_Meta-2025-Sustainability-Landing-Page_Program-Feature_Data-Center-Construction.jpg" alt="An image of engineered wood being used to construct a building." /></div><p class="is-style-paragraph-overline">Data center construction</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>To reduce the emissions associated with data center construction, we have begun piloting mass timber, a variety of wood-based products engineered for strength and durability, in the construction of buildings on our data center campuses. </p><p><a class="wp-block-button__link wp-element-button" aria-label="Navigate to Meta’s blog titled ‘Meta pilots mass timber for more sustainable data center construction’" href="https://sustainability.atmeta.com/blog/2025/07/31/meta-pilots-mass-timber-for-more-sustainable-data-center-construction/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true" id="stepper-slide-3"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/14_Meta-2025-Sustainability-Landing-Page_Program-Feature_Symbiosis-Coalition.jpg" alt="An image of evergreen treetops immersed in fog." /></div><p class="is-style-paragraph-overline">Symbiosis Coalition</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>As a member of the Symbiosis Coalition, Meta is helping to support the accelerated development of up to 20 million tons of nature-based carbon removal credits by 2030.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open the Symbiosis Coalition launch press release web page" href="https://www.symbiosiscoalition.org/perspectives/launch-press-release" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div></div><div class="stepper-carousel__items stepper-carousel__items--mobile"><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/12_Meta-2025-Sustainability-Landing-Page_Data-Center-Feature_Energy-Innovation-1.jpg" alt="A nuclear energy facility in Illinois." /></div><p class="is-style-paragraph-overline">Energy innovation</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>With the goal of adding one to four GW of nuclear generation capacity in the US, Meta is working with developers that can accelerate the availability of new nuclear generators and scale sufficiently to reduce costs. </p><p><a class="wp-block-button__link wp-element-button" aria-label="Navigate to Meta’s blog titled ‘Accelerating the Next Wave of Nuclear to Power AI Innovation’" href="https://sustainability.atmeta.com/blog/2024/12/03/accelerating-the-next-wave-of-nuclear-to-power-ai-innovation/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/13_Meta-2025-Sustainability-Landing-Page_Program-Feature_Data-Center-Construction.jpg" alt="An image of engineered wood being used to construct a building." /></div><p class="is-style-paragraph-overline">Data center construction</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>To reduce the emissions associated with data center construction, we have begun piloting mass timber, a variety of wood-based products engineered for strength and durability, in the construction of buildings on our data center campuses. </p><p><a class="wp-block-button__link wp-element-button" aria-label="Navigate to Meta’s blog titled ‘Meta pilots mass timber for more sustainable data center construction’" href="https://sustainability.atmeta.com/blog/2025/07/31/meta-pilots-mass-timber-for-more-sustainable-data-center-construction/" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div><div class="wp-block-stepper-carousel-slide stepper-carousel-slide stepper-carousel-slide-container stepper-carousel-slide-content" role="tabpanel" aria-hidden="true"><div class="stepper-carousel__media-item is-in-slide"><img src="https://sustainability.atmeta.com/wp-content/uploads/2025/08/14_Meta-2025-Sustainability-Landing-Page_Program-Feature_Symbiosis-Coalition.jpg" alt="An image of evergreen treetops immersed in fog." /></div><p class="is-style-paragraph-overline">Symbiosis Coalition</p><div class="wp-block-group is-animating-group is-layout-constrained wp-block-group-is-layout-constrained"><p>As a member of the Symbiosis Coalition, Meta is helping to support the accelerated development of up to 20 million tons of nature-based carbon removal credits by 2030.</p><p><a class="wp-block-button__link wp-element-button" aria-label="Open the Symbiosis Coalition launch press release web page" href="https://www.symbiosiscoalition.org/perspectives/launch-press-release" target="_blank" rel="noreferrer noopener">Learn more</a></p></div></div></div></div></div><div class="wp-block-group is-style-default is-layout-constrained wp-block-group-is-layout-constrained c3"><hr class="wp-block-separator has-alpha-channel-opacity" /><p class="has-text-align-left has-minus-1-font-size" id="citation"><sup>1</sup> Construction Waste is defined as waste materials generated during the construction, renovation and demolition of buildings and roads.</p></div></div></div><div class="gdprconsent-container gdprconsent-wrapper gdprconsent-content" id="GDPRConsentBar"><p>To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Meta through cookies. Learn more, including about available controls: <a href="https://sustainability.atmeta.com/policy">Cookie Policy</a></p></div>]]></description>
      <link>https://sustainability.atmeta.com/2025-sustainability-report/</link>
      <guid>https://sustainability.atmeta.com/2025-sustainability-report/</guid>
      <pubDate>Fri, 12 Sep 2025 18:36:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[A New Ranking Framework for Better Notification Quality on Instagram]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing how Meta is applying machine learning (ML) and diversity algorithms to improve notification quality and user experience. </li>
<li class="c1" aria-level="1">We’ve introduced a diversity-aware notification ranking framework to reduce uniformity and deliver a more varied and engaging mix of notifications.</li>
<li class="c1" aria-level="1">This new framework reduces the volume of notifications and drives higher engagement rates through more diverse outreach.</li>
</ul><p>Notifications are one of the most powerful tools for bringing people back to Instagram and enhancing engagement. Whether it’s a friend liking your photo, another close friend posting a story, or a suggestion for a reel you might enjoy, notifications help surface moments that matter in real time.</p>
<p>Instagram leverages machine learning (ML) models to decide who should get a notification, when to send it, and what content to include. These models are trained to optimize for user positive engagement such as click-through-rate (CTR) – the probability of a user clicking a notification – as well as other metrics like time spent.</p>
<p>However, while engagement-optimized models are effective at driving interactions, there’s a risk that they might overprioritize the product types and authors someone has previously engaged with. This can lead to overexposure to the same creators or the same product types while overlooking other valuable and diverse experiences. </p>
<p>This means people could miss out on content that would give them a more balanced, satisfying, and enriched experience. Over time, this can make notifications feel spammy and increase the likelihood that people will disable them altogether. </p>
<p>The real challenge lies in finding the right balance: How can we introduce meaningful diversity into the notification experience without sacrificing the personalization and relevance people on Instagram have come to expect?</p>
<p>To tackle this, we’ve introduced a diversity-aware notification ranking framework that helps deliver more diverse, better curated, and less repetitive notifications. This framework has significantly reduced daily notification volume while improving CTR. It also introduces several benefits:</p>
<ul><li class="c1" aria-level="1">The extensibility of incorporating customized soft penalty (demotion) logic for each dimension, enabling more adaptive and sophisticated diversity strategies.</li>
<li class="c1" aria-level="1">The flexibility of tuning demotion strength across dimensions like content, author, and product type via adjustable weights.</li>
<li class="c1" aria-level="1">The integration of balancing personalization and diversity, ensuring notifications remain both relevant and varied.</li>
</ul><h2>The Risks of Notifications without Diversity</h2>
<p>The issue of overexposure in notifications often shows up in two major ways:</p>
<p><strong>Overexposure to the same author:</strong> People might receive notifications that are mostly about the same friend. For example, if someone often interacts with content from a particular friend, the system may continue surfacing notifications from that person alone – ignoring other friends they also engage with. This can feel repetitive and one-dimensional, reducing the overall value of notifications.</p>
<p><strong>Overexposure to the same product surface:</strong> People might mostly receive notifications from the same product surface such as Stories, even when Feed or Reels could provide value. For example, someone may be interested in both reel and story notifications but has recently interacted more often with stories. Because the system heavily prioritizes past engagement, it sends only story notifications, overlooking the person’s broader interests. </p>
<h2>Introducing Instagram’s Diversity-Aware Notification Ranking Framework</h2>
<p>Instagram’s diversity-aware notification ranking framework is designed to enhance the notification experience by balancing the predicted potential for user engagement with the need for content diversity. This framework introduces a diversity layer on top of the existing engagement ML models, applying multiplicative penalties to the candidate scores generated by these models, as figure1, below, shows.</p>
<p>The diversity layer evaluates each notification candidate’s similarity to recently sent notifications across multiple dimensions such as content, author, notification type, and product surface. It then applies carefully calibrated penalties—expressed as multiplicative demotion factors—to downrank candidates that are too similar or repetitive. The adjusted scores are used to re-rank the candidates, enabling the system to select notifications that maintain high engagement potential while introducing meaningful diversity. In the end, the quality bar selects the top-ranked candidate that passes both the ranking and diversity criteria.</p>
<figure id="attachment_22893" aria-describedby="caption-attachment-22893" class="wp-caption alignnone c2"><img class="size-full wp-image-22893" src="https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png" alt="" width="4000" height="2250" srcset="https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png 5530w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=2048,1152 2048w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/09/A-new-ranking-framework-for-better-notification-quality-on-Instagram.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22893" class="wp-caption-text">Figure.1: Instagram’s diversity-aware ranking framework where the diversity layer sits on top of the existing modeling layer and penalizes notifications that are too similar to recently sent ones.</figcaption></figure><h3>Mathematical Formulation </h3>
<p>Within the diversity layer, we apply a multiplicative demotion factor to the base relevance score of each candidate. Given a notification candidate 𝑐, we compute its final score as the product of its base ranking score and a diversity demotion multiplier:</p>
<p class="c3"><em><img src="https://s0.wp.com/latex.php?latex=%5Ctext%7BScore%7D%28c%29+%3D+R%28c%29+%5Ctimes+D%28c%29+&amp;bg=ffffff&amp;fg=000&amp;s=0&amp;c=20201002" alt="\text{Score}(c) = R(c) \times D(c)" class="latex" /><br /></em></p>
<p>where <em>R(c)</em> represents the candidate’s base relevance score, and <em>D(c) ∈ [0,1]</em> is a penalty factor that reduces the score based on similarity to recently sent notifications. We define a set of semantic dimensions (e.g., author, product type) along which we want to promote diversity. For each dimension <em>i,</em> we compute a similarity signal <em>p</em><em><sub>i</sub></em><em>(c)</em> between candidate <em>c</em> and the set of historical notifications <em>H</em>, using a maximal marginal relevance (MMR) approach:</p>
<p class="c3">
</p><p><em>where sim<sub>i</sub>(·,·) is a predefined similarity function for dimension i. In our baseline implementation, p<sub>i</sub>(c) is binary: it equals 1 if the similarity exceeds a threshold 𝜏<sub>i</sub> and 0 otherwise. </em></p>
<p>The final demotion multiplier is defined as: </p>
<p class="c3">
</p><p><em>where each w<sub>i </sub> ∈ [0,1] controls the strength of demotion for its respective dimension. This formulation ensures that candidates similar to previously delivered notifications along one or more dimensions are proportionally down-weighted, reducing redundancy and promoting content variation. The use of a multiplicative penalty allows for flexible control across multiple dimensions, while still preserving high-relevance candidates.</em></p>
<h2>The Future of Diversity-Aware Ranking</h2>
<p>As we continue evolving our notification diversity-aware ranking system, a next step is to introduce more adaptive, dynamic demotion strategies. Instead of relying on static rules, we plan to make demotion strength responsive to notification volume and delivery timing. For example, as a user receives more notifications—especially of similar type or in rapid succession—the system progressively applies stronger penalties to new notification candidates, effectively mitigating overwhelming experiences caused by high notification volume or tightly spaced deliveries.</p>
<p>Longer term, we see an opportunity to bring large language models (LLMs) into the diversity pipeline. LLMs can help us go beyond surface-level rules by understanding semantic similarity between messages and rephrasing content in more varied, user-friendly ways. This would allow us to personalize notification experiences with richer language and improved relevance while maintaining diversity across topics, tone, and timing.</p>]]></description>
      <link>https://engineering.fb.com/2025/09/02/ml-applications/a-new-ranking-framework-for-better-notification-quality-on-instagram/</link>
      <guid>https://engineering.fb.com/2025/09/02/ml-applications/a-new-ranking-framework-for-better-notification-quality-on-instagram/</guid>
      <pubDate>Tue, 02 Sep 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Enabling Kotlin Incremental Compilation on Buck2]]></title>
      <description><![CDATA[<p>The Kotlin incremental compiler has been a true gem for developers chasing faster compilation since its introduction in build tools. Now, we’re excited to bring its benefits to <a href="https://buck2.build/">Buck2</a> –  Meta’s build system – to unlock even more speed and efficiency for Kotlin developers.</p>
<p>Unlike a traditional compiler that recompiles an entire module every time, an incremental compiler focuses only on what was changed. This cuts down compilation time in a big way, especially when modules contain a large number of source files.</p>
<p>Buck2 promotes small modules as a key strategy for achieving fast build times. Our codebase followed that principle closely, and for a long time, it worked well. With only a handful of files in each module, and Buck2’s support for fast incremental builds and parallel execution, incremental compilation didn’t seem like something we needed.</p>
<p>But, let’s be real: Codebases grow, teams change, and reality sometimes drifts away from the original plan. Over time, some modules started getting bigger – either from legacy or just organic growth. And while big modules were still the exception, they started having quite an impact on build times.</p>
<p>So we gave the Kotlin incremental compiler a closer look – and we’re glad we did. The results? Some <strong>critical modules now build up to 3x faster</strong>. That’s a big win for developer productivity and overall build happiness. </p>
<p>Curious about how we made it all work in Buck2? Keep reading. We’ll <strong>walk you through the steps we took to bring the Kotlin incremental compiler to life</strong> in our Android toolchain.</p>
<h2>Step 1: Integrating Kotlin’s Build Tools API</h2>
<p>As of Kotlin 2.2.0, the only guaranteed public contract to use the compiler is through the command-line interface (CLI). But since the CLI doesn’t support incremental compilation (at least for now), it didn’t meet our needs. Alternatively, we could integrate the Kotlin incremental compiler directly via the internal compiler’s components – APIs that are technically accessible but not intended for public use. However, relying on them would’ve made our toolchain fragile and likely to break with every Kotlin update since there’s no guarantee of backward compatibility. That didn’t seem like the right path either.</p>
<p>Then we came across the Build Tools API (<a href="https://github.com/Kotlin/KEEP/issues/421">KEEP</a>), introduced in Kotlin 1.9.20 as the official integration point for the compiler – including support for incremental compilation. Although the API was still marked as experimental, we decided to give it a try. We knew it would eventually stabilize, and saw it as a great opportunity to get in early, provide feedback, and help shape its direction. Compared to using internal components, it offered a far more sustainable and future-proof approach to integration.</p>
<h3>⚠️ Depending on kotlin-compiler? Watch out!</h3>
<p>In the Java world, a <em>shaded</em> library is a modified version of the library where the class and package names are changed. This process – called shading – is a handy way to avoid classpath conflicts, prevent version clashes between libraries, and keeps internal details from leaking out.</p>
<p>Here’s quick example:</p>
<ul><li class="c1" aria-level="1">Unshaded (original) class: com.intellij.util.io.DataExternalizer</li>
<li class="c1" aria-level="1">Shaded class: org.jetbrains.kotlin.com.intellij.util.io.DataExternalizer</li>
</ul><p>The Build Tools API depends on the <em>shaded</em> version of the Kotlin compiler (kotlin-compiler-embeddable). But our Android toolchain was historically built with the <em>unshaded</em> one (kotlin-compiler). That mismatch led to java.lang.NoClassDefFoundError crashes when testing the integration because the shaded classes simply weren’t on the classpath.</p>
<p>Replacing the unshaded compiler across the entire Android toolchain would’ve been a big effort. So to keep moving forward, we went with a quick workaround: We unshaded the Build Tools API instead. 🙈 Using the <a href="https://github.com/google/jarjar">jarjar</a> library, we stripped the org.jetbrains.kotlin prefix from class names and rebuilt the library.</p>
<p>Don’t worry, once we had a working prototype and confirmed everything behaved as expected, we circled back and did it right – fully migrating our toolchain to use the shaded Kotlin compiler. That brought us back in line with the API’s expectations and gave us a more stable setup for the future.</p>
<h2>Step 2: Keeping previous output around for the incremental compiler</h2>
<p>To compile incrementally, the Kotlin compiler needs access to the output from the previous build. Simple enough, but Buck2 deletes that output by default before rebuilding a module. </p>
<p>With <a href="https://buck2.build/docs/rule_authors/incremental_actions/">incremental actions</a>, you can configure Buck2 to skip the automatic cleanup of previous outputs. This gives your build actions access to everything from the last run. The tradeoff is that it’s now up to you to figure out what’s still useful and manually clean up the rest. It’s a bit more work, but it’s exactly what we needed to make incremental compilation possible.</p>
<h2>Step 3: Making the incremental compiler cache relocatable</h2>
<p>At first, this might not seem like a big deal. You’re not planning to move your codebase around, so why worry about making the cache relocatable, right?</p>
<p>Well… that’s until you realize you’re no longer in a tiny team, and you’re definitely not the only one building the project. Suddenly, it does matter.</p>
<p>Buck2 supports <a href="https://buck2.build/docs/users/remote_execution/">distributed builds</a>, which means your builds don’t have to run only on your local machine. They can be executed elsewhere, with the results sent back to you. And if your compiler cache isn’t relocatable, this setup can quickly lead to trouble – from conflicting overloads to strange ambiguity errors caused by mismatched paths in cached data.</p>
<p>So we made sure to configure the root project directory and the build directory explicitly in the incremental compilation settings. This keeps the compiler cache stable and reliable, no matter who runs the build or where it happens.</p>
<h2>Step 4: Configuring the incremental compiler</h2>
<p>In a nutshell, to decide what needs to be recompiled, the Kotlin incremental compiler looks for changes in two places:</p>
<ul><li class="c1" aria-level="1">Files within the module being rebuilt.</li>
<li class="c1" aria-level="1">The module’s dependencies.</li>
</ul><p>Once the changes are found, the compiler figures out which files in the module are affected – whether by direct edits or through updated dependencies – and recompiles only those.</p>
<p>To get this process rolling, the compiler needs just a little nudge to understand how much work it really has to do.</p>
<p>So let’s give it that nudge!</p>
<h3>Tracking changes inside the module</h3>
<p>When it comes to tracking changes, you’ve got two options: You can either let the compiler do its magic and detect changes automatically, or you can give it a hand by passing a list of modified files yourself. The first option is great if you don’t know which files have changed or if you just want to get something working quickly (like we did during prototyping). However, if you’re on a Kotlin version earlier than 2.1.20, you have to provide this information yourself. Automatic source change detection via the Build Tools API isn’t available prior to that. Even with newer versions, if the build tool already has the change list before compilation, it’s still worth using it to optimize the process.</p>
<p>This is where Buck’s incremental actions come in handy again! Not only can we preserve the output from the previous run, but we also get hash digests for every action input. By comparing those hashes with the ones from the last build, we can generate a list of changed files. From there, we pass that list to the compiler to kick off incremental compilation right away – no need for the compiler to do any change detection on its own.</p>
<h3>Tracking changes in dependencies</h3>
<p>Sometimes it’s not the module itself that changes, it’s something the module depends on. In these cases, the compiler relies on classpath snapshot. These snapshots capture the Application Binary Interface (ABI) of a library. By comparing the current snapshots to the previous one, the compiler can detect changes in dependencies and figure out which files in your module are affected. This adds an extra layer of filtering on top of standard compilation avoidance.</p>
<p>In Buck2, we added a dedicated action to generate classpath snapshots from library outputs. This artifact is then passed as an input to the consuming module, right alongside the library’s compiled output. The best part? Since it’s a separate action, it can be run remotely or be pulled from cache, so your machine doesn’t have to do the heavy lifting of extracting ABI at this step.</p>
<p><img class="size-full wp-image-22835 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/08/classpath-snapshots-for-abi-4.png" alt="" width="506" height="552" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/classpath-snapshots-for-abi-4.png 506w, https://engineering.fb.com/wp-content/uploads/2025/08/classpath-snapshots-for-abi-4.png?resize=96,105 96w, https://engineering.fb.com/wp-content/uploads/2025/08/classpath-snapshots-for-abi-4.png?resize=192,209 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>If, after all, only your module changes but your dependencies do not, the API also lets you skip the snapshot comparison entirely if your build tool handles the dependency analysis on its own. Since we already had the necessary data from Buck2’s incremental actions, adding this optimization was almost free.</p>
<h2>Step 5: Making compiler plugins work with the incremental compiler</h2>
<p>One of the biggest challenges we faced when integrating the incremental compiler was making it play nicely with our custom compiler plugins, many of which are important to our build optimization strategy. This step was necessary for unlocking the full performance benefits of incremental compilation, but it came with two major issues we needed to solve.</p>
<h3>🚨 Problem 1: Incomplete results</h3>
<p>As we already know, the input to the incremental compiler does not have to include all Kotlin source files. Our plugins weren’t designed for this and ended up producing incomplete results when run on just a subset of files. We had to make them incremental as well so they could handle partial inputs correctly.</p>
<p><img class="size-full wp-image-22836 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/08/incremental-compiler.png" alt="" width="775" height="640" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/incremental-compiler.png 775w, https://engineering.fb.com/wp-content/uploads/2025/08/incremental-compiler.png?resize=768,634 768w, https://engineering.fb.com/wp-content/uploads/2025/08/incremental-compiler.png?resize=96,79 96w, https://engineering.fb.com/wp-content/uploads/2025/08/incremental-compiler.png?resize=192,159 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>🚨 Problem 2: Multiple rounds of Compilation</h3>
<p>The Kotlin incremental compiler doesn’t just recompile the files that changed in a module. It may also need to recompile other files in the same module that are affected by those changes. Figuring out the exact set of affected files is tricky, especially when circular dependencies come into play. To handle this, the incremental compiler approximates the affected set by compiling in multiple rounds within a single build.</p>
<p><em>💡Curious how that works under the hood? The</em> <a href="https://blog.jetbrains.com/kotlin/2020/09/the-dark-secrets-of-fast-compilation-for-kotlin/"><em>Kotlin blog on fast compilation</em></a> <em>has a great deep dive that’s worth checking out.</em></p>
<p>This behavior comes with a side effect, though. Since the compiler may run in multiple rounds with different sets of files, compiler plugins can also be triggered multiple times, each time with a different input. That can be problematic, as later plugin runs may override outputs produced by earlier ones. To avoid this, we updated our plugins to accumulate their results across rounds rather than replacing them.</p>
<p><img class="size-full wp-image-22837 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png" alt="" width="1527" height="726" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png 1527w, https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png?resize=916,436 916w, https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png?resize=768,365 768w, https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png?resize=1024,487 1024w, https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png?resize=96,46 96w, https://engineering.fb.com/wp-content/uploads/2025/08/multiple-rounds.png?resize=192,91 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Step 6: Verifying the functionality of annotation processors</h2>
<p>Most of our annotation processors use Kotlin Symbol Processing (KSP2), which made this step pretty smooth. KSP2 is designed as a standalone tool that uses the Kotlin Analysis API to analyze source code. Unlike compiler plugins, it runs independently from the standard compilation flow. Thanks to this setup, we were able to continue using KSP2 without any changes.</p>
<p><em>💡 Bonus: KSP2 comes with its own built-in incremental processing support. It’s fully self-contained and doesn’t depend on the incremental compiler at all. </em></p>
<p>Before we adopted KSP2 (or when we were using an older version of the Kotlin Annotation Processing Tool (KAPT), which operates as a plugin) our annotation processors ran in a separate step dedicated solely to annotation processing. That step ran before the main compilation and was always non-incremental.</p>
<h2>Step 7: Enabling compilation against ABI</h2>
<p>To maximize cache hits, Buck2 builds Android modules against the class ABI instead of the full JAR. For Kotlin targets, we use the jvm-abi-gen compiler plugin to generate class ABI during compilation.</p>
<p>But once we turned on incremental compilation, a couple of new challenges popped up:</p>
<ol><li class="c1" aria-level="1">The jvm-abi-gen plugin currently lacks direct support for incremental compilation, which ties back to the issues we mentioned earlier with compiler plugins.</li>
<li class="c1" aria-level="1">ABI extraction now happens twice – once during compilation via jvm-abi-gen, and again when the incremental compiler creates classpath snapshots.</li>
</ol><p>In theory, both problems could be solved by switching to full JAR compilation and relying on classpath snapshots to maintain cache hits. While that could work in principle, it would mean giving up some of the build optimizations we’ve already got in place – a trade-off that needs careful evaluation before making any changes.</p>
<p>For now, we’ve implemented a custom (yet suboptimal) solution that merges the newly generated ABI with the previous result. It gets the job done, but we’re still actively exploring better long-term alternatives.</p>
<p>Ideally, we’d be able to reuse the information already collected for classpath snapshot or, even better, have this kind of support built directly into the Kotlin compiler. There’s an open ticket for that: <a href="https://youtrack.jetbrains.com/issue/KT-62881/Pass-to-the-compilation-only-ABI-snapshot-of-the-classpath">KT-62881</a>. Fingers crossed!</p>
<h2>Step 8: Testing</h2>
<p>Measuring the impact of build changes is not an easy task. Benchmarking is great for getting a sense of a feature’s potential, but it doesn’t always reflect how things perform in “the real world.” Pre/post testing can help with that, but it’s tough to isolate the impact of a single change, especially when you’re not the only one pushing code. </p>
<p>We set up A/B testing to overcome these obstacles and measure the true impact of the Kotlin incremental compiler on Meta’s codebase with high confidence. It took a bit of extra work to keep the cache healthy across variants, but it gave us a clean, isolated view of how much difference the incremental compiler really made at scale.</p>
<p>We started with the largest modules –  the ones we already knew were slowing builds the most. Given their size and known impact, we expected to see benefits quickly. And sure enough, we did.</p>
<h2>The impact of incremental compilation </h2>
<p>The graph below shows early results on how enabling incremental compilation for selected targets impacts their local build times during incremental builds over a 4-week period. This includes not just compilation, but also annotation processing, and a few other optimisations we’ve added along the way.</p>
<p>With incremental compilation, we’ve seen about a 30% improvement for the average developer. And for modules without annotation processing, the speed nearly doubled. That was more than enough to convince us that the incremental compiler is here to stay. </p>
<p><img class="size-full wp-image-22838 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png" alt="" width="994" height="612" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png 994w, https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png?resize=916,564 916w, https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png?resize=768,473 768w, https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/08/kotlin-modules.png?resize=192,118 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>What’s next</h2>
<p>Kotlin incremental compilation is now supported in Buck2, and we’re actively rolling it out across our codebase! For now, it’s available for internal use only, but we’re working on bringing it to the recently introduced <a href="https://github.com/facebook/buck2">open source</a> toolchain as well.</p>
<p>But that’s not all! We’re also exploring ways to expand incrementality across the entire Android toolchain, including tools like Kosabi (the Kotlin counterpart to <a href="https://engineering.fb.com/2017/11/09/android/rethinking-android-app-compilation-with-buck/">Jasabi</a>), to deliver even faster build times and even better developer experience.</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/">open source site</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource">Facebook</a>, <a href="https://www.threads.net/@metaopensource">Threads</a>, <a href="https://x.com/MetaOpenSource">X</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw">LinkedIn</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/08/26/open-source/enabling-kotlin-incremental-compilation-on-buck2/</link>
      <guid>https://engineering.fb.com/2025/08/26/open-source/enabling-kotlin-incremental-compilation-on-buck2/</guid>
      <pubDate>Tue, 26 Aug 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Creating AI agent solutions for warehouse data access and security]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">In this post, we explore the ways we’re evolving Meta’s data warehouse to facilitate productivity and security to serve both human users and AI agents. </li>
<li class="c1" aria-level="1">We detail how we’re developing agents that help users making data access requests to get to the data they need, and that help data owners process requests and maintain security. </li>
<li class="c1" aria-level="1">We share how we’re using guardrails, including auditing and feedback systems, to ensure agents operate within set boundaries and to evaluate the overall process.</li>
</ul><p>As part of its offline data systems, Meta operates a data warehouse that supports use cases across analytics, ML, and AI. Given the sheer volume of data, the scale of our systems, and the diverse use cases built on top, data warehouse security is very important. Teams across Meta both manage access and use that data in their day-to-day work. As the scale continues to grow and the data access patterns become more complex, the complexity of managing access and the time spent to obtain access keep increasing. How do we minimize security risks and enable teams to operate efficiently? With the rise of GenAI and agents, we needed to rethink how we can enhance security and productivity with agents, making them an integral part of our internal data products, capable of both streamlining data access and minimizing risk. In this post, we will share our work. </p>
<div class="jetpack-video-wrapper"><iframe title="Agentic Solution for Data Warehouse Access by Can Lin &amp; Uday Ramesh Savagaonkar" width="1778" height="1000" src="https://www.youtube.com/embed/qT1Il-pzQGQ?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Background</h2>
<p>In the past, we scaled data access and management by organizing the data warehouse into a hierarchical structure, as shown below in Figure 1. At the leaf of this hierarchy are tables, with pipelines producing them and dashboards consuming them. On-calls manage these assets, followed by teams and organizational hierarchies. Using role-based access control, we further modeled business needs as roles, aligning with this hierarchical structure and other dimensions, such as data semantics. </p>
<p>However, with the growth in AI, timely access of data has become increasingly important. As the scale of data warehouses continues to grow and data access patterns become more complex, the complexity and amount of time spent on managing and obtaining access keep increasing. </p>
<figure id="attachment_22709" aria-describedby="caption-attachment-22709" class="wp-caption alignnone c2"><img class="size-full wp-image-22709" src="https://engineering.fb.com/wp-content/uploads/2025/07/image3.png" alt="" width="846" height="546" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image3.png 846w, https://engineering.fb.com/wp-content/uploads/2025/07/image3.png?resize=768,496 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image3.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image3.png?resize=192,124 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22709" class="wp-caption-text">Figure 1: Data warehouse in resource hierarchy</figcaption></figure><p>To understand how we have handled this traditionally, and why that is becoming increasingly challenging, it helps to visualize the data flow through Meta’s offline infrastructure as a graph, as shown in Figure 2 below. Each node in this graph is an asset, such as a table, a column, or a dashboard, and each edge in this graph is an activity. </p>
<p>Traditionally, when most of the data analytics was rules-driven, this graph was highly partitioned, and all data-access decisions were local. Engineers who wanted to use the data could discover the data by asking their teammates or looking at other people’s code. On the access-approval front, access could be granted to members of closely related teams. But with the ability of AI systems to process data across different portions of the data graph, such human-driven decisions are becoming challenging.</p>
<figure id="attachment_22710" aria-describedby="caption-attachment-22710" class="wp-caption alignnone c3"><img class="size-full wp-image-22710" src="https://engineering.fb.com/wp-content/uploads/2025/07/image5.png" alt="" width="766" height="478" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image5.png 766w, https://engineering.fb.com/wp-content/uploads/2025/07/image5.png?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image5.png?resize=192,120 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22710" class="wp-caption-text"><em>Figure 2: Data warehouse as a data graph</em></figcaption></figure><p>Figure 3 below illustrates that as humans and agents work more frequently across domains, the system complexity increases, both in terms of data scale and the dynamics of access, with AI being a major driver of these complex access patterns. However, we believe AI can also offer solutions to these challenges. With recent advancements in AI, particularly with agents, we’ve needed to evolve our previous approach to an agentic solution for data access. Additionally, while the system was originally designed for humans to operate and to serve both humans and services, we needed to adapt it for agents and humans working together. The new agentic workflow must natively integrate into data products and create a streamlined experience. We also must create strict guardrails, such as analytical rule-based risk assessment, to safeguard the agents.</p>
<figure id="attachment_22711" aria-describedby="caption-attachment-22711" class="wp-caption alignnone c4"><img class="size-full wp-image-22711" src="https://engineering.fb.com/wp-content/uploads/2025/07/image4.png" alt="" width="1594" height="626" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image4.png 1594w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=916,360 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=768,302 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=1024,402 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=1536,603 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image4.png?resize=192,75 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22711" class="wp-caption-text"><em>Figure 3: Challenges to scale and streamline data access</em></figcaption></figure><h2>User and owner agents</h2>
<p>We modeled the solution as a multi-agent system, as shown in Figure 4 below. Data-user agents assist data users in obtaining access, while data-owner agents help data owners manage access. These two agents also collaborate to streamline the process when both parties are involved. We intentionally separate the two agents to decompose the problem, allowing each to have its own focus.</p>
<figure id="attachment_22712" aria-describedby="caption-attachment-22712" class="wp-caption alignnone c5"><img class="size-full wp-image-22712" src="https://engineering.fb.com/wp-content/uploads/2025/07/image7.png" alt="" width="1036" height="362" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image7.png 1036w, https://engineering.fb.com/wp-content/uploads/2025/07/image7.png?resize=916,320 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image7.png?resize=768,268 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image7.png?resize=1024,358 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image7.png?resize=96,34 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image7.png?resize=192,67 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22712" class="wp-caption-text"><em>Figure 4: How to model the problem for agents to solve</em></figcaption></figure><p>Look closer at the data-user agent illustrated below in Figure 5. It’s not a monolithic entity; instead, it’s composed of three specialized sub-agents, each focusing on a specific, separate task, and is coordinated by a triage agent.</p>
<figure id="attachment_22713" aria-describedby="caption-attachment-22713" class="wp-caption alignnone c6"><img class="size-full wp-image-22713" src="https://engineering.fb.com/wp-content/uploads/2025/07/image10.png" alt="" width="1670" height="892" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image10.png 1670w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=916,489 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=768,410 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=1024,547 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=1536,820 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image10.png?resize=192,103 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22713" class="wp-caption-text"><em>Figure 5: Data-user agent</em></figcaption></figure><p>The first sub-agent suggests alternatives. For instance, when users encounter restricted tables, alternative options are often available, including unrestricted or less-restrictive tables. The agent also assists users in rewriting queries to use only non-restricted columns or in utilizing curated analyses. These insights are often hidden or considered tribal knowledge. Large language models and agents allow us to synthesize this information and guide users at scale.</p>
<p>The second sub-agent facilitates low-risk data exploration. Typically, users often need access to only a small fraction of a table’s data, especially during the data-exploration phase of analysis workflows. This sub-agent provides context-aware, task-specific data access for low-risk exploration. We will dive deeper into this below.</p>
<p>The third sub-agent helps users obtain access by crafting permission requests and negotiating with data-owner agents. Currently, we maintain a human-in-the-loop for oversight, but over time, we expect these sub-agents will operate more autonomously.</p>
<p>The data-owner agent is also composed of several sub-agents, including one for handling security operations and another for assisting with access management, as shown below in Figure 6.</p>
<figure id="attachment_22714" aria-describedby="caption-attachment-22714" class="wp-caption alignnone c7"><img class="size-full wp-image-22714" src="https://engineering.fb.com/wp-content/uploads/2025/07/image8.png" alt="" width="1756" height="824" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image8.png 1756w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=916,430 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=768,360 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=1024,481 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=1536,721 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=96,45 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image8.png?resize=192,90 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22714" class="wp-caption-text"><em>Figure 6: Data-owner agent</em></figcaption></figure><p>The first data-owner sub-agent, focused on security operations, acts like a junior engineer who assists the team with security tasks. It follows the SOP (standard operating procedure) provided by the data owner, typically derived from documented rules or guidelines, to handle incoming permission requests.</p>
<p>The second sub-agent proactively configures access rules for the team. This represents an evolution from the traditional role-mining process, where agents enable us to better utilize semantics and content. </p>
<h2>Data warehouse for agents</h2>
<p>As we mentioned at the outset, we organized the data warehouse in a hierarchical structure to scale out access. How do we evolve it with the agentic system?</p>
<p>LLMs communicate through text. The hierarchical structure of the data warehouse provides a convenient way to convert warehouse resources into text, much like a nested folder structure, as depicted in Figure 7 below. Here, organizing units are represented as folders, while leaf nodes such as tables, dashboards, policies, or other entities are represented as resources. This setup gives agents a read-only summarized view of the data warehouse. The SOP we discussed earlier, which documents data-access practices from rules, wikis, and past interactions, becomes a resource that can be represented in text format. It serves as input for both data-user agents to guide users and data-owner agents to manage access.</p>
<figure id="attachment_22715" aria-describedby="caption-attachment-22715" class="wp-caption alignnone c8"><img class="size-full wp-image-22715" src="https://engineering.fb.com/wp-content/uploads/2025/07/image9.png" alt="" width="1082" height="426" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image9.png 1082w, https://engineering.fb.com/wp-content/uploads/2025/07/image9.png?resize=916,361 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image9.png?resize=768,302 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image9.png?resize=1024,403 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image9.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image9.png?resize=192,76 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22715" class="wp-caption-text"><em>Figure 7: Data warehouse as resources</em></figcaption></figure><p>In addition to organizing inputs for agents to use, another crucial aspect is context management. Here, we differentiate among three scenarios, as shown below in Figure 8. The first scenario is called “automatic context.” Imagine data users discovering data they want to access, only to find their access blocked by controls. The system is aware of who is trying to access what, allowing the agent to fetch the exact context. The second scenario is “static context.” This occurs when users choose to focus on a specific scope explicitly or expand the resource from an automatic context to a broader one. The last scenario is “dynamic context.” It allows agents to further filter resources by metadata, such as data semantics, or via similarity search.</p>
<figure id="attachment_22716" aria-describedby="caption-attachment-22716" class="wp-caption alignnone c9"><img class="size-full wp-image-22716" src="https://engineering.fb.com/wp-content/uploads/2025/07/image2.png" alt="" width="1162" height="326" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image2.png 1162w, https://engineering.fb.com/wp-content/uploads/2025/07/image2.png?resize=916,257 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image2.png?resize=768,215 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image2.png?resize=1024,287 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image2.png?resize=96,27 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image2.png?resize=192,54 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22716" class="wp-caption-text"><em>Figure 8: Context management</em></figcaption></figure><p>Another key area is intention management. In the context of data access, we often refer to this as  “business needs.” What drives a data user to access certain resources? As shown in Figure 9 below, we model user intention in two ways: explicit and implicit.</p>
<figure id="attachment_22717" aria-describedby="caption-attachment-22717" class="wp-caption alignnone c10"><img class="size-full wp-image-22717" src="https://engineering.fb.com/wp-content/uploads/2025/07/image13.png" alt="" width="752" height="242" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image13.png 752w, https://engineering.fb.com/wp-content/uploads/2025/07/image13.png?resize=96,31 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image13.png?resize=192,62 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22717" class="wp-caption-text"><em>Figure 9: Intention management</em></figcaption></figure><p>In explicit intention management, users explicitly communicate their intentions to the system. For example, when using certain data tools to access data, they can inform the system of their current task by assuming an associated role, which carries the context of the business needs. This approach captures standard intentions.</p>
<p>The second is implicit intention. Not every intention can be modeled by roles. That’s where dynamic intention comes in. The system infers intention from a data user’s activities over a short period. For example, if a data user is woken up at midnight to address a pipeline failure, any subsequent data access is likely related to resolving that issue.</p>
<h2>Deep dive: Partial data preview</h2>
<p>Now, let’s dive into how all these elements come together to enable a complete end-to-end use case, which we refer to as partial data preview.</p>
<p>To quickly recap: In our data-access agentic solution, we have data-user agents that assist data users in obtaining access, and data-owner agents that help data owners manage and operate data access. Typically, a data user’s workflow begins with data discovery, followed by data exploration, before diving into full-fledged data analysis. During the exploration phase, there’s usually interaction with a small amount of data exposure. So, how do we enable more task-specific, context-aware access at this stage of data access?</p>
<p>To make the system work seamlessly end to end, four key capabilities (illustrated below in Figure 10) are orchestrated by an agentic workflow: </p>
<ul><li class="c1" aria-level="1"><strong>Context.</strong> We analyze data-user activities and other information to understand the business needs driving data access and align them with data controls. This enables us to provide task-specific, context-aware control. </li>
<li class="c1" aria-level="1"><strong>Query-level access control at a granular level.</strong> We analyze the shape of queries, such as whether they involve aggregation or random sampling. </li>
<li class="c1" aria-level="1"><strong>Data-access budget.</strong> Employees are given a data-access budget based on the amount of data they typically access, and this budget, which refreshes daily, is our first line of defense. </li>
<li class="c1" aria-level="1"><strong>Rule-based risk management.</strong> We employ rule-based risk control. This defends against attacks against or malfunctions of the AI agent.</li>
</ul><figure id="attachment_22718" aria-describedby="caption-attachment-22718" class="wp-caption alignnone c11"><img class="size-full wp-image-22718" src="https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png" alt="" width="1999" height="719" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=916,329 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=768,276 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=1024,368 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=1536,552 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image1_28b4f6.png?resize=192,69 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22718" class="wp-caption-text"><em>Figure 10: Partial data preview overview</em></figcaption></figure><p>Figure 11, below, illustrates how the system architecture works: The data-user agent taps into the user-activities tool to gather user activities across various platforms, including diffs, tasks, posts, SEVs, dashboards, and documents. It also uses the user-profile tool to fetch profile information. With this data, the data-user agent formulates the user’s intention based on their activities, profiles, and query shapes, and then calls upon the data-owner agent. The data-owner agent steps in to analyze the query, identifying the resources being accessed. It then fetches metadata related to these resources, such as table summaries, column descriptions, data semantics, and SOPs. The data-owner agent leverages an LLM model to generate the output decision and the reasoning behind it. The output guardrail ensures that the decision aligns with rule-based risk calculations. </p>
<figure id="attachment_22719" aria-describedby="caption-attachment-22719" class="wp-caption alignnone c12"><img class="size-full wp-image-22719" src="https://engineering.fb.com/wp-content/uploads/2025/07/image12.png" alt="" width="1748" height="882" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image12.png 1748w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=916,462 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=768,388 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=1024,517 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=1536,775 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image12.png?resize=192,97 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22719" class="wp-caption-text"><em>Figure 11: Partial data preview architecture</em></figcaption></figure><p>All decisions and logs are securely stored for future reference and analysis. While many of these building blocks might be familiar to some of you—after all, we’ve been working with query analyzers for decades—this is the first time we’re harnessing the language and reasoning capabilities of LLMs to build a multi-agent system for data users and data owners. LLMs have shown potential in this area because business needs are often context- and task-specific, making them difficult to model analytically. LLMs give us the ability to delve into these nuances, while the agents help us construct a dynamic, end-to-end workflow. At the same time, we employ guardrails, such as analytical rule-based risk computation, to ensure that the agents operate within set boundaries. Throughout the decision-making process, we also place a strong emphasis on transparency and tracing.</p>
<p>Below, Figure 12 shows the evaluation process. Evaluation is one of the most crucial steps in developing an agentic solution. To assess the system’s accuracy, recall, and other performance metrics, we curated an evaluation dataset using real requests. This involved compiling historical requests, collecting user activities and profile information, associating them with query runs, and then using this data for evaluation. We run the evaluation process daily to catch any potential regressions.</p>
<figure id="attachment_22720" aria-describedby="caption-attachment-22720" class="wp-caption alignnone c13"><img class="size-full wp-image-22720" src="https://engineering.fb.com/wp-content/uploads/2025/07/image11.png" alt="" width="1512" height="328" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image11.png 1512w, https://engineering.fb.com/wp-content/uploads/2025/07/image11.png?resize=916,199 916w, https://engineering.fb.com/wp-content/uploads/2025/07/image11.png?resize=768,167 768w, https://engineering.fb.com/wp-content/uploads/2025/07/image11.png?resize=1024,222 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/image11.png?resize=96,21 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image11.png?resize=192,42 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22720" class="wp-caption-text"><em>Figure 12: Partial data preview evaluation</em></figcaption></figure><p>We’ve also developed a data flywheel for the process, as illustrated below in Figure 13. This means that data users’ queries, the agents’ processing traces, the context, and the final outputs are all securely stored for feedback and auditing purposes. Additionally, we’ve created a data tool for data owners, allowing them to view and review decisions and provide us with feedback. This feedback helps us update our evaluations and assess the overall process.</p>
<figure id="attachment_22721" aria-describedby="caption-attachment-22721" class="wp-caption alignnone c14"><img class="size-full wp-image-22721" src="https://engineering.fb.com/wp-content/uploads/2025/07/image6.png" alt="" width="594" height="832" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/image6.png 594w, https://engineering.fb.com/wp-content/uploads/2025/07/image6.png?resize=96,134 96w, https://engineering.fb.com/wp-content/uploads/2025/07/image6.png?resize=192,269 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22721" class="wp-caption-text"><em>Figure 13: Partial data preview feedback loop</em></figcaption></figure><h2>What’s ahead?</h2>
<p>There’s still plenty of work ahead of us to become agent-ready. Here are just a few examples. </p>
<ul><li class="c1" aria-level="1">First, agent collaboration. We’re seeing more and more scenarios where it’s not the users directly accessing data, but rather agents acting on their behalf. How can we support these use cases in the most efficient way?</li>
<li class="c1" aria-level="1">Second, our data warehouse and tools were originally built for employees or services, not agents. How do we continue evolving them to be effectively used by other agents?</li>
<li class="c1" aria-level="1">Lastly, evaluation and benchmarking are important, and we’ll need to keep developing these areas to ensure we stay on track.</li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/08/13/data-infrastructure/agentic-solution-for-warehouse-data-access/</link>
      <guid>https://engineering.fb.com/2025/08/13/data-infrastructure/agentic-solution-for-warehouse-data-access/</guid>
      <pubDate>Thu, 14 Aug 2025 00:05:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Federation Platform and Privacy Waves: How Meta distributes compliance-related tasks at scale]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re exploring Meta’s Federation Platform, a scalable set of tools for managing compliance-related tasks, along with Privacy Waves, our method for batching these tasks and ensuring accountability. </li>
<li class="c1" aria-level="1">Together, the Federation Platform and Privacy Waves create a structured, effective, and sustainable approach to operationalizing privacy work, enabling Meta to safeguard user data for the billions of people that use our products.</li>
<li class="c1" aria-level="1">Given its success in the privacy domain, we’re expanding this approach to other domains such as security and accessibility.</li>
</ul><p>At Meta, we take a systematic approach to privacy-related compliance. Experts decode complex obligations into actionable product requirements, ensuring coverage and consistency across all Meta products. We then deploy technical solutions that address these requirements at scale through our <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Privacy Aware Infrastructure (PAI) initiative</a>. Following that, our privacy teams centrally automate remediation of potential issues; and finally, if expert help is needed, they send tasks to product teams for distributed execution.</p>
<p>Operationalizing this work at Meta’s scale – across tens of thousands of engineers and numerous products – requires robust coordination. To facilitate this, we developed the Federation Platform and Privacy Waves program:</p>
<ul><li class="c1" aria-level="1">The <strong>Federation Platform</strong> breaks down large compliance-related initiatives into smaller, manageable workstreams. It distributes tasks to the appropriate teams and enables them to track progress through to completion.</li>
<li class="c1" aria-level="1">The <strong>Privacy Waves program</strong> organizes tasks for these initiatives into monthly batches, creating a predictable cadence that improves quality and accountability of task distribution and management. It helps teams plan and execute their compliance-related work systematically, rather than reactively. </li>
</ul><p>Together, the Federation Platform and Privacy Waves program play a critical role in safeguarding user data and ensuring consistent, effective operations of our systems and solutions, supporting Meta’s compliance posture (for both existing and future obligations) while balancing internal engineering efficiency and experience. </p>
<p>They are significant levers in Meta’s compliance-related efforts, managing over 100,000 tasks annually within established timelines. Internal surveys reveal significantly higher positive sentiment for Privacy Waves tasks compared to ad-hoc tasks. And we estimate that the program has saved hundreds of thousands of engineering hours by enhancing strategy, tooling, and task quality. The success of this approach in the privacy domain has encouraged its expansion into other domains such as security, accessibility and our broader compliance efforts.</p>
<h2>The need for a centralized work distribution and management system</h2>
<p>There are several reasons why large organizations like Meta benefit from a centralized system to distribute and manage compliance-related work:</p>
<ul><li aria-level="1"><strong>Meeting privacy obligations at scale is complex</strong> because it often requires thousands of engineers to each complete small, specialized tasks across hundreds of global pressures and thematic areas.</li>
</ul><ul><li aria-level="1"><strong>Scalability and internal accountability are crucial.</strong> Doing this ad hoc can lead to task fatigue, difficulty meeting completion expectations, and diminished developer sentiment. Without centralized management and oversight, it becomes challenging to effectively prioritize, track, and execute work across organizational boundaries, or to deduplicate tasks across teams.</li>
</ul><ul><li aria-level="1"><strong>Developer experience matters</strong> and can even increase output. A positive, well-managed task flow reduces operational burden, maintains morale, and sustains high productivity. </li>
</ul><ul><li aria-level="1"><strong>External accountability is essential to operations.</strong> Meta must demonstrate consistent and effective operations to regulators and auditors. The Federation Platform enables clear, standardized practices along with consistent documentation and validation to uphold Meta’s compliance posture in response to external requirements.</li>
</ul><h2>Managing privacy work with the Federation Platform</h2>
<h3>Workstream configuration: How engineers integrate with the platform</h3>
<p><img class="alignnone size-full wp-image-22805" src="https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Workstream-configuration_cropped.png" alt="" width="882" height="198" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Workstream-configuration_cropped.png 882w, https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Workstream-configuration_cropped.png?resize=768,172 768w, https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Workstream-configuration_cropped.png?resize=96,22 96w, https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Workstream-configuration_cropped.png?resize=192,43 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Implementing a workstream on the Federation Platform requires defining in-code logic that mirrors the typical lifecycle of a potential privacy issue. This involves specifying how to detect, distribute, remediate, and verify resolution of these issues, ensuring their effective management. The resulting technical workstream configuration (code file) includes methods for:</p>
<ul><li class="c1" aria-level="1"><strong>Scraping flags:</strong> Scraping involves identifying the relevant set of privacy flags –  indications of potential issues that require attention. These flags are ingested into the Federation Platform based on the workstream’s configuration, which often leverages Meta’s reusable detection and verification frameworks. The scraping process can be automated to run daily using in-code methods or ad-hoc via the platform’s intake APIs. Scraping defines the scope of the workstream, with additional filters and linters configured as needed.</li>
<li class="c1" aria-level="1"><strong>Ownership resolution:</strong> This involves implementing logic to determine the ownership of privacy flags. Typically, this requires referencing Meta’s central catalog to map relevant assets, such as code files and data tables, to their respective owners.</li>
</ul><ul><li aria-level="1"><strong>Grouping:</strong> Workstreams can optionally group related flags, such as those with a common owner or located in the same directory. This allows for efficient bulk remediation by bundling these flags into a single task or diff (code change).</li>
</ul><ul><li class="c1" aria-level="1"><strong>Actioning (Task/Diff):</strong> Workstreams decide how to address each privacy flag or group of flags. The most common approach is to file a task, which is then assigned to the asset owner. Alternatively, they can choose to send automated code changes to directly resolve issues, which must be reviewed by the asset owner.</li>
</ul><ul><li aria-level="1"><strong>Task content and distribution:</strong> Workstreams configure the content of tasks, providing context on why the task is necessary, its alignment with privacy initiatives, and instructions and workflows to fix the issue. Workstreams also configure how they want to distribute their tasks, which is most commonly done through the Privacy Waves program.</li>
</ul><ul><li aria-level="1"><strong>Resolution logic:</strong> Finally, workstreams define resolution logic to determine when a privacy flag is resolved. This allows the Federation Platform to automatically close tasks once the underlying issue is fixed or reopen tasks if they are prematurely closed.</li>
</ul><p>The general-purpose configuration described above is versatile and extends well beyond privacy use cases. For instance, security and accessibility workstreams have started utilizing it to address potential vulnerabilities and product accessibility matters through task distribution. Similarly, engineering excellence initiatives operate workstreams to drive API migrations, code quality improvements, and the cleanup of obsolete experiments across numerous teams. This positions the Federation Platform as a powerful tool for driving diverse, large scale initiatives across the organization.</p>
<p>In addition to the technical configuration steps outlined above, privacy workstreams strive to adhere to the comprehensive end-to-end federation process detailed below, ensuring a holistic approach to managing privacy issues.</p>
<h3>An overview of the end-to-end federation process</h3>
<h4>Step 1: High-level strategy and planning</h4>
<p>Before distributing work, a thorough review process evaluates the holistic strategy for a privacy area to ensure their plan efficiently meets applicable privacy-related obligations. This strategy often involves a combination of developing privacy aware infrastructure and controls through traditional project work, privacy teams centralizing bulk remediation via scripts and mass code changes, and – when automated solutions are not feasible – distributing work across the company via Federation Platform workstreams and Privacy Waves.</p>
<p>Product organizations (e.g., Facebook, Instagram, WhatsApp) receive advanced visibility into upcoming privacy work, allowing them to incorporate it into their roadmaps and commit to its delivery. While aligning work across organizational lines takes longer, it ultimately enables easier and more efficient completion of tasks.</p>
<h4>Step 2: Configuring efficient task experiences</h4>
<p>Tasks for Federation Platform workstreams that participate in Privacy Waves must clearly communicate the nature of the work, due dates, link to relevant context and documentation, and contain the necessary steps for resolution. Structured tasks guide users through a wizard-like workflow with multiple-choice questions, often culminating in automated remediations (e.g., code changes, click-to-fix tools) based on user decisions. These ‘wizards’ facilitate appropriate decision-making by product engineers and, in some cases, have been shown to reduce the effort required to complete tasks by around 50%.</p>
<p>Tasks are enriched with links to support forums and similar tasks where assistance can be sought, if needed. AI-powered support agents are embedded within tasks which help task owners search through relevant resources and write code quickly, which requires human review before landing.</p>
<h4>Step 3: Reviewing and improving task quality</h4>
<p>A review committee provides feedback on task quality and content for workstreams participating in Privacy Waves, identifying areas for improvement and opportunities for automation. Automated health signals for each workstream, such as completion rates, open tasks, deferral rates, and developer friction (e.g., broken tooling, inadequate support), are measured and tracked. Workstreams and their reviewers monitor these metrics monthly and are held accountable for improvements.</p>
<p>Engineering sentiment is captured for each workstream through task owner surveys, and AI is used to summarize their feedback, enabling workstream owners to learn from task owner input and enhance future tasks. These features contribute to improved work quality, developer sentiment, and completion rates.</p>
<h4>Step 4: Distributing the work</h4>
<p>Linting tools are employed to prevent the distribution of low-quality and low-risk work (e.g., for assets queued for deletion or lacking any data). Workstreams can configure the lints they wish to apply.</p>
<p>Tasks are sent in Privacy Waves, which are batches of privacy-related work distributed at a predefined, predictable cadence. Privacy Waves streamline execution, coordination, and reporting, since all tasks in a wave share the same deadline, allowing for timely reminders.</p>
<p>A sophisticated matching algorithm aligns tasks with teams based on competing priorities related to assets they own. Combined with predictable task distribution, this approach ensures timely work assignment and enables teams to effectively prioritize, allowing them to balance responsibilities and make consistent progress towards addressing their workloads.</p>
<h4>Step 5: Ensuring accountability of execution</h4>
<p>To ensure timely completion of tasks, deadlines are established with an aim at preventing deferral beyond these critical dates. Automated nudges and escalations are strategically used to remind individuals and teams to complete work on schedule, minimizing unnecessary noise and highlighting overdue tasks that require immediate attention.</p>
<p>Furthermore, completion rates for privacy work are rigorously measured and reported at all organizational levels, fostering a culture of accountability from frontline teams to leadership. This transparent approach ensures that everyone is held responsible for executing their tasks in a timely manner, promoting a sense of ownership and urgency across the organization.</p>
<h4>Step 6: Reporting and recognition</h4>
<p>The centralized distribution of tasks via Federation Platform and Privacy Waves streamline operational effectiveness and verification. These systems document completed tasks in a standardized format that aligns with expectations, providing clear and consistent evidence that supports Meta’s compliance posture in response to external requirements.</p>
<p>At Meta, executing on compliance-related work is an integral part of internal engineering expectations. To ensure that individuals receive the recognition they deserve, centralized recognition tooling is utilized to credit their contributions in performance evaluations. This approach not only motivates engineers to prioritize these efforts, but also reinforces the importance of this critical work in maintaining user trust and our compliance posture.</p>
<h2>Expansions for the Federation Platform and Waves</h2>
<p>As Meta continues to evolve, the Federation Platform and Waves programs are actively being expanded into new domains like security, accessibility, and broader compliance-related efforts. This expansion presents unique challenges, including different types of tasks, complex multi-step remediation processes, varying deadlines, and more. However, our foundational principles of centralized task distribution, execution tracking, and accountability provide a robust framework to address these challenges effectively.</p>
<p><img class="alignnone wp-image-22806 size-medium" src="https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Expansion-beyond-privacy_cropped.png?w=556" alt="" width="556" height="466" srcset="https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Expansion-beyond-privacy_cropped.png 556w, https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Expansion-beyond-privacy_cropped.png?resize=96,80 96w, https://engineering.fb.com/wp-content/uploads/2025/08/Federation-Platform-Waves-Expansion-beyond-privacy_cropped.png?resize=192,161 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>To ensure seamless extension into new areas, we’ll refine our tooling and processes, developing solutions that cater to each domain’s specific needs while maintaining high standards of quality and efficiency. By doing so, we aim to exceed expectations, reinforcing our commitment to safeguarding user data and ensuring efficient and consistent operations across all areas. This forward-looking approach underscores Meta’s dedication to innovation in compliance standardization, setting a benchmark for other tech companies to follow.</p>
<h2>Acknowledgments</h2>
<p><em>The authors would like to express our gratitude to reviewers of this post, including (in last name alphabetical order): Chris Adams, Bob Baldwin, Denys Besedynskyy, Herb David, Dylan Drop, Katriel Cohn-Gordon, Xenia Habekoss, Mohit Jha, Ryan Pratt, Matt Pregozen, Jessica Retka, Thomas Richards, and Chris Wiltz, many of whom have made significant contributions to Federation Platform and Privacy Waves.</em></p>
<p><em>Additionally, we would also like to acknowledge the contributions of many current and former Meta employees, who have played a crucial role in developing and maturing Federation Platform and Privacy Waves over the years. In particular, we would like to extend special thanks to (in last name alphabetical order): Quinn Armstrong, Cecilia Baek, Yashdeep Bindal, Chris Buckley, Adam Campbell, Katriel Cohn-Gordon, Ruo Ding, Jason Fennell, Andrew Fong, Riccardo Govoni, Abhishek Gulati, Aleksandar Ilic, AJ Jahansouz, Shruthi Katakam, Risa Kawai, [Emile Litvak], Amira Malpass, Idan Michael, Jason Nawrocki, Anthony O’Sullivan, Yuval Oren, Disha Parekh, [Uday Patireddy], Vimalkumar Patel, [Riley Pinkerton], Matt Pregozen, Mateen Saifyan, Pallavi Saraswati, Jay Shah, Or Sperling, Sana Surani, Rajesh Vantipalli, Avi Varadarajulu, Michelle Xu, Robbin Xu, Rui Xue, Anna Zeng, and Hansen Zhang.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/08/11/security/federation-platform-privacy-waves-meta-distributes-compliance-tasks/</link>
      <guid>https://engineering.fb.com/2025/08/11/security/federation-platform-privacy-waves-meta-distributes-compliance-tasks/</guid>
      <pubDate>Mon, 11 Aug 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Diff Risk Score: AI-driven risk-aware software development]]></title>
      <description><![CDATA[<h2>The state of the research</h2>
<p><a href="https://arxiv.org/abs/2410.06351?ref=engineeringatmeta" target="_blank" rel="noopener">Diff Risk Score (DRS)</a> is a AI-powered technology built at Meta that predicts the likelihood of a code change causing a production incident, also known as a SEV. Built on a fine-tuned Llama LLM, DRS evaluates code changes and metadata to produce a risk score and highlight potentially risky code snippets. Today, DRS powers many risk-aware features that optimize product quality, developer productivity, and computational capacity efficiency. Notably, DRS has helped us eliminate major code freezes, letting developers ship code when they historically could not with minimal impact to customer experience and the business.</p>
<h2>Why it matters</h2>
<p>Software development is fraught with risk, especially for intricate, rapidly evolving, and scaled products and technologies. Because Meta operates at a global scale, we need the best tools possible to mitigate risk and to protect both user experience and advertiser outcomes..</p>
<p>AI is transforming how we build products, so we committed ourselves to applying AI to improve every aspect of the software development process. Production risk was one of the areas we tackled first. We theorized that, if equipped with a model that could predict if a code change might cause a SEV, we could build features and workflows to improve almost every aspect of writing and pushing code.</p>
<p>Since DRS use cases are too numerous to cover in depth here, we’ll focus on one: <a href="https://dl.acm.org/doi/10.1145/3722216?ref=engineeringatmeta" target="_blank" rel="noopener">code unfreeze.</a> For Meta, production incidents can drive significant negative user experience and advertiser impact. For this reason, some teams have historically “frozen” major parts of the codebase for sensitive periods like Cyber 5 holiday shopping week, preventing engineers from shipping code to reduce incident risk. For certain teams, it has cut down their holiday shopping code freeze, leading to significant improvements in productivity.</p>
<p>While this had clear reliability benefits, the tradeoff was a substantial reduction in productivity. DRS enabled a more nuanced approach, letting developers land lower-risk changes during these periods while minimizing production incidents, thus protecting the user experience, the business, and productivity. In fact, DRS has driven meaningful productivity gains across many sensitive periods. During one such period, a major partner event in 2024, we landed 10,000+ code changes (that previously could not have landed during a freeze) with minimal production impact, enabling continued innovation and customer success. What’s more, by managing productivity and risk in this way, we benefit twice: through more code landed and through less engineering time spent detecting, understanding, and mitigating production incidents.</p>
<p>Code unfreeze works well, but it’s just the start of what the technology can do. Understanding risk, even imperfectly and at a statistical level, has driven improvements for Meta in more ways than we anticipated – there are 19 use cases for risk tooling and growing!</p>
<h2>Where we’re headed next</h2>
<p>The success of DRS has spurred the creation of new risk-aware features across Meta that span the entire development lifecycle, from planning to post-release monitoring. The demand to build such features also led us to build the Risk Awareness Platform to provide risk analysis APIs and tool integrations.</p>
<p>We envision four major directions for risk awareness in the coming months and years.</p>
<p>First, while we’ve seen an explosion of DRS-powered features on the Risk Awareness Platform, from optimizing build and test selection to improving reliability, selecting code reviewers, and analyzing release risks, we believe this is only the beginning. A critical problem in software engineering is maximizing innovation rate subject to a reliability threshold, so the applications of risk understanding are virtually inexhaustible. We believe code risk can play a significant role in improving this tradeoff, so we will build more risk-aware features while improving their quality. As the risk model, feature data, and user experiences improve, we’ll see greater real-world benefits for people who use Meta’s products and businesses who advertise with Meta.</p>
<p>Second, we will expand beyond code change risk to configuration change risk. While code changes cause the plurality of SEVs at Meta, configuration changes are another large category. For this reason, we’ve expanded the RAP to include models that predict the risk of various config changes. These efforts are state of the art, focused on an open research area, and earlier on the research-to-production continuum, but we believe they will soon power feature families of their own, much like DRS does today.</p>
<p>Third, we want to automate the risk mitigation step. Instead of flagging risky diffs and recommending appropriate reviewers or rollback mechanisms, we want to use AI agents to proactively generate risk-mitigating changes. This can be done for code in motion (i.e. diffs or pull requests) and for code at rest to lower baseline codebase risk. Additionally, once we are armed with a greater understanding of configuration risks, these agents will be able to operate flexibly across both code and config changes.</p>
<p>Fourth, we will increasingly use natural language outputs to show humans what these risk-aware technologies are doing and why. By helping engineers understand the rationale behind the risk score, we’ll empower them to either mitigate risks or give the model feedback to improve accuracy. This creates a learning loop for improving both our risk models and the end user experience. LLM explainability remains an open area of research, but our teams are actively working to offer answers to common questions.</p>
<p>We are excited for the future of risk-aware software development, and we look forward to learning from—and with—our colleagues in industry as we make progress in this valuable domain.</p>
<h2>Read the papers<br /></h2>
<p class="title mathjax">“<a href="https://arxiv.org/abs/2410.06351" target="_blank" rel="noopener">Moving Faster and Reducing Risk: Using LLMs in Release Deployment</a>“</p>
<p>“<a href="https://dl.acm.org/doi/10.1145/3722216?ref=engineeringatmeta" target="_blank" rel="noopener">Leveraging Risk Models to Improve Productivity for Effective Code Un-Freeze at Scale</a>”</p>
<h2>Acknowledgements</h2>
<p><em>We would like to thank all the team members and the leadership that contributed to making the DRS effort successful at Meta. Rui Abreu, David Amsallem, Parveen Bansal, Kaavya Chinniah, Brian Ellis, James Everingham, Peng Fan, Ford Garberson, Jun Ge, Kelly Hirano, Kosay Jabre, David Khavari, Sahil Kumar, Ajay Lingapuram, Yalin Liu, Audris Mockus, Megh Mehta, Vijayaraghavan Murali, Venus Montes, Aishwarya Girish Paraspatki, Akshay Patel, Brandon Reznicek, Peter C Rigby, Maher Saba, Babak Shakibi, Roy Shen, Gursharan Singh, Matt Steiner, Weiyan Sun, Ryan Tracy, Siri Uppalapati, and Nachiappan Nagappan.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/08/06/developer-tools/diff-risk-score-drs-ai-risk-aware-software-development-meta/</link>
      <guid>https://engineering.fb.com/2025/08/06/developer-tools/diff-risk-score-drs-ai-risk-aware-software-development-meta/</guid>
      <pubDate>Wed, 06 Aug 2025 19:50:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building a human-computer interface for everyone]]></title>
      <description><![CDATA[<p>What if you could control any device using only subtle hand movements?</p>
<p><a href="https://www.meta.com/blog/reality-labs-surface-emg-research-nature-publication-ar-glasses-orion/" target="_blank" rel="noopener">New research from Meta’s Reality Labs</a> is pointing even more firmly toward wrist-worn devices using <a href="https://en.wikipedia.org/wiki/Electromyography" target="_blank" rel="noopener">surface electromyography (sEMG)</a> becoming the future of human-computer interaction.</p>
<p>But how do you develop a wrist-worn input device that works for everyone?</p>
<p>Generalization has been one of the most significant challenges in the field of human-computer interaction (HCI). The machine learning models that power a device can be trained to respond to an individual’s hand gestures, but they struggle to apply that same learning to someone else. Essentially, novel HCI devices are usually one-size-fits-one.</p>
<p>On the latest episode of the Meta Tech Podcast, <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> sits down with Sean B., Lauren G., and Jesse M. — research scientists on Meta’s EMG engineering and research team — to discuss how their team is tackling the challenge of generalization and reimagining how we interact with technology. </p>
<p>They discuss the road to creating a first-of-its-kind, generic human-computer neuromotor interface, what happens when software and hardware engineering meet neuroscience, and more!</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/37610330/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/2tkjPgcX6k3Dw8m6xfBwoI?ref=engineeringatmeta" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/meta-tech-podcast/id1370910331?ref=engineeringatmeta" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pocketcasts.com/podcasts/c4ede3e0-1fbf-0136-c266-7d73a919276a/01bfd518-45ea-45c8-8dd6-60e2b667e8eb?ref=engineeringatmeta" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/08/04/virtual-reality/building-a-human-computer-interface-for-everyone-meta-tech-podcast/</link>
      <guid>https://engineering.fb.com/2025/08/04/virtual-reality/building-a-human-computer-interface-for-everyone-meta-tech-podcast/</guid>
      <pubDate>Mon, 04 Aug 2025 16:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Accelerating on-device ML on Meta’s family of apps with ExecuTorch]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1"><a href="https://github.com/pytorch/executorch/" target="_blank" rel="noopener">ExecuTorch</a> is the PyTorch inference framework for edge devices developed by Meta with support from industry leaders like Arm, Apple, and Qualcomm. </li>
<li class="c1" aria-level="1">Running machine learning (ML) models on-device is increasingly important for Meta’s family of apps (FoA). These on-device models improve latency, maintain user privacy by keeping data on users’ devices, and enable offline functionality.</li>
<li class="c1" aria-level="1">We’re showcasing some of the on-device AI features, powered by ExecuTorch, that are serving billions of people on Instagram, WhatsApp, Messenger, and Facebook.</li>
<li class="c1" aria-level="1">These rollouts have significantly improved the performance and efficiency of on-device ML models in Meta’s FoA and eased the research to production path.</li>
</ul><p>Over the past year, we’ve rolled out <a href="https://github.com/pytorch/executorch/" target="_blank" rel="noopener">ExecuTorch</a>, an open-source solution for on-device inference on mobile and edge devices, across our family of apps (FoA) and seen significant improvements in model performance, privacy enhancement, and latency over our previous on-device machine learning (ML) stack.</p>
<p>ExecuTorch was <a href="https://pytorch.org/blog/pytorch-edge-enabling-on-device-inference-across-mobile-and-edge-devices-with-executorch/" target="_blank" rel="noopener">built in collaboration with industry leaders</a> and uses PyTorch 2.x technologies to convert models into a stable and compact representation for efficient on-device deployment. Its compact runtime, modularity, and extensibility make it easy for developers to choose and customize components – ensuring portability across platforms, compatibility with PyTorch, and high performance.</p>
<p>Adopting ExecuTorch has helped us enhance our user experiences in our products and services used by billions of people all over the world.</p>
<p>The following are just a few examples of the various ML models on our apps on Android and iOS devices that ExecuTorch supports.</p>
<h2>Enabling Cutouts on Instagram</h2>
<p><a href="https://ai.meta.com/blog/instagram-edits-cutouts-segment-anything/?ref=shareable">Cutouts</a> is one of Instagram’s latest features for creative expression and storytelling. It lets people transform photos and videos of their favorite moments into animated, personalized stickers that they can share via Reels or Stories. We migrated the Cutouts feature in Instagram to run with ExecuTorch by enabling <a href="https://arxiv.org/abs/2312.06736">SqueezeSAM</a>, a lightweight version of the <a href="https://ai.meta.com/blog/instagram-edits-cutouts-segment-anything/">Meta Segment Anything Model (SAM)</a>. For both Android and iOS, ExecuTorch was significantly faster compared to the older stack, translating into increases in Cutouts’ daily active users (DAU). </p>
<figure id="attachment_22760" aria-describedby="caption-attachment-22760" class="wp-caption alignnone c2"><img class="wp-image-22760" src="https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png" alt="" width="600" height="523" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=916,798 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=768,669 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=1024,892 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=1536,1338 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=96,84 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Instagram-Cutouts-ExecuTorch.png?resize=192,167 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22760" class="wp-caption-text">ExecuTorch enables Instagram’s Cutouts feature to run faster and more efficiently for both on-device sticker generation (left) and creating overlays on a photo. (right)</figcaption></figure><h2>Improving video and call quality on WhatsApp</h2>
<p>WhatsApp needs to be usable and reliable regardless of your network connection bandwidth. To achieve this, we developed bandwidth estimation models, tailored for various platforms. These models help detect and utilize available network bandwidth, optimizing video streaming quality without compromising the smoothness of video calls.  </p>
<p>These models need to be highly accurate and run as efficiently as possible. By leveraging ExecuTorch, we have observed improvements for the bandwidth estimation models in performance, reliability, and efficiency metrics. Specifically, we reduced the model load time and average inference time substantially while reducing app not responsive (ANR) metrics. Along the way, we further strengthened  security guarantees compared to the older PyTorch mobile framework by adding <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzzing tests</a>, which involve supplying invalid or random inputs to a program and monitoring for exceptions. With the positive signal from these releases, we are now migrating several other key WhatsApp models, such as ones for on-device noise-canceling and video enhancement, to ExecuTorch as well. </p>
<figure id="attachment_22761" aria-describedby="caption-attachment-22761" class="wp-caption alignright c3"><img class="wp-image-22761" src="https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png" alt="" width="277" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png 923w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=423,916 423w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=768,1663 768w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=473,1024 473w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=709,1536 709w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=96,208 96w, https://engineering.fb.com/wp-content/uploads/2025/07/ExecuTorch-Messenger-Language-Identification-Model-LiD.png?resize=192,416 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22761" class="wp-caption-text">Here, Messenger’s language identification model (Lid) restricts the prompt language to English for Meta AI’s Imagine feature.</figcaption></figure><h2>Shipping on-device ML for end-to-end encryption on Messenger</h2>
<p><a href="https://about.fb.com/news/2024/03/end-to-end-encryption-on-messenger-explained/" target="_blank" rel="noopener">End-to-end encryption (E2EE) on Messenger</a> ensures that no one except you and the people you’re talking to can see your messages, not even Meta. ExecuTorch has enabled E2EE on Messenger by moving server side models to run on-device, allowing data transfers to remain encrypted.</p>
<p>To enable E2EE, we migrated and deployed several models, including an on-device language identification (LID) model on Messenger. LID is a Messenger model that detects the language of given text and enables various downstream tasks, including translation, message summarization, and personalized content recommendations. With ExecuTorch, on-device LID is significantly faster and conserves server and network capacity. </p>
<p>To preserve Messenger’s E2EE environment, we have also leveraged ExecuTorch to move other Messenger models on-device, including one for optimizing video calling quality (similar to WhatsApp’s bandwidth estimation models) and another for image cutouts (similar to Cutouts on Instagram). These shifts resulted in improved infrastructure efficiency by freeing up capacity and enabling us to scale these features globally. </p>
<h2>Background music recommendations for Facebook</h2>
<p>Facebook employs a core AI model called SceneX that performs a variety of tasks, including image recognition/categorization, captioning, creating AI-generated backgrounds for images, and image safety checks. Shifting SceneX to ExecuTorch now allows it to enhance people’s Facebook Stories by suggesting background music based on images.</p>
<p>With the ExecuTorch rollout, we saw performance improvements in SceneX across the board from low- to high-end devices compared to the older stack. Several other models, including which enhance image quality and perform background noise reduction during calls, are now in various stages of A/B testing. </p>
<h2>Building the future of on-device AI with the ExecuTorch Community</h2>
<p>We hope the results we’ve seen leveraging ExecuTorch to help solve some of Meta’s on-device ML challenges at scale will be encouraging to the rest of the industry. <a href="https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md" target="_blank" rel="noopener">We invite you to contribute to ExecuTorch</a> and share feedback on our <a href="https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md" target="_blank" rel="noopener">GitHub page</a>. You can also join our growing community on the <a href="https://discord.gg/74dmqtAQQs" target="_blank" rel="noopener">ExecuTorch Discord server</a>.</p>
<p>We look forward to driving more innovation in on-device ML and shaping the future of on-device AI together with the community.</p>]]></description>
      <link>https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/</link>
      <guid>https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/</guid>
      <pubDate>Mon, 28 Jul 2025 22:30:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Policy Zones: How Meta enforces purpose limitation at scale in batch processing systems]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta has developed <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">Privacy Aware Infrastructure (PAI)</a> and <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Policy Zones</a> to enforce purpose limitations on data, especially in large-scale batch processing systems. </li>
<li class="c1" aria-level="1">Policy Zones integrates with Meta’s <strong><em>exabyte-scale</em></strong> data warehouse and processing systems, using runtime enforcement and SQL parsing to propagate and enforce privacy annotations across <strong><em>millions</em></strong> of daily data flows per day, performing <strong><em>trillions</em></strong> of user consent checks per hour, and through our stream processing systems which transport multiple <strong><em>petabytes per hour</em></strong>.</li>
<li class="c1" aria-level="1">We’ve built tools to help engineers use Policy Zones, so that they can quickly respond to privacy requirements. As a testament to its usability, these tools have allowed us to <strong>deploy Policy Zones  across data assets and processors in our batch processing systems</strong>. </li>
<li class="c1" aria-level="1">Policy Zones technology is used at scale in batch processing systems to meet our privacy commitments to our users across Meta’s family of apps. While Policy Zones technology has become widely used in our batch processing systems, we are continuing to invest in our PAI to make it even easier to use for our engineers. </li>
</ul><p>Meta’s <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">Privacy Aware Infrastructure (PAI)</a> is designed to streamline data flows <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">while ensuring purpose limitation</a> and transparency, leveraging automation to reduce the overhead associated with privacy requirements. This enables our engineers to focus on building innovative products that people love, while always honoring their privacy. By making privacy a core part of our infrastructure, we’re empowering product teams to create new experiences that delight our community.</p>
<p>In our previous blogs, we introduced <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">PAI</a> and its key components, including <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/" target="_blank" rel="noopener">data lineage</a> and <a href="https://engineering.fb.com/2025/04/28/security/how-meta-understands-data-at-scale/" target="_blank" rel="noopener">data understanding</a>. These foundational elements have enabled us to effectively manage and track data flows at scale. As we moved forward to enforce purpose limitation, we recognized the need for a robust solution to control how data flows in complex systems, and remediate data flow at scale so that engineers can focus on production innovation with limited friction arising from privacy compliance. </p>
<p>In this blog, we will deep dive into our Policy Zones approach for batch processing systems and how we use it to protect users’ messaging data. These systems process data in batch (mainly via SQL), such as our exabyte data warehouse that powers Meta’s AI training and analytics workflows. <img class="alignnone size-full wp-image-22725" src="https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png" alt="" width="3812" height="1348" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png 3812w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=916,324 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=768,272 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=1024,362 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=1536,543 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=2048,724 2048w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=96,34 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Metas-AI-and-analytics-workflows_Final.png?resize=192,68 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Before Policy Zones, we relied on conventional access control mechanisms like access control lists (ACL) to protect datasets (“assets”) when they were accessed. However, this approach requires physical coarse-grained separation of data into distinct groupings of datasets to ensure each maintains a single purpose. While viable at a small scale, this approach leads to significant operational overhead as it requires frequent and exhaustive audits of many individual assets to ensure the continuous validity for a sizable privacy control.</p>
<h2>Data flow control for batch processing systems via Policy Zones</h2>
<p>To mitigate the challenges associated with coarse-grained physical data separation, we have invested in Policy Zones as a key component of our PAI strategy. It leverages <a href="https://dl.acm.org/doi/10.1145/360051.360056" target="_blank" rel="noopener">Information Flow Control (IFC)</a> principles to offer a <a href="https://dl.acm.org/doi/10.1145/363516.363526" target="_blank" rel="noopener">more durable and sustainable approach</a> by controlling not only how data is accessed but also how data is processed and transferred in real time. We developed tools and APIs for developers to easily integrate Policy Zones, which <strong>automatically</strong> track and protect data flows by <strong>enforcing flow restrictions at runtime</strong>, to their code. To maintain data integrity, Policy Zones enforce a fundamental principle: The restrictions on downstream data must be equal to or more restrictive than those of the upstream source from which the data originates. Once the data is protected by Policy Zones, any future processing or usage of the data has to be compatible with the restrictions or it will be blocked.</p>
<p>Meta’s data warehouse is a critical component of our data processing infrastructure, supporting various workloads such as batch analytics, real-time processing, and machine learning. Engineers have developed numerous data processing systems to cater to different usage patterns, resulting in millions of jobs running daily to process and transform data. Policy Zones operates at tremendous scale, including:</p>
<ul><li class="c1" aria-level="1">Controlling access for millions of datasets,</li>
<li class="c1" aria-level="1">Analyzing the processing of tens of millions of data flows per day across hundreds of thousands of unique queries,</li>
<li class="c1" aria-level="1">Performing batch user consent checks that performs trillions of consent checks per hour across datasets that span different purpose-use boundaries,</li>
<li class="c1" aria-level="1">Handling hundreds of distinct data policy requirements for any given flow.</li>
</ul><p>The intricate relationships between datasets are exemplified in the following diagram, which depicts a single deployment of Policy Zones enforcing a purpose-use limitation on a subset of data processing within the warehouse. This visual representation highlights the complexity of data dependencies and the need for robust policy enforcement mechanisms. Each dot represents a single dataset, and each line between them represents a data dependency. In other words, in order to compute a given dataset (represented by dots), you would need to use all of the datasets that are connected to it (represented by lines). </p>
<p><img class="alignnone wp-image-22727 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/A-Single-Policy-Zone_Final-B.png" alt="" width="800" height="798" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/A-Single-Policy-Zone_Final-B.png 800w, https://engineering.fb.com/wp-content/uploads/2025/07/A-Single-Policy-Zone_Final-B.png?resize=768,766 768w, https://engineering.fb.com/wp-content/uploads/2025/07/A-Single-Policy-Zone_Final-B.png?resize=96,96 96w, https://engineering.fb.com/wp-content/uploads/2025/07/A-Single-Policy-Zone_Final-B.png?resize=192,192 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>At first glance, the intricate web of data dependencies may seem daunting to manage. However, Policy Zones is designed to track and enforce policies across these complex relationships. While compartmentalizing such a complex system can be resource intensive, Policy Zones offers a more efficient and effective solution for managing data dependencies and ensuring privacy requirements. To address the challenges we’ve faced over the years, we’ve had to develop innovative solutions for our batch processing systems, which are essential for managing the vast amounts of data that flow through our systems. The table below describes the key challenges and our approaches to solving them.</p>
<table border="1"><tbody><tr><td class="c2"><strong>Challenge</strong></td>
<td class="c2"><strong>Approach</strong></td>
</tr><tr><td><strong>Coarse-grained data separation to compartmentalize purpose use:</strong> A common strategy for managing distinct purposes is to separate data and its processing entirely, a technique known as data compartmentalization. However, this approach can be difficult to implement due to the intricate web of data dependencies that exist within our systems.</td>
<td><strong>Fine-grained information flow tracking:</strong> We track how data flows to ensure that the restrictions are at least as restrictive as the sources used to populate the output datasets. As a result, engineers do not need to coarsely compartmentalize their data. Fine-grained tracking allows us to more efficiently profile risk without needing to separate data and its processing to specific purposes.</td>
</tr><tr><td><strong>Overly conservative labeling of data (label creep):</strong> By default, any incidental access of purpose-use limited  data results in all of the derived datasets needing to be purpose-use limited, even if the access is spurious. We need a way to stop propagation (called <em>reclassification</em>) of sensitive data labels when the data is transformed to no longer be sensitive.</td>
<td><strong>Policy Zone Manager (PZM):</strong> We built a suite of tools that aids in carefully propagating purpose-use limitations that will identify potential over-labeling situations; these are controlled through a reclassification system, which allow engineers to safely stop propagation. </td>
</tr><tr><td><strong>Lack of governance, extensible data model:</strong> There are numerous internal data policies and individual privacy controls active at any given time, with new policies being created regularly by various public commitment-oriented teams. These teams need to have strong controls over how their data policies are being enforced. It’s also critical that each policy operates independently from other policies due to the different stages of rollout each policy is in.</td>
<td><strong>Governable Data Annotations</strong> (GDAs) are precise, governed annotations on datasets that describe the kinds of data that are subject to purpose-use limitations. Their entire lifecycle is subject to precise controls; they limit who can create them, who can associate the annotation on a dataset, who can remove an annotation, among other controls. The annotation labels are human readable, e.g., MESSAGING_DATA  describes user data from a messaging context.</td>
</tr></tbody></table><p><br />
Below we will describe how we scaled out Policy Zones in batch processing systems via a walkthrough of one of the ways we protect messaging data across Meta’s family of apps.</p>
<p><img class="alignnone size-full wp-image-22728" src="https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png" alt="" width="1920" height="1237" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=916,590 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=768,495 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=1024,660 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=1536,990 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Walkthrough_-Protecting-Messaging-Data-_Final.png?resize=192,124 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>We’ll walk through how we protect user’s messaging data with Policy Zones across our batch processing systems. Users send messages to others through apps like Facebook Messenger. <a href="https://engineering.fb.com/2023/12/06/security/building-end-to-end-security-for-messenger/" target="_blank" rel="noopener">Messenger supports end-to-end encryption</a>. Additionally, to support platform integrity and reliability, we process certain non-content messaging data, such as delivery timestamps and status, to support product performance improvement, detect abuse, and protect users from harmful conduct.</p>
<p>Messaging data is collected from these apps and enters into our data warehouse and AI systems via <a href="https://engineering.fb.com/2022/11/09/developer-tools/tulip-schematizing-metas-data-platform/" target="_blank" rel="noopener">logging</a> or database scrapes from web systems. It is streamed through our message queue, <a href="https://engineering.fb.com/2019/10/07/core-infra/scribe/" target="_blank" rel="noopener">Scribe</a>, where it can be processed in real time or stored in a time-partitioned dataset for asynchronous batch processing.</p>
<p><img class="alignnone wp-image-22729 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png" alt="" width="1920" height="1629" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=916,777 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=768,652 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=1024,869 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=1536,1303 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=96,81 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Online-Systems_Final-1.png?resize=192,163 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The logging libraries are configured in a fluent builder pattern. Below is a snippet that shows how a logger  is configured to log messaging metadata. The key element is that the logger is associated with a Policy Zones annotation; in this blog post we call it MESSAGING_DATA. This annotation is called a <strong>Governable data annotation</strong> (GDA). GDAs are simple, human-readable labels that affect the behavior of access on the dataset. GDAs have controls on their lifecycle that ensure data policies are upheld. In the representative code snippet below, the annotation on the logger will restrict where the data can flow, in particular it can only flow to other datasets that have this annotation, and further that access to it is restricted to allowed purposes defined in a separate central configuration.</p>
<p><img class="alignnone wp-image-22730 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png" alt="" width="1340" height="718" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png 1340w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png?resize=916,491 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png?resize=768,412 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png?resize=1024,549 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-governable-data-annotation.png?resize=192,103 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The above annotation on the logger will trigger Policy Zones’ infrastructure to impose certain restrictions. One key requirement is that downstream data assets, which rely on this logger’s data, must also carry the same annotation. This is done by leveraging Policy Zone’s flow control mechanisms that reason about how data flows through our systems. A processor can access a dataset annotated with a GDA if the Policy Zones infrastructure has checked the flow of data. </p>
<p>The logger config code snippet above generates code that writes data to a corresponding Scribe message queue category from our web servers. Policy Zones verifies that the messaging GDA is associated with this downstream Scribe category, ensuring compliant data flow (see the corresponding flow below).</p>
<p><img class="alignnone wp-image-22731 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png" alt="" width="3448" height="2182" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png 3448w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=916,580 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=768,486 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=1024,648 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=1536,972 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=2048,1296 2048w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Compliant-data-corresponding-flow_final.png?resize=192,122 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Scribe category data is processed by our stream processing systems and ingested into the warehouse in the form of time-partitioned datasets. These time-partitioned datasets are used by our batch processing to compute derived datasets to product support analytics, machine learning, and operational monitoring, among other uses. </p>
<h2>Enforcing zones as data flows through the warehouse</h2>
<p>In the next section, we’ll explore how Policy Zones enforces purpose-use limitation in the warehouse. Data processing within the data warehouse is typically represented using SQL, which defines how to store, transform, and retrieve relational data. SQL’s declarative nature and robust support for relational data processing enables users to write large-scale data processing jobs that can handle petabytes of data with minimal code. These attributes significantly enhance the efficiency and effectiveness of privacy-related tasks, allowing for scalable, policy-compliant data processing across our infrastructure. The most popular warehouse processors, like <a href="https://research.facebook.com/publications/presto-sql-on-everything/" target="_blank" rel="noopener">Presto</a>, are SQL-based. </p>
<p>Scheduling these queries is done through our distributed job scheduling framework, <a href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Dataswarm</a>. Users specify the frequency of runs and what data they depend on. Since our data is primarily time-partitioned, job schedules mirror the partitioning scheme of the data starting in a waiting state, and then run as soon as new time partitions become available. An example representative dataswarm pipeline is shown below, that calculates daily messages sent. It does this by reading a representative example of an input messages metadata logger dataset (described above), transforming that data and then writing it into an output messages_sent table.</p>
<p><img class="alignnone wp-image-22732 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png" alt="" width="1246" height="760" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png 1246w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png?resize=916,559 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png?resize=768,468 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png?resize=1024,625 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-Dataswarm-query.png?resize=192,117 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The example query in the Dataswarm pipeline above illustrates how to compute a derived dataset, calculating the daily number of messages sent per user. This concise, templatized SQL statement is processed by Dataswarm, which generates a fully expanded SQL statement that Presto then interprets to initiate a distributed job across thousands of machines. By abstracting away the execution details, engineers can focus on defining high-level transformations, simplifying the development process and improving productivity.</p>
<p><img class="alignnone wp-image-22733 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png" alt="" width="3786" height="1638" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png 3786w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=916,396 916w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=768,332 768w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=1024,443 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=1536,665 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=2048,886 2048w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=96,42 96w, https://engineering.fb.com/wp-content/uploads/2025/07/High-level-transformations_final.png?resize=192,83 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>We built <a href="https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/" target="_blank" rel="noopener">Unified Programming Model (UPM)</a>, a SQL parser that intercepts queries issued by various data processors and translates them into semantic trees. These trees capture the inputs, outputs, and transformations of each data movement step, providing the necessary signals for precise policy enforcement as data flows through the system.</p>
<p>As shown in the diagram below, the process begins with data processors issuing SQL queries. UPM parses those queries and sends the information about the transformation to the first key piece of Policy Zones infrastructure: the <strong>Policy Evaluation Service (PES)</strong>. PES performs flow control checks, validating whether the data movement and transformation steps comply with privacy policies. The diagram below shows how Policy Zones integrates with the existing batch processing infrastructure.</p>
<p><img class="alignnone size-full wp-image-22744" src="https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png" alt="" width="1920" height="1109" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=916,529 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=768,444 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=1024,591 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=1536,887 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=96,55 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Policy-zones-architecture_Final.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>If a flow is allowed, PES passes the decisions to the compute engines, which then perform the actual low-level data accesses. PES also forwards the checking results to the <strong>Warehouse Permission Service (WPS)</strong>, which performs a final validation, ensuring that access to warehouse data is reasoned about in a policy-aware manner. WPS was built to service traditional access control. We have since augmented its abilities to also ensure safe flows according to the GDAs annotated on the accessed datasets. It does this through propagation of a special token (depicted above with a key icon). PES issues the key to the client, which contains cryptographically signed contextual information. That key is then forwarded through the compute engines and passed in at time of access. This allows WPS to have enough context to reason about the access in the greater context of the overall flow. To illustrate this change, let’s look at how WPS has changed with the integration of Policy Zones. Historically, WPS only received individual access requests (e.g., “read table A by identity X,” or separately, “write table B by identity Y”). With our Policy Zones integration, WPS can now see additional information such as, “read table A by identity X <em>and</em> PES says the read satisfies the GDA flow safety requirements on the MESSAGING_DATA GDA.” </p>
<p>Processors integrate against Policy Zones through a client-side library we refer to as <em>PrivacyLib</em>. PrivacyLib abstracts the coordination logic, operational monitoring, and service calls in an effort to separate concerns between data processing business logic and privacy checking. </p>
<p><strong>Stream processing.</strong> Up until this point, we’ve described how data flows from web server frontends to time-partitioned datasets in the warehouse via an ingestion system that reads from Scribe categories. However, there are also real-time stream processing systems that directly operate against the Scribe categories, rather than working on the bulk time-partitioned datasets. These systems are often used in latency-sensitive systems where we need to compute results quickly from large datasets (e.g., categorizing newly created users originating from a spam bot based on their recent events to detect terms of service violations). </p>
<p>Policy Zones are also integrated into these systems in much the same way. <a href="https://research.facebook.com/publications/realtime-data-processing-at-facebook/" target="_blank" rel="noopener">XStream</a> is our next-generation stream processing system that provides a SQL-like interface for defining streaming data transformations. We use the same UPM parse to determine safe data flows. Key to scaling and reliability is that we analyze the streaming application statically before it starts processing events. This is made possible by XStream’s declarative data transformation programming model. Other realtime processing systems are handled by Policy Zones for function-based systems, which was alluded to in our <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">earlier blog post</a> about Policy Zones.</p>
<p>The critical component in allowing fine-grained data separation is the Flows-To evaluator in PES. After the source-sink data dependency information is extracted, PES determines if the flow of data is permissible by using information-flow control-theoretic checks. Some of these checks can include ensuring a consent check was performed by the data processor (e.g., to ensure the user permits their data to be used for a certain purpose), or that the GDA from the source tables are also on the destination table. These checks can be modeled as a lattice where the nodes represent different GDA labeling states and purpose use, and the edges between them represent allowed (safe) transitions. The code snippet below shows the logic of one of the functions that performs our purpose-use checks for GDAs, <a href="https://engineering.fb.com/2021/04/29/developer-tools/rust/">written in Rust</a>, with some modifications made for clarity.</p>
<figure id="attachment_22734" aria-describedby="caption-attachment-22734" class="wp-caption alignnone c3"><img class="wp-image-22734 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?w=1024" alt="" width="1024" height="880" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png 1564w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=916,787 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=768,660 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=1024,880 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=1536,1320 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=96,82 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-purpose-use-check.png?resize=192,165 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22734" class="wp-caption-text"><strong>Code snippet</strong>: One of PES’s core functions for checking that the purpose of access is allowed by the GDA, checked by our lattice-theoretic Flows-To checker. The function first collects all of the purpose limitations from all of the source datasets being accessed. It then loops through each GDA’s requirements to see if the allowed purposes of the GDA satisfies our flows-to checker on the intended consumption purpose. If not, the flow is marked as unsafe.</figcaption></figure><p>Putting the pieces together, PES and WPS integrate into our batch processing systems to allow seamless flow safety checks against our purpose-use limitation requirements. Instead of traditional coarse-grained data separation, engineers can write batch processing queries that access datasets with heterogeneous purpose-use requirements all in the same warehouse. People do not necessarily need to request permission to special purpose-use limited silos of the warehouse as Policy Zones can ensure the data is protected despite being commingled with other non-purpose-use restricted datasets. This unique ability is particularly useful for machine learning workflows, which we discuss in the next section.</p>
<h2>Enforcing zones for AI training workflows</h2>
<p>Non-content messaging data is used to train models, such as spam filters, to identify and block unwanted or malicious messages, ensuring a safe user experience. PES is integrated directly into the APIs used by workflows for reading and writing data and models, enforcing strong data usage protections. Below is a main scenario detailing data flow in the ML stack. The diagram below shows the component architecture of AI training at Meta, and how Policy Zones integrates with it. Key to our architecture is that PES integrates principally at the control plane of AI training. To build intuition from the previous general purpose warehouse processing section, this is analogous to checking SQL statements rather than checking individual rows being accessed.</p>
<p><img class="alignnone size-full wp-image-22735" src="https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png" alt="" width="3840" height="2244" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png 3840w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=916,535 916w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=768,449 768w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=1024,598 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=1536,898 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=2048,1197 2048w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/07/ML-training-data-flows-_Final.png?resize=192,112 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Machine learning training workflows are defined by user-authored scripts, which can be created using internal authoring tools, <a href="https://engineering.fb.com/2024/02/12/developer-tools/meta-loves-python/" target="_blank" rel="noopener">all of which utilize Python</a>. Users also have the option to directly write custom Python code to interact with workflow scheduling tools like <a href="https://engineering.fb.com/2016/05/09/core-infra/introducing-fblearner-flow-facebook-s-ai-backbone/" target="_blank" rel="noopener">FBLearner</a>. During the training process, large-scale dataframes are loaded into the training workflow. These dataframes can be sourced from Data Warehouse or directly from real-time batch services.</p>
<p>In scenarios involving distributed training, intermediate storage like temporary tables are used to temporarily store data outputs between operators. The resulting models are stored in the model storage system. For tasks such as transfer learning or recurring/continuous training, these models can be retrieved from the model storage and reintroduced into the training workflow for incremental updates.</p>
<p>Workflows can be annotated with purpose-use requirements in the following ways.</p>
<ul><li class="c1" aria-level="1"><strong>Automatic inference:</strong> PES automatically infers annotations from upstream data dependencies and applies them to the current workflow and all downstream dependent models or assets, provided there are no conflicts. </li>
<li class="c1" aria-level="1"><strong>Manual override:</strong> Users can manually override the inferred annotations when authoring workflows or in the model type linked to the workflows. The “Model Type” is a widely used concept at Meta to describe the clearly delineated business purpose for the machine learning work.</li>
</ul><p>Below is a representative code example defining a training workflow:</p>
<p><img class="alignnone wp-image-22736 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?w=1024" alt="" width="1024" height="825" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png 1058w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?resize=916,738 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?resize=768,618 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?resize=1024,825 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?resize=96,77 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-Policy-Zones-defining-training-workflow.png?resize=192,155 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>We associate model types with a GDA. The following shows the configuration information for the messaging_spam_filter model type; note that it is annotated with the MESSAGING_DATA GDA.</p>
<p><img class="alignnone size-full wp-image-22737" src="https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png" alt="" width="1920" height="1421" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=916,678 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=768,568 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=1024,758 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=1536,1137 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=96,71 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-spam-filter_final.png?resize=192,142 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>At run time, we associate all accesses during the workflow with a model type and ensure that the assets being written to also have GDA. PES is integrated into various data reading and writing APIs within the AI training stack to accommodate this capability. When a workflow reads data, it retrieves the data’s annotations. The first annotation is applied to the workflow. When the workflow outputs data, the data output is annotated with the current workflow’s annotation, including any intermediate datasets.</p>
<h2>How Policy Zones are applied reliably at scale</h2>
<p>Policy Zones Manager (PZM) enables engineers to reliably integrate Policy Zones to existing data processing and propagate them to new processing code. PZM supports two major workflows: applying zones to existing processing, and propagating zones from new processing. Although many components are shared between these two workflows, the overall experience is quite different for engineers.</p>
<p><strong>Applying zones to existing processing</strong>. PZM allows engineers to seed a proposed annotation on a dataset (e.g., the logger from the beginning of the blog post) to understand the downstream implications. Since Policy Zones is an enforcement mechanism, care must be taken in applying GDAs as it may break production workflows. PZM will guide an engineer trying to add a GDA through the right steps to avoid any production breakage. It does this by <em>simulating</em> the potential effects of enforcement that comes from the new GDA labeling, and then <em>suspending</em> flows that would break. These suspensions are then tracked and burned down by the engineer to ensure complete end-to-end compliance with the GDA’s purpose-use requirements.</p>
<p><strong>Propagating Zones from new processing.</strong> As engineers build new processing pipelines, PZM validates the new flows and surfaces any issues detected with the data flow. As data flows through our warehouse, we need to ensure derived datasets continue to be properly annotated. When a user tries to derive new datasets from Policy Zones-protected data, the system may automatically repair the flow (e.g., by propagating the annotation when the intention is clear from context), or if the context is unclear, will present an interstitial to the user. <em>Dr Policy Zone (Dr. PZ)</em> is a debugger tool that guides an engineer to resolve these kinds of Policy Zones errors.  </p>
<p>To illustrate how Dr. PZ works, recall above the example SQL statement that computes message sends for each user.  The SQL read from the message_metadata table and wrote to the messages_sent table. If the output table does not have the right set of GDAs, the user will be presented with an error message and ways to fix their in-development pipeline. We use generative AI to simplify the explanation to the user and also to provide some remediation guidance. The screenshot below shows an example of a dialog an engineer would interact with in Dr. PZ. </p>
<p><img class="alignnone wp-image-22752 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png" alt="" width="1920" height="1173" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=916,560 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=768,469 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=1024,626 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=1536,938 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Messaging-content_final-1.png?resize=192,117 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><strong>Reclassification</strong> is critical to limiting over-annotation from spurious flows. In our messaging example, reclassification allows us to stop propagating the MESSAGING_DATA  GDA on an output table even if the source table may have it. Reclassifications are governed by a precise set of rules that ensure the high-level data policies are not broken. In general, reclassifications are controlled by separate safeguards independent from Policy Zones. Allowed reclassifications are specific to each GDA and may include: different privacy systems Policy Zones is not natively aware of, complex privacy-preserving transformations (e.g., differential private mechanisms), or through routine review by human subject matter experts. </p>
<h2>Learnings and challenges</h2>
<p>In our blog post <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">that introduced Policy Zones</a>, we discussed some of the high level learnings and challenges of scaling out Policy Zones. In this section, we focus on the learnings and challenges from scaling Policy Zones for batch processing.</p>
<p><strong>Opaque operators.</strong> Not all processing  in the warehouse is SQL-based. An example would be small-scale intermediate processing in a general purpose programming language: Dataswarm supports PhpMethodOperator, which allows one to write arbitrary Hack code transformations on (small) warehouse datasets. For these cases, we built processor-specific integration points to capture the context of the data flow. PrivacyLib makes integrations relatively straightforward. The major challenge we had to overcome was finding the right place to integrate Policy Zones checking. We targeted low-level data access call sites as PrivacyLib can help stitch together data dependency information (e.g., by logging reads to correlate against future writes by a data processor). </p>
<p><strong>Reclassification versus complex data policies.</strong> Our original instantiation of policy rules was quite expressive. It allowed the formulation of intricate data flow policies. An advantage of this approach is that we did not need to use reclassification as the policy captured most of the subtle intricacies. The major disadvantage of this approach was that it was very difficult for engineers to understand and debug blocked flows. We decided to simplify our policy language to a nominal typing system of flat, hierarchy-free human-readable labels. Safe transitions could only be described through transitions from one set of GDAs to another set. We found that nuances in a data policy were better tracked by our reclassification system so engineers could generally have a simple model of the policy that worked for most data processing. </p>
<h2>The future of Policy Zones for batch processing</h2>
<p>Policy Zones enables developers to quickly innovate in our data warehouse while respecting the various privacy requirements on the data they are using. Policy Zones has hit major milestones in the warehouse, but we still have exciting opportunities ahead of us. These include:</p>
<p><strong>Reducing friction through generative AI:</strong> Navigating Policy Zones errors can be quite tricky at times. We’ve built an expert system in Dr. PZ that attempts to help engineers navigate the right remediation plan. In addition to this deterministic system, we are also experimenting with using generative AI to help a user navigate the right path and better understand why they are being blocked.</p>
<p><strong>Closing the gap on opaque operators:</strong> as mentioned in the previous section, we’ve had challenges in tracking the data dependencies in some of our processing. For the time being, we’ve resorted to traditional coarse-grained data separation and siloing processing. However, we are continuing to close this gap through improved PrivacyLib integrations to further reduce friction for engineers so they can enjoy the benefits of fine-grained data tracking.</p>
<p><strong>Seamless hand-off to Policy Zones for function-based systems:</strong> in our original blog post we described two versions of Policy Zones. This post focuses on the first, Policy Zones for batch processing systems.  A future post will focus on the second, Policy Zones for function-based systems. </p>
<p>In day-to-day usage, the end-to-end flow of data and processing touches on both of these systems. Today, we have a process to ensure that the requirements from one Policy Zones system are eventually mirrored in the other as data moves between the two. We hope to make this experience more seamless so that engineers don’t have to think about two separate runtimes.</p>
<h2>Acknowledgements</h2>
<p><em>The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing purpose limitation for batch processing systems over the years. In particular, we would like to extend special thanks to (in alphabetical order) Aihua Liu, Alex Ponomarenko, Alvin Wen, Andy Modell, Anuja Jaiswal, Avi Heroor, Ben Sharma, CJ Bell, Chris Green, David Taieb, Dávid Koronthály, Dino Wernli, Dong Jia, Ganapathy (G2) Krishnamoorthy, Govind Chandak, Guilherme Kunigami, Gunjan Jha, Harsha Rastogi, Ian Carmichael, Iuliu Rus, James Gill, Jon Griffin, Jerry Pan, Jesse Zhang, Jiahua Ni, Jiang Wu, Joanna Jiang, John Ahlgren, John Myles White, Judy Nash, Jun Fan, Jun Fang, Justin Slepak, Kuen Ching, Lung-Yen Chen, Manos Karpathiotakis, Marc Celani, Matt Shaer, Michael Levin, Mike Lui, Nimish Shah, Perry Stoll, Pradeep Kalipatnapu, Prashant Dhamdhere, Prashanth Bandaru, Rajesh Nishtala, Ramnath Krishna Prasad, Ramy Wassef, Robert Rusch, Ruogu Hu, Sandy Yen, Saurav Sen, Scott Renfro, Seth Silverman, Shiven Dimri, Sihui Han, Sriguru Chakravarthi, Srikanth Sastry, Sundaram Narayanan, Sushil Dhaundiyal, Tariq Sharif, Tim Nguyen, Tiziano Carotti, Thomas Lento, Tony Harper, Uday Ramesh Savagaonkar, Vlad Fedorov, Vlad Gorelik, Wolfram Schulte, Xiaotian Guo, Xuelian Long, Yanbo Xu, Yi Huang, and Zhi Han. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Avtar Brar, Brianna O’Steen, Chloe Lu, Chris Wiltz, Jason Hendrickson, Jordan Coupe, Morgan Guegan,  Rituraj Kirti, Supriya Anand, and Xenia Habekoss. We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, and Ramnath Krishna Prasad for pulling required support together to make this blog post happen.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/07/23/security/policy-zones-meta-purpose-limitation-batch-processing-systems/</link>
      <guid>https://engineering.fb.com/2025/07/23/security/policy-zones-meta-purpose-limitation-batch-processing-systems/</guid>
      <pubDate>Thu, 24 Jul 2025 01:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta keeps its AI hardware reliable]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Hardware faults can have a significant impact on AI training and inference.</li>
<li class="c1" aria-level="1">Silent data corruptions (SDCs), undetected data errors caused by hardware, can be particularly harmful for AI systems that rely on accurate data for training as well as providing useful outputs.</li>
<li class="c1" aria-level="1">We are sharing methodologies we deploy at various scales for detecting SDC across our AI and non-AI infrastructure to help ensure the reliability of AI training and inference workloads across Meta.</li>
</ul><div class="jetpack-video-wrapper"><iframe title="AI Hardware Reliability at Scale | Sriram Sankar &amp; Harish Dixit" width="1778" height="1000" src="https://www.youtube.com/embed/4EZhnYwcPwQ?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<p>Meta’s global AI infrastructure consists of a large number of hardware components and servers , connected via network fabric across globally distributed data centers. This setup integrates storage, compute, and network architectures with unique file systems and PyTorch applications tailored for training or inference workloads. This infrastructure supports training large-scale models as well as advanced AI applications such as text-to-image generation and <a href="https://ai.meta.com/blog/instagram-edits-cutouts-segment-anything/" target="_blank" rel="noopener">object segmentation</a>.  </p>
<p>Since 2018, Meta’s <a href="https://ieeexplore.ieee.org/abstract/document/8416200/" target="_blank" rel="noopener">hardware reliability journey</a> has led to <a href="https://dl.acm.org/doi/abs/10.1145/3358960.3375793" target="_blank" rel="noopener">novel findings</a>, identifying unique failure types in disks, CPUs, memories, switches, GPUs, ASICs, and networks, often leading the industry in discovering failure modes. We have developed mitigation policies to ensure smooth infrastructure operation and availability for billions of users and thousands of internal use cases. <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/" target="_blank" rel="noopener">As we continue to build large AI clusters</a>, understanding hardware failures and mitigation strategies is crucial for the reliable training of large-scale AI models.</p>
<p>Training large-scale models involves thousands of accelerators in a synchronous environment, where any component failure can interrupt or halt the process. We focus on reducing hardware failures during training through detection and diagnostics, and quickly restarting training with healthy servers and accelerators. This involves optimizing fault categorization, device triage, node selection, cluster validation, and checkpoint restore.</p>
<p>From our experience running <a href="https://arxiv.org/abs/2407.21783" target="_blank" rel="noopener">the Llama 3 herd of models</a>, we find that hardware failures in components such as SRAMs, HBMs, processing grids, and network switch hardware significantly impact AI cluster reliability, with over 66% of training interruptions due to such failures. Some of the challenges for AI clusters include accelerators that might be less reliable than CPUs due to complexity and limited telemetry, network complexity that could result in misattributed failures, and errors within the GPU software stack that may require  extensive configuration to correct. Hence, reducing hardware and configuration failures greatly enhances cluster efficiency.</p>
<p><img class="alignnone size-full wp-image-22664" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png" alt="" width="1655" height="926" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png 1655w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=916,513 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=768,430 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=1024,573 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=1536,859 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-1.png?resize=192,107 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Types of hardware faults encountered at Meta</h2>
<p>The hardware faults or errors that we observe in our infrastructure can be classified broadly into three categories: </p>
<h3>Static errors </h3>
<p>Hardware failures often appear as binary states: A device either powers on or powers off. These static errors are straightforward to identify in large-scale fleets. If devices fail to power on or enumerate, simple health checks can verify their presence and configurations. As configurations and device scales grow in large training clusters, these faults occur more frequently but are easier to triage, root-cause, and repair, making them manageable at scale. </p>
<h3>Transient errors </h3>
<p>Transient errors, categorized by their reproducibility, include load-dependent or partially observable faults, such as device issues from thermal runaway or random crashes from uncorrectable errors. Mitigation involves understanding manifestation conditions; and our larger scale aids in triaging and pattern matching, setting traps for these conditions. When triggered, devices are marked for mitigation or repair. Advances in RAS telemetry in hyperscale infrastructure have greatly improved this process. Factors including workload sensitivity, temperature range, frequency, and manufacturing parameters contribute to these errors.</p>
<p>Mitigation can also involve inducing conditions with artificial workloads in non-production stages to make faults more repeatable. Additionally, capturing transient states as “sticky” status values provides telemetry indications for hardware failures. Though less frequent than static faults and harder to detect, Meta’s scale and our significant engineering efforts have made these scenarios detectable.</p>
<h3>Silent errors </h3>
<p>Silent errors or <a href="https://engineering.fb.com/2021/02/23/data-infrastructure/silent-data-corruption/" target="_blank" rel="noopener">silent data corruptions (SDCs)</a> occur when hardware miscomputes without leaving detectable traces, leading applications to consume incorrect results. These errors, often due to silicon defects, can remain unnoticed for long periods unless significant deviations are observed. Detecting them requires extensive engineering and costly telemetry to trace data corruption back to specific devices. These faults significantly impact large-scale services due to the lack of telemetry and continued consumption.</p>
<p><a href="https://arxiv.org/abs/2102.11245" target="_blank" rel="noopener">Case studies</a>, including one where a single computation error led to missing rows in a Spark application, highlight the prevalence of silent errors in hyperscale infrastructures. Historically, soft-error-related bitflips were reduced to one fault per million devices, but with increased silicon density in accelerators, silent data corruptions now occur at about one fault per thousand devices, much higher than cosmic-ray-induced soft errors.</p>
<h2>Key challenges presented by SDCs </h2>
<p>SDCs present significant challenges in hyperscale infrastructure due to their data dependency, creating an impractical exponential test space for all possible data values. These faults also depend on device voltage, frequency, operating temperature, and life cycle. For instance, a device may fail computational checks only after months of use, indicating a state of “wear out.” Therefore, consistent, periodic, and frequent testing within a random state space is necessary throughout the device’s life cycle to identify these inaccuracies.</p>
<p><img class="alignnone size-full wp-image-22665" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png" alt="" width="1733" height="975" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png 1733w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-2.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Novel SDC detection mechanisms </h2>
<p>To protect applications from silent data corruption, Meta employs several detection mechanisms, as detailed in the papers, <a href="https://arxiv.org/abs/2203.08989" target="_blank" rel="noopener">“Detecting Silent Errors in the Wild”</a> and <a href="https://dl.acm.org/doi/abs/10.1145/3676641.3716258" target="_blank" rel="noopener">“Hardware Sentinel.”</a></p>
<ol><li class="c1" aria-level="1"><a href="https://engineering.fb.com/2022/03/17/production-engineering/silent-errors/" target="_blank" rel="noopener">Fleetscanner</a>: Fleetscanner captures performance outliers at scale with targeted micro-benchmarks for identifying hardware defects. These benchmarks’ signatures are integrated into telemetry for non-benchmark-based detection. This approach involves running directed tests during maintenance operations such as firmware upgrades and hardware repairs. Tests are scheduled periodically, covering the entire fleet every 45 to 60 days. While it provides dedicated testing on hosts, it may be too slow for some SDCs.</li>
<li class="c1" aria-level="1"><a href="https://arxiv.org/abs/2203.08989" target="_blank" rel="noopener">Ripple:</a> Ripple co-locates with workloads, executing tests in milliseconds to seconds, allowing fleet-wide coverage in days. It overlaps test instructions across cores and threads, providing faster detection than Fleetscanner.</li>
<li class="c1" aria-level="1"><a href="https://dl.acm.org/doi/abs/10.1145/3676641.3716258" target="_blank" rel="noopener">Hardware Sentinel:</a> This novel, test-and-architecture-agnostic approach evaluates application exceptions in kernel space. It identifies core-based anomalies as silent data corruption without requiring test allocations, operating solely in the analytical plane. Hardware Sentinel outperforms testing-based methods by 41% across architectures, applications, and data centers.</li>
</ol><p>Combined together, these three mechanisms provide one of the best in-fleet coverage at scale, for detecting and protecting our infrastructure against SDCs.</p>
<h2>Silent errors in AI hardware </h2>
<p>The methodologies described above execute across the fleet and are fully productionized at scale, detecting SDCs across AI and non-AI infrastructure. However, AI applications such as training and inference have unique and more challenging implications for SDCs. </p>
<h3>SDCs in training workloads</h3>
<p>SDCs in training workloads lead to incorrect computations, affecting both forward and backward passes. This results in a divergence from the intended training path, impacting training efficacy. While AI training workloads are sometimes considered self-resilient to SDCs, this is true only for a limited subset of SDC manifestations. In most realistic scenarios, self-resilience is inadequate. SDCs persist across iterations, and the quantization of data values in AI training, which increases information per bit, exacerbates the impact of SDCs, continuously increasing divergence rates in training workloads.</p>
<p>Below we present the two most common cases of training divergence due to SDCs.</p>
<h4>Not-a-Number (NaN) propagation </h4>
<p>Not-a-Number (NaN) propagation occurs when an SDC pushes a representable value into an incorrect representation, generating a NaN during training computations. Once a NaN is created, it propagates through subsequent computations, affecting the training iteration, accelerator domain, host domain, and eventually the entire cluster. This widespread NaN contagion can lead to a cluster halt, as the source—often a few specific computations on a single accelerator—may be difficult to trace amidst the cluster’s scale. Identifying and quarantining the offending accelerator and nodes are necessary to resolve the issue.</p>
<h4>Corrupted gradient variance </h4>
<p>Corrupted gradient variance occurs when an SDC affects gradient calculations, leading to gradient explosion, implosion, or local minima. This corruption, while within numeric bounds, is mistakenly treated as correct, affecting the entire cluster in synchronous training. The corrupted values are exchanged as true values, causing the training to appear to progress without actual improvement. Over time, SDCs aggregate, causing major divergences in gradients, potentially trapping the algorithm in local minima or causing gradient explosions or implosions.</p>
<p>Detecting these SDCs is challenging due to their subtlety and the time required to observe their effects, which can take weeks or months. Unlike NaN propagation, these corruptions are harder to trace and rectify, as they don’t trigger NaN traps. Consequently, SDCs can lead to significant unproductive use of computational resources and training iterations. Without detection, the root cause remains elusive, making subsequent training risky until the offending device is identified and isolated.</p>
<h3>SDCs in inference workloads</h3>
<p>In inference applications, SDCs lead to incorrect results, which, due to the scale of operations, affect thousands of inference consumers. Persistent SDCs can directly impact decisions made by systems such as recommendation engines or LLM outputs. These corruptions can bypass policies related to privacy or integrity, as they are not constrained by boundaries. Consequently, inference corruptions significantly reduce the efficacy of models trained with substantial computational resources, making seemingly benign inference use cases problematic at scale.</p>
<h2>Impact of SDCs</h2>
<p>SDCs in training and inference clusters create complex debugging scenarios across thousands of components. </p>
<p>In training, visible faults halt the cluster, but SDCs create an illusion of progress, obscuring the fault source. NaN propagation requires identifying the offending node; otherwise, restarts from checkpoints will eventually fail. Corrupted gradient variance prolongs this illusion until variances aggregate, making restarts ineffective. SDCs thus cause significant computational inefficiency, with a larger temporal impact than visible faults.</p>
<p>In inference, triage involves costly telemetry at each substage. Until the offending node is identified, inference clusters can’t be used, risking repeat corruption. Large deviations are easier to detect with anomaly detectors, but smaller ones require extensive debugging. This process involves hundreds of engineers, halting production use cases, and impacting reliable capacity for serving production.</p>
<p><img class="alignnone size-full wp-image-22666" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png" alt="" width="1533" height="863" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png 1533w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-3.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Detection of SDCs in AI hardware </h2>
<p>Mitigation strategies that we run in our infrastructure for dealing with SDCs in AI training workloads are classified into infrastructure strategies and stack strategies:</p>
<h3>Infrastructure strategies</h3>
<p>These are applied during operational triage at the cluster level. They focus on managing and mitigating SDCs through the physical and network infrastructure, ensuring that the hardware and system-level components are robust and capable of handling errors effectively. </p>
<h4>Reductive triage</h4>
<p>This strategy involves conducting a binary search with mini-training iterations on progressively smaller cluster sizes to isolate NaN propagation. The goal is to identify a small cluster that replicates the NaN issue, allowing the offending node to be quarantined for further investigation. A reconstituted cluster with new nodes can then resume training from a saved checkpoint. However, this method relies on the ability to reproduce SDCs, which is not always guaranteed due to their dependence on data, electrical, and temperature variations. For corrupted gradient variance, a similar divide-and-triage approach can be used, but the effectiveness varies with training data and cluster size, despite consistent hyperparameter settings.</p>
<h4>Deterministic training</h4>
<p>This approach involves running a known effective model for a few training iterations to ensure there are no NaNs or gradient divergences. It helps verify computational failures that are not data-dependent, as it guarantees correctness for a specific set of values and training inputs.</p>
<h4>Hyper-checkpointing</h4>
<p>This method involves creating checkpoints at increasingly high frequencies to facilitate faster identification and isolation of the corrupting node. It helps maintain training throughput while containing NaN propagation to a specific accelerator or host, thereby speeding up the triage and quarantine process.</p>
<h3>Stack strategies</h3>
<p>These require coordination with the workload and involve adjustments and enhancements at the software-stack level. This includes implementing error detection and correction mechanisms within the application and software layers to handle SDCs more effectively during training processes.</p>
<h4>Gradient clipping</h4>
<p>This strategy involves enforcing gradient clipping within the training workload to limit values within a specified range, thereby mitigating NaN propagation. Computations exceeding this range are clipped, and NaNs can be detected during this step by setting them to a max or min value based on the operand sign. While effective for some NaNs depending on representation format, it may introduce partial errors in certain cases.</p>
<h4><a href="https://ieeexplore.ieee.org/abstract/document/10020972" target="_blank" rel="noopener">Algorithmic fault tolerance</a></h4>
<p>This robust approach integrates fault tolerance into training algorithms to handle a range of data corruptions, reducing the need for detection and triage. It enhances computational efficiency with minimal overhead, as demonstrated in CPU training. This method requires understanding common defect modes and investing in engineering across the stack, with modified guarantees to training workloads, albeit with some overhead to the overall training footprint.</p>
<h4>Tri-variate computational training architecture</h4>
<p>This approach uses shadow nodes in synchronous training to mitigate SDCs. Training steps are repeated across different nodes at random iterations, ensuring correct progress after verification. If shadow and live nodes differ, training halts, and only those nodes are investigated. The rest continue with new nodes. This method involves multiple shadow-node pools, a random training-node pool, and specified steps from the same checkpoint. It offers robust training but demands significant algorithmic changes and increased data movement and infrastructure overhead.</p>
<h4><a href="https://arxiv.org/abs/2405.01741" target="_blank" rel="noopener">Parameter vulnerability factors</a></h4>
<p>This approach identifies vulnerable and resilient layers in machine-learning architectures, allowing mapping of vulnerable layers to resilient hardware and resilient layers to unprotected hardware. This dynamic evaluation must scale with architecture evolution. Resilience often incurs costs in area, power, or performance, so PVF enables targeted resilient design, especially for inference.</p>
<h4><a href="https://dl.acm.org/doi/10.1145/3620666.3651349" target="_blank" rel="noopener">Divergence detection</a></h4>
<p>This mechanism maintains a distribution map for each neuron to detect divergence from typical output distributions, identifying inference corruptions. Though costly, it can be applied at selected sampling rates for large-scale inference. By preserving each neuron’s behavior for specific workloads, divergence helps detect corruptions during execution.</p>
<p>While we have optimized these different methodologies to run effectively in our infrastructure, it should be noted that they offer varying levels of resilience with distinct operating points and engineering/infrastructure overheads. Depending on the scale and intensity of training and inference workloads, orchestrating these strategies effectively can mitigate SDCs’ adverse effects in AI applications.</p>
<h3>Performance faults and unknown unknowns!</h3>
<p>While SDCs are a major challenge at hyperscale, Meta has been developing solutions to detect performance regressions. <a href="https://www.usenix.org/conference/osdi24/presentation/chow" target="_blank" rel="noopener">ServiceLab</a>, for example, is a large-scale performance testing platform that helps identify tiny performance regressions at scale. In addition,  Fleetscanner has identified hundreds of performance outliers, seen as an emergent fault mode alongside SDCs.</p>
<p>While current mechanisms detect and address static, transient, and silent faults, the full range of hardware fault variants remains partially uncovered. The unknown unknowns require agile solutions across the entire infrastructure and silicon lifecycle, as well as across the hardware-to-software and application stack, to achieve first-class reliability operations.</p>
<h2>A journey towards industry leadership and standardization</h2>
<p>Meta’s journey toward industry leadership in SDC began with identifying frequent fleet issues in 2016, scaling SDC detection in 2018, and implementing detection frameworks by 2019. By 2020, detection mechanisms were integrated into accelerators, and Meta published the paper, <a href="https://arxiv.org/abs/2102.11245">“Silent Data Corruptions at Scale.”</a> In 2022, Meta introduced <a href="https://arxiv.org/abs/2203.08989">“FleetScanner and Ripple”</a> and conducted an <a href="https://research.facebook.com/blog/2022/2/engineering-director-sriram-sankar-discusses-metas-first-research-award-opportunity-in-silent-data-corruptions-at-scale">RFP</a> for <a href="https://fburl.com/n5h6hlml">academic awards</a>, funding five winners. </p>
<p>In 2023, Meta collaborated with industry leaders (Google, Microsoft, ARM, AMD, NVIDIA, and Intel) to enhance server resilience, defining test architectures and metrics. A joint <a href="https://www.opencompute.org/blog/computings-hidden-menace-the-ocp-takes-action-against-silent-data-corruption-sdc">RFP</a> with partners from the <a href="https://www.opencompute.org/blog/ocps-server-resilience-initiative-sdc-academic-research-awards-announced">Open Compute Project selected six winners for cross-domain SDC research</a>. By 2024, Meta’s fleet had advanced <a href="https://dl.acm.org/doi/abs/10.1145/3676641.3716258">AI SDC detection methodologies</a> in production, contributing to research through publications, tutorials, and talks at major conferences and forums, addressing at-scale reliability challenges.</p>
<p><img class="alignnone size-full wp-image-22667" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png" alt="" width="1999" height="1126" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=768,433 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=1024,577 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=1536,865 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-4.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>The Meta Training and Inference Accelerator </h2>
<p>Meta is on an ambitious journey toward enabling training and inference accelerators under the <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">Meta Training and Inference Accelerator (MTIA)</a> family. On this journey, our goal is to utilize all the lessons learned from the fleet and move toward industry-leading, fleet-reliability practices in MTIA architecture and design practices. Using the factory-to-fleet approach, and consistently revisiting our reliability solutions across the stack, our goal is to deliver a best-in-class, reliable-and-performant solution to add to our infrastructure portfolio of AI hardware and to power AI applications at scale. </p>
<h3>Factory to fleet</h3>
<p>To uncover unknowns early, a comprehensive factory-to-fleet view of the silicon life cycle is key. Innovation is needed in all phases, from design to deployment. In design and architecture, revisiting RAS solutions for scale, life-cycle debug hooks, and telemetry architectures can support tools such as Hardware Sentinel, Fleetscanner, and Ripple. During validation and integration, novel yield analysis, manufacturing diagnostics, and fleet-signature-feedback-based detection can prevent faults before shipping. In AI silicon fleets, user-space diagnostics with periodic testing, coverage maps, and control parameters are beneficial. Large-scale analytics like Hardware Sentinel can detect early wear out and data corruption. Robust firmware hooks and debug architecture provide fast feedback to design and architecture amidst fleet-scale issues.</p>
<p><img class="alignnone size-full wp-image-22668" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png" alt="" width="1326" height="743" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png 1326w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=916,513 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=768,430 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=1024,574 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-5.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h3>Stack-level resilience</h3>
<p>Factory-to-fleet solutions offer life-cycle resilience for silicon, but resilience must extend beyond silicon to firmware, compilers, kernels, and operating systems. Investments in resilience architectures are needed for correctness-invariant-instruction heterogeneity and enhanced telemetry for exception tracing. Granular firmware-control mechanisms improve telemetry upon fault detection. At the software and application level, techniques like gradient clipping and algorithmic fault tolerance that we called out in this blog, are crucial for security amidst corruptions. Experience with SDCs shows that in-line software resilience and test-agnostic analytical approaches effectively scale for many SDCs with minimal investment, while testing-based approaches are limited to specific instructions.</p>
<p><img class="alignnone size-full wp-image-22669" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png" alt="" width="1737" height="972" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png 1737w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=916,513 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=768,430 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=1024,573 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=1536,860 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-AI-hardware-reliability-at-scale-image-6.png?resize=192,107 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Hardware faults significantly impact AI training and inference production. As cluster sizes and semiconductor complexity grow, fault complexity will exponentially increase. Solutions must involve factory-to-fleet coordination and stack-level resiliency. For AI applications, treating reliability as a primary design consideration is essential.</p>
<h2>Acknowledgments</h2>
<p><em>The authors would like to thank all the cross-functional engineers and teams instrumental in landing these solutions over the years. This blog accompanies the</em> <a href="https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/#talk"><em>@Scale conference talk</em></a><em>, please check out the</em> <em>talk</em> <em>for more details.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/</link>
      <guid>https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/</guid>
      <pubDate>Tue, 22 Jul 2025 20:45:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Using AI to make lower-carbon, faster-curing concrete]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta has <a href="https://github.com/facebookresearch/SustainableConcrete" target="_blank" rel="noopener">developed an open-source AI tool</a> to design concrete mixes that are stronger, more sustainable, and ready to build with faster—speeding up construction while reducing environmental impact.</li>
<li class="c1" aria-level="1">The AI tool leverages Bayesian optimization, powered by Meta’s <a href="https://botorch.org/" target="_blank" rel="noopener">BoTorch</a> and <a href="https://ax.dev/" target="_blank" rel="noopener">Ax</a> frameworks, and was developed with Amrize and the University of Illinois Urbana-Champaign (U of I) to accelerate the discovery of high-performance, low carbon concrete.</li>
<li class="c1" aria-level="1">Meta successfully deployed a concrete mix that was optimized with the AI tool at a data center construction site. Being open-sourced and freely available, the AI-tool could help increase the adoption and optimization of sustainable concrete mixes in the construction industry at large.</li>
</ul><p>Low carbon concrete solutions are essential for advancing our <a href="https://sustainability.fb.com/wp-content/uploads/2023/07/Meta-2023-Path-to-Net-Zero.pdf" target="_blank" rel="noopener">goal of net zero emissions in 2030</a>. Concrete production is a major contributor to the embodied carbon emissions in data center construction and <a href="https://www.weforum.org/stories/2024/09/cement-production-sustainable-concrete-co2-emissions/" target="_blank" rel="noopener">accounts for 8% of all global CO2 emissions</a>, according to the World Economic Forum.Conventionally, concrete is optimized for strength (28-day compressive strength) and cost. But modern constructions – including data centers – require concrete that is optimized for sustainability, curing speed, workability, and finishability as well. </p>
<p>Innovation in concrete formulations is difficult and slow. Compared to traditional concrete, current formulas for low carbon concrete face several challenges: slower curing speeds, issues with surface quality, and complications in supply chains when novel materials are involved.</p>
<p>But concrete suppliers can utilize AI to develop and scale innovative concrete mixes as drop-in replacements, accelerating the discovery and integration of sustainable materials for large-scale use.</p>
<p>By collaborating with Amrize — one of the world’s largest cement manufacturers and major concrete suppliers — and the University of Illinois Urbana-Champaign (U of I), we’ve developed an AI model and pipeline to accelerate the discovery of new concrete mixtures that meet traditional requirements alongside newer sustainability needs.</p>
<p>Our work with Amrize and U of I has already resulted in the successful design and deployment of AI-designed green concrete at our <a href="https://www.facebook.com/RosemountDataCenter/" target="_blank" rel="noopener">new data center in Rosemount, MN</a>. </p>
<h2>Meta’s AI model for green concrete</h2>
<p>Designing concrete formulas is a complex, multi-objective problem. The designer must choose between various types and proportions of cement, lower-carbon supplementary cementitious materials (SCMs), water-to-binder ratios, coarse and fine aggregate types, and admixtures. SCMs’ impact on concrete performance varies by source location and seasonality, requiring long-term tests for validation. Finally, time-consuming tests taking days and weeks are needed to fully validate the performance of new mixes. Thus, it is important for the design process to be as efficient as possible. </p>
<p>There are several key ingredients often used in a sustainable concrete mix: </p>
<ul><li class="c1" aria-level="1"><strong>Cement</strong> is the “glue” that holds concrete together. It’s made from calcining limestone, clay, and other minerals in a high-temperature rotary kiln – the process which contributes significantly to CO2 emissions. The cement is then mixed with water, SCMs, aggregates, and admixtures at a ready mix plant to create concrete. When the cement paste hydrates and stiffens over time, it forms a hard, binding gel that gives concrete its strength.</li>
<li class="c1" aria-level="1"><strong>Slag</strong> is a byproduct of steel production. It’s a molten waste material that’s cooled and ground into a fine powder. In concrete, slag helps reduce concrete’s embodied carbon by replacing cement, and improves long-term strength, durability, and resistance to external chemicals.</li>
<li class="c1" aria-level="1"><strong>Fly ash</strong> is a type of industrial by-product from coal-fired power plants. It’s collected from the air pollution control systems and can be used as a substitute for some of the cement in concrete. Fly ash helps reduce the embodied carbon in concrete by replacing cement, and also improves its long-term strength, durability, and workability.</li>
<li class="c1" aria-level="1"><strong>Fine aggregate</strong>, like sand, is smaller than coarse aggregate and fills in the gaps between the larger rocks or gravel. Sand helps to create a smooth, even surface, and improves the overall texture of the concrete.</li>
<li class="c1" aria-level="1"><strong>Coarse aggregate</strong> refers to crushed stone or gravel that are added to concrete to provide bulk volume and load-bearing capacity, helping the concrete resist cracking and shrinkage.</li>
</ul><p>Mixing these ingredients together in different proportions gives rise to concrete with varying strength and sustainability properties. The properties of each ingredient varies by origin and condition of manufacturing. Furthermore, some of the SCMs are declining in availability, necessitating the discovery and incorporation of novel materials for which little-to-no data is available. All of this adds to the challenges of concrete design. The goal of our approach is to optimize the trade-off between strength and sustainability.</p>
<figure id="attachment_22606" aria-describedby="caption-attachment-22606" class="wp-caption alignnone c2"><img class="size-full wp-image-22606" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png" alt="" width="1920" height="1080" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-1.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22606" class="wp-caption-text">Several key ingredients used to generate concrete mixes, clockwise from top left: fly ash, coarse aggregates, fine aggregate, and cement.</figcaption></figure><figure id="attachment_22607" aria-describedby="caption-attachment-22607" class="wp-caption alignnone c3"><img class="size-full wp-image-22607" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png" alt="" width="1999" height="1095" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=916,502 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=768,421 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=1024,561 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=1536,841 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=96,53 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-2.png?resize=192,105 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22607" class="wp-caption-text">An example of a low carbon concrete mix design, showing the relative amount of ingredients by weight.</figcaption></figure><p>To accelerate the concrete mix design process, Meta developed <a href="https://github.com/facebookresearch/SustainableConcrete" target="_blank" rel="noopener">an AI model for sustainable concrete</a> using <a href="https://botorch.org/" target="_blank" rel="noopener">BoTorch</a> and <a href="https://ax.dev/" target="_blank" rel="noopener">Ax</a>, Meta’s open-source software for <a href="https://arxiv.org/abs/1807.02811" target="_blank" rel="noopener">Bayesian optimization</a> and <a href="https://researchoutreach.org/articles/adaptive-experiments-machine-learning-help-scientific-discovery/" target="_blank" rel="noopener">adaptive experimentation</a>, respectively. This model uses multi-objective Bayesian optimization algorithms to learn and optimize concrete compositions. The approach predicts compressive strength curves for different mixtures, optimizing short- and long-term strength properties and sustainability.</p>
<p><em>(For technical details of the model and optimization algorithm, see our technical report, “<a href="https://arxiv.org/abs/2310.18288" target="_blank" rel="noopener">Sustainable Concrete via Bayesian Optimization”</a> and our open source <a href="https://github.com/facebookresearch/SustainableConcrete" target="_blank" rel="noopener">SustainableConcrete</a> repository with the associated data and code.)</em></p>
<p>The basis of the approach is a model that predicts the compressive strength curves associated with different concrete mixtures.</p>
<p>The figure below shows an example of two strength curve predictions, one associated with a concrete mix including pure portland cement (purple)—the most commonly used type of cement—and a second one where a part of the cement was substituted by fly ash (blue), whose carbon impact is lower than that of cement. We then leveraged the strength predictions at various times to jointly optimize the short- and long-term strength properties, as well as the sustainability of the associated mixes, to generate new formulas that can be validated through testing.</p>
<p>By using AI, we can accelerate the discovery process and drive efficiency in the experiment process.  </p>
<figure id="attachment_22608" aria-describedby="caption-attachment-22608" class="wp-caption alignnone c3"><img class="size-full wp-image-22608" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png" alt="" width="1999" height="1088" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=916,499 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=768,418 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=1024,557 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=1536,836 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-3.png?resize=192,105 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22608" class="wp-caption-text">Two strength curve predictions carried out by our model during early development. The more sustainable mix (green) exhibits lower compressive strength early on but overtakes the traditional mix (blue) later on, a common trade-off of more sustainable concrete mixes.</figcaption></figure><p>To train this AI model with real data, we collaborated with Professor Nishant Garg and his research group at U of I. In each iteration, the AI suggests new promising concrete mixes based on performance predictions, which are updated with the latest data. We validated these predictions with lab testing and used the results to refine the AI for subsequent iterations. Our AI pipeline consists of the workflow of generating baseline data, training an AI model, using it to develop and validate new hypotheses, and then improving the baseline data and AI training.</p>
<p>In implementing the first AI pipeline, we focused on several key metrics: compressive strength, curing speed, slump, and sustainability, which we quantify using a proxy for the carbon footprint of the concrete mix. The compressive strength of concrete is crucial for determining both its long-term structural integrity, typically specified as its 28-day compressive strength, and its short-term curing speed, specified as the time needed to achieve certain strength requirements such as strength one, three, and five days after the pour. When densely sampled x-day strength data is available, a strength curve can be generated.</p>
<p>These attributes can be tested on concrete cylinders in the lab, allowing for rapid and systematic data generation necessary for training the AI. Larger-scale tests can be conducted after the new formulas, discovered through iterative testing, have been reviewed by concrete experts. By conducting research and development in stages, we can focus the AI on critical metrics and expedite progress.</p>
<p>The resulting AI pipeline is illustrated below:</p>
<figure id="attachment_22610" aria-describedby="caption-attachment-22610" class="wp-caption alignnone c3"><img class="size-full wp-image-22610" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png" alt="" width="1999" height="843" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=916,386 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=768,324 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=1024,432 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=1536,648 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=96,40 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-4.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22610" class="wp-caption-text">Adaptive experimentation steps to implement an AI pipeline.</figcaption></figure><p>The figure below shows how AI is able to learn and further optimize x-day strength versus sustainability (quantified using a proxy for the carbon footprint of the concrete mix) over several iterations, exceeding the initial human-designed formulas.</p>
<p><img class="alignnone wp-image-22656 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier-of-concrete-strength-and-sustainability.gif" alt="" width="1920" height="1230" /></p>
<p>Over time, this AI pipeline has generated high-quality data for AI training and development, containing over a hundred unique concrete mixes, comprehensive x-day compressive strength data, and global warming potential (GWP, measured in terms of kilograms of CO2 per cubic meter). </p>
<h2>Developing an AI pipeline for industrial green concrete</h2>
<p>In 2024, we started collaborating with <a href="http://amrize.com/" target="_blank" rel="noopener">Amrize</a> to explore how Meta’s AI can be used at scale in the concrete industry. </p>
<p>Amrize shared basic concrete performance data, supporting Meta’s open source approach. They developed an AI pipeline at their batch plant near St. Paul, MN, extending the discovery and testing process.</p>
<p>Critical to data centers are the concrete slabs that serve as surfaces for deploying servers and associated power and cooling equipment. Data center slabs to be flat, level, smooth, and durable to enable reliable servicing of the equipment that reside on it. Therefore, their concrete formulations must meet additional high-quality finish requirements. Our AI algorithms incorporate specific water-to-binder ratios and volumetric material constraints, and discover high-performing formulas with faster curing and lower GWP values that meet the stricter requirements.</p>
<p>Distinguishing between formulas suitable for slabs, and those more suitable for other applications, we can compare their performance against industry standard formulas (see below). Within two iterations, and with minor human adjustments, the AI pipeline discovered formulas that exceeded standard low carbon industry formulas in terms of strength, speed, and sustainability.</p>
<figure id="attachment_22611" aria-describedby="caption-attachment-22611" class="wp-caption alignnone c3"><img class="size-full wp-image-22611" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png" alt="" width="1999" height="1250" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=916,573 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=768,480 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=1024,640 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=1536,960 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-5.png?resize=192,120 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22611" class="wp-caption-text">The strength curves of standard industry low carbon formulas compared to AI-optimized formulas. AI-optimized formulas are faster, stronger, and have lower carbon emissions.</figcaption></figure><p><img class="alignnone wp-image-22650 size-full" src="https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png" alt="" width="3850" height="2504" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png 3850w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=916,596 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=768,499 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=1024,666 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=1536,999 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=2048,1332 2048w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Pareto-Frontier_Final.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<div class="jetpack-video-wrapper"><iframe title="Low Carbon Cement Formulas" width="1778" height="1000" src="https://www.youtube.com/embed/PHuAG_jZMD8?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Applying Amrize’s AI-designed concrete formulation at Meta’s Rosemount data center</h2>
<p>Further testing is needed to apply AI-generated formulas in real-world applications. We therefore extended the first AI pipeline to incorporate additional steps and further tests, as shown in the figure below.</p>
<p>Amrize collaborated with Mortensen, a general contractor responsible for the construction of our datacenter, to test the new formula’s workability and finishability. Successful slab tests led to at-scale application in a site support section in one of the data center building slabs at Meta’s Rosemount, MN data center project.</p>
<figure id="attachment_22613" aria-describedby="caption-attachment-22613" class="wp-caption alignnone c3"><img class="size-full wp-image-22613" src="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png" alt="" width="1999" height="1269" srcset="https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=916,581 916w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=768,488 768w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=1024,650 1024w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=1536,975 1536w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2025/07/Meta-green-AI-concrete-image-7.png?resize=192,122 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22613" class="wp-caption-text">The development and scaling-up process to test and validate AI-generated concrete formulas. Human experts assess the outputs of each stage and iteration, refine the AI to incorporate additional constraints, and/or adjust individual constraints such as total binder amount and water-to-binder ratio.</figcaption></figure><p>Formal tests show that the team exceeded all the technical requirements while achieving good workability and finish performance required for the application. </p>
<p class="jetpack-slideshow-noscript robots-nocontent">This slideshow requires JavaScript.</p>
<h2>Open source for more sustainable construction </h2>
<p>At Meta, we believe AI can generate high-performance, low carbon concrete mixes for major construction projects such as data centers. Open source AI can benefit every level of the construction industry, from the construction companies and contractors, to suppliers, providers, architects, and, of course, building owners.</p>
<p>We will continue our collaboration with Amrize to further scale the use of AI in the concrete industry. The basic AI solution will remain open source to enable further commercial productization, application, and R&amp;D. </p>
<p>Our aim is to scale the use of low carbon concrete in data centers and encourage the adoption of performance-based requirements at minimum risk. Meta will continue to engage with other hyperscalers to collaboratively test and prove low carbon concrete formulas to further decrease carbon emissions. Meta will also leverage organizations such as iMasons and the <a href="https://www.opencompute.org/" target="_blank" rel="noopener">Open Compute Project</a> to publish reference designs, AI-informed formulas, case studies, and best practices.  </p>
<h2>Learn more about Meta’s AI for sustainable concrete</h2>
<ul><li class="c1" aria-level="1"><a href="https://github.com/facebookresearch/SustainableConcrete" target="_blank" rel="noopener">Download the sustainable concrete AI model on GitHub.</a></li>
<li aria-level="1">Read our technical report, “<a href="https://arxiv.org/abs/2310.18288" target="_blank" rel="noopener">Sustainable Concrete via Bayesian Optimization</a>.”</li>
<li class="c1" aria-level="1">Learn more about <a href="https://botorch.org/" target="_blank" rel="noopener">BoTorch</a> and <a href="https://ax.dev/" target="_blank" rel="noopener">Ax. </a></li>
<li aria-level="1">Read more about <a href="https://sustainability.atmeta.com/blog/2024/12/19/advancing-low-carbon-concrete-in-our-data-centers/" target="_blank" rel="noopener">how Meta uses low carbon concrete in our data centers</a>.</li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/07/16/data-center-engineering/ai-make-lower-carbon-faster-curing-concrete/</link>
      <guid>https://engineering.fb.com/2025/07/16/data-center-engineering/ai-make-lower-carbon-faster-curing-concrete/</guid>
      <pubDate>Wed, 16 Jul 2025 14:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[An inside look at Meta’s transition from C to Rust on mobile]]></title>
      <description><![CDATA[<p>Have you ever worked is legacy code? Are you curious what it takes to modernize systems at a massive scale?</p>
<p><a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> is joined on the latest Meta Tech Podcast by Elaine and Buping, two software engineers working on a bold project to rewrite the decades-old C code in one of Meta’s core messaging libraries in Rust. It’s an ambitious effort that will transform a central messaging library that is shared across Messenger, Facebook, Instagram, and Meta’s AR/VR platforms.</p>
<p>They discuss taking on a project of this scope – even without a background in Rust, how they’re approaching it, and what it means to optimize for ‘developer happiness.’</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/37177840/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/1Okh8hQXHgBB2MuW5XTjDx?utm_source=engineeringatmeta&amp;utm_medium=blog&amp;utm_campaign=metatechpodcast&amp;utm_id=meta+tech+podcast" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/meta-tech-podcast/id1370910331?utm_source=engineeringatmeta&amp;utm_medium=blog&amp;utm_campaign=metatechpodcast&amp;utm_id=meta+tech+podcast" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pocketcasts.com/podcasts/c4ede3e0-1fbf-0136-c266-7d73a919276a/5d3989cd-fb22-4452-9122-fa7c37fb9154?utm_source=engineeringatmeta&amp;utm_medium=blog&amp;utm_campaign=metatechpodcast&amp;utm_id=meta+tech+podcast" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/07/01/developer-tools/an-inside-look-at-metas-transition-from-c-to-rust-on-mobile/</link>
      <guid>https://engineering.fb.com/2025/07/01/developer-tools/an-inside-look-at-metas-transition-from-c-to-rust-on-mobile/</guid>
      <pubDate>Tue, 01 Jul 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta joins Kotlin Foundation]]></title>
      <description><![CDATA[<p>We are proud to announce that Meta has officially joined the <a href="https://kotlinfoundation.org/" target="_blank" rel="noopener">Kotlin Foundation</a> as a gold member, marking a significant milestone in our ongoing commitment to Kotlin and the broader Android development ecosystem.</p>
<p>Over the past several years, Meta engineers have been actively <a href="https://engineering.fb.com/2024/12/18/android/translating-java-to-kotlin-at-scale/" target="_blank" rel="noopener">migrating our extensive Android codebase</a>—comprising tens of millions of lines—from Java to Kotlin. To facilitate this massive transition, we developed an internal tool called <a href="https://www.infoq.com/news/2024/12/meta-java-kotlin-port/" target="_blank" rel="noopener">Kotlinator</a>, which automates much of the conversion process while ensuring the resulting Kotlin code is idiomatic and compatible with our internal frameworks. We have continued to share these efforts as a part of the <a href="https://youtu.be/POmlM7OshwA?si=15r6zufGnwrkTolG" target="_blank" rel="noopener">enterprise Java-to-Kotlin working group</a>.</p>
<p>In addition to these internal efforts, we at Meta have been sharing our work publicly through open source projects such as Kotlin and Android build toolchain in <a href="https://buck2.build/" target="_blank" rel="noopener">Buck2</a>. Initiatives like this toolchain aim to provide <a href="https://www.youtube.com/watch?v=bC_grxuSO08" target="_blank" rel="noopener">tooling and best practices</a> for enhancing build speeds and scalability, ultimately benefiting the broader developer community. </p>
<p>Meta’s involvement in the Kotlin Foundation aligns with our broader strategy to support and advance the Kotlin ecosystem. Meta will contribute to initiatives in the Kotlin Foundation’s grants program, which support open source library authors and encourage innovation among students and developers. Meta’s membership in the Kotlin Foundation underscores our dedication to fostering a robust, collaborative Kotlin community and advancing the language’s capabilities across platforms.</p>
<p>To learn more about Meta’s open source efforts, visit the <a href="https://opensource.fb.com/" target="_blank" rel="noopener">Meta Open Source site</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.instagram.com/metaopensource/" target="_blank" rel="noopener">Instagram</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/06/30/android/meta-joins-kotlin-foundation/</link>
      <guid>https://engineering.fb.com/2025/06/30/android/meta-joins-kotlin-foundation/</guid>
      <pubDate>Mon, 30 Jun 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Extending the Malbec subsea cable to Southern Brazil]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta is partnering with V.tal to extend the <a href="https://engineering.fb.com/2021/11/11/connectivity/malbec-subsea-cable/" target="_blank" rel="noopener">Malbec subsea cable</a> to Porto Alegre, Brazil by 2027.</li>
<li class="c1" aria-level="1">With this new extension, Malbec will become the first subsea cable to land in the state of Rio Grande do Sul, bringing more connectivity to millions of people in Southern Brazil and neighboring countries.</li>
<li class="c1" aria-level="1">Malbec will improve the scale and reliability of digital infrastructure in Porto Alegre, establishing it as a digital hub and improving online experiences across Southern Brazil, Argentina, Chile, Paraguay, and Uruguay.</li>
</ul><p>Today, we’re announcing the extension of the <a href="https://engineering.fb.com/2021/11/11/connectivity/malbec-subsea-cable/" target="_blank" rel="noopener">Malbec subsea cable</a> to the city of Porto Alegre, Brazil. Developed by Meta, in partnership with <a href="https://www.vtal.com/en/home/" target="_blank" rel="noopener">V.tal</a>, Malbec is a 2,500 km cable that entered service in 2021 to provide connectivity between the Southern Cone of South America and Brazil. The new extension will be operational in 2027 and will link Porto Alegre to the cities of Rio de Janeiro and São Paulo, Brazil and Buenos Aires, Argentina. </p>
<p>“The expansion of Malbec to Porto Alegre is a milestone for connectivity in South America, benefiting millions of people in Brazil and positioning the capital of Rio Grande do Sul as the first major international digital hub in the south of the country,” explained Meta’s Director of Connectivity Policy, Brazil, Ana Luiza Valadares. “It will contribute to attracting digital infrastructure companies, lowering costs for companies and improving consumer services.”</p>
<p>Felipe Campos, CEO of V.tal, added, “The impact of this project will be significant for the local digital economy, positioning Porto Alegre as a new connectivity hub. It will be a unique infrastructure that will attract the interest of operators and internet providers, as well as other submarine cable companies.</p>
<p>In addition, all the Southern Cone countries will benefit from this new ecosystem, not to mention the end users and companies who will have a better experience when using the internet and digital applications.”</p>
<p><img class="alignnone size-full wp-image-22567" src="https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png" alt="" width="1656" height="1238" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png 1656w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=916,685 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=768,574 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=1024,766 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=1536,1148 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=96,72 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Signal-repeater.png?resize=192,144 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>This extension is one of the latest in Meta’s digital infrastructure investments to support growing demand for digital capacity, resilience, and global reach. Earlier this year, Meta also activated a Point of Presence (PoP) in Porto Alegre. PoPs facilitate the efficient delivery of  content locally, which reduces the network management costs for internet service providers while improving the quality of experience for their customers. With the advent of AI and increasing demand for online services, digital infrastructure deployments play an important role in ensuring that the benefits of AI and other emerging technologies are available to everyone, regardless of where they live or work. </p>
<p>“This investment in submarine connectivity, fully aligned with our Economic, Inclusive and Sustainable Development Plan, represents a strategic milestone for the state’s future,” says Rio Grand do Sul Governor, Eduardo Leite. “Furthermore, it fosters artificial intelligence projects, technologies that are already transforming the present and will define the future of innovation, a sector in which Rio Grande do Sul is a leader in Brazil, <a href="https://www.gov.br/inpi/pt-br/inpi-data/indice-brasil-de-inovacao-e-desenvolvimento-ibid/IBID2024_ENfinal.pdf" target="_blank" rel="noopener">according to the ranking of state competitiveness</a>.”</p>
<p>Malbec will be the first international subsea cable to land in Rio Grande do Sul, bringing with it over 84 terabits of international capacity and direct connectivity to northern Brazil and Argentina. Like most subsea cables, local service providers will be able to acquire capacity on Malbec to serve additional bandwidth to millions of people in Brazil’s southern states. The providers will also extend Malbec’s capacity by connecting with providers in the neighboring countries of Argentina, Chile, Paraguay, and Uruguay, further positioning Brazil as a South American connectivity hub.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/22/connectivity/extending-malbec-subsea-cable-southern-brazil/</link>
      <guid>https://engineering.fb.com/2025/05/22/connectivity/extending-malbec-subsea-cable-southern-brazil/</guid>
      <pubDate>Thu, 22 May 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Journey to 1000 models: Scaling Instagram’s recommendation system]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">In this post, we explore how Instagram has successfully scaled its algorithm to include over 1000 ML models without sacrificing recommendation quality or reliability. </li>
<li class="c1" aria-level="1">We delve into the intricacies of managing such a vast array of models, each with its own performance characteristics and product goals. </li>
<li class="c1" aria-level="1">We share insights and lessons learned along the way—from the initial realization that our infrastructure maturity was lagging behind our ambitious scaling goals, to the innovative solutions we implemented to bridge these gaps.</li>
</ul><p>In the ever-evolving landscape of social media, Instagram serves as a hub for creative expression and connection, continually adapting to meet the dynamic needs of its global community. At the heart of this adaptability lies a web of machine learning (ML) models, each playing a crucial role in personalizing experiences. As Instagram’s reach and influence has grown, so too has the complexity of its algorithmic infrastructure. This growth, while exciting, presents a unique set of challenges, particularly in terms of reliability and scalability.</p>
<p>Join us as we uncover the strategies and tools that have enabled Instagram to maintain its position at the forefront of social media innovation, ensuring a seamless and engaging experience for billions of users worldwide.</p>
<div class="jetpack-video-wrapper"><iframe title="Journey to 1000 Models: Scaling Instagram's algorithm without the Reliability Nightmare" width="1778" height="1000" src="https://www.youtube.com/embed/Aojmc0R1Nmo?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Are there really that many ML models in Instagram?</h2>
<p>Though what shows up in Feed, Stories, and Reels is personally ranked, the number of ranked surfaces goes much deeper—to which comments surface in Feed, which notifications are “important,” or whom you might tag in a post. These are all driven by ML recommendations. </p>
<p>Within a given surface, we’ll have different layers of the ranking funnel: sourcing (retrieval), early-stage ranking (ESR), and late-stage ranking (LSR). We operate on fewer candidates as we progress through the funnel, as the underlying operations grow more expensive (see Figure 1 below):</p>
<figure id="attachment_22526" aria-describedby="caption-attachment-22526" class="wp-caption alignnone c2"><img class="wp-image-22526" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?w=933" alt="" width="547" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png 1566w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=835,916 835w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=768,843 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=933,1024 933w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=1400,1536 1400w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=96,105 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-1.png?resize=192,211 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22526" class="wp-caption-text">Figure 1: The ranking funnel.</figcaption></figure><p>Within each surface and layer, there is constant experimentation, and these permutations create a severe infrastructure challenge. We need to allow room for our ML engineers to experiment with changes such as adjusting weights for a given prediction. The net result, depicted below in Figure 2, is a large number of models serving user traffic in production:</p>
<figure id="attachment_22527" aria-describedby="caption-attachment-22527" class="wp-caption alignnone c3"><img class="size-large wp-image-22527" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?w=1024" alt="" width="1024" height="187" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=916,168 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=768,141 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=1024,187 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=1536,281 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=96,18 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-2.png?resize=192,35 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22527" class="wp-caption-text">Figure 2: An expression of the factors behind the fleet’s numerical growth.</figcaption></figure><h2>How did we realize infra maturity wasn’t going to catch up?</h2>
<h3>Identified risks</h3>
<p>We identified several risks associated with scaling our algorithm, rooted in complaints about ML productivity and repeating patterns of issues:</p>
<ul><li class="c1" aria-level="1"><strong>Discovery:</strong> Even as a team focused on one app — Instagram — we couldn’t stay on top of the growth, and product ML teams were maintaining separate sources of truth, if any, for their models in production.</li>
<li class="c1" aria-level="1"><strong>Release:</strong> We didn’t have a consistent way to launch new models safely, and the process was slow, impacting ML velocity and, therefore, product innovation.</li>
<li class="c1" aria-level="1"><strong>Health:</strong> We lacked a consistent definition of model prediction quality, and with the diversity of surfaces and subtlety of degraded ranking, quality issues went unnoticed.</li>
</ul><h3>Solution overview</h3>
<p>To address these risks, we implemented several solutions:</p>
<ul><li class="c1" aria-level="1"><strong>Model registry:</strong> We built a registry that serves as a ledger for production model importance and business function foremost, among other metadata. This registry serves as our foundational source of truth, upon which we can leverage automation to uplevel system-wide observability, change management, and model health.</li>
<li class="c1" aria-level="1"><strong>Model launch tooling:</strong> We developed a more ideal flow for launching new models that includes estimation, approval, prep, scale-up, and finalization. This process is now automated, and we’ve reduced the time it takes to launch a new model from days to hours.</li>
<li class="c1" aria-level="1"><strong>Model stability:</strong> We defined and operationalized model stability, a pioneering metric that measures the accuracy of our model predictions. We’ve leveraged model stability to produce SLOs for all models in the model registry, which enables simple understanding of the entire product surface’s ML health.</li>
</ul><h2>Model registry</h2>
<h3>What did model investigations look like prior to the registry?</h3>
<p>Before we created the model registry, the investigation process was a time-consuming and error-prone experience for on-call engineers and model owners. An on-call engineer had to ask multiple questions to model owners to gather information, as depicted Figure 3 below, about the context of what this model does in the stack and to clarify how important it is to the business.</p>
<figure id="attachment_22528" aria-describedby="caption-attachment-22528" class="wp-caption alignnone c3"><img class="size-large wp-image-22528" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?w=1024" alt="" width="1024" height="536" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png 1782w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=916,479 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=768,402 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=1024,536 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=1536,803 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=96,50 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-3.png?resize=192,100 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22528" class="wp-caption-text">Figure 3: A fictional but typical non-productive investigation.</figcaption></figure><p>Understanding this context is extremely important to the operational response: Depending on the importance of the model and the criticality of the surface it’s supporting, the response is going to differ in kind. When a model is an experiment serving a small percentage of the traffic, an appropriate response can be to end the experiment and reroute the traffic back to the main model (the baseline). But if there’s a problem with the baseline model that needs to be handled with urgency, it’s not possible to “just turn it off.” The engineer on call has to loop in the model owner, defeating the purpose of having a dedicated on-call.</p>
<p>To avoid holding up an operational response on a single POC, we needed a central source of truth for model importance and business function. <em>What if the model is not available? What if 10 of these issues happen concurrently?</em> </p>
<p>With the development of the model registry, we standardized the collection of model importance and business function information, ensuring most of our operational resources were going towards the most important models.</p>
<h3>What problems did the model registry solve?</h3>
<p>The model registry is a system of record built on top of <a href="https://research.facebook.com/publications/holistic-configuration-management-at-facebook/" target="_blank" rel="noopener">Configerator</a>, Meta’s distributed configuration suite . This schematized ledger (see an example in Figure 4 and detailed further below) provides read-and-write access to operational data based on the inventory of production models. It’s a flexible and extensible foundation upon which one can build automation and tools to solve problems that are specific to individual organizations within Meta that are not served by the general tooling. </p>
<figure id="attachment_22529" aria-describedby="caption-attachment-22529" class="wp-caption alignnone c3"><img class="size-large wp-image-22529" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?w=1024" alt="" width="1024" height="375" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=916,335 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=768,281 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=1024,375 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=1536,562 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-4.png?resize=192,70 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22529" class="wp-caption-text">Figure 4: An abridged example of what a model registry entry looks like.</figcaption></figure><p>As Instagram scaled its investment in AI through <a href="https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram-explore-recommendations-system/" target="_blank" rel="noopener">rapid innovation in content recommendations</a>, the number of models and AI assets grew; as a result, it has been increasingly important — but also increasingly difficult — to maintain a minimum standard for all of our models, as we lacked an authoritative source for the business context as well as for a model’s importance. </p>
<p>In creating the model registry, we set out to provide a structured interface for collecting business context via model types, importance via criticality, and additional metadata that would enable model understanding. Below, we’ll get into the model types, criticality, and automation we’ve built for this purpose.</p>
<h4>Model types</h4>
<p>At a high level, model type describes the purpose for the ML workload where it represents a category or class of models that share a common purpose or are used in similar contexts. For example, we have “ig_stories_tray_mtml<strong>”</strong> which is a string attached to training flows, model checkpoints, inference services, and more. Put simply, a model type identifies for the reader this model’s purpose in the ranking funnel.</p>
<p>Let’s break it down: </p>
<p>“ig_stories_tray_mtml” → “ig<strong>” “</strong>stories<strong>” “</strong>tray<strong>” “</strong>mtml”</p>
<ul><li class="c1" aria-level="1"><strong>“</strong><strong>ig</strong><strong>”:</strong> This model is an “ig” model as opposed to “fb” or “whatsapp”.</li>
<li class="c1" aria-level="1"><strong>“</strong><strong>stories</strong><strong>”:</strong> This model serves IG Stories.</li>
<li class="c1" aria-level="1"><strong>“</strong><strong>tray</strong><strong>”:</strong> This model serves in the main IG Stories tray (as opposed to stories in some other surface).</li>
<li class="c1" aria-level="1"><strong> “</strong><strong>mtml</strong><strong>”:</strong> This model is a multi-task-multi-label model, commonly used in late-stage ranking.</li>
</ul><p>We can then use these model type strings to tag AI assets, and since they serve as proxies for business context, we can use them also for asset management, policy enforcement, analytics, and more.</p>
<p>The metadata entries in the model registry are anchored on two main types that describe model instances (ModelMetadata) as well as model types (ModelTypeMetadata). These types are made up of “core” attributes that are universally applicable, as well as “extended” attributes that allow different teams to encode their opinions about how these entries will inform operations. For example, in Instagram our extended attributes encode “baseline” and “holdout” model IDs, which are used in our ranking infrastructure to orchestrate ranking funnel execution. </p>
<h4>Criticality</h4>
<p>In addition to defining business function, we had to establish clear guidelines for model importance. Within Meta, SEVs and services have a unified-importance tier system where the Global Service Index (GSI) records a criticality from TIER0 to TIER4 based on the maximum incident severity level the service can cause, from SEV0 as the most critical to SEV4 as simply a “heads up.” Since GSI criticality had social proof at the company, and infra engineers were familiar with this system, we adopted these criticalities for models and now annotate them at the model type and model level.</p>
<p>No longer would each team decide to raise their own model services to TIER1 for themselves, increasing the burden on all teams that support these models. Teams needed to provide an immediate response (available 24/7) on call and be able to prove that their models contributed meaningfully to critical business metrics to qualify for elevated monitoring.</p>
<h4>Configuration structure as a foundation for automation </h4>
<p>Once we had onboarded a critical mass of Instagram models to the model registry, we could begin to fully integrate with our monitoring and observability suite using our Meta-wide configuration solution, Configerator. With this, we could now have model performance monitoring and alerts that are fully automated and integrated with our tooling for <a href="https://engineering.fb.com/2021/12/13/production-engineering/slick/" target="_blank" rel="noopener">SLIs called SLICK</a>, dashboards that allow us to monitor models across many time series dimensions, and a suite of alerting specific to the model that is driven from the entries in the model registry.</p>
<p>This provided all our teams confidence that our monitoring coverage was complete and automated.</p>
<h2>Launching</h2>
<p>While a point-in-time snapshot of models in production is great for static systems, Instagram’s ML landscape is constantly shifting. With the rapid increase of iteration on the recommendation system driving an increased number of launches, it became clear our infrastructure support to make this happen was not adequate. Time-to-launch was a bottleneck in ML velocity, and we needed to drive it down.</p>
<h3>What did the process look like?</h3>
<p>Conventionally, services were longstanding systems that had engineers supporting them to tune. Even when new changes would introduce new capacity regression risks, we could gate this behind change safety mechanisms. </p>
<p>However, our modeling and experimentation structure was unique in that we were planning for more rapid iteration, and our options were insufficient. To safely test the extent of load a new service could support, we would clone the entire service, send shadow traffic (i.e., cloned traffic that isn’t processed by our clients), and run multiple overload tests until we found a consistent peak throughput. But this wasn’t a perfect science. Sometimes we didn’t send enough traffic, and sometimes we’d send too much, and the amount could change throughout the day due to variations in global user behavior. </p>
<p>This could easily take two days to get right, including actually debugging the performance itself when the results weren’t expected. Once we got the result, we’d then have to estimate the final cost. Below (in Figure 5) is the formula we landed on.</p>
<figure id="attachment_22530" aria-describedby="caption-attachment-22530" class="wp-caption alignnone c3"><img class="size-large wp-image-22530" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?w=1024" alt="" width="1024" height="131" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png 1828w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=916,117 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=768,98 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=1024,131 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=1536,197 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=96,12 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-5.png?resize=192,25 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22530" class="wp-caption-text">Figure 5: A formula calculating capacity estimations for a new launch.</figcaption></figure><p>The actual traffic shifting portion was tedious as well. For example, when we managed to fully estimate that we needed 500 replicas to host the new service, we might not actually have 500 spares lying around to do a full replacement, so launching was a delicate process of partially sizing up by approximately 20%, sending 20% of traffic over, and then scaling down the old service by 20% to reclaim and recycle the capacity. Rinse, repeat. Inefficient!</p>
<p>And by the time we got to the end of this arduous process, the ordeal still wasn’t over. Each team was responsible for correctly setting up new alerts for their baseline in a timely fashion, or else their old models could and did trigger false alarms. </p>
<h3>How does forcing virtual pools aid product growth?</h3>
<p>One of the prerequisites for fixing competition for resources and unblocking productivity was to put up guardrails. Prior to this, it was “first come first served,” with no clear way to even “reserve” future freed capacity. It was also hard to reason about fairness from an infra perspective: Would it make sense to give each team equal pools, or give each individual person a maximum limit? </p>
<p>As it turned out, not all MLEs are experimenting at the same time, due to staggered progress on their work, so individual (per-engineer) limits were not ideal. One member might be in the experimentation stage and another might be training. So our solution was to provide bandwidth to each team. </p>
<p>Once each team — and therefore product — had quotas distributed, their launch policy became more clear cut. Some teams established free launching as long as the team was within quota. Others required no regressions in capacity usage. But mostly this unlocked our ability to run launches in parallel, since each one required much less red tape, and prioritization was no longer done at the org level.</p>
<h3>What other tooling improved launching?</h3>
<p>As mentioned earlier, preplanning with capacity estimations was critical to understanding cost and ensuring reliability. We were often asked, <em>Why not let autoscaling take care of everything?</em> The problem was that each service could be configured slightly differently than a previously optimized service, or some architectural change could have affected the performance of the model. We didn’t have an infinite amount of supply to work with, so by the time we fully traffic-shifted everything over, we might find that we didn’t have enough supply. Reverting is costly, taking hours to get through each stage.</p>
<p>By doing capacity estimations in advance, this also allowed us and each team to accurately evaluate metric improvement versus cost. It might be worthwhile to double our costs if something would increase time spent on the app by 1%, but likely not for a 0.05% improvement where we could better spend that capacity funding another initiative.  </p>
<p>With partners in AI Infra, we developed two major solutions to this process: offline performance evaluation and an automated launching platform.</p>
<p>We simplified determining performance of a new service using recorded traffic. Pre-recorded traffic was continuously collected into a data warehouse that the benchmarker could read from, and we’d spin up temporary jobs with this automation. One job would replay different levels of traffic continuously and send it to another job that was a clone of the existing experiment. By putting stoppers on desired latency and error rates, the tooling would eventually output a converged stable number that we could understand as the max load (see Figure 6).</p>
<figure id="attachment_22531" aria-describedby="caption-attachment-22531" class="wp-caption alignnone c4"><img class="wp-image-22531" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?w=1024" alt="" width="575" height="500" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png 1174w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?resize=916,796 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?resize=768,667 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?resize=1024,890 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?resize=96,83 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-6.png?resize=192,167 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22531" class="wp-caption-text">Figure 6: Load tests converging on an accurate measure of load.</figcaption></figure><p>The launch platform itself would input the numbers we captured from these tests, automatically collect demand data as defined, and run that same formula to calculate a cost. The platform would then perform the upscaling/downscaling cycle for teams as we shifted traffic.</p>
<p>And finally, by leveraging the model registry, we were able to land this model change in code (see example in Figure 6), to help us better maintain and understand the 1000+ models within our fleet. Likewise, this bolstered our trust in the model registry, which was now directly tied to the model launch lifecycle.</p>
<figure id="attachment_22532" aria-describedby="caption-attachment-22532" class="wp-caption alignnone c3"><img class="size-large wp-image-22532" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?w=1024" alt="" width="1024" height="446" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=916,399 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=768,334 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=1024,446 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=1536,668 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=96,42 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-7.png?resize=192,84 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22532" class="wp-caption-text">Figure 7: A theoretical model registry change during launch.</figcaption></figure><p>This suite of launch automation has dramatically reduced the class of SEVs related to model launches, improved our pace of innovation from a few to more than 10 launches per week, and reduced the amount of time engineers spend conducting a launch by more than two days.</p>
<h2>Model stability</h2>
<p>As the number of models in production increased, our organization started to feel the effects of an inconsistent measure of model health. While ranking models are run like any other distributed backend system (receive a request, produce a response), one may think a universal SLO that measures request success rate can suffice to capture holistic health. This is not the case for ranking models, as the <em>accuracy</em> of recommendations received carries significant importance to the end-user experience. If we consider a user who is a huge fan of golf but does not enjoy cooking content (see the “available &amp; irrelevant” case in Figure 8 below), we see an example of this <em>inaccuracy</em> in practice. This is precisely what the model stability metric sought to capture.</p>
<figure id="attachment_22533" aria-describedby="caption-attachment-22533" class="wp-caption alignnone c3"><img class="size-large wp-image-22533" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?w=1024" alt="" width="1024" height="586" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=916,524 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=768,439 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=1024,586 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=1536,878 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=96,55 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-8.png?resize=192,110 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22533" class="wp-caption-text">Figure 8: Different types of responses that can be provided to an end user.</figcaption></figure><h3>Why is measuring ranking model reliability unique?</h3>
<p>Ranking models, unlike traditional idempotent request/response backends, produce scores predicting user action given a set of candidates (PLIKE, PCOMMENT, PFOLLOW, etc.). These scores then combine and are used to determine which candidates are most relevant to an end user. It’s important that these scores accurately reflect user interest, as their accuracy is directly correlated to user engagement. If we recommend irrelevant content, user engagement suffers. The model stability metric was designed to make it easy to measure this <em>accuracy</em> and detect <em>inaccuracy</em> at our scale. </p>
<p>Let’s discuss how this works.</p>
<h3>Defining model stability</h3>
<p>Models are complex, and they produce multiple output predictions. Let’s take a simplified example (shown in Figure 9 below) of a multi-task-multi-label (MTML) model predicting three actions:</p>
<figure id="attachment_22534" aria-describedby="caption-attachment-22534" class="wp-caption alignnone c3"><img class="wp-image-22534" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?w=1024" alt="" width="1024" height="625" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png 1926w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=916,559 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=768,469 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=1024,625 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=1536,938 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-9.png?resize=192,117 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22534" class="wp-caption-text">Figure 9: A simplified MTML model predicting three actions.</figcaption></figure><p>For us to claim this model is stable, we must also claim that each underlying <em>prediction</em> is stable.</p>
<p>When evaluating the accuracy of a ranking model’s predictions, we typically look at two metrics:</p>
<ul><li class="c1" aria-level="1">Model <strong>calibration</strong>, which is based on observed real-world outcomes and answers the question, <em>“Are we over- or under-predicting user action?”</em> It is calculated as a ratio of predicted click-through-rate (CTR) and empirical CTR. A perfect predictor will have calibration centered at 1.</li>
<li class="c1" aria-level="1">Model <strong>normalized entropy</strong> (NE), which measures the discriminative power of a predictor, and answers the question, <em>“How well can this predictor separate action from inaction?”</em> It is calculated as a ratio of the average log-loss per impression to what the average log-loss per impression would be if we always predicted the empirical CTR. With NE, lower values are better, and an NE of 1 is equivalent to random predictions.</li>
</ul><p>(For more information regarding our choice of prediction evaluation metrics, please refer to the paper, “<a href="https://research.facebook.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/" target="_blank" rel="noopener">Practical Lessons from Predicting Clicks on Ads at Facebook.</a>”)</p>
<p>A model’s predictions are unstable when either <strong>calibration</strong> or <strong>NE</strong> are out of their expected healthy ranges. To determine what a healthy range is, we must look at each metric in real time, and Figure 10 below shows what these time series can look like:</p>
<figure id="attachment_22535" aria-describedby="caption-attachment-22535" class="wp-caption alignnone c3"><img class="size-large wp-image-22535" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?w=1024" alt="" width="1024" height="511" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=916,457 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=768,383 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=1024,511 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=1536,766 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-10.png?resize=192,96 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22535" class="wp-caption-text">Figure 10: Example predictions of calibration and NE over a period of time.</figcaption></figure><p>By observing the trend of a healthy prediction, we can apply thresholds for our evaluation metrics. When these thresholds are breached, the underlying prediction is considered unstable.</p>
<p>From here, we can define <em>model stability</em> as a binary indicator across a model’s predictions. It is 1 if all underlying predictions are stable, and 0 if any prediction is unstable. This is an extremely powerful method of reacting to real-time prediction instability as well as a tool for understanding trends in predictive health per model or across distinct products ranking funnels.</p>
<h3>Operationalizing model stability</h3>
<p>With a real-time view on model predictive health, we can leverage this unified definition of model stability and apply it to all of our models in production, once again leveraging the model registry as a ledger to hold this important data. In Figure 11 below, we can see the addition of model stability metric metadata after we determined the expected thresholds.</p>
<figure id="attachment_22536" aria-describedby="caption-attachment-22536" class="wp-caption alignnone c3"><img class="size-large wp-image-22536" src="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?w=1024" alt="" width="1024" height="446" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=916,399 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=768,334 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=1024,446 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=1536,668 1536w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=96,42 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Instagram-1000-models-image-11.png?resize=192,84 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22536" class="wp-caption-text">Figure 11: Model stability definitions stored in the model registry.</figcaption></figure><p>Given the large number of models in production, each producing many predictions, building a portable definition of model health applicable to all of our ranking models represented an important milestone toward upleveling Instagram’s ML infrastructure maturity. This has unlocked our ability to build generic alerting to guarantee detection of our most important models becoming unstable, thereby moving us closer to mitigation when our recommendation system is at risk. </p>
<p>Since the addition of these metrics and alerting, ML teams have discovered previously hidden issues within their models and addressed them faster than before, leading to higher-quality recommendations.</p>
<h2>Key takeaways</h2>
<p>In our journey to scale Instagram’s algorithm to manage over 1000 models, we have learned several critical lessons that have shaped our approach and infrastructure. These takeaways not only highlight the challenges we faced but also underscore the strategies that led to our success.</p>
<h3>Infra understanding is the foundation to building the right tools</h3>
<p>A unified understanding of our infrastructure footprint was essential in developing the right tools to support our scaling efforts. By identifying the gaps and potential risks in our existing systems, we were able to implement solutions such as the model registry that significantly improved our operational efficiency and reliability posture.</p>
<h3>Helping colleagues move fast means we all move faster</h3>
<p>By addressing the model iteration bottleneck, we enabled our teams to innovate more rapidly. Our focus on creating a seamless, self-service process for model iteration empowered client teams to take ownership of their workflows. This not only accelerated their progress but also reduced the operational burden on our infrastructure team. As a result, the entire organization benefited from increased agility and productivity.</p>
<h3>Reliability must consider quality</h3>
<p>Ensuring the reliability of our models required us to redefine how we measure and maintain model quality. By operationalizing model stability and establishing clear metrics for model health, we were able to proactively manage the performance of our models. This approach enables us to maintain high standards of quality across our recommendation systems, ultimately enhancing user engagement and satisfaction.</p>
<p>Our experience in scaling Instagram’s recommendation system has reinforced the importance of infrastructure understanding, collaboration, and a focus on quality. By building robust tools and processes, we have not only improved our own operations but also empowered our colleagues to drive innovation and growth across the platform.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/21/production-engineering/journey-to-1000-models-scaling-instagrams-recommendation-system/</link>
      <guid>https://engineering.fb.com/2025/05/21/production-engineering/journey-to-1000-models-scaling-instagrams-recommendation-system/</guid>
      <pubDate>Wed, 21 May 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta’s Full-stack HHVM optimizations for GenAI]]></title>
      <description><![CDATA[<p>As Meta has launched new, innovative products leveraging generative AI (GenAI), we need to make sure the underlying infrastructure components evolve along with it. Applying infrastructure knowledge and optimizations have allowed us to adapt to changing product requirements, delivering a better product along the way. Ultimately, our infrastructure systems need to balance our need to ship high-quality experiences with a need to run systems sustainability.</p>
<p>Splitting GenAI inference traffic out into a dedicated WWW tenant, which allows specialized runtime and warm-up configuration, has enabled us to meet both of those goals while delivering a 30% improvement in latency. </p>
<div class="jetpack-video-wrapper"><iframe title="Splitting the Monolith | Phil Lopreiato &amp; Zach Zundel" width="1778" height="1000" src="https://www.youtube.com/embed/QBIqvBy3lqg?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Who we are</h2>
<p>As the Web Foundation team, we operate Meta’s monolithic web tier, running <a href="https://hacklang.org/">Hack</a>. The team is composed of cross-functional engineers who make sure the infrastructure behind the web tier is healthy and well designed. We jump into incident response, work on some of the most complex areas of the infrastructure, and help build whatever we need to keep the site happily up and running.</p>
<p>To accomplish this, we have established a series of best practices on being a “good citizen” of the shared tier. We need to ensure that all requests comply with these guidelines to prevent issues from spilling over and affecting other teams’ products. One core rule is the request runtime—limiting a request to 30 seconds of execution. This is a consequence of the <a href="https://docs.hhvm.com/hhvm/">HHVM (HipHop Virtual Machine) runtime</a>—each request has a corresponding worker thread, of which there is a finite number. To ensure there are always threads available to serve incoming requests, we need to balance the resources available on each host with its expected throughput. If requests are taking too long, there will be fewer available threads to process new requests, leading to user-visible unavailability. </p>
<h2>The changing landscape</h2>
<p>Classically, webservers at Meta are optimized for serving front-end requests—rendering webpages and serving GraphQL queries. These requests’ latency is typically measured in hundreds of milliseconds to seconds (substantially below the 30-second limit), which enables hosts to process approximately 500 queries per second.</p>
<p>Additionally, a web server will spend about two-thirds of its time doing input/output (I/O), and the remaining third doing CPU work. This fact has influenced the design of the Hack language, which supports asyncio, a type of cooperative multi-tasking, and all the core libraries support these primitives to increase performance and decrease the amount of time the CPU is sitting idle, waiting for I/O.</p>
<p>GenAI products, especially LLMs, have a different set of requirements. These are driven by the core inference flow: The model responds with a stream of tokens that can take seconds or minutes to complete. A user may see this as a chatbot “typing” a response. This isn’t an effect to make our products seem friendlier; it’s the speed at which our models think! After a user submits a query to the model, we need to start streaming these responses back to the user as fast as possible. On top of that, the total latency of the request is now substantially longer (measured in seconds). These properties have two effects on the infrastructure—minimal overhead on the critical path before calling the LLM, and a long duration for the rest of the request, most of which is spent waiting on I/O. (See Figures 1 and 2 below).</p>
<figure id="attachment_22500" aria-describedby="caption-attachment-22500" class="wp-caption alignnone c1"><img class="wp-image-22500" src="https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?w=1024" alt="" width="600" height="288" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png 1419w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?resize=916,440 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?resize=768,369 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?resize=1024,491 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?resize=96,46 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-figure-1.png?resize=192,92 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22500" class="wp-caption-text">Figure 1: Percent of time spent on I/O, typical requests (~70%) vs. GenAI (~90%).</figcaption></figure><figure id="attachment_22501" aria-describedby="caption-attachment-22501" class="wp-caption alignnone c1"><img class="wp-image-22501" src="https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?w=1024" alt="" width="600" height="395" srcset="https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png 1314w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?resize=916,602 916w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?resize=768,505 768w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?resize=1024,673 1024w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/05/Meta-GenAI-HHVM-image3.png?resize=192,126 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22501" class="wp-caption-text">Figure 2: Overall request latency CDF; typical requests vs. GenAI.</figcaption></figure><h2>A series of optimizations</h2>
<p>This shift in requirements allowed Web Foundation to reexamine the rules of running the monolithic web tier. We then launched a dedicated web tenant (a standalone deployment of WWW) that allowed custom configuration, which we could better tune to the needs of the workload.</p>
<h3>Request timeout</h3>
<p>First, running on an isolated web tier allowed us to increase the runtime limit for GenAI requests. This is a straightforward change, but it allowed us to isolate the longer-running traffic to avoid adverse impacts on the rest of the production tier. This way, we can avoid requests timing out if inference takes longer than 30 seconds.</p>
<h3>Thread-pool sizing</h3>
<p>Running requests for longer means there is reduced availability of worker threads (which, remember, map 1:1 with processed requests). Since webservers have a finite amount of memory, we can divide the total memory available by the per-request memory limit to get a peak number of active requests; this in turn tells us how many requests we can execute simultaneously. We ended up running with approximately 1000 threads on GenAI hosts, as compared to a couple of hundred on normal webservers.</p>
<h3>JIT cache and “jumpstart”</h3>
<p>HHVM is a just-in-time (JIT) interpreted language, which means the first time a given function executes, the machine needs to compile it to lower-level machine code for execution. Additionally, a technique called <a href="https://engineering.fb.com/2021/03/03/developer-tools/hhvm-jump-start/">Jump-Start</a> allows a webserver to seed its JIT cache with outputs from a previously warmed server. By allowing GenAI hosts to use Jump-Start profiles from the main web tier, we are able to greatly speed up execution, even if the code overlap is not identical. </p>
<h3>Request warm-up</h3>
<p>HHVM also supports the execution of dummy requests at server startup, which we can execute, and then we can discard the results. The intent here is to warm non-code caches within the webserver. Configuration values and service discovery info are normally fetched inline the first time they are needed and then cached within the webserver. By fetching and caching this information in warm-up requests, we prevent our users from observing the latency of these initial fetches. </p>
<h3>Shadow traffic</h3>
<p>Finally, Meta heavily uses real-time configuration to control feature rollouts, which means that jumpstart profiles consumed at startup time might not cover all <em>future</em> code paths the server will execute. To maintain coverage in the steady state, we also added request shadowing, so we can ensure that gating changes are still covered in the JIT cache.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/20/web/metas-full-stack-hhvm-optimizations-for-genai/</link>
      <guid>https://engineering.fb.com/2025/05/20/web/metas-full-stack-hhvm-optimizations-for-genai/</guid>
      <pubDate>Tue, 20 May 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Open-sourcing Pyrefly: A faster Python type checker written in Rust]]></title>
      <description><![CDATA[<p>Back in 2017, engineers at Meta sought to create a type checker for Instagram’s typed Python codebase. Years later, as the type system continued to evolve, that type checker eventually became <a href="https://engineering.fb.com/2025/05/15/developer-tools/introducing-pyrefly-a-new-type-checker-and-ide-experience-for-python">Pyrefly. </a></p>
<p>Pyrefly is a new type checker and IDE experience for Python, written with Rust, and n<a href="https://pyrefly.org">ow available for the entire Python community to use</a>! It’s open-source, supports both CLI usage and IDE integration. and is designed to help you catch errors before runtime in Python codebases of any size.</p>
<p>On this episode of the Meta Tech Podcast, <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> sits down with Maggie, Rebecca, and Neil — some of the team behind Pyrefly — to discuss this latest  release from Meta and how they built an incremental type checker that scales to mono repositories.</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/36576150/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/4JhEid69dDIB2f82bkvjYu" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/open-sourcing-pyrefly-a-faster-python-type-checker/id1370910331?i=1000708623648" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/v3dj86hr" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>
<h4 dir="ltr"><strong>Links</strong></h4>
<ul><li dir="ltr" role="presentation"><a href="https://pyrefly.org/">Pyrefly</a></li>
<li dir="ltr" role="presentation"><a href="https://pyre-check.org/">Pyre</a></li>
<li dir="ltr" role="presentation"><a href="https://github.com/astral-sh/ruff">Ruff</a></li>
<li dir="ltr" role="presentation"><a href="https://peps.python.org/pep-0484/">PEP 484</a></li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/05/15/developer-tools/open-sourcing-pyrefly-a-faster-python-type-checker-written-in-rust/</link>
      <guid>https://engineering.fb.com/2025/05/15/developer-tools/open-sourcing-pyrefly-a-faster-python-type-checker-written-in-rust/</guid>
      <pubDate>Thu, 15 May 2025 20:30:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing Pyrefly: A new type checker and IDE experience for Python]]></title>
      <description><![CDATA[<p>Today we are announcing an alpha version of <a href="https://pyrefly.org/" target="_blank" rel="noopener">Pyrefly</a>, an open source Python type checker and IDE extension crafted in <a href="https://engineering.fb.com/2021/04/29/developer-tools/rust/" target="_blank" rel="noopener">Rust</a>. Pyrefly is a static typechecker that analyzes Python code to ensure type consistency and help you catch errors throughout your codebase before your code runs. It also supports IDE integration and CLI usage to give you flexibility in how you incorporate it into your workflow. </p>
<p>The open source community is the backbone of the Python language. We are eager to collaborate on Pyrefly with the community and improve Python’s type system and the many libraries that we all rely on.  </p>
<h2>Get started</h2>
<p>Ready to dive in? <a href="https://pyrefly.org" target="_blank" rel="noopener">The official Pyrefly website</a> has all the details, but to quickly get started:</p>
<ul><li class="c1" aria-level="1"><a href="https://pyrefly.org/en/docs/installation/" target="_blank" rel="noopener">Install</a> Pyrefly on the command-line: pip install pyrefly.</li>
<li class="c1" aria-level="1"><a href="https://pyrefly.org/en/docs/migrating-to-pyrefly/" target="_blank" rel="noopener">Migrate your existing type checker configuration to Pyrefly</a>.</li>
<li class="c1" aria-level="1">Enhance Your IDE: Download the <a href="https://marketplace.visualstudio.com/items?itemName=meta.pyrefly" target="_blank" rel="noopener">Pyrefly extension for VSCode</a> and enjoy a lightning fast IDE experience from starter projects to monorepos.</li>
<li class="c1" aria-level="1">Leave feedback for us on <a href="https://github.com/facebook/pyrefly/issues" target="_blank" rel="noopener">GitHub</a>.</li>
</ul><h2>Why we built Pyrefly</h2>
<p>Back in 2017, we embarked on a mission to create a type checker that could handle <a href="https://instagram-engineering.com/web-service-efficiency-at-instagram-with-python-4976d078e366" target="_blank" rel="noopener">Instagram’s massive codebase</a> of typed Python. This mission led to the birth of the <a href="https://github.com/facebook/pyre-check" target="_blank" rel="noopener">Pyre</a> type checker, inspired by the robust designs of <a href="https://hacklang.org/" target="_blank" rel="noopener">Hack</a> and <a href="https://flow.org/">Flow</a>, and written in OCaml to deliver scalable performance. </p>
<p>Over the years, Pyre served us well, but as the type system evolved and the need for typechecking to drive responsive IDE emerged, it was clear that we needed to take a new approach. We explored alternate solutions and leveraged community tools like <a href="https://github.com/Microsoft/pyright" target="_blank" rel="noopener">Pyright</a> for code navigation. But the need for an extensible type checker that can bring code navigation, checking at scale, and exporting types to other services drove us to start over, creating Pyrefly. </p>
<h2>The principles behind Pyrefly</h2>
<p>Today, we’re excited to unveil Pyrefly, a project <a href="https://github.com/facebook/pyrefly" target="_blank" rel="noopener">we’ve been developing openly on</a> GitHub. We invite you to explore our work and try it out on your own project. While a project like Pyrefly is the sum of thousands of technical choices, a few notable principles we’ve followed are:</p>
<h3>Performance</h3>
<p>We want to shift checks that used to happen later on CI to happening on every single keystroke. That requires checking code at speed (on large codebases we can check 1.8 million lines of code per second!) and careful thought to incrementality and updates. Pyrefly is implemented in Rust and designed for high performance on codebases of all sizes.</p>
<h3>IDE first</h3>
<p>We want the IDE and command line to share a consistent view of the world, which means crafting abstractions that capture the differences without incurring unnecessary costs. Designing these abstractions from the beginning is much easier than retrofitting them, which we tried with Pyre.</p>
<h3>Inference</h3>
<p>Some <a href="https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/" target="_blank" rel="noopener">Python programs are typed</a>, but many aren’t. We want users to benefit from types even if they haven’t annotated their code – so automatically infer types for returns and local variables and display them in the IDE. What’s more, in the IDE you can even double click to insert these inferred types if you think that would make the program better.</p>
<h3>Open source</h3>
<p>Python is open source, and hugely popular. The <a href="https://typing.python.org/en/latest/spec/" target="_blank" rel="noopener">Python typing specification</a> is open source, which made Pyrefly vastly easier to develop. Many of the libraries Meta contributes to are open source,( e.g., <a href="https://pytorch.org/" target="_blank" rel="noopener">PyTorch</a>).</p>
<p>Pyrefly is also open source, <a href="https://github.com/facebook/pyrefly/" target="_blank" rel="noopener">available on GitHub</a> under the <a href="https://github.com/facebook/pyrefly/blob/main/LICENSE" target="_blank" rel="noopener">MIT license</a>, and we encourage <a href="https://github.com/facebook/pyrefly/pulls" target="_blank" rel="noopener">pull requests</a> and <a href="https://github.com/facebook/pyrefly/issues" target="_blank" rel="noopener">issue reports</a>. We also have a <a href="https://discord.gg/Cf7mFQtW7W" target="_blank" rel="noopener">Discord channel</a> for more free flowing discussions. We would love to build a community around Pyrefly.</p>
<h2>The future of Pyrefly</h2>
<p>We will work with the Python community to drive the language forward and improve the developer experience. Since the beginning of Pyre, we open sourced our code and contributed a number of PEPs alongside the community of type checker maintainers. We feel we can do more with Pyrefly to help Python developers leverage the benefits of types for developers, library authors, and folks just learning the language. </p>
<p>Meta has leveraged types in dynamic languages from the beginning and knows the significant benefits it brings to developer productivity and security. We plan to share more of our learnings and tooling with <a href="https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/" target="_blank" rel="noopener">blogs</a>, better types in the ecosystem and language enhancements. </p>
<p>Today we’re releasing Pyrefly as an alpha. At the same time, we’re busy burning down the long-tail of bugs and features aiming to remove the alpha label this Summer. Your feedback is invaluable to get there, so please give it a try and <a href="https://github.com/facebook/pyrefly/issues" target="_blank" rel="noopener">report your bugs</a> or things you think can be improved. Even if Pyrefly isn’t right for your project, we would love to hear how you use types and what you would like to see improved in your editor.</p>
<p>Join us on the journey as we help illuminate your bugs with Pyrefly. Happy coding! 🐍✨</p>
<h2>Hear more about Pyrefly </h2>
<p>Check out the <a href="https://engineering.fb.com/2025/05/15/developer-tools/open-sourcing-pyrefly-a-faster-python-type-checker-written-in-rust" target="_blank" rel="noopener">episode of the Meta Tech Podcast</a> where several team members share their experience developing Pyrefly and technical details for how it works. We also just <a href="https://us.pycon.org/2025/schedule/presentation/118/" target="_blank" rel="noopener">talked at PyCon US</a> about high-performance Python through faster type checking and free threaded execution.</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">open source site</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a>, and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>
<h2>Acknowledgements </h2>
<p><em>Pyrefly was created By Meta’s Python Language Tooling Team: Jia Chen, Rebecca Chen, Sam Goldman, David Luo, Kyle Into, Zeina Migeed, Neil Mitchell, Maggie Moss, Conner Nilsen, Aaron Pollack, Teddy Sudol, Steven Troxler, Lucian Wischik, Danny Yang, and Sam Zhou.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/05/15/developer-tools/introducing-pyrefly-a-new-type-checker-and-ide-experience-for-python/</link>
      <guid>https://engineering.fb.com/2025/05/15/developer-tools/introducing-pyrefly-a-new-type-checker-and-ide-experience-for-python/</guid>
      <pubDate>Thu, 15 May 2025 20:30:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Accelerating GPU indexes in Faiss with NVIDIA cuVS]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Meta and NVIDIA collaborated to accelerate vector search on GPUs by integrating <a href="https://github.com/rapidsai/cuvs" target="_blank" rel="noopener">NVIDIA cuVS</a> into <a href="https://github.com/facebookresearch/faiss/releases/tag/v1.10.0" target="_blank" rel="noopener">Faiss v1.10</a>, Meta’s open source library for similarity search.</li>
<li class="c1" aria-level="1">This new implementation of cuVS will be more performant than classic GPU-accelerated search in some areas.</li>
<li class="c1" aria-level="1">For inverted file (IVF) indexing, NVIDIA cuVS outperforms classical GPU-accelerated IVF build times by up to 4.7x; and search latency is reduced by as much as 8.1x.</li>
<li class="c1" aria-level="1">For graph indexing, CUDA ANN Graph (CAGRA) outperforms CPU Hierarchical Navigable Small World graphs (HNSW) build times by up to 12.3x; and search latency is reduced by as much as 4.7x.</li>
</ul><h1>The Faiss library</h1>
<p>The <a href="https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/" target="_blank" rel="noopener">Faiss library</a> is an open source library, developed by Meta FAIR, for efficient vector search and clustering of dense vectors. Faiss pioneered vector search on GPUs, as well as the ability to seamlessly switch between GPUs and CPUs. It has made a lasting impact in both research and industry, being used as an integrated library in several databases (e.g., Milvus and OpenSearch), machine learning libraries, data processing libraries, and AI workflows. Faiss is also used heavily by researchers and data scientists as a standalone library, often <a href="https://github.com/facebookresearch/faiss/pull/1484" target="_blank" rel="noopener">paired with PyTorch</a>. </p>
<h1>Collaboration with NVIDIA</h1>
<p>Three years ago, Meta and NVIDIA worked together to enhance the capabilities of vector search technology and to accelerate vector search on GPUs. Previously, in 2016, Meta had incorporated high performing vector search algorithms made for NVIDIA GPUs: GpuIndexFlat; GpuIndexIVFFlat; GpuIndexIVFPQ. After the partnership, NVIDIA rapidly contributed <a href="https://arxiv.org/abs/2308.15136" target="_blank" rel="noopener">GpuIndexCagra</a>, a state-of-the art graph-based index designed specifically for GPUs. In its latest release, <a href="https://github.com/facebookresearch/faiss/releases/tag/v1.10.0" target="_blank" rel="noopener">Faiss 1.10.0</a> officially includes these algorithms from the <a href="https://github.com/rapidsai/cuvs" target="_blank" rel="noopener">NVIDIA cuVS library</a>. </p>
<p>Faiss 1.10.0 also includes a <a href="https://anaconda.org/pytorch/faiss-gpu-cuvs" target="_blank" rel="noopener">new conda package</a> that unlocks the ability to choose between the classic Faiss GPU implementations and the newer <a href="https://github.com/facebookresearch/faiss/wiki/GPU-Faiss-with-cuVS-usage" target="_blank" rel="noopener">NVIDIA cuVS algorithms</a>, making it easy for users to switch between GPU and CPU.</p>
<h1>Benchmarking</h1>
<p>The following benchmarks were conducted using the <a href="https://docs.rapids.ai/api/cuvs/nightly/cuvs_bench/" target="_blank" rel="noopener">cuVS-bench</a> tool. </p>
<p>We measured:</p>
<ul><li class="c1" aria-level="1">A tall, slender image dataset: A subset of 100 million vectors from the <a href="https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search" target="_blank" rel="noopener">Deep1B</a> dataset by 96 dimensions.</li>
<li class="c1" aria-level="1">A short, wide dataset of text embeddings: <a href="https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file#benchmark-cases" target="_blank" rel="noopener">5 million vector embeddings,</a> curated using the <a href="https://openai.com/index/new-and-improved-embedding-model/" target="_blank" rel="noopener">OpenAI text-embedding-ada-002 model</a>.</li>
</ul><p>Tests for index build times and search latency were conducted on an <a href="https://www.nvidia.com/en-us/data-center/h100/" target="_blank" rel="noopener">NVIDIA H100 GPU</a> and compared to an Intel Xeon Platinum 8480CL system. Results are reported in the tables below at 95% recall along the <a href="https://docs.rapids.ai/api/cuvs/nightly/comparing_indexes/" target="_blank" rel="noopener">pareto frontiers for k=10 nearest neighbors. </a></p>
<h2>Build time (95% recall@10)</h2>
<table class="c5" border="1" style="width: 769px;"><tbody><tr><td class="c2" colspan="2"><strong>Index</strong></td>
<td class="c4" colspan="2">
<p class="c3"><strong>Embeddings</strong>100M x 96<strong>(seconds)</strong></p>
</td>
<td class="c4" colspan="2">
<p class="c3"><strong>Embeddings</strong>5M x 1536<strong>(seconds)</strong></p>
</td>
</tr><tr><td><strong>Faiss Classic</strong></td>
<td><strong>Faiss cuVS</strong></td>
<td><strong>Faiss Classic</strong></td>
<td><strong>  Faiss cuVS</strong></td>
<td><strong>Faiss Classic</strong></td>
<td><strong>Faiss cuVS</strong></td>
</tr><tr><td>IVF Flat</td>
<td>IVF Flat</td>
<td>101.4</td>
<td><strong>37.9</strong> (2.7x)</td>
<td>24.4</td>
<td><strong>15.2</strong> (1.6x)</td>
</tr><tr><td>IVF PQ</td>
<td>IVF PQ</td>
<td>168.2</td>
<td><strong>72.7</strong> (2.3x)</td>
<td>42.0</td>
<td><strong>9.0</strong> (4.7x)</td>
</tr><tr><td>HNSW (CPU)</td>
<td>CAGRA</td>
<td>3322.1</td>
<td><strong>518.5</strong> (6.4x)</td>
<td>1106.1</td>
<td><strong>89.7</strong> (12.3x)</td>
</tr></tbody></table><p><em>Table 1: Index build times for Faiss-classic and Faiss-cuVS in seconds (with NVIDIA cuVS speedups in parentheses).</em></p>
<h3>Search latency (95% recall@10)</h3>
<table class="c6" style="width: 763px;"><tbody><tr><td class="c2" colspan="2"><strong>Index</strong></td>
<td class="c4" colspan="2">
<p class="c3"><strong>Embeddings</strong>100M x 96<strong>(milliseconds)</strong></p>
</td>
<td class="c4" colspan="2">
<p class="c3"><strong>Embeddings</strong>5M x 1536<strong>(milliseconds)</strong></p>
</td>
</tr><tr><td><strong>Faiss Classic</strong></td>
<td><strong>Faiss cuVS</strong></td>
<td><strong>Faiss Classic</strong></td>
<td><strong>Faiss cuVS</strong></td>
<td><strong>Faiss Classic</strong></td>
<td><strong>Faiss cuVS</strong></td>
</tr><tr><td>IVF Flat</td>
<td>IVF Flat</td>
<td>0.75</td>
<td><strong>0.39</strong> (1.9x)</td>
<td>1.98</td>
<td><strong>1.14</strong> (1.7x)</td>
</tr><tr><td>IVF PQ</td>
<td>IVF PQ</td>
<td>0.49</td>
<td><strong>0.17</strong> (2.9x)</td>
<td>1.78</td>
<td><strong>0.22</strong> (8.1x)</td>
</tr><tr><td>HNSW (CPU)</td>
<td>CAGRA</td>
<td>0.56</td>
<td><strong>0.23</strong> (2.4x)</td>
<td>0.71</td>
<td><strong>0.15</strong> (4.7x)</td>
</tr></tbody></table><p><em>Table 2: Online (i.e., one at a time) search query latency for Faiss-classic and Faiss-cuVS in milliseconds (with NVIDIA cuVS speedups in parentheses).</em></p>
<h2>Looking forward</h2>
<p>The emergence of state-of-the-art NVIDIA GPUs has revolutionized the field of vector search, enabling high recall and lightning-fast search speeds. The integration of Faiss and cuVS will continue to incorporate state-of-the-art algorithms, and we look forward to unlocking new innovations in this partnership between Meta and NVIDIA. </p>
<p>Read here for <a href="https://developer.nvidia.com/cuvs" target="_blank" rel="noopener">more details about NVIDIA cuVS</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/08/data-infrastructure/accelerating-gpu-indexes-in-faiss-with-nvidia-cuvs/</link>
      <guid>https://engineering.fb.com/2025/05/08/data-infrastructure/accelerating-gpu-indexes-in-faiss-with-nvidia-cuvs/</guid>
      <pubDate>Thu, 08 May 2025 19:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Enhancing the Python ecosystem with type checking and free threading]]></title>
      <description><![CDATA[<p><em>Meta and Quantsight have improved key libraries in the Python Ecosystem. There is plenty more to do and we invite the community to help with our efforts. </em></p>
<p>We’ll look at two key efforts in Python’s packaging ecosystem to make packages faster and easier to use:</p>
<ul><li class="c1" aria-level="1">🚀 Unlock performance wins for developers through free-threaded Python – where we leverage Python 3.13’s support for concurrent programming (made possible by removing the Global Interpreter Lock (GIL)). </li>
<li class="c1" aria-level="1">✅ Increase developer velocity in the IDE with improved type annotations.</li>
</ul><h2>Enhancing typed Python in the Python scientific stack</h2>
<p>Type hints, introduced in Python 3.5 with <a href="https://peps.python.org/pep-0484/" target="_blank" rel="noopener">PEP-484</a>, allow developers to specify variable types, enhancing code understanding without affecting runtime behavior. Type-checkers validate these annotations, helping prevent bugs and improving IDE functions like autocomplete and jump-to-definition. Despite their benefits, adoption is inconsistent across the open source ecosystem, with varied approaches to specifying and maintaining type annotations.</p>
<p>The landscape of open source software is fractured with respect to how type annotations are specified, maintained, and distributed to end users. Some projects have in-line annotations (types directly declared in the source code directly), others keep types in stub files, and many projects have no types at all, relying on third party repositories such as the <a href="https://github.com/python/typeshed" target="_blank" rel="noopener">typeshed</a> to provide community-maintained stubs. Each approach has its own pros and cons, but application and maintenance of them <a href="https://discuss.python.org/t/prevalence-staleness-of-stubs-packages-in-pypi/70457" target="_blank" rel="noopener">has been inconsistent</a>.</p>
<p>Meta and Quansight are addressing this inconsistency through:</p>
<ol><li class="c1" aria-level="1"><strong>Direct contributions:</strong> We have improved the type coverage for pandas-stubs and numpy, and are eager to expand the effort to more packages. </li>
<li class="c1" aria-level="1"><strong>Community engagement:</strong> Promoting type annotation efforts to encourage community involvement, listen to feedback and create actionable ways to improve the ecosystem. </li>
<li class="c1" aria-level="1"><strong>Tooling and automation:</strong> Developing tools to address common challenges adding types and keeping the types up-to-date with the source code.</li>
</ol><h2>Improved type annotations in pandas</h2>
<p>TL;DR: <em>Pandas is the second most downloaded package from the Python scientific stack. We improved</em> <a href="https://github.com/pandas-dev/pandas-stubs/" target="_blank" rel="noopener"><em>pandas-stubs</em></a> <em>package type annotation coverage from 36% to over 50%.</em></p>
<h3>Background</h3>
<p>The pandas community maintains its own stubs in a separate repository, which must be installed to obtain type annotations. While these stubs are checked separately from the source code, it allows the community to use types with their own type checking and IDE. </p>
<h3>Improving type coverage</h3>
<p>When we began our work in pandas-stubs, coverage was around 36%, as measured by the percentage of parameters, returns, and attributes that had a complete type annotation (the annotation is present and all generics have type arguments). After several weeks of work and about 30 PRs, type completeness is now measured at over 50%. The majority of our contributions involved adding annotations to previously-untyped parameters, adding type arguments to raw generic types, and removing deprecated/undocumented interfaces. We also improved several inaccurate annotations and updated others to match the inline annotations in the pandas source code. </p>
<h3>Key introductions</h3>
<p>Two key introductions significantly increased coverage:</p>
<ul><li class="c1" aria-level="1">Replacing raw Series types with UnknownSeries, a new type aliased to Series[Any]. When applied to return type annotations, this reduces the number of type checker false-positives when the function is called.</li>
<li class="c1" aria-level="1">Improving types of core Dataframe operations like insert, combine, replace, transpose, and assign, as well as many timestamp and time-zone related APIs.</li>
</ul><h3>Tooling development</h3>
<p>In addition to improving coverage directly, we developed tooling to catalog public interfaces missing annotations. We also augmented our tools for measuring type coverage to handle the situation where stubs are distributed independently, rather than being packaged into the core library wheel.</p>
<h2>What is free-threaded Python ?</h2>
<p>Free-threaded Python (FTP) is an experimental build of CPython that allows multiple threads to interact with the VM in parallel. Previously, access to the VM required holding the global interpreter lock (GIL), thereby serializing execution of concurrently running threads. With the GIL becoming optional, developers will be able to take full advantage of multi-core processors and write truly parallel code.</p>
<h3>Benefits of free-threaded Python</h3>
<p>The benefits of free-threaded Python are numerous:</p>
<ul><li class="c1" aria-level="1"><strong>True parallelism in a single process</strong>: With the GIL removed, developers can write Python code that takes full advantage of multi-core processors without needing to use multiple processes. CPU-bound code can execute in parallel across multiple cores.</li>
<li class="c1" aria-level="1"><strong>Improved performance:</strong> By allowing multiple threads to execute Python code simultaneously, work can be effectively distributed across multiple threads inside a single process.</li>
<li class="c1" aria-level="1"><strong>Simplified concurrency:</strong> Free-threading provides developers with a more ergonomic way to write parallel programs in Python. Gone are the days of needing to use multiprocessing.Pool and/or resorting to custom shared memory data structures to efficiently share data between worker processes.</li>
</ul><h3>Getting Python’s ecosystem ready for FTP</h3>
<p>The ecosystem of Python packages must work well with free-threaded Python in order for it to be practically useful; application owners can’t use free-threading unless their dependencies work well with it. To that end, we have been taking a “bottoms up” approach to tackle the most difficult/popular packages in the ecosystem. <a href="https://py-free-threading.github.io/tracking/" target="_blank" rel="noopener">We’ve added free-threading support</a> to many of the most popular packages used for scientific computing (e.g. numpy, scipy, scikit-learn) and language bindings (e.g. Cython, nanobind, pybind, PyO3).</p>
<h2>Just getting started</h2>
<p>Together, we made substantial progress in improving type annotations and free-threading compatibility in Python libraries. We couldn’t have done it without the Python community and are asking others to join our efforts.  Whether it’s <a href="https://discuss.python.org/t/call-for-suggestions-nominate-python-packages-for-typing-improvements/80186" target="_blank" rel="noopener">further updates to the type annotations</a> or <a href="https://py-free-threading.github.io/porting/" target="_blank" rel="noopener">preparing your code for FTP</a>, we value your help moving the Python ecosystem forward!</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">open source site</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/05/developer-tools/enhancing-the-python-ecosystem-with-type-checking-and-free-threading/</link>
      <guid>https://engineering.fb.com/2025/05/05/developer-tools/enhancing-the-python-ecosystem-with-type-checking-and-free-threading/</guid>
      <pubDate>Mon, 05 May 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Taking the plunge: Why Meta is laying the world’s longest subsea cable]]></title>
      <description><![CDATA[<p>Meta develops infrastructure all across the globe to transport information and content for the billions of people using our services around the world. At the core of this infrastructure are aggregation points – like data centers – and the digital cables that connect them. Subsea cables – the unseen digital highways of the internet – are critical for Meta to serve people wherever they are in the world. In fact, more than 95% of the world’s intercontinental traffic goes through subsea cables. </p>
<p>Meta’s engineering team prioritizes both innovation and quality when designing and deploying these cables. In the latest Meta Tech Podcast, Andy Palmer-Felgate and Pascal Pecci, both subsea cable systems engineers, join <a href="https://www.threads.net/@passy_">Pascal Hartig</a> on the Meta Tech podcast to discuss the latest in subsea engineering technology. This episode dives deeper into the engineering nuances of large-scale subsea cable projects like the recently announced <a href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Project Waterworth</a>. </p>
<p>Learn more about Meta’s work on these engineering feats. Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/36358920/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>The <a href="https://insidefacebookmobile.libsyn.com/">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod">Instagram</a>, <a href="https://threads.net/@metatechpod">Threads</a>, or <a href="https://twitter.com/metatechpod">X</a>.And if you’re interested in learning more about career opportunities at Meta, visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/01/connectivity/taking-the-plunge-why-meta-is-laying-the-worlds-longest-subsea-cable/</link>
      <guid>https://engineering.fb.com/2025/05/01/connectivity/taking-the-plunge-why-meta-is-laying-the-worlds-longest-subsea-cable/</guid>
      <pubDate>Thu, 01 May 2025 20:48:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Taking the plunge: The engineering journey of building a subsea cable]]></title>
      <description><![CDATA[<p>Meta develops infrastructure all across the globe to transport information and content for the billions of people using our services around the world. At the core of this infrastructure are aggregation points – like data centers – and the digital cables that connect them. Subsea cables – the unseen digital highways of the internet – are critical for Meta to serve people wherever they are in the world. In fact, more than 95% of the world’s intercontinental traffic goes through subsea cables. </p>
<p>Meta’s engineering team prioritizes both innovation and quality when designing and deploying these cables. In the latest Meta Tech Podcast, Andy Palmer-Felgate and Pascal Pecci, both subsea cable systems engineers, join <a href="https://www.threads.net/@passy_">Pascal Hartig</a> on the Meta Tech podcast to discuss the latest in subsea engineering technology. This episode dives deeper into the engineering nuances of large-scale subsea cable projects like the recently announced <a href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Project Waterworth</a>. </p>
<p>Learn more about Meta’s work on these engineering feats. Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/36358920/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>The <a href="https://insidefacebookmobile.libsyn.com/">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod">Instagram</a>, <a href="https://threads.net/@metatechpod">Threads</a>, or <a href="https://twitter.com/metatechpod">X</a>.And if you’re interested in learning more about career opportunities at Meta, visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/05/01/connectivity/taking-the-plunge-the-engineering-journey-of-building-a-subsea-cable/</link>
      <guid>https://engineering.fb.com/2025/05/01/connectivity/taking-the-plunge-the-engineering-journey-of-building-a-subsea-cable/</guid>
      <pubDate>Thu, 01 May 2025 20:48:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Introducing AutoPatchBench: A Benchmark for AI-Powered Security Fixes]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We are introducing AutoPatchBench, a benchmark for the automated repair of vulnerabilities identified through fuzzing.</li>
<li class="c1" aria-level="1">By providing a standardized benchmark, AutoPatchBench enables researchers and practitioners to objectively evaluate and compare the effectiveness of various AI program repair systems. </li>
<li class="c1" aria-level="1">This initiative facilitates the development of more robust security solutions, and also encourages collaboration within the community to address the critical challenge of software vulnerability repair.</li>
<li class="c1" aria-level="1">AutoPatchBench is available now on <a href="https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks" target="_blank" rel="noopener">GitHub.</a></li>
</ul><p>AI is increasingly being applied to solve security challenges, including repairing vulnerabilities identified through fuzzing. However, the lack of a standardized benchmark for objectively assessing AI-driven bug repair agents specific to fuzzing has impeded progress in academia and the broader community. Today, we are publicly releasing AutoPatchBench, a benchmark designed to evaluate AI program repair systems. AutoPatchBench sits within <a href="https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks" target="_blank" rel="noopener">CyberSecEval 4</a>, Meta’s new benchmark suite for evaluating AI capabilities to support defensive use cases. It features 136 fuzzing-identified C/C++ vulnerabilities in real-world code repos along with verified fixes sourced from the <a href="https://arxiv.org/abs/2408.02153" target="_blank" rel="noopener">ARVO dataset</a>. </p>
<p>AutoPatchBench provides a standardized evaluation framework for assessing the effectiveness of AI-assisted vulnerability repair tools. This benchmark aims to facilitate a comprehensive understanding of the capabilities and limitations of various AI-driven approaches to repairing fuzzing-found bugs. By offering a consistent set of evaluation criteria, AutoPatchBench fosters transparency and reproducibility in research, enabling both academic and industry professionals to identify best practices and areas for improvement.</p>
<h2>Fixing fuzzing-found vulnerabilities with AI</h2>
<p>Fuzzing is a cornerstone in automated testing, renowned for its effectiveness in uncovering security vulnerabilities. By bombarding a target program with vast amounts of pseudo-random input data, fuzz testing exposes critical security and reliability issues, such as memory corruption, invalid pointer dereference, integer overflow, and parsing errors. </p>
<p>However, resolving a fuzzing crash is often a labor intensive task, demanding intricate debugging and thorough code review to pinpoint and rectify the underlying cause. This process can be both time-consuming and resource-intensive. Unlike regular test failures, fuzzing bugs frequently reveal security vulnerabilities that pose severe threats to system integrity and user data. Given these stakes, automating the repair of fuzzing bugs with AI becomes not just advantageous but essential. AI’s ability to swiftly analyze patterns and propose solutions significantly reduces the time and effort required for repairs, making it an invaluable ally in safeguarding our digital environments.</p>
<p>Let’s explore the process of addressing bugs identified through fuzzing by examining a demonstrative example. Consider the following C function, which harbors a read/write buffer overflow vulnerability:</p>
<pre class="line-numbers"><code class="language-cpp">#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
void process_input(const char *input) {
    char buffer[8];
    strcpy(buffer, input); // Potential buffer overflow
    printf("Processed: %s\n", buffer);
}
</code></pre>
<p>In this scenario, a fuzzing harness might supply an input that surpasses the buffer’s capacity, leading to a crash due to buffer overflow. A typical stack trace from such a crash might appear as follows:</p>
<pre class="line-numbers"><code class="language-none">== Fuzzer Crash Report ==
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7af1223 in strcpy () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff7af1223 in strcpy ()
#1  0x0000555555555140 in process_input (input=0x7fffffffe695 "AAAAAA...")
#2  0x0000555555555162 in main (argc=2, argv=0x7fffffffe5f8)</code></pre>
<p>Here, the process_input function invokes strcpy on a string that exceeds the eight-character buffer, causing a segmentation fault. A straightforward patch involves ensuring the copy operation remains within the buffer’s limits. This can be achieved by using a bounded copy function like strncpy or implementing a length check before copying:</p>
<pre class="line-numbers"><code class="language-cpp">void process_input(const char *input) {
    char buffer[8];
    strncpy(buffer, input, sizeof(buffer) - 1);
    buffer[sizeof(buffer) - 1] = '\0';
    printf("Processed: %s\n", buffer);
}
</code></pre>
<p>This patch ensures that the string remains within the buffer’s limits, effectively preventing out-of-bounds writes. Its correctness can be confirmed by verifying that the fuzzing input, which previously caused the crash, no longer does so. Additional checks can be conducted to ensure the patch doesn’t introduce any unintended side effects.</p>
<p>As illustrated, fixing a fuzzing crash involves:</p>
<ol><li class="c1" aria-level="1">Analyzing the crash stack trace and the target code. </li>
<li class="c1" aria-level="1">Pinpointing the root cause. </li>
<li class="c1" aria-level="1">Patching the vulnerable code. </li>
<li class="c1" aria-level="1">Verifying the fix’s accuracy. </li>
</ol><p>An AI-based solution can automate these steps by utilizing an LLM’s capability to understand and generate code.</p>
<h2>Why we developed AutoPatchBench</h2>
<p>AutoPatchBench is informed by key advancements in the field of AI-driven program repair, particularly those focusing on fuzzing-found vulnerabilities. Among the notable contributions is Google’s tech report on <a href="https://research.google/pubs/ai-powered-patching-the-future-of-automated-vulnerability-fixes/" target="_blank" rel="noopener">AI-powered patching</a>, which pioneered the use of LLMs for addressing fuzzing crashes, achieving a 15% fix rate with their proprietary dataset. Subsequently, <a href="https://arxiv.org/abs/2501.07531" target="_blank" rel="noopener">Google’s study on generic program repair agents</a> introduced the GITS-Eval benchmark, encompassing 178 bugs across various programming languages. </p>
<p>In the realm of AI software engineering agents, benchmarks like <a href="https://www.swebench.com/" target="_blank" rel="noopener">SWE-Bench</a> and <a href="https://openai.com/index/introducing-swe-bench-verified/" target="_blank" rel="noopener">SWE-Bench Verified</a> have gained widespread acceptance for evaluating generic AI SWE agents. However, these benchmarks do not specifically tackle the unique challenges posed by fuzzing-found vulnerabilities, which demand specialized approaches that utilize fuzzing-specific artifacts and address security concerns. </p>
<p>AutoPatchBench addresses this gap by offering a dedicated benchmark focused on a wide variety of C/C++ vulnerabilities of 11 crash types identified through fuzzing with automated verification capability. Unlike the broader focus of GITS-Eval and SWE-Bench, AutoPatchBench is specifically designed to assess the effectiveness of AI-driven tools in repairing security-critical bugs typically uncovered by fuzzing. This targeted approach enables a more precise evaluation of AI capabilities in meeting the complex requirements of fuzzing-found vulnerabilities, thereby advancing the field of AI-assisted program repair in a focused manner.</p>
<h2>Inside AutoPatchBench</h2>
<p>We’re making AutoPatchBench <a href="https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks" target="_blank" rel="noopener">publicly available</a> as part of CyberSecEval 4 to encourage community collaboration in tackling the challenge of automating fuzzing crash repairs. This benchmark is specifically designed for AI program repair agents focusing on C/C++ bugs identified through fuzzing. It includes real-world C/C++ vulnerabilities with verified fixes sourced from the <a href="https://arxiv.org/abs/2408.02153" target="_blank" rel="noopener">ARVO dataset</a>, and incorporates additional verification of AI-generated patches through fuzzing and white-box differential testing.</p>
<h3>ARVO dataset</h3>
<p>The ARVO dataset serves as the foundation for AutoPatchBench, offering a comprehensive collection of real-world vulnerabilities that are essential for advancing AI-driven security research. Sourced from C/C++ projects identified by Google’s OSS-Fuzz, ARVO includes over 5,000 reproducible vulnerabilities across more than 250 projects. Each entry is meticulously documented with a triggering input, a canonical developer-written patch, and the capability to rebuild the project in both its vulnerable and patched states. </p>
<p>However, there are notable challenges when using the ARVO dataset as a benchmark for AI patch generation:</p>
<ol><li class="c1" aria-level="1">While reproducibility is vital for a reliable benchmark, the ARVO dataset includes samples where crashes are not consistently reproducible. Some samples lack crash stack traces, making it exceedingly difficult to address the crash.</li>
<li class="c1" aria-level="1">Although ARVO provides a ground-truth fix for each identified vulnerability, it lacks an automated mechanism to verify the correctness of a generated patch. Objective automated verification is essential for a benchmark focused on patch generation.</li>
</ol><p>AutoPatchBench addresses these challenges by creating a curated subset and by employing a comprehensive and automated verification process.</p>
<h3>Selection criteria</h3>
<p>To ensure the reliability and effectiveness of AutoPatchBench, we meticulously filtered the ARVO dataset samples based on the following criteria:</p>
<ul><li class="c1" aria-level="1"><strong>Valid C/C++ vulnerability:</strong> The ground-truth fix shall edit one or more C/C++ source files that are not fuzzing harnesses.</li>
<li class="c1" aria-level="1"><strong>Dual-container setup</strong>: Each vulnerability is accompanied by two containers—one that contains vulnerable code and another for the fixed code—that build without error.</li>
<li class="c1" aria-level="1"><strong>Reproducibility</strong>: The crash must be consistently reproducible within the vulnerable container.</li>
<li class="c1" aria-level="1"><strong>Valid stack trace</strong>: A valid stack trace must be present within the vulnerable container to facilitate accurate diagnosis and repair.</li>
<li class="c1" aria-level="1"><strong>Successful compilation</strong>: The vulnerable code must compile successfully within its designated container, ensuring that the environment is correctly set up for testing.</li>
<li class="c1" aria-level="1"><strong>Fixed code verification</strong>: The fixed code must also compile successfully within its respective container, confirming that the patch does not introduce new build issues.</li>
<li class="c1" aria-level="1"><strong>Crash resolution</strong>: The crash must be verified as resolved within the fixed container, demonstrating the effectiveness of the patch.</li>
<li class="c1" aria-level="1"><strong>Fuzzing pass</strong>: The fixed code must pass a comprehensive fuzzing test without finding new crashes, ensuring that the ground-truth patch maintains the integrity and functionality of the software.</li>
</ul><p>After applying these rigorous selection criteria, we retained 136 samples for AutoPatchBench that fulfill the necessary conditions for both patch generation and verification. From this refined set, we created a down-sampled subset of 113 AutoPatchBench-Lite samples to provide a focused benchmark for testing AI patch generation tools. These subsets preserves the diversity and complexity of real-world vulnerabilities including 11 distinct crash types, offering a solid foundation for advancing AI-driven security solutions.</p>
<h3>Patch verification</h3>
<p>In the process of patch generation, the patch generator utilizes two automated methods to verify the viability of a generated patch before submitting it for evaluation. The first method involves attempting to build the patched program, which checks for syntactic correctness. The second method involves attempting to reproduce the crash by running the input that initially triggered it. If the crash no longer occurs, it suggests that the issue has been resolved. However, these steps alone are insufficient to guarantee the correctness of the patch, as a patch might not maintain the program’s intended functionality, rendering it incorrect despite resolving the crash.</p>
<p>To address this issue, AutoPatchBench adopts a comprehensive approach to automate the evaluation of generated patches. This involves subjecting the patched code to further fuzz testing using the original fuzzing harness that initially detected the crash. Additionally, white-box differential testing compares the runtime behavior of the patched program against the ground truth repaired program, confirming that the patch has effectively resolved the underlying bug without altering the program’s intended functionality. Since a patch can potentially be made in multiple places, we cannot assume that the LLM will patch the same function as the groundtruth patch does. Instead we find all the callstacks for each call to a patched function. Then we find the lowest common ancestor (LCA) across all pairs of stacktraces offered by the groundtruth patch and the LLM patch. We then utilize debug information to inspect arguments, return values, and local variables at the first function above the LCA, differential testing offers a detailed view of the patch’s impact on the program state. </p>
<p>This process evaluates whether the generated patch produces a program state identical to the ground truth program after the patched function returns. By using a diverse set of inputs obtained from fuzzing, this gives higher confidence that the bug is fixed without changing the visible behavior of the patched functions. This differential testing is implemented using a Python script that leverages LLDB APIs to dump all visible states and identify differences between the ground truth and the patched program. </p>
<p>However, as with all attempts to solve provably undecidable problems (in this case: program equivalence), there are some failure modes for this verification step. For example, sometimes the analysis fails with timeouts, in which case we consider the semantics to be preserved if both the ground truth and the LLM patch timed out. Programs might also behave non-deterministically, and we run each input three times to identify nondeterministic struct fields and values. Such fields will not be compared to avoid false alarms from noisy, random values. Additionally, we strip any fields that contain the substring “build” or “time” as we’ve observed false positives from build-ids (that happen to be deterministic within a program, but not across different patches). </p>
<p>It should also be noted that on a number of examples, the crashing PoC never actually triggered the breakpoints on the ground truth patch, making comparison of the resulting states impossible. However, our case study showed that white-box differential testing is still effective in filtering out a majority of incorrect patches despite its limitation, which will be discussed in the case study.</p>
<h3>AutoPatchBench and AutoPatchBench-Lite</h3>
<p>AutoPatchBench is a comprehensive benchmark dataset of 136 samples. It encompasses a wide range of real-world vulnerabilities, providing a robust framework for assessing the capabilities of automated patch generation systems. </p>
<p>Within this benchmark, we have also created a subset called AutoPatchBench-Lite that consists of 113 samples. AutoPatchBench-Lite focuses on a simpler subset of vulnerabilities where the root cause of the crash is confined to a single function. This version is designed to cater to scenarios where the complexity of the bug is relatively low, making it more accessible for tools that are in the early stages of development or for those that specialize in handling straightforward issues.</p>
<p>The rationale for creating AutoPatchBench-Lite stems from the observation that when root causes are distributed across multiple locations within the code, the difficulty of generating a correct patch increases significantly. Addressing such “hard” crashes requires a tool to possess advanced reasoning capabilities to analyze larger codebases and apply patches to multiple areas simultaneously. This complexity not only challenges the tool’s design but also demands a higher level of sophistication in its algorithms to ensure accurate and effective patching.</p>
<p>By offering both AutoPatchBench and AutoPatchBench-Lite, we provide a tiered approach to benchmarking, allowing developers to progressively test and refine their tools. This structure supports the development of more advanced solutions capable of tackling both simple and complex vulnerabilities, ultimately contributing to the enhancement of AI-assisted bug repair techniques.</p>
<h3>Expected use cases</h3>
<p>AutoPatchBench offers significant value to a diverse range of users. Developers of auto-patch tools can leverage our open-sourced patch generator to enhance their tools and assess their effectiveness using the benchmark. Software projects employing fuzzing can incorporate our open-sourced patch generator to streamline vulnerability repair. Additionally, model developers can integrate the benchmark into their development cycles to build more robust and specialized expert models for bug repair. The tooling around the patch generator provided here can also be used in reinforcement learning as a reward signal during training. This data helps train models to better understand the nuances of bug repair, enabling them to learn from past fixes and improve their ability to generate accurate patches. </p>
<h2>Reference implementation</h2>
<p>We developed a basic patch generator to establish a baseline performance using AutoPatchBench. This generator is specifically designed to address simple crashes that involve patching a single function. We have <a href="https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks" target="_blank" rel="noopener">open-sourced this reference implementation</a> to encourage the community to build and expand upon it.</p>
<p>Figure 1 shows a high-level overview of its design. The patch generator takes crash stack trace and target source code as input. It identifies the source locations from the stack trace and extracts every function that contains those locations. It then asks the LLM to identify the root cause and repair the crash by patching one of the functions. Upon receiving a response from the LLM, the patch generator extracts the revised code, applies the patch, compiles the program, and tests it against the original input that caused the crash. If the build or test fails, we re-engage the LLM with the error message from the build or test output, requesting it to attempt a solution again until the crash is resolved. If a fix trajectory fails to reach a valid solution that passes build and crash reproduction within a finite number of steps, we start a new trajectory to reset the context window, preventing prolonged entrapment in an incorrect path.</p>
<figure id="attachment_22444" aria-describedby="caption-attachment-22444" class="wp-caption alignnone c2"><img class="wp-image-22444" src="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?w=1024" alt="" width="600" height="519" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png 1600w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=916,792 916w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=768,664 768w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=1024,886 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=1536,1329 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=96,83 96w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-1.png?resize=192,166 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22444" class="wp-caption-text">Figure 1: Patch generation flowchart.</figcaption></figure><p>The example prompt below also captures the essence of our approach. Note that the actual prompt is more detailed and divided into multiple segments to promote chain-of-thought reasoning.</p>
<pre class="line-numbers"><code class="language-none">As an experienced Security Engineer at Meta, your task is to address the following security-critical fuzzing crash. Below is the stack trace of the crash:
== Fuzzer Crash Report ==
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7af1223 in strcpy () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff7af1223 in strcpy ()
#1  0x0000555555555140 in process_input (input=0x7fffffffe695 "AAAAAA...")
#2  0x0000555555555162 in main (argc=2, argv=0x7fffffffe5f8)
Here is the source code for the functions involved in the stack trace:
strcpy() {...}
void process_input(const char *input) {
    char buffer[8];
    strcpy(buffer, input); // Potential buffer overflow
    printf("Processed: %s\n", buffer);
}
int main() {...}
Assuming the root cause of the crash is within one of these functions, generate a patched version of the faulty function to resolve the fuzzing crash. Ensure that you provide a complete rewrite of the function so that the patch can be applied and the code compiled without errors.
</code></pre>
<h2>A case study with AutoPatchBench-Lite</h2>
<p>In the case study, we demonstrate the use of AutoPatchBench by evaluating our reference patch generator with several LLM models. Given that our reference implementation is limited to addressing simple issues, we conducted our evaluation with AutoPatchBench-Lite, which contains 113 samples. To prevent fix trajectories from becoming excessively prolonged, we capped the maximum length of each trajectory at five. Additionally, we set the maximum number of retries to 10. </p>
<p><em>Please note that the case study is not intended to provide a statistically rigorous comparison of model performance. Instead, it aims to present preliminary results to establish a baseline expectation. We encourage future research to build upon these findings.</em></p>
<h3>Effectiveness of patch generation and verification</h3>
<p>We evaluated the effectiveness of the patch generator and our automated verification processes while using different LLM models as back-end. The figure below illustrates the effectiveness of patch generation and verification by presenting the percentage of samples that successfully passed each sequential verification step: (1) patch validity: build and crash reproducibility check, (2) fuzzing pass: passes 10-minute fuzzing, and (3) testing pass: passes white-box differential testing. It is important to note that the patch generation process only utilizes step (1) to verify the build and crash reproducibility. The fuzzing and differential testing are conducted post-generation to assess correctness.</p>
<figure id="attachment_22443" aria-describedby="caption-attachment-22443" class="wp-caption alignnone c3"><img class="size-large wp-image-22443" src="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?w=1024" alt="" width="1024" height="634" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png 1496w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?resize=916,567 916w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?resize=768,475 768w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?resize=1024,634 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-2.png?resize=192,119 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22443" class="wp-caption-text">Figure 2: Patch generation and verification success rate.</figcaption></figure><p>Figure 2 shows that all models achieved similar generation success rates of around 60% and similar post-verification success rates of around 5-11% with overlapping confidence intervals, and therefore, we do not draw any conclusion about their relative performance. The graph does, however, reveal that a substantial portion of the generated patches are found to be incorrect when subjected to fuzzing and white-box differential testing. For instance, Gemini 1.5 Pro achieved a 61.1% patch generation success rate, yet fewer than 15% of these patches (5.3% out of total set) were found to be correct. This gap highlights that build and crash reproduction are not good enough signals to infer the correctness of generated patches, and that future patch generation approaches should scrutinize the semantic preservation of generated patches more thoroughly. This gap also underscores the vital role of the comprehensive verification processes that checks semantic equivalence, a distinctive contribution of AutoPatchBench.</p>
<h3>Effect of inference-time computation</h3>
<p>To assess the impact of inference-time computation on improving the patch generation success rate, we present the distribution of retry counts among the 73 patches produced by Llama 4 Maverick.</p>
<figure id="attachment_22469" aria-describedby="caption-attachment-22469" class="wp-caption alignnone c3"><img class="size-large wp-image-22469" src="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?w=1024" alt="" width="1024" height="548" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png 1386w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?resize=916,490 916w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?resize=768,411 768w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?resize=1024,548 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2025/04/AutoPatchBench-image-3-1.png?resize=192,103 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22469" class="wp-caption-text">Figure 3: Percentage of generated patches per number of iterations.</figcaption></figure><p>Figure 3 shows that 44 out of 73 patches, or 60.2%, were successfully generated on the first attempt. The remaining 40% of the samples required more than two iterations, with no evident plateau until the 10th iteration. This outcome demonstrates that allocating more computational resources during inference-time leads to a higher success rate and suggests that increasing the number of retries could yield better results.</p>
<h3>Manual validation</h3>
<p>In our investigation of the precision and recall of white-box differential testing, we conducted a manual validation of 44 patches that passed 10-minute fuzzing against human-written ground truth fixes with the help of security experts. These patches were selected from a pool of 73 generated by Llama 4 Maverick. The following table shows the confusion matrix.</p>
<p>Table 1: Confusion matrix between human judgement and differential testing</p>
<table class="c4" border="1" style="width: 573px;"><tbody><tr><td>
</td><td>Test pass</td>
<td>Test fail</td>
<td>Sum</td>
</tr><tr><td>Human pass</td>
<td>5</td>
<td>0</td>
<td>5</td>
</tr><tr><td>Human reject</td>
<td>7</td>
<td>32</td>
<td>39</td>
</tr><tr><td>Sum</td>
<td>12</td>
<td>32</td>
<td>44</td>
</tr></tbody></table><p>The results showed that the differential testing achieved an accuracy of 84.1% for this sample (5 + 32 / 44), indicating a high overall agreement with the human assessment. However, a closer examination of the confusion matrix revealed a notable discrepancy between precision and recall. Specifically, the testing method demonstrated 100.0% recall in this case study, correctly identifying all 5 instances that humans judged as correct. In contrast, precision was relatively low (41.7%), with 7 false positives out of 12 total positive predictions. This suggests that differential testing reported success on some incorrect patches as well, highlighting the need for manual validation of patch correctness. Despite this shortcoming, the result clearly shows the utility of differential testing in automatically rejecting a substantial number of incorrect patches, which will substantially save the manual validation effort.</p>
<h3>Key insights</h3>
<p>Our case study revealed several limitations of the current patch generator.</p>
<h4>The root cause may not exist in the stack trace</h4>
<p>Frequently, crashes are the result of state contamination that occurs prior to the crash being triggered. Consequently, none of the functions within the stack frames may include the code responsible for the root cause. Since our current implementation requires the LLM to assume that the root cause is located within one of the functions in the stack trace, it is unable to generate an accurate patch in such cases. Solving this problem would require a more autonomous agent which can reason about the root cause on its own with a code browsing capability.</p>
<h4>Cheating</h4>
<p>In some instances, the LLM resorted to “cheating” by producing patches that superficially resolved the issue without addressing the underlying problem. This can occur when the generator modifies or removes code in a way that prevents the crash from occurring, but does not actually fix the root cause of the issue. We observed that cheating happens more frequently when we request the LLM to retry within the same trajectory. A potential solution to this could be to empower the LLM to say “I cannot fix it,” which may come with a tradeoff with success rate. However, note that most of the cheating was caught in the verification step, highlighting the utility of differential testing.</p>
<h4>Need for enhanced patch verification methods</h4>
<p>Fuzzing and white-box differential testing have shown that a large majority of generated patches are incorrect when compared to the ground-truth patches. This finding highlights the challenge of generating accurate patches without enhanced verification capabilities. To address this gap, several approaches can be considered:</p>
<ul><li class="c1" aria-level="1">A patch generator could provide additional code context when querying the LLM for a patch so that LLM can better understand the consequence of a code patch.</li>
<li class="c1" aria-level="1">A patch generator could make additional LLM queries to verify the perseverance of existing functionality.</li>
<li class="c1" aria-level="1">A patch generator can attempt to generate multiple valid patches by exploring multiple trajectories in parallel, and let LLM choose the best option that is most likely to be correct.</li>
<li class="c1" aria-level="1">In a well-tested real-world codebase, a patch generator can utilize existing tests to validate the patches it creates. This process complements building the code and checking for crash reproduction, allowing the patch generator to retry if a patch fails the tests. The accuracy of the generated patches is largely dependent on the thoroughness of the existing tests.</li>
</ul><p>In conclusion, while our study has identified several challenges with the current patch generation process, it also opens up opportunities for improvement. By addressing these limitations with innovative solutions, we can enhance the accuracy and reliability of patch generation, paving the way for more robust and effective automated tools.</p>
<h2>Get started with AutoPatchBench</h2>
<p>AutoPatchBench is now available on <a href="https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks">GitHub</a>. We welcome pull requests to integrate new/additional agent architectures into the framework, and look forward to seeing how well they perform on AutoPatchBench.</p>]]></description>
      <link>https://engineering.fb.com/2025/04/29/ai-research/autopatchbench-benchmark-ai-powered-security-fixes/</link>
      <guid>https://engineering.fb.com/2025/04/29/ai-research/autopatchbench-benchmark-ai-powered-security-fixes/</guid>
      <pubDate>Tue, 29 Apr 2025 19:15:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building Private Processing for AI tools on WhatsApp]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We are inspired by the possibilities of AI to help people be more creative, productive, and stay closely connected on WhatsApp, so we set out to build a new technology that allows our users around the world to use AI in a privacy-preserving way.</li>
<li class="c1" aria-level="1">We’re sharing an early look into Private Processing, an optional capability that enables users to initiate a request to a confidential and secure environment and use AI for processing messages where no one — including Meta and WhatsApp — can access them.</li>
<li class="c1" aria-level="1">To validate our implementation of these and other security principles, independent security researchers will be able to continuously verify our privacy and security architecture and its integrity.</li>
</ul><p>AI has revolutionized the way people interact with technology and information, making it possible for people to automate complex tasks and gain valuable insights from vast amounts of data. However, the current state of AI processing — which relies on large language models often running on servers, rather than mobile hardware — requires that users’ requests are visible to the provider. Although that works for many use cases, it presents challenges in enabling people to use AI to process private messages while preserving the level of privacy afforded by end-to-end encryption.</p>
<p>We set out to enable AI capabilities with the privacy that people have come to expect from WhatsApp, so that AI can deliver helpful capabilities, such as summarizing messages, without Meta or WhatsApp having access to them, and in the way that meets the following principles:</p>
<ul><li class="c1" aria-level="1"><strong>Optionality:</strong> Using Meta AI through WhatsApp, including features that use Private Processing, must be optional. </li>
<li class="c1" aria-level="1"><strong>Transparency:</strong> We must provide transparency when our features use Private Processing.</li>
<li class="c1" aria-level="1"><strong>User control:</strong> For people’s most sensitive chats that require extra assurance, they must be able to prevent messages from being used for AI features like mentioning Meta AI in chats, with the help of WhatApp’s <a href="https://blog.whatsapp.com/introducing-advanced-chat-privacy" target="_blank" rel="noopener">Advanced Chat Privacy</a> feature.</li>
</ul><h2>Introducing Private Processing</h2>
<p>We’re excited to share an initial overview of Private Processing, a new technology we’ve built to support people’s needs and aspirations to leverage AI in a secure and privacy-preserving way. This confidential computing infrastructure, built on top of a Trusted Execution Environment (TEE), will make it possible for people to direct AI to process their requests — like summarizing unread WhatsApp threads or getting writing suggestions — in our secure and private cloud environment. In other words, Private Processing will allow users to leverage powerful AI features, while preserving WhatsApp’s core privacy promise, ensuring <strong>no one except you and the people you’re talking to can access or share your personal messages, not even Meta or WhatsApp. </strong></p>
<p>To uphold this level of privacy and security, we designed Private Processing with the following foundational requirements:</p>
<ul><li class="c1" aria-level="1"><strong>Confidential processing:</strong> Private Processing must be built in such a way that prevents any other system from accessing user’s data — including Meta, WhatsApp or any third party — while in processing or in transit to Private Processing.</li>
<li class="c1" aria-level="1"><strong>Enforceable guarantees:</strong> Attempts to modify that confidential processing guarantee must cause the system to fail closed or become publicly discoverable via verifiable transparency.</li>
<li class="c1" aria-level="1"><strong>Verifiable transparency:</strong> Users and security researchers must be able to audit the behavior of Private Processing to independently verify our privacy and security guarantees.</li>
</ul><p>However, we know that technology platforms like ours operate in a highly adversarial environment where threat actors continuously adapt, and software and hardware systems keep evolving, generating unknown risks. As part of our <a href="https://engineering.fb.com/2022/07/28/security/five-security-principles-for-billions-of-messages-across-metas-apps/" target="_blank" rel="noopener">defense-in-depth approach</a> and best practices for any security-critical system, we’re treating the following additional layers of requirements as core to Private Processing on WhatsApp:</p>
<ul><li class="c1" aria-level="1"><strong>Non-targetability:</strong> An attacker should not be able to target a particular user for compromise without attempting to compromise the entire Private Processing system.</li>
<li class="c1" aria-level="1"><strong>Stateless processing and forward security:</strong> Private Processing must not retain access to user messages once the session is complete to ensure that the attacker can not gain access to historical requests or responses.</li>
</ul><h3>Threat modeling for Private Processing</h3>
<p>Because we set out to meet these high-security requirements, our work to build Private Processing began with developing a threat model to help us identify potential attack vectors and vulnerabilities that could compromise the confidentiality, integrity, or availability of user data. We’ve worked with our peers in the security community to audit the architecture and our implementation to help us continue to harden them. </p>
<h3>Building in the open</h3>
<p>To help inform our industry’s progress in building private AI processing, and to enable independent security research in this area, we will be publishing components of Private Processing, expanding the scope of our <a href="https://bugbounty.meta.com/" target="_blank" rel="noopener">Bug Bounty program</a> to include Private Processing, and releasing a detailed security engineering design paper, <strong>as we get closer to the launch of Private Processing in the coming weeks. </strong></p>
<p>While AI-enabled processing of personal messages for summarization and writing suggestions at users’ direction is the first use case where Meta applies Private Processing, we expect there will be others where the same or similar infrastructure might be beneficial in processing user requests. We will continue to share our learnings and progress transparently and responsibly.</p>
<h2>How Private Processing works</h2>
<p>Private Processing creates a secure cloud environment where AI models can analyze and process data without exposing it to unauthorized parties. </p>
<p>Here’s how it works:</p>
<ul><li class="c1" aria-level="1"><strong>Authentication:</strong> First, Private Processing obtains <a href="https://engineering.fb.com/2022/12/12/security/anonymous-credential-service-acs-open-source/" target="_blank" rel="noopener">anonymous credentials</a> to verify that the future requests are coming from authentic WhatsApp clients.</li>
<li class="c1" aria-level="1"><strong>Third-party routing and load balancing:</strong> In addition to these credentials, Private Processing fetches HPKE encryption public keys from a third-party CDN in order to support Oblivious HTTP (OHTTP).</li>
<li class="c1" aria-level="1"><strong>Wire session establishment:</strong> Private Processing establishes an OHTTP connection from the user’s device to a Meta gateway via a third-party relay which hides requester IP from Meta and WhatsApp.</li>
<li class="c1" aria-level="1"><strong>Application session establishment:</strong> Private Processing establishes a Remote Attestation + Transport Layer Security (RA-TLS) session between the user’s device and the TEE. The attestation verification step cross-checks the measurements against a third-party ledger to ensure that the client only connects to code which satisfies our verifiable transparency guarantee.</li>
<li class="c1" aria-level="1"><strong>Request to Private Processing:</strong> After the above session is established, the device makes a request to Private Processing (e.g., message summarization request), that is encrypted end-to-end between the device and Private Processing with an ephemeral key that Meta and WhatsApp cannot access. In other words, no one except the user’s device or the selected TEEs can decrypt the request.</li>
<li class="c1" aria-level="1"><strong>Private Processing:</strong> Our AI models process data in a confidential virtual machine (CVM), a type of TEE, without storing any messages, in order to generate a response. CVMs may communicate with other CVMs using the same RA-TLS connection clients use to complete processing. </li>
<li class="c1" aria-level="1"><strong>Response from Private Processing:</strong> The processed results are then returned to the user’s device, encrypted with a key that only the device and the pre-selected Private Processing server ever have access to. Private Processing does not retain access to messages after the session is completed.</li>
</ul><h2>The threat model</h2>
<p>In designing any security-critical system, it is important to develop a threat model to guide how we build its defenses. Our threat model for Private Processing includes three key components:</p>
<ul><li class="c1" aria-level="1"><strong>Assets</strong>: The sensitive data and systems that we need to protect.</li>
<li class="c1" aria-level="1"><strong>Threat actors</strong>: The individuals or groups that may attempt to compromise our assets.</li>
<li class="c1" aria-level="1"><strong>Threat scenarios</strong>: The ways in which our assets could be compromised, including the tactics, techniques, and procedures (TTPs) that threat actors might use.</li>
</ul><h3>Assets</h3>
<p>In the context of applying Private Processing to summarizing unread messages or providing writing suggestions at users’ direction, we will use Private Processing to protect messaging content, whether they have been received by the user, or still in draft form. We use the term “messages” to refer to these primary assets in the context of this blog.</p>
<p>In addition to messages, we also include additional, secondary assets which help support the goal of Private Processing and may interact with or directly process assets: the Trusted Computing Base (TCB) of the Confidential Virtual Machine (CVM), the underlying hardware, and the cryptographic keys used to protect data in transit.</p>
<h3>Threat actors</h3>
<p>We have identified three threat actor types that could attack our system to attempt to recover assets.</p>
<ol><li class="c1" aria-level="1">Malicious or compromised insiders with access to our infrastructure.</li>
<li class="c1" aria-level="1">A third party or supply chain vendor with access to components of the infrastructure.</li>
<li class="c1" aria-level="1">Malicious end users targeting other users on the platform.</li>
</ol><h3>Threat scenarios</h3>
<p>When building Private Processing to be resilient against these threat actors, we consider relevant threat scenarios that may be pursued against our systems, including (but not limited to) the following:</p>
<h4>External actors directly exploit the exposed product attack surface or compromise the services running in Private Processing CVMs to extract messages.</h4>
<p>Anywhere the system processes untrusted data, there is potentially an attack surface for a threat actor to exploit. Examples of these kinds of attacks include exploitation of zero-day vulnerabilities or attacks unique to AI such as prompt injection. </p>
<p>Private Processing is designed to reduce such an attack surface through limiting the exposed entry points to a small set of thoroughly reviewed components which are subject to regular assurance testing. The service binaries are hardened and run in a containerized environment to mitigate the risks of code execution and limit a compromised binary’s ability to exfiltrate data from within the CVM to an external party.</p>
<h4>Internal or external attackers extract messages exposed through the CVM.</h4>
<p>Observability and debuggability remains a challenge in highly secure environments as they can be at odds with the goal of confidential computing, potentially exposing side channels to identify data and in the worst case accidentally leaking messages themselves. However, deploying any service at scale requires some level of observability to identify failure modes, since they may negatively impact many users, even when the frequency is uncommon. We implement a log-filtering system to limit export to only allowed log lines, such as error logs.</p>
<p>Like any complex system, Private Processing is built of components to form a complex supply chain of both hardware and software. Internally, our CVM build process occurs in restricted environments that maintain provenance and require multi-party review. Transparency of the CVM environment, which we’ll provide through publishing a third-party log of CVM binary digests and CVM binary images, will allow external researchers to analyze, replicate, and report instances where they believe logs could leak user data.</p>
<h4>Insiders with physical or remote access to Private Processing hosts interfere with the CVM at boot and runtime, potentially bypassing the protections in order to extract messages.</h4>
<p>TEE software exploitation is a growing area of security research, and vulnerability researchers have repeatedly demonstrated the ability to bypass TEE guarantees. Similarly, physical attacks on Private Processing hosts may be used to defeat TEE guarantees or present compromised hosts as legitimate to an end user.</p>
<p>To address these unknown risks, we built Private Processing on the principle of defense-in-depth by actively tracking novel vulnerabilities in this space, minimizing and sanitizing untrusted inputs to the TEE, minimizing attack surface through CVM hardening and enabling abuse detection through enhanced host monitoring.</p>
<p>Because we know that defending against physical access introduces significant complexity and attack surface even with industry-leading controls, we continuously pursue further attack surface hardening. In addition, we reduce these risks through measures like encrypted DRAM and standard physical security controls to protect our datacenters from bad actors.</p>
<p>To further address these unknown risks, we seek to eliminate the viability of targeted attacks via routing sessions through a third-party OHTTP relay to prevent an attacker’s ability to route a specific user to a specific machine.</p>
<h2>Designing Private Processing</h2>
<p>Here is how we designed Private Processing to meet these foundational security and privacy requirements against the threat model we developed.</p>
<p><em>(Further technical documentation and security research engagements updates are coming soon).</em></p>
<h3>Confidential processing</h3>
<p>Data shared to Private Processing is processed in an environment which does not make it available to any other system. This protection is further upheld by encrypting data end-to-end between the client and the Private Processing application, so that only Private Processing, and no one in between – including Meta, WhatsApp, or any third-party relay – can access the data.</p>
<p>To prevent possible user data leakage, only limited service reliability logs are permitted to leave the boundaries of CVM.</p>
<h3>System software</h3>
<p>To prevent privileged runtime access to Private Processing, we prohibit remote shell access, including from the host machine, and implement security measures including code isolation. Code isolation ensures that only designated code in Private Processing has access to user data. Prohibited remote shell access ensures that neither the host nor a networked user can gain access to the CVM shell.</p>
<p>We defend against potential source control and supply chain attacks by implementing established industry best practices. This includes building software exclusively from checked-in source code and artifacts, where any change requires multiple engineers to modify the build artifacts or build pipeline.</p>
<p>As another layer of security, all code changes are auditable. This allows us to ensure that any potential issues are discovered — either through our continuous internal audits of code, or by external security researchers auditing our binaries.</p>
<h3>System hardware</h3>
<p>Private Processing utilizes CPU-based confidential virtualization technologies, along with Confidential Compute mode GPUs, which prevent certain classes of attacks from the host operating system, as well as certain physical attacks.</p>
<h3>Enforceable guarantees</h3>
<p>Private Processing utilizes CPU-based confidential virtualization technologies which allow attestation of software based in a hardware root of trust to guarantee the security of the system prior to each client-server connection. Before any data is transmitted, Private Processing checks these attestations, and confirms them against a third-party log of acceptable binaries.</p>
<h3>Stateless and forward secure service</h3>
<p>We operate Private Processing as a stateless service, which neither stores nor retains access to messages after the session has been completed.</p>
<p>Additionally, Private Processing does not store messages to disk or external storage, and thus does not maintain durable access to this data.</p>
<p>As part of our data minimization efforts, requests to Private Processing only include data that is useful for processing the prompt — for example, message summarization will only include the messages the user directed AI to summarize.</p>
<h3>Non-targetability</h3>
<p>Private Processing implements the OHTTP protocol to establish a secure session with Meta routing layers. This ensures that Meta and WhatsApp do not know which user is connecting to what CVM. In other words, Meta and WhatsApp do not know the user that initiated a request to Private Processing while the request is in route, so that a specific user cannot be routed to any specific hardware.</p>
<p>Private Processing uses anonymous credentials to authenticate users over OHTTP. This way, Private Processing can authenticate users to the Private Processing system, but remains unable to identify them. Private Processing does not include any other identifiable information as part of the request during the establishment of a system session. We limit the impact of small-scale attacks by ensuring that they cannot be used to target the data of a specific user.</p>
<h3>Verifiable transparency</h3>
<p>To provide users visibility into the processing of their data and aid in validation of any client-side behaviors, we will provide capabilities to obtain an in-app log of requests made to Private Processing, data shared with it, and details of how that secure session was set up. </p>
<p>In order to provide verifiability, we will make available the CVM image binary powering Private Processing. We will make these components available to researchers to allow independent, external verification of our implementation.</p>
<p>In addition, to enable deeper bug bounty research in this area, we will publish source code for certain components of the system, including our attestation verification code or load bearing code.</p>
<p>We will also be expanding the scope of our existing <a href="https://bugbounty.meta.com/">Bug Bounty program</a> to cover Private Processing to enable further independent security research into Private Processing’s design and implementation. </p>
<p>Finally, we will be publishing a detailed technical white paper on the security engineering design of Private Processing to provide further transparency into our security practices, and aid others in the industry in building similar systems.</p>
<h2>Get Involved</h2>
<p>We’re deeply committed to providing our users with the best possible messaging experience while ensuring that only they and the people they’re talking to can access or share their personal messages. Private Processing is a critical component of this commitment, and we’re excited to make it available in the coming weeks.</p>
<p>We welcome feedback from our users, researchers, and the broader security community through our security research program:</p>
<ul><li class="c1" aria-level="1">More details: <a href="https://bugbounty.meta.com">Meta Bug Bounty</a></li>
<li class="c1" aria-level="1"><a href="mailto:bugbounty@meta.com">Contact us</a></li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/04/29/security/building-private-processing-for-ai-tools-on-whatsapp/</link>
      <guid>https://engineering.fb.com/2025/04/29/security/building-private-processing-for-ai-tools-on-whatsapp/</guid>
      <pubDate>Tue, 29 Apr 2025 19:15:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building Private Processing for AI tools on WhatsApp]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We are inspired by the possibilities of AI to help people be more creative, productive, and stay closely connected on WhatsApp, so we set out to build a new technology that allows our users around the world to use AI in a privacy-preserving way.</li>
<li class="c1" aria-level="1">We’re sharing an early look into Private Processing, an optional capability that enables users to initiate a request to a confidential and secure environment and use AI for processing messages where no one — including Meta and WhatsApp — can access them.</li>
<li class="c1" aria-level="1">To validate our implementation of these and other security principles, independent security researchers will be able to continuously verify our privacy and security architecture and its integrity.</li>
</ul><p>AI has revolutionized the way people interact with technology and information, making it possible for people to automate complex tasks and gain valuable insights from vast amounts of data. However, the current state of AI processing — which relies on large language models often running on servers, rather than mobile hardware — requires that users’ requests are visible to the provider. Although that works for many use cases, it presents challenges in enabling people to use AI to process private messages while preserving the level of privacy afforded by end-to-end encryption.</p>
<p>We set out to enable AI capabilities with the privacy that people have come to expect from WhatsApp, so that AI can deliver helpful capabilities, such as summarizing messages, without Meta or WhatsApp having access to them, and in the way that meets the following principles:</p>
<ul><li class="c1" aria-level="1"><strong>Optionality:</strong> Using Meta AI through WhatsApp, including features that use Private Processing, must be optional. </li>
<li class="c1" aria-level="1"><strong>Transparency:</strong> We must provide transparency when our features use Private Processing.</li>
<li class="c1" aria-level="1"><strong>User control:</strong> For people’s most sensitive chats that require extra assurance, they must be able to prevent messages from being used for AI features like mentioning Meta AI in chats, with the help of WhatApp’s <a href="https://blog.whatsapp.com/introducing-advanced-chat-privacy" target="_blank" rel="noopener">Advanced Chat Privacy</a> feature.</li>
</ul><h2>Introducing Private Processing</h2>
<p>We’re excited to share an initial overview of Private Processing, a new technology we’ve built to support people’s needs and aspirations to leverage AI in a secure and privacy-preserving way. This confidential computing infrastructure, built on top of a Trusted Execution Environment (TEE), will make it possible for people to direct AI to process their requests — like summarizing unread WhatsApp threads or getting writing suggestions — in our secure and private cloud environment. In other words, Private Processing will allow users to leverage powerful AI features, while preserving WhatsApp’s core privacy promise, ensuring <strong>no one except you and the people you’re talking to can access or share your personal messages, not even Meta or WhatsApp. </strong></p>
<p>To uphold this level of privacy and security, we designed Private Processing with the following foundational requirements:</p>
<ul><li class="c1" aria-level="1"><strong>Confidential processing:</strong> Private Processing must be built in such a way that prevents any other system from accessing user’s data — including Meta, WhatsApp or any third party — while in processing or in transit to Private Processing.</li>
<li class="c1" aria-level="1"><strong>Enforceable guarantees:</strong> Attempts to modify that confidential processing guarantee must cause the system to fail closed or become publicly discoverable via verifiable transparency.</li>
<li class="c1" aria-level="1"><strong>Verifiable transparency:</strong> Users and security researchers must be able to audit the behavior of Private Processing to independently verify our privacy and security guarantees.</li>
</ul><p>However, we know that technology platforms like ours operate in a highly adversarial environment where threat actors continuously adapt, and software and hardware systems keep evolving, generating unknown risks. As part of our <a href="https://engineering.fb.com/2022/07/28/security/five-security-principles-for-billions-of-messages-across-metas-apps/" target="_blank" rel="noopener">defense-in-depth approach</a> and best practices for any security-critical system, we’re treating the following additional layers of requirements as core to Private Processing on WhatsApp:</p>
<ul><li class="c1" aria-level="1"><strong>Non-targetability:</strong> An attacker should not be able to target a particular user for compromise without attempting to compromise the entire Private Processing system.</li>
<li class="c1" aria-level="1"><strong>Stateless processing and forward security:</strong> Private Processing must not retain access to user messages once the session is complete to ensure that the attacker can not gain access to historical requests or responses.</li>
</ul><h3>Threat modeling for Private Processing</h3>
<p>Because we set out to meet these high-security requirements, our work to build Private Processing began with developing a threat model to help us identify potential attack vectors and vulnerabilities that could compromise the confidentiality, integrity, or availability of user data. We’ve worked with our peers in the security community to audit the architecture and our implementation to help us continue to harden them. </p>
<h3>Building in the open</h3>
<p>To help inform our industry’s progress in building private AI processing, and to enable independent security research in this area, we will be publishing components of Private Processing, expanding the scope of our <a href="https://bugbounty.meta.com/" target="_blank" rel="noopener">Bug Bounty program</a> to include Private Processing, and releasing a detailed security engineering design paper, <strong>as we get closer to the launch of Private Processing in the coming weeks. </strong></p>
<p>While AI-enabled processing of personal messages for summarization and writing suggestions at users’ direction is the first use case where Meta applies Private Processing, we expect there will be others where the same or similar infrastructure might be beneficial in processing user requests. We will continue to share our learnings and progress transparently and responsibly.</p>
<h2>How Private Processing works</h2>
<p>Private Processing creates a secure cloud environment where AI models can analyze and process data without exposing it to unauthorized parties. </p>
<p>Here’s how it works:</p>
<ul><li class="c1" aria-level="1"><strong>Authentication:</strong> First, Private Processing obtains <a href="https://engineering.fb.com/2022/12/12/security/anonymous-credential-service-acs-open-source/" target="_blank" rel="noopener">anonymous credentials</a> to verify that the future requests are coming from authentic WhatsApp clients.</li>
<li class="c1" aria-level="1"><strong>Third-party routing and load balancing:</strong> In addition to these credentials, Private Processing fetches HPKE encryption public keys from a third-party CDN in order to support Oblivious HTTP (OHTTP).</li>
<li class="c1" aria-level="1"><strong>Wire session establishment:</strong> Private Processing establishes an OHTTP connection from the user’s device to a Meta gateway via a third-party relay which hides requester IP from Meta and WhatsApp.</li>
<li class="c1" aria-level="1"><strong>Application session establishment:</strong> Private Processing establishes a Remote Attestation + Transport Layer Security (RA-TLS) session between the user’s device and the TEE. The attestation verification step cross-checks the measurements against a third-party ledger to ensure that the client only connects to code which satisfies our verifiable transparency guarantee.</li>
<li class="c1" aria-level="1"><strong>Request to Private Processing:</strong> After the above session is established, the device makes a request to Private Processing (e.g., message summarization request), that is encrypted end-to-end between the device and Private Processing with an ephemeral key that Meta and WhatsApp cannot access. In other words, no one except the user’s device or the selected TEEs can decrypt the request.</li>
<li class="c1" aria-level="1"><strong>Private Processing:</strong> Our AI models process data in a confidential virtual machine (CVM), a type of TEE, without storing any messages, in order to generate a response. CVMs may communicate with other CVMs using the same RA-TLS connection clients use to complete processing. </li>
<li class="c1" aria-level="1"><strong>Response from Private Processing:</strong> The processed results are then returned to the user’s device, encrypted with a key that only the device and the pre-selected Private Processing server ever have access to. Private Processing does not retain access to messages after the session is completed.</li>
</ul><h2>The threat model</h2>
<p>In designing any security-critical system, it is important to develop a threat model to guide how we build its defenses. Our threat model for Private Processing includes three key components:</p>
<ul><li class="c1" aria-level="1"><strong>Assets</strong>: The sensitive data and systems that we need to protect.</li>
<li class="c1" aria-level="1"><strong>Threat actors</strong>: The individuals or groups that may attempt to compromise our assets.</li>
<li class="c1" aria-level="1"><strong>Threat scenarios</strong>: The ways in which our assets could be compromised, including the tactics, techniques, and procedures (TTPs) that threat actors might use.</li>
</ul><h3>Assets</h3>
<p>In the context of applying Private Processing to summarizing unread messages or providing writing suggestions at users’ direction, we will use Private Processing to protect messaging content, whether they have been received by the user, or still in draft form. We use the term “messages” to refer to these primary assets in the context of this blog.</p>
<p>In addition to messages, we also include additional, secondary assets which help support the goal of Private Processing and may interact with or directly process assets: the Trusted Computing Base (TCB) of the Confidential Virtual Machine (CVM), the underlying hardware, and the cryptographic keys used to protect data in transit.</p>
<h3>Threat actors</h3>
<p>We have identified three threat actor types that could attack our system to attempt to recover assets.</p>
<ol><li class="c1" aria-level="1">Malicious or compromised insiders with access to our infrastructure.</li>
<li class="c1" aria-level="1">A third party or supply chain vendor with access to components of the infrastructure.</li>
<li class="c1" aria-level="1">Malicious end users targeting other users on the platform.</li>
</ol><h3>Threat scenarios</h3>
<p>When building Private Processing to be resilient against these threat actors, we consider relevant threat scenarios that may be pursued against our systems, including (but not limited to) the following:</p>
<h4>External actors directly exploit the exposed product attack surface or compromise the services running in Private Processing CVMs to extract messages.</h4>
<p>Anywhere the system processes untrusted data, there is potentially an attack surface for a threat actor to exploit. Examples of these kinds of attacks include exploitation of zero-day vulnerabilities or attacks unique to AI such as prompt injection. </p>
<p>Private Processing is designed to reduce such an attack surface through limiting the exposed entry points to a small set of thoroughly reviewed components which are subject to regular assurance testing. The service binaries are hardened and run in a containerized environment to mitigate the risks of code execution and limit a compromised binary’s ability to exfiltrate data from within the CVM to an external party.</p>
<h4>Internal or external attackers extract messages exposed through the CVM.</h4>
<p>Observability and debuggability remains a challenge in highly secure environments as they can be at odds with the goal of confidential computing, potentially exposing side channels to identify data and in the worst case accidentally leaking messages themselves. However, deploying any service at scale requires some level of observability to identify failure modes, since they may negatively impact many users, even when the frequency is uncommon. We implement a log-filtering system to limit export to only allowed log lines, such as error logs.</p>
<p>Like any complex system, Private Processing is built of components to form a complex supply chain of both hardware and software. Internally, our CVM build process occurs in restricted environments that maintain provenance and require multi-party review. Transparency of the CVM environment, which we’ll provide through publishing a third-party log of CVM binary digests and CVM binary images, will allow external researchers to analyze, replicate, and report instances where they believe logs could leak user data.</p>
<h4>Insiders with physical or remote access to Private Processing hosts interfere with the CVM at boot and runtime, potentially bypassing the protections in order to extract messages.</h4>
<p>TEE software exploitation is a growing area of security research, and vulnerability researchers have repeatedly demonstrated the ability to bypass TEE guarantees. Similarly, physical attacks on Private Processing hosts may be used to defeat TEE guarantees or present compromised hosts as legitimate to an end user.</p>
<p>To address these unknown risks, we built Private Processing on the principle of defense-in-depth by actively tracking novel vulnerabilities in this space, minimizing and sanitizing untrusted inputs to the TEE, minimizing attack surface through CVM hardening and enabling abuse detection through enhanced host monitoring.</p>
<p>Because we know that defending against physical access introduces significant complexity and attack surface even with industry-leading controls, we continuously pursue further attack surface hardening. In addition, we reduce these risks through measures like encrypted DRAM and standard physical security controls to protect our datacenters from bad actors.</p>
<p>To further address these unknown risks, we seek to eliminate the viability of targeted attacks via routing sessions through a third-party OHTTP relay to prevent an attacker’s ability to route a specific user to a specific machine.</p>
<h2>Designing Private Processing</h2>
<p>Here is how we designed Private Processing to meet these foundational security and privacy requirements against the threat model we developed.</p>
<p><em>(Further technical documentation and security research engagements updates are coming soon).</em></p>
<h3>Confidential processing</h3>
<p>Data shared to Private Processing is processed in an environment which does not make it available to any other system. This protection is further upheld by encrypting data end-to-end between the client and the Private Processing application, so that only Private Processing, and no one in between – including Meta, WhatsApp, or any third-party relay – can access the data.</p>
<p>To prevent possible user data leakage, only limited service reliability logs are permitted to leave the boundaries of CVM.</p>
<h3>System software</h3>
<p>To prevent privileged runtime access to Private Processing, we prohibit remote shell access, including from the host machine, and implement security measures including code isolation. Code isolation ensures that only designated code in Private Processing has access to user data. Prohibited remote shell access ensures that neither the host nor a networked user can gain access to the CVM shell.</p>
<p>We defend against potential source control and supply chain attacks by implementing established industry best practices. This includes building software exclusively from checked-in source code and artifacts, where any change requires multiple engineers to modify the build artifacts or build pipeline.</p>
<p>As another layer of security, all code changes are auditable. This allows us to ensure that any potential issues are discovered — either through our continuous internal audits of code, or by external security researchers auditing our binaries.</p>
<h3>System hardware</h3>
<p>Private Processing utilizes CPU-based confidential virtualization technologies, along with Confidential Compute mode GPUs, which prevent certain classes of attacks from the host operating system, as well as certain physical attacks.</p>
<h3>Enforceable guarantees</h3>
<p>Private Processing utilizes CPU-based confidential virtualization technologies which allow attestation of software based in a hardware root of trust to guarantee the security of the system prior to each client-server connection. Before any data is transmitted, Private Processing checks these attestations, and confirms them against a third-party log of acceptable binaries.</p>
<h3>Stateless and forward secure service</h3>
<p>We operate Private Processing as a stateless service, which neither stores nor retains access to messages after the session has been completed.</p>
<p>Additionally, Private Processing does not store messages to disk or external storage, and thus does not maintain durable access to this data.</p>
<p>As part of our data minimization efforts, requests to Private Processing only include data that is useful for processing the prompt — for example, message summarization will only include the messages the user directed AI to summarize.</p>
<h3>Non-targetability</h3>
<p>Private Processing implements the OHTTP protocol to establish a secure session with Meta routing layers. This ensures that Meta and WhatsApp do not know which user is connecting to what CVM. In other words, Meta and WhatsApp do not know the user that initiated a request to Private Processing while the request is in route, so that a specific user cannot be routed to any specific hardware.</p>
<p>Private Processing uses anonymous credentials to authenticate users over OHTTP. This way, Private Processing can authenticate users to the Private Processing system, but remains unable to identify them. Private Processing does not include any other identifiable information as part of the request during the establishment of a system session. We limit the impact of small-scale attacks by ensuring that they cannot be used to target the data of a specific user.</p>
<h3>Verifiable transparency</h3>
<p>To provide users visibility into the processing of their data and aid in validation of any client-side behaviors, we will provide capabilities to obtain an in-app log of requests made to Private Processing, data shared with it, and details of how that secure session was set up. </p>
<p>In order to provide verifiability, we will make available the CVM image binary powering Private Processing. We will make these components available to researchers to allow independent, external verification of our implementation.</p>
<p>In addition, to enable deeper bug bounty research in this area, we will publish source code for certain components of the system, including our attestation verification code or load bearing code.</p>
<p>We will also be expanding the scope of our existing <a href="https://bugbounty.meta.com/">Bug Bounty program</a> to cover Private Processing to enable further independent security research into Private Processing’s design and implementation. </p>
<p>Finally, we will be publishing a detailed technical white paper on the security engineering design of Private Processing to provide further transparency into our security practices, and aid others in the industry in building similar systems.</p>
<h2>Get Involved</h2>
<p>We’re deeply committed to providing our users with the best possible messaging experience while ensuring that only they and the people they’re talking to can access or share their personal messages. Private Processing is a critical component of this commitment, and we’re excited to make it available in the coming weeks.</p>
<p>We welcome feedback from our users, researchers, and the broader security community through our security research program:</p>
<ul><li class="c1" aria-level="1">More details: <a href="https://bugbounty.meta.com">Meta Bug Bounty</a></li>
<li class="c1" aria-level="1"><a href="mailto:bugbounty@meta.com">Contact us</a></li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/04/29/security/whatsapp-private-processing-ai-tools/</link>
      <guid>https://engineering.fb.com/2025/04/29/security/whatsapp-private-processing-ai-tools/</guid>
      <pubDate>Tue, 29 Apr 2025 19:15:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta understands data at scale]]></title>
      <description><![CDATA[<div class="row">
<div class="col-md-12 content-area" id="primary">
<main class="site-main" id="main"><section class="error-404 not-found"><div class="page-content"> 
<p>We recently migrated the Code engineering blog. There are a number of additions and enhancements to the site, but this page no longer exists or has been moved to a new section.</p>
<p><a href="https://engineering.fb.com/">Return to the Code blog homepage</a></p>
</div>
</section></main></div>
</div>]]></description>
      <link>https://engineering.fb.com/2025/04/28/security/how-meta-understands-data-at-scale/</link>
      <guid>https://engineering.fb.com/2025/04/28/security/how-meta-understands-data-at-scale/</guid>
      <pubDate>Mon, 28 Apr 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta Open Source: 2024 by the numbers]]></title>
      <description><![CDATA[<p>Open source has played an essential role in the tech industry and beyond. Whether in the AI/ML, web, or mobile space, our open source community grew and evolved while connecting people worldwide. </p>
<p>At <a href="https://opensource.fb.com/" target="_blank" rel="noopener">Meta Open Source</a>, 2024 was a year of growth and transformation. Our open source initiatives addressed the evolving needs and challenges of developers—powering breakthroughs in AI and enabling the creation of innovative, user-focused applications and experiences. In close collaboration with the open source community, we shared knowledge, introduced new projects, and enhanced existing ones.</p>
<p>In this post, we look at our portfolio of open source projects through numbers to give a better view of the scale of the community we interact with daily. </p>
<p><img class="alignnone wp-image-22369 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?w=1024" alt="" width="1024" height="665" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=916,595 916w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=768,499 768w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=1024,665 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=1536,998 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-2.png?resize=192,125 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>At Meta, <a href="https://github.com/facebook" target="_blank" rel="noopener">we have several GitHub organizations</a> where we publish new open source projects, maintain existing ones, and hold already archived projects. They include various tools, frameworks, and platforms for web, mobile, AI/ML, and hardware industries.</p>
<p><img class="alignnone size-large wp-image-22370" src="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?w=1024" alt="" width="1024" height="672" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png 1798w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=916,601 916w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=768,504 768w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=1024,672 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=1536,1008 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=96,63 96w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-3.png?resize=192,126 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>By the end of last year, we launched 256 brand-new repositories, bringing active public projects to 944. This number excludes archived repositories and projects that we moved to foundations.</p>
<p><img class="alignnone size-large wp-image-22371" src="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?w=1024" alt="" width="1024" height="664" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png 1810w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=916,594 916w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=768,498 768w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=1024,664 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=1536,996 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-4.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>In 2024, our open source codebases grew at an impressive pace, reaching 189,719 total commits in just one year. Community contributors accounted for 71,018, while Meta employees made the remaining 118,701.</p>
<p><img class="alignnone size-large wp-image-22372" src="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?w=1024" alt="" width="1024" height="664" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png 1810w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=916,594 916w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=768,498 768w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=1024,664 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=1536,996 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-5.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Open source cannot exist without people collaborating, sharing, and innovating. A total of 4,274 external contributors helped bring our community to 7,144 strong. This remarkable community is what fuels the ongoing evolution of Meta Open Source.</p>
<p><img class="alignnone size-large wp-image-22373" src="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?w=1024" alt="" width="1024" height="664" srcset="https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png 1810w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=916,594 916w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=768,498 768w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=1024,664 1024w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=1536,996 1536w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2025/04/Meta-Open-Source-by-the-numbers-image-6.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Beyond individual contributions, our projects on GitHub accumulated an additional 151,380 stars, bringing the total to a staggering 1.8 million. This growth in engagement shows strong interest and excitement for Meta Open Source projects. </p>
<h2>Thank you to the open source community</h2>
<p>At Meta, we believe open source accelerates the pace of innovation in the world. By sharing our technologies, we aim to move the industry forward while allowing other companies and individuals to use our solutions to scale more quickly and build great products.</p>
<p>At the same time, Meta Open Source projects are made possible by contributions from developers like you. Pull requests, documentation updates, social media posts, and everything in between are what build connections in our communities. Thank you all for another great year for open source.</p>
<p>To learn more about Meta Open Source, visit our <a href="https://opensource.fb.com/" target="_blank" rel="noopener">open source site</a>, subscribe to our <a href="https://www.youtube.com/channel/UCCQY962PmHabTjaHv2wJzfQ" target="_blank" rel="noopener">YouTube channel</a>, or follow us on <a href="https://www.facebook.com/MetaOpenSource" target="_blank" rel="noopener">Facebook</a>, <a href="https://www.threads.net/@metaopensource" target="_blank" rel="noopener">Threads</a>, <a href="https://x.com/MetaOpenSource" target="_blank" rel="noopener">X,</a> and <a href="https://www.linkedin.com/showcase/meta-open-source?fbclid=IwZXh0bgNhZW0CMTEAAR2fEOJNb7zOi8rJeRvQry5sRxARpdL3OpS4sYLdC1_npkEy60gBS1ynXwQ_aem_mJUK6jEUApFTW75Emhtpqw" target="_blank" rel="noopener">LinkedIn</a>.</p>]]></description>
      <link>https://engineering.fb.com/2025/04/02/open-source/meta-open-source-by-the-numbers/</link>
      <guid>https://engineering.fb.com/2025/04/02/open-source/meta-open-source-by-the-numbers/</guid>
      <pubDate>Wed, 02 Apr 2025 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Mobile GraphQL at Meta in 2025]]></title>
      <description><![CDATA[<p>Mobile GraphQL is a framework used at Meta for fetching data in mobile applications using <a href="https://graphql.org/" target="_blank" rel="noopener">GraphQL</a>, a strongly-typed, declarative query language. At Meta it handles data fetching for apps like Facebook and Instagram.</p>
<p>Sabrina, a software engineer on Meta’s Mobile GraphQL Platform Team, joins <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> on the Meta Tech podcast to discuss the evolution and future of GraphQL. Sabrina shares how GraphQL helps her team build better user experiences for everyone on Meta’s family of apps while also making developers’ lives easier with innovative APIs and other features.</p>
<p>She also shares her team’s insights and unexpected challenges for anyone interested in building a Mobile GraphQL platform.</p>
<p>Learn more about how Mobile GraphQL is transforming product development at Meta.</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/35903175/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/0EC0ZSRVZYhKDQ3HRIqYGE" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/mobile-graphql-at-meta-in-2025/id1370910331?i=1000701235632" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://play.pocketcasts.com/podcasts/c4ede3e0-1fbf-0136-c266-7d73a919276a" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/03/31/data-infrastructure/mobile-graphql-meta-2025/</link>
      <guid>https://engineering.fb.com/2025/03/31/data-infrastructure/mobile-graphql-meta-2025/</guid>
      <pubDate>Mon, 31 Mar 2025 18:16:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Building multimodal AI for Ray-Ban Meta glasses]]></title>
      <description><![CDATA[<p>Multimodal AI – models capable of processing multiple different types of inputs like speech, text, and images – have been <a href="https://www.wsj.com/tech/ai/metas-ai-powered-ray-bans-are-life-enhancing-for-the-blind-3ae38026">transforming user experiences in the wearables space</a>.</p>
<p>With our Ray-Ban Meta glasses, <a href="https://www.meta.com/blog/ray-ban-meta-smart-glasses-new-styles-multimodal-ai-ferrari/?srsltid=AfmBOoo5_UKTrC8M-l1bQ3rwyqZzG5AygmsYPeeXE6rLSTE-xAjsVeTo">multimodal AI</a> helps the glasses see what the wearer is seeing. This means anyone wearing Ray-Ban Meta glasses can ask them questions about what they’re looking at. The glasses can provide information about a landmark, translate text you’re looking at, and many other features.</p>
<p>But what does it take to bring AI into a wearable device?</p>
<p>On this episode of the Meta Tech Podcast, meet Shane, a research scientist at Meta who has spent the last seven years focusing on computer vision and multimodal AI for wearables. Shane and his team have been behind cutting edge AI research like <a href="https://arxiv.org/pdf/2309.16058">AnyMAL</a>, a unified language model that can reason over an array of input signals including text, audio, video, and even IMU motion sensor data.</p>
<p>Shane sits down with <a href="https://www.threads.net/@passy_">Pascal Hartig</a> to share how his team is building foundational models for the Ray-Ban Meta glasses. They talk about the unique challenges of AI glasses and pushing the boundaries of AI-driven wearable technology.</p>
<p>Whether you’re an engineer, a tech enthusiast, or simply curious, this episode has something for everyone!</p>
<p>Download or listen to the episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/35484470/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/3KKHyDHl6LIgTCgtv5KuVJ">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/meta-tech-podcast/id1370910331">Apple Podcasts</a></li>
<li><a href="https://pca.st/fefw0wwy">Pocket Casts</a></li>
<li><a href="https://overcast.fm/login">Overcast</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod">Instagram</a>, <a href="https://threads.net/@metatechpod">Threads</a>, or <a href="https://twitter.com/metatechpod">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com">Meta Careers</a> page.</p>
<p><strong>Links</strong></p>
<ul><li><a href="https:/arxiv.org/abs/2309.16058">AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model</a></li>
<li><a href="https://www.forbes.com/sites/stevenaquino/2024/10/11/inside-the-be-my-eyes-meta-collaboration-and-the-allure-to--impact-humanity/">Inside The Be My Eyes-Meta Collaboration</a></li>
<li class="dp_VC"><a href="https://engineering.fb.com/2021/09/02/core-infra/cachelib/">Cachelib</a></li>
<li class="dp_VC"><a href="https://www.threads.net/@metaopensource">Meta Open Source on Threads</a></li>
<li><a href="https://www.wsj.com/tech/ai/metas-ai-powered-ray-bans-are-life-enhancing-for-the-blind-3ae38026">Meta’s AI-Powered Ray-Bans Are Life-Enhancing for the Blind</a></li>
</ul><p><strong>Timestamps</strong></p>
<ul><li>Intro 0:06</li>
<li>OSS News 0:56</li>
<li>Introduction Shane 1:30</li>
<li>The role of research scientist over time 3:03</li>
<li>What’s Multi-Modal AI? 5:45</li>
<li>Applying Multi-Modal AI in Meta’s products 7:21</li>
<li>Acoustic modalities beyond speech 9:17</li>
<li>AnyMAL 12:23</li>
<li>Encoder zoos 13:53</li>
<li>0-shot performance 16:25</li>
<li>Iterating on models 17:28</li>
<li>LLM parameter size 19:29</li>
<li>How do we process a request from the glasses? 21:53</li>
<li>Processing moving images 23:44</li>
<li>Scaling to billions of users 26:01</li>
<li>Where lies the optimization potential? 28:12</li>
<li>Incorporating feedback 29:08</li>
<li>Open-source influence 31:30</li>
<li>Be My Eyes Program 33:57</li>
<li>Working with industry experts at Meta 36:18</li>
<li>Outro 38:55</li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/03/04/virtual-reality/building-multimodal-ai-for-ray-ban-meta-glasses/</link>
      <guid>https://engineering.fb.com/2025/03/04/virtual-reality/building-multimodal-ai-for-ray-ban-meta-glasses/</guid>
      <pubDate>Tue, 04 Mar 2025 22:24:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[A case for QLC SSDs in the data center]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">The growth of data and need for increased power efficiency are leading to innovative storage solutions.</li>
<li class="c1" aria-level="1">HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. </li>
<li class="c1" aria-level="1">QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs.   QLC provides higher density, improved power efficiency, and better cost than existing TLC SSDs. </li>
</ul><p>Today, HDDs are the go-to storage solution for most data centers because of their lower cost and power footprint compared to other solutions like TLC flash. But while HDDs are growing in size, they haven’t been growing in I/O performance. In other words, the bandwidth per TB for HDDs has been dropping. This has been forcing data center engineers to meet their storage performance needs by shifting hot (frequently accessed) data to a TLC flash tier or by overprovisioning storage.</p>
<p>QLC flash as a technology has been around since 2009. Adoption has been slow because it has historically operated at lower drive capacity points – less than 32TB. As well, high cost and limited write endurance didn’t make it an attractive alternative to TLC in the datacenter. </p>
<p>In the meantime, HDD densities have been growing without any significant increase in the throughput. As more data is stored on a given drive the need for I/O goes up proportionally. The continued densification of HDD capacity has led to a consistent decline in BW/TB. This has negatively affected a portion of hot workloads and forced bytes to get stranded on HDDs.</p>
<p><img class="alignnone size-large wp-image-22313" src="https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?w=1024" alt="" width="1024" height="641" srcset="https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png 1346w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?resize=916,573 916w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?resize=768,480 768w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?resize=1024,641 1024w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?resize=96,60 96w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-capacity-graph.png?resize=192,120 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>QLC flash occupies a unique space in the performance spectrum in between HDDs and SSDs for servicing workloads that still depend upon performance at 10 MB/s/TB range i.e., where we had 16-20TB HDDs. Additionally there are workloads doing large batch IOs which do not need very high performance but still are in the 15-20 MB/s/TB range and use TLC flash today.</p>
<p>QLC flash introduced as a tier above HDDs can meet write performance requirements with sufficient headroom in endurance specifications. The workloads being targeted are read-bandwidth-intensive with infrequent as well as comparatively low write bandwidth requirements. Since the bulk of power consumption in any NAND flash media comes from writes, we expect our workloads to consume lower power with QLC SSDs. </p>
<p>The advent of the 2Tb QLC NAND die along with 32-die stack becoming mainstream illustrates just how rapidly the density scaling of QLC flash is growing at a NAND package level as well as at drive level.</p>
<p>We expect QLC SSD density will scale much higher than TLC SSD density in the near-term and long-term. This will bring meaningful impact to server and rack level bytes densification as well as help lower per-TB acquisition and power costs at both the drive and server level. </p>
<p><img class="alignnone size-large wp-image-22328" src="https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png?w=999" alt="" width="999" height="289" srcset="https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png 999w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png?resize=916,265 916w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png?resize=768,222 768w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png?resize=96,28 96w, https://engineering.fb.com/wp-content/uploads/2025/03/Meta-QLC-HDD-TLC-comparison-chart.png?resize=192,56 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>QLC at Meta</h2>
<p>Meta’s storage teams have started working closely with partners like <a href="https://www.purestorage.com/" target="_blank" rel="noopener">Pure Storage</a>, utilizing their DirectFlash Module (DFM) and DirectFlash software solution to bring reliable QLC storage to Meta. We are also working with other NAND vendors to integrate standard NVMe QLC SSDs into our data centers. </p>
<p>While today QLC is lower in cost than TLC, it is not yet price competitive enough for a broader deployment. Still, the gains in power consumption efficiency are material and the above mentioned use cases are expected to greatly benefit from that. Given that HDDs are continuing to get colder as their density increases (decreasing BW/TB), and that NAND cost structures are improving with technology advancements, we believe that adding a QLC tier is the right path forward.</p>
<h2>Hardware considerations for adopting QLC</h2>
<p>While E1.S as a form factor has been great for our TLC deployments, it’s not an ideal form factor to scale our QLC roadmap because its size limits the number of NAND packages per drive.</p>
<p>The Industry standard U.2-15mm is still a prevalent form factor across SSD suppliers and it enables us to potentially scale to 512TB capacity. E3 doesn’t bring additional value over U.2 at the moment and the market adoption split between <a href="https://www.snia.org/forums/cmsi/knowledge/formfactors">the 4 variants of E3</a> makes it less attractive. Pure Storage’s DFMs can allow scaling up to 600TB with the same NAND package technology. Designing a server to support DFMs allows the drive slot to also accept U.2 drives. This strategy enables us to reap the most benefits in cost competition, schedule acceleration, power efficiency, and vendor diversity.  </p>
<p>The primary benefit of QLC drives is byte density at the drive and server level and the associated power efficiency. Within Meta, the byte density target of the QLC-based server is 6x the densest TLC-based server we ship today. Even though the BW/TB expected of QLC is lower than TLC, the QLC server bytes density requires a more performant CPU, faster memory and network subsystem to take advantage of the media capabilities.  </p>
<h2>Adapting our storage software for QLC </h2>
<p>Adopting Meta’s existing storage software to QLC has presented some interesting challenges. As discussed above, our QLC systems are very high in density. And we are targeting QLC SSDs as a higher performance media compared to HDDs. This raises throughput expectations beyond any single server throughput we ever had. </p>
<p>Scaling such high throughput across CPU cores and sockets requires careful placement of data and compute to process that I/O. We need to make sure we minimize data touchpoints and can separate the I/O by type. The software stack in Pure Storage’s solutions uses Linux userspace block device driver (ublk) devices over io_uring to both expose the storage as a regular block device and enable zero copy for data copy elimination – as well as talk to their userspace FTL (DirectFlash software) in the background. </p>
<p>For other vendors, the stack uses io_uring to directly interact with the NVMe block device.</p>
<p>Further, QLC SSDs have a significant delta between read and write throughput. Read throughput in the case of QLC can be as high as 4x or more than write throughput. What’s more, typical use cases around reads are latency sensitive so we need to make sure that the I/O delivering this massive read BW is not getting serialized behind the writes. This requires building, and carefully tuning, rate controllers and I/O schedulers.</p>
<h2>Looking forward</h2>
<p>Meta recognizes QLC flash’s potential as a viable and promising optimization opportunity for storage cost, performance, and power for data center workloads. As flash suppliers continue to invest in advanced fab processes and package designs and increase the QLC flash production output, we anticipate substantial cost improvements, making QLC flash progressively more attractive for a broader range of data center workloads. We are excited about driving innovation, fostering collaboration, and promoting ecosystem alignment in this evolving storage space.</p>]]></description>
      <link>https://engineering.fb.com/2025/03/04/data-center-engineering/a-case-for-qlc-ssds-in-the-data-center/</link>
      <guid>https://engineering.fb.com/2025/03/04/data-center-engineering/a-case-for-qlc-ssds-in-the-data-center/</guid>
      <pubDate>Tue, 04 Mar 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta is translating its Java codebase to Kotlin]]></title>
      <description><![CDATA[<p>Meta has been working to <a href="https://engineering.fb.com/2024/12/18/android/translating-java-to-kotlin-at-scale/" target="_blank" rel="noopener">shift its Android codebase from Java to Kotlin</a>, a newer language for Android development that offers some key advantages over Java. We’ve even open sourced <a href="https://github.com/fbsamples/kotlin_ast_tools" target="_blank" rel="noopener">various examples and utilities</a> we used to in our migration to manipulate Kotlin code.</p>
<p>So how do you <a href="https://engineering.fb.com/2022/10/24/android/android-java-kotlin-migration/" target="_blank" rel="noopener">translate roughly tens of millions of lines of Java code to Kotlin</a>? On this episode of the Meta Tech Podcast, <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> sits down with Eve and Jocelyn, two software engineers on Meta’s Mobile Infra Codebases Team, to talk about taking on this challenge. They share some of the unexpected difficulties along the way, how they avoid nullability issues, and how they’re generating idiomatic code for Meta’s internal frameworks.</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/35085305/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/5uzpWr4xXMgQHDEYc7pRCq" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/meta-tech-podcast/id1370910331" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/andgfqxa" target="_blank" rel="noopener">Pocket Casts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/02/25/android/how-meta-is-translating-its-java-codebase-to-kotlin/</link>
      <guid>https://engineering.fb.com/2025/02/25/android/how-meta-is-translating-its-java-codebase-to-kotlin/</guid>
      <pubDate>Tue, 25 Feb 2025 20:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Protecting user data through source code analysis at scale]]></title>
      <description><![CDATA[<p>Meta’s Anti Scraping team focuses on preventing <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">unauthorized scraping</a> as part of our ongoing work to combat data misuse. In order to protect Meta’s <a href="https://engineering.fb.com/2022/11/16/culture/meta-code-review-time-improving/" target="_blank" rel="noopener">changing codebase</a> from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to <strong>detect potential scraping vectors</strong> at scale across our Facebook, Instagram, and even parts of our Reality Labs codebases. </p>
<h2>What is scraping? </h2>
<p><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">Scraping</a> is the automated collection of data from a website or app and can be either authorized or unauthorized. Unauthorized scrapers commonly hide themselves by mimicking the ways users would normally use a product. As a result, unauthorized scraping can be difficult to detect. At Meta, we take a number of steps to <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">combat scraping</a> and have a number of methods to distinguish unauthorized automated activity from legitimate usage. </p>
<h2>Proactive detection</h2>
<p>Meta’s Anti-Scraping team learns about scrapers (entities attempting to scrape our systems) through many different sources. For example, we <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">investigate suspected unauthorized scraping activity</a> and take actions against such entities, including sending cease-and-desist letters and disabling accounts.</p>
<p>Part of our strategy is to further develop proactive measures to mitigate the risk of scraping over and above our reactive approaches. One way we do this is by <strong>turning our attack vector criteria into static analysis rules</strong> that run automatically on our entire code base. Those static analysis tools, which include <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener">Zoncolan</a> for Hack and <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener">Pysa</a> for Python, run automatically for their respective codebases and are built in-house, allowing us to customize them for Anti-Scraping purposes. This approach can identify potential issues early and ensure product development teams have an opportunity to remediate prior to launch.</p>
<p>Static analysis tools enable us to apply learnings across events to systematically prevent similar issues from existing in our codebase. They also help us create best practices when developing code to combat unauthorized scraping.</p>
<h2>Developing static analysis rules</h2>
<p>Our static analysis tools (like <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener">Zoncolan</a> and <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener">Pysa</a>) focus on tracking data flow through a program.</p>
<p>Engineers define classes of issues using the following:</p>
<ul><li class="c1" aria-level="1"><em>Sources</em> are where the data originates. For potential scraping issues, these are mostly user-controlled parameters, as these are the avenues in which scrapers control the data they could receive.</li>
<li class="c1" aria-level="1"><em>Sinks</em> are where the data flows to. For scraping, the sink is usually when the data flows back to the user.</li>
<li class="c1" aria-level="1">An <em>Issue</em> is found when our tools detect a possibility of data flow from a source to a sink.</li>
</ul><p>For example, assume the “source” to be the user-controlled “count” parameter that determines the number of results loaded, and “the sink” to be the data that is returned to the user. Here, the user controlled “count” parameter is an entrypoint for a scraper who can manipulate its value to extract more data than intended by the application. When our tools suspect that there is a code flow between such sources and sinks, it alerts the team for further triage.</p>
<h2>An example of static analysis</h2>
<p>Building on the example above, see the below mock code excerpt loading the number of followers for a page:</p>
<pre class="line-numbers"><code class="language-none"># views/followers.py
async def get_followers(request: HttpRequest) -&gt; HttpResponse:
        viewer = request.GET['viewer_id']
target = request.GET['target_id']
        count = request.GET['count']
        if(can_see(viewer, target)):
                followers = load_followers(target, count)
                return followers
# controller/followers.py
async def load_followers(target_id: int, count: int):
        ...</code></pre>
<p>In the example above, the mock endpoint backed by get_followers is a potential scraping attack vector since the “user” and “count” variables control whose information is to be loaded and number of followers returned. Under usual circumstances, the endpoint would be called with suitable parameters that match what the user is browsing on screen. However, scrapers can abuse such an endpoint by specifying arbitrary users and large counts which can result in their entire follower lists returned in a single request. By doing so, scrapers can try to evade rate limiting systems which limit how many requests a user can send to our systems in a defined timeframe. These systems are set in place to stop any scraping attempts at a high level.</p>
<p>Since our static analysis systems run automatically on our codebase, the Anti-Scraping team can identify such scraping vectors proactively and make remediations before the code is introduced to our production systems. For example, the recommended fix for the code above is to cap the maximum number of results that can be returned at a time:</p>
<pre class="line-numbers"><code class="language-none"># views/followers.py
async def get_followers(request: HttpRequest) -&gt; HttpResponse:
        viewer = request.Get['viewer_id']
        target = request.GET['target_id']
        count = min(request.GET['count'], MAX_FOLLOWERS_RESULTS)
        if(can_see(viewer, target)):
            followers = load_followers(target, count)
                return followers
# controller/followers.py
async def load_followers(target_id: int, count: int):
        ...</code></pre>
<p>Following the fix, the maximum number of results retrieved by each request is limited to MAX_FOLLOWERS_RESULTS. Such a change would not affect regular users and only interfere with scrapers, forcing them to send magnitudes more requests that would then trigger our rate limiting systems.</p>
<h2>The limitations of static analysis in combating unauthorized scraping </h2>
<p>Static analysis tools are not designed to catch all possible unauthorized scraping issues. Because unauthorized scrapers can mimic the legitimate ways that people use Meta’s products, we cannot fully prevent all unauthorized scraping without affecting people’s ability to use our apps and websites the way they enjoy. Since unauthorized scraping is both a common and complex challenge to solve, <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">we combat scraping by taking a more holistic approach</a> to staying ahead of scraping actors.</p>]]></description>
      <link>https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/</link>
      <guid>https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/</guid>
      <pubDate>Tue, 18 Feb 2025 21:30:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Unlocking global AI potential with next-generation subsea infrastructure]]></title>
      <description><![CDATA[<p>Today, we’re announcing our most ambitious subsea cable endeavor yet: Project Waterworth. Once complete, the project will reach five major continents and span over 50,000 km (longer than the Earth’s circumference), making it the world’s longest subsea cable project using the highest-capacity technology available. </p>
<p>Project Waterworth will bring industry-leading connectivity to the U.S., India, Brazil, South Africa, and other key regions. This project will enable greater economic cooperation, facilitate digital inclusion, and open opportunities for technological development in these regions. For example, in India, where we’ve already seen significant growth and investment in digital infrastructure, Waterworth will help accelerate this progress and support the country’s ambitious plans for its digital economy.</p>
<p>Subsea cables projects, such as Project Waterworth, are the backbone of global digital infrastructure, accounting for more than <a href="https://globaldigitalinclusion.org/wp-content/uploads/2024/01/GDIP-Good-Practices-for-Subsea-Cables-Policy-Investing-in-Digital-Inclusion.pdf">95% of intercontinental traffic</a> across the world’s oceans to seamlessly enable digital communication, video experiences, online transactions, and more. Project Waterworth will be a multi-billion dollar, multi-year investment to strengthen the scale and reliability of the world’s digital highways by opening three new oceanic corridors with the abundant, high speed connectivity needed to drive AI innovation around the world. </p>
<p><img class="alignnone size-large wp-image-22288" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We’ve driven infrastructure innovation with various partners over the past decade, <a href="https://engineering.fb.com/2021/03/28/connectivity/echo-bifrost/">developing</a> more than 20 <a href="https://engineering.fb.com/2021/09/28/connectivity/2africa-pearls/">subsea cables</a>. This includes multiple deployments of industry-leading subsea cables of 24 fiber pairs – compared to the typical 8 to 16 fiber pairs of other new systems. These investments enable unmatched connectivity for our world’s increasing digital needs. </p>
<p>With Project Waterworth, we continue to advance engineering design to maintain cable resilience, enabling us to build the longest 24 fiber pair cable project in the world and enhance overall speed of deployment. We are also deploying first-of-its-kind routing, maximizing the cable laid in deep water — at depths up to 7,000 meters — and using enhanced burial techniques in high-risk fault areas, such as shallow waters near the coast, to avoid damage from ship anchors and other hazards.</p>
<p>AI is revolutionizing every aspect of our lives, from how we interact with each other to how we think about infrastructure – and Meta is at the forefront of building these innovative technologies. As AI continues to transform industries and societies around the world, it’s clear that capacity, resilience, and global reach are more important than ever to support leading infrastructure. With Project Waterworth we can help ensure that the benefits of AI and other emerging technologies are available to everyone, regardless of where they live or work.</p>]]></description>
      <link>https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/</link>
      <guid>https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/</guid>
      <pubDate>Fri, 14 Feb 2025 17:28:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Looking back at our Bug Bounty program in 2024]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">In 2024, our bug bounty program awarded more than $2.3 million in bounties, bringing our total bounties since the creation of our program in 2011 to over $20 million. </li>
<li class="c1" aria-level="1">As part of our <a href="https://about.fb.com/news/2019/01/designing-security-for-billions/" target="_blank" rel="noopener">defense-in-depth strategy</a>, we continued to collaborate with the security research community in the areas of GenAI, AR/VR, ads tools, and more. </li>
<li class="c1" aria-level="1">We also celebrated the security research done by our bug bounty community as part of our annual bug bounty summit and many other industry events. </li>
</ul><p>As we embark on a new year, we’re sharing several updates on our work with external bug bounty security researchers to help protect our global community and platforms. This includes new payout stats, details on what’s in scope for GenAI-related bug reports, and a recap of some of our engagements throughout last year with bug bounty researchers. </p>
<h2>Highlights from Meta’s bug bounty program in 2024</h2>
<p>In 2024, we received nearly 10,000 bug reports and paid out more than $2.3 million in bounty awards to researchers around the world who helped make our platforms safer.</p>
<ul><li class="c1" aria-level="1">Since 2011, we have paid out more than $20 million in bug bounties. </li>
<li class="c1" aria-level="1">Last year, we received nearly 10,000 reports and paid out awards on nearly 600 valid reports.</li>
<li class="c1" aria-level="1">In 2024, we awarded more than $2.3 million to nearly 200 researchers from more than 45 countries. </li>
<li class="c1" aria-level="1">The top three countries based on bounties awarded last year are India, Nepal, and the United States.</li>
</ul><h2>Engaging researchers in bug hunting in GenAI </h2>
<p>After <a href="https://about.fb.com/news/2023/09/building-generative-ai-features-responsibly/" target="_blank" rel="noopener">making our generative AI features available to security researchers</a> through our long-running bug bounty program in 2023, Meta has continued to roll out new GenAI products and tools. In 2024, we provided more details to our research community on <a href="https://bugbounty.meta.com/scope/" target="_blank" rel="noopener">what’s in scope for bug bounty reports related to our large language models (LLMs)</a>. We now welcome reports that demonstrate integral privacy or security issues associated with Meta’s LLMs, including being able to extract training data through tactics like model inversion or extraction attacks. </p>
<p>We have already received several impactful reports focused on our GenAI tools, and we look forward to continuing this important work with our community of researchers to help ensure the security and integrity of our GenAI tools.</p>
<h2>Encouraging security research in ads audience and hardware products </h2>
<p>This year, we prioritized our efforts to steer security research by the bug bounty community towards a number of product surfaces, including:</p>
<p><strong>Ads audience tools designed to help people choose a target audience for their ads:</strong> <a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener">We introduced new payout guidelines</a> to provide transparency to our security researchers on how we assess the impact of the report we receive for potential security bugs in Meta’s <a href="https://www.facebook.com/business/help/717368264947302" target="_blank" rel="noopener">ads audience tools</a>. We cap the maximum base payout for discovering PII (name, email, phone number, state, ZIP, gender) for an ads audience at $30,000 and then apply any applicable deduction based on the required user interaction, prerequisites, and any other mitigation factors to arrive at the final awarded bounty amount. More details <a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener">here</a>.</p>
<p><strong>Mixed reality hardware products:</strong> As Meta continues to roll out <a href="https://engineering.fb.com/2023/09/12/security/meta-quest-2-defense-through-offense/" target="_blank" rel="noopener">mixed reality products</a>, we work to encourage security research into these hardware and AI-driven technologies to help us find and fix potential bugs as quickly as possible. In 2024, our bug bounty researchers contributed reports on potential issues in Quest that could have impacted safety settings or lead to memory corruption. We also brought our Quest 3 and Ray-Ban Meta glasses to <a href="http://hardwear.io" target="_blank" rel="noopener">hardwear.io USA 2024</a>, a leading conference that brings together top hardware hackers to test new hardware products and help uncover potential vulnerabilities. </p>
<h2>Building and celebrating the global bug bounty community</h2>
<p>As part of our continuous commitment to security research – both inside and outside Meta – we  invested in enabling open collaboration with our bug bounty community by:</p>
<p><strong>Organizing community events and presenting joint research:</strong> We hosted our annual Meta Bug Bounty Researcher Conference (MBBRC) in Johannesburg, South Africa, bringing together 60 of our top researchers from all over the world. We received more than 100 bug reports and awarded over $320,000 in total.  We also co-presented talks at EkoParty, DEF CON, Hardwear.io, Pwn2own, and other security research summits. This year, we’re pleased to share that 2025 MBBRC will be hosted in Tokyo, Japan May 12-15. Stay tuned for more details in 2025.</p>
<p><strong>Celebrating long-time researchers:</strong> One of our most long-standing and prolific researchers, <a href="https://philippeharewood.com/" target="_blank" rel="noopener">Philippe Harewood</a>, reached a 10-year milestone with over 500 valid reports paid out by our bug bounty program. Noteworthy contributions over the years include Philippe’s groundbreaking research on <a href="https://www.youtube.com/watch?v=vwUxRCmgwSw" target="_blank" rel="noopener">Instagram access token leak</a>, <a href="https://philippeharewood.com/bypass-video-capture-limit-on-ray-ban-stories/" target="_blank" rel="noopener">video capture limit bypass on Ray-Ban stories</a>, and more. </p>
<p><strong>Providing resources and timely updates for the research community:</strong> The <a href="http://bugbounty.meta.com" target="_blank" rel="noopener">Meta Bug Bounty website</a> serves as a centralized hub for all bug bounty news and updates. Researchers can also follow the program on <a href="http://www.instagram.com/metabugbounty" target="_blank" rel="noopener">Instagram</a>, <a href="http://www.facebook.com/BugBounty" target="_blank" rel="noopener">Facebook</a>, and <a href="http://www.x.com/metabugbounty" target="_blank" rel="noopener">X</a>, for quick updates.</p>
<h2>Looking ahead</h2>
<p>Meta’s bug bounty team looks forward to introducing new initiatives and continuing to engage with our existing community and new researchers who are just getting started. Additionally, we will continue to provide seasoned experts with unique opportunities to test unreleased features through our private bug bounty tracks.</p>
<p>For the past 14 years, our bug bounty program has fostered a collaborative relationship with external researchers that has helped keep our platforms safer and more secure. We would like to extend a heartfelt thanks to everyone who contributed to the growth of our program in 2024.</p>]]></description>
      <link>https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/</link>
      <guid>https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/</guid>
      <pubDate>Thu, 13 Feb 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Revolutionizing software testing: Introducing LLM-powered bug catchers]]></title>
      <description><![CDATA[<h2>WHAT IT IS</h2>
<p><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Meta’s Automated Compliance Hardening (ACH) tool</a> is a system for mutation-guided, LLM-based test generation. ACH hardens platforms against regressions by generating undetected faults (mutants) in source code that are specific to a given area of concern and using those same mutants to generate tests. When applied to privacy, for example, ACH automates the process of searching for privacy-related faults and preventing them from entering our systems in the future, ultimately hardening our code bases to reduce risk of any privacy regression.</p>
<p>ACH automatically generates unit tests that target a particular kind of fault. We describe the faults we care about to ACH in plain text. The description can be incomplete, and even self-contradictory, yet ACH still generates tests that it proves will catch bugs of the kind described.</p>
<p>Traditionally, automated test generation techniques sought merely to increase code coverage. As every tester knows, this is only part of the solution because increasing coverage doesn’t necessarily find faults. ACH is a radical departure from this tradition, because it targets specific faults, rather than uncovered code, although it often also increases coverage in the process of targeting faults. Furthermore, because ACH is founded on the principles of <a href="https://arxiv.org/abs/2402.04380" target="_blank" rel="noopener">Assured LLM-based Software Engineering</a>, it keeps verifiable assurances that its tests do catch the kind of faults described.</p>
<p>Our new research paper, “<a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Mutation-Guided LLM-based Test Generation at Meta</a>,” gives details of the underlying scientific foundations for ACH and how we apply ACH to privacy testing, but this approach can be applied to any sort of regression testing.</p>
<h2>HOW IT WORKS</h2>
<p>Mutation testing, where faults (mutants) are deliberately introduced into source code (using version control to keep them away from production) to assess how well an existing testing framework can detect these changes, has been <a href="https://web.eecs.umich.edu/~weimerw/2022-481F/readings/mutation-testing.pdf" target="_blank" rel="noopener">researched for decades</a>. But, despite this, mutation testing has remained difficult to deploy. </p>
<p><a href="http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/JiaH10.pdf" target="_blank" rel="noopener">In earlier approaches</a>, mutants themselves would be automatically generated (most often using a rule-based approach). But this method would result in mutants that weren’t particularly realistic in terms of how much of a concern they actually represent.</p>
<p>On top of that, even with the mutants being automatically generated, humans would still have to manually write the tests that would kill the mutants (catch the faults).</p>
<p>Writing these tests is a painstaking and laborious process. So engineers were faced with a two-pronged issue: Even after doing all of the work to write a test to catch a mutant, there was no guarantee the test would even catch the automatically-generated mutant. </p>
<p>By leveraging LLMs, we can generate mutants that represent realistic concerns and also save on human labor by generating tests to catch the faults automatically as well. ACH marries automated test generation techniques with the capabilities of large language models (LLMs) to generate mutants that are highly relevant to an area of testing concern as well as tests that are guaranteed to catch bugs that really matter.</p>
<p>Broadly, ACH works in three steps:</p>
<ol><li class="c1" aria-level="1">An engineer describes the kind of bugs they’re concerned about.</li>
<li class="c1" aria-level="1">ACH uses that description to automatically generate lots of bugs.</li>
<li class="c1" aria-level="1">ACH uses the generated bugs to automatically generate lots of tests that catch them.</li>
</ol><p>At Meta we’ve <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">applied ACH-assisted testing to several of our platforms</a>, including Facebook Feed, Instagram, Messenger, and WhatsApp. Based on our own testing, we’ve concluded that engineers found ACH useful for hardening code against specific concerns and found other benefits even when tests generated by ACH don’t directly tackle a specific concern.</p>
<figure id="attachment_22243" aria-describedby="caption-attachment-22243" class="wp-caption alignnone c2"><img class="size-large wp-image-22243" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png 1534w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22243" class="wp-caption-text">A top-level overview of the architecture of the ACH system. The system leverages LLMs to generate faults, check them against possible equivalents, and then generate tests to catch those faults.</figcaption></figure><h2>WHY IT MATTERS</h2>
<p>Meta has a very large number of data systems and uses <a href="https://engineering.fb.com/2022/07/27/developer-tools/programming-languages-endorsed-for-server-side-use-at-meta/" target="_blank" rel="noopener">many different programming languages</a>, frameworks, and services to power our family of apps and products. But, how are our thousands of engineers across the world ensuring that their code is reliable and won’t generate bugs that would negatively impact application performance, leading to privacy risk? The answer lies with LLMs. </p>
<p>LLM-based test generation and LLM-based mutant generation are not new, but this is the first time they’ve been combined and deployed in large-scaled industrial systems. Generating mutants and the tests to kill them have been traditionally difficult processes to scale. Since LLMs are probabilistic and don’t need to rely on rigidly defined rules to make decisions, they allow us to tackle both sides of this equation – generating mutations and tests to kill them – very efficiently and with a high level of accuracy. </p>
<p>This new approach significantly modernizes this form of automated test generation and helps software engineers take in concerns from a variety of sources (previous faults, colleagues, user requirements, regulatory requirements, etc.) and efficiently convert them from freeform text into actionable tests – with the guarantee that the test will catch the fault they’re looking for.</p>
<p>ACH can be applied to any class of faults and have a significant impact on hardening against future regressions and optimizing testing itself.</p>
<h2>WHAT’S NEXT</h2>
<p>Our novel approach combines LLM-based test generation and mutant generation to help automate complex technical organizational workflows in this space. This innovation has the potential to simplify risk assessments, reduce cognitive load for developers, and ultimately create a safer online ecosystem. We’re committed to expanding deployment areas, developing methods to measure mutant relevance, and detecting existing faults to drive industry-wide adoption of automated test generation in compliance.</p>
<p>We will be sharing more developments and encourage you to watch this space.</p>
<h2>READ THE PAPER</h2>
<p><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Mutation-Guided LLM-based Test Generation at Meta</a></p>]]></description>
      <link>https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/</link>
      <guid>https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/</guid>
      <pubDate>Wed, 05 Feb 2025 19:30:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Data logs: The latest evolution in Meta’s access tools]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing how Meta built support for data logs, which provide people with additional data about how they use our products.</li>
<li class="c1" aria-level="1">Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. </li>
</ul><p>Users have a variety of tools they can use to manage and access their information on Meta platforms. Meta is always looking for ways to enhance its access tools in line with technological advances, and in February 2024 we began including <a href="https://www.facebook.com/help/384437594328726#data-logs">data logs</a> in the <a href="https://www.facebook.com/help/212802592074644">Download Your Information</a> (DYI) tool. Data logs include things such as information about content you’ve viewed on Facebook. Some of this data can be unique, but it can also include additional details about information that we already make available elsewhere, such as through a user’s profile, products like <a href="https://www.facebook.com/help/1700142396915814">Access Your Information</a> or <a href="https://www.facebook.com/help/256333951065527">Activity Log</a>, or account downloads. This update is the result of significant investments over a number of years by a large cross-functional team at Meta, and consultations with experts on how to continue enhancing our access tools.</p>
<p>Data logs are just the most recent example of how Meta gives users the power to access their data on our platforms. We have a long history of giving users transparency and control over their data:</p>
<ul><li class="c1" aria-level="1">2010: Users can retrieve a copy of their information through DYI. </li>
<li class="c1" aria-level="1">2011: Users can easily review actions taken on Facebook through <a href="https://www.facebook.com/help/256333951065527">Activity Log</a>.</li>
<li class="c1" aria-level="1">2014: Users have more transparency and control over ads they see with the “<a href="https://www.facebook.com/help/794535777607370#advertiser-choices">Why Am I Seeing This Ad?</a>” feature on Facebook.</li>
<li class="c1" aria-level="1">2018: Users have a curated experience to find information about them through <a href="https://about.fb.com/news/2018/03/privacy-shortcuts/">Access Your Information</a>. Users can retrieve a copy of their information on Instagram through <a href="https://help.instagram.com/181231772500920">Download Your Data</a> and on WhatsApp through <a href="https://faq.whatsapp.com/526463418847093/">Request Account Information</a>.</li>
<li class="c1" aria-level="1">2019: Users can view their activity off Meta-technologies and clear their history. Meta joins the <a href="https://engineering.fb.com/2019/12/02/security/data-transfer-project/">Data Transfer Project</a> and has continuously led the development of shared technologies that enable users to port their data from one platform to another. </li>
<li class="c1" aria-level="1">2020: Users continue to <a href="https://about.fb.com/news/2020/03/data-access-tools/">receive more information</a> in DYI such as additional information about their interactions on Facebook and Instagram.</li>
<li class="c1" aria-level="1">2021: Users can more easily navigate categories of information in <a href="https://about.fb.com/news/2021/01/introducing-the-new-access-your-information/">Access Your Information</a></li>
<li class="c1" aria-level="1">2023: Users can more easily use our tools as access features are consolidated within <a href="https://www.facebook.com/help/943858526073065">Accounts Center</a>.</li>
<li class="c1" aria-level="1">2024: Users can access data logs in Download Your Information.</li>
</ul><h2>What are data logs?</h2>
<p>In contrast to our production systems, which can be queried billions of times per second thanks to techniques like caching, Meta’s data warehouse, powered by Hive, is designed to support low volumes of large queries for things like analytics and cannot scale to the query rates needed to power real-time data access.</p>
<p>We created data logs as a solution to provide users who want more granular information with access to data stored in Hive. In this context, an individual data log entry is a formatted version of a single row of data from Hive that has been processed to make the underlying data transparent and easy to understand.</p>
<p>Obtaining this data from Hive in a format that can be presented to users is not straightforward. Hive tables are partitioned, typically by date and time, so retrieving all the data for a specific user requires scanning through every row of every partition to check whether it corresponds to that user. Facebook has over 3 billion monthly active users, meaning that, assuming an even distribution of data, ~99.999999967% of the rows in a given Hive table might be processed for such a query even though they won’t be relevant. </p>
<p>Overcoming this fundamental limitation was challenging, and adapting our infrastructure to enable it has taken multiple years of concerted effort. Data warehouses are commonly used in a range of industry sectors, so we hope that this solution should be of interest to other companies seeking to provide access to the data in their data warehouses.</p>
<h3>Initial designs</h3>
<p>When we started designing a solution to make data logs available, we first considered whether it would be feasible to simply run queries for each individual as they requested their data, despite the fact that these queries would spend almost all of their time processing irrelevant data. Unfortunately, as we highlighted above, the distribution of data at Meta’s scale makes this approach infeasible and incredibly wasteful: It would require scanning entire tables once per DYI request, scaling linearly with the number of individual users that initiate DYI requests. These performance characteristics were infeasible to work around. </p>
<p>We also considered caching data logs in an online system capable of supporting a range of indexed per-user queries. This would make the per-user queries relatively efficient. However, copying and storing data from the warehouse in these other systems presented material computational and storage costs that were not offset by the overall effectiveness of the cache, making this infeasible as well. </p>
<h3>Current design</h3>
<p>Finally, we considered whether it would be possible to build a system that relies on amortizing the cost of expensive full table scans by batching individual users’ requests into a single scan. After significant engineering investigation and prototyping, we determined that this system provides sufficiently predictable performance characteristics to infrastructure teams to make this possible. Even with this batching over short periods of time, given the relatively small size of the batches of requests for this information compared to the overall user base, most of the rows considered in a given table are filtered and scoped out as they are not relevant to the users whose data has been requested. This is a necessary trade off to enable this information to be made accessible to our users.</p>
<p>In more detail, following a pre-defined schedule, a job is triggered using Meta’s internal task-scheduling service to organize the most recent requests, over a short time period, for users’ data logs into a single batch. This batch is submitted to a system built on top of Meta’s <a href="https://atscaleconference.com/workflowsfacebook-powering-developer-productivity-and-automation-at-facebook-scale/">Core Workflow Service</a> (CWS). CWS provides a useful set of guarantees that enable long-running tasks to be executed with predictable performance characteristics and reliability guarantees that are critical for complex multi-step workflows.</p>
<p>Once the batch has been queued for processing, we copy the list of user IDs who have made requests in that batch into a new Hive table. For each data logs table, we initiate a new worker task that fetches the relevant metadata describing how to correctly query the data. Once we know what to query for a specific table, we create a task for each partition that executes a job in <a href="https://www.youtube.com/watch?v=4T-MCYWrrOw">Dataswarm</a> (our data pipeline system). This job performs an INNER JOIN between the table containing requesters’ IDs and the column in each table that identifies the owner of the data in that row. As tables in Hive may leverage security mechanisms like access control lists (ACLs) and privacy protections built on top of Meta’s <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Privacy Aware Infrastructure</a>, the jobs are configured with appropriate security and privacy policies that govern access to the data.</p>
<figure id="attachment_22235" aria-describedby="caption-attachment-22235" class="wp-caption alignnone c2"><img class="size-large wp-image-22235" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?w=1024" alt="" width="1024" height="417" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png 4833w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=916,373 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=768,312 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1024,417 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1536,625 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=2048,833 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=96,39 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=192,78 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22235" class="wp-caption-text">A diagram showing how a workflow gathers inputs and runs sub-workflows for each table and partition to reliably gather data logs. The data logs workflow will gather metadata, then prepare requester IDs, then run parallel processes for each table. Each table workflow prepares the table, then runs approximately sequential jobs for each partition. Some partitions across tables may be processed in parallel.</figcaption></figure><p>Once this job is completed, it outputs its results to an intermediate Hive table containing a combination of the data logs for all users in the current batch. This processing is expensive, as the INNER JOIN requires a full table scan across all relevant partitions of the Hive table, an operation which may consume significant computational resources. The output table is then processed using PySpark to identify the relevant data and split it into individual files for each user’s data in a given partition. </p>
<figure id="attachment_22234" aria-describedby="caption-attachment-22234" class="wp-caption alignnone c2"><img class="size-large wp-image-22234" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?w=1024" alt="" width="1024" height="430" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png 3166w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=916,385 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=768,322 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1024,430 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1536,645 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=2048,860 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=96,40 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=192,81 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22234" class="wp-caption-text">A diagram showing that the processing of each partition splits the data into one file per user per partition. Both users A and B will have a file representing Table 1 Partition 1, and so on.</figcaption></figure><p>The result of these batch operations in the data warehouse is a set of comma delimited text files containing the unfiltered raw data logs for each user. This raw data is not yet explained or made intelligible to users, so we run a post-processing step in Meta’s Hack language to apply privacy rules and filters and render the raw data into meaningful, well-explained HTML files. We do this by passing the raw data through various renderers, discussed in more detail in the next section. Finally, once all of the processing is completed, the results are aggregated into a ZIP file and made available to the requestor through the DYI tool.</p>
<h2>Lessons learned from building data logs</h2>
<p>Throughout the development of this system we found it critical to develop robust checkpointing mechanisms that enable incremental progress and resilience in the face of errors and temporary failures. While processing everything in a single pass may reduce latency, the risk is that a single issue will cause all of the previous work to be wasted. For example, in addition to jobs timing out and failing to complete, we also experienced errors where full-table-scan queries would run out of memory and fail partway through processing. The capability to resume work piecemeal increases resiliency and optimizes the overall throughput of the system.</p>
<figure id="attachment_22233" aria-describedby="caption-attachment-22233" class="wp-caption alignnone c2"><img class="size-large wp-image-22233" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?w=1024" alt="" width="1024" height="432" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png 1787w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=916,386 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=768,324 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1024,432 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1536,648 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=96,41 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22233" class="wp-caption-text">A diagram explaining that checkpointing after incremental task completions can optimize system throughput. Approach 1 shows one task failure on step 2, which is recomputed successfully. Approach 2 combines everything into one step, but when it fails, it winds up taking longer overall.</figcaption></figure><p>Ensuring data correctness is also very important. As we built the component that splits combined results into individual files for each user, we encountered an issue that affected this correctness guarantee and could have led to data being returned to the wrong user. The root cause of the issue was a Spark concurrency bug that partitioned data incorrectly across the parallel Spark workers. To prevent this issue, we built verification in the post-processing stage to ensure that the user ID column in the data matches the identifier for the user whose logs we are generating. This means that even if similar bugs were to occur in the core data processing infrastructure we would prevent any incorrect data from being shown to users. </p>
<p>Finally, we learned that complex data workflows require advanced tools and the capability to iterate on code changes quickly without re-processing everything. To this end, we built an experimentation platform that enables running modified versions of the workflows to quickly test changes, with the ability to independently execute phases of the process to expedite our work. For example, we found that innocent-looking changes such as altering which column is being fetched can lead to complex failures in the data fetching jobs.  We now have the ability to run a test job under the new configuration when making a change to a table fetcher.</p>
<h2>Making data consistently understandable and explainable</h2>
<p>Meta cares about ensuring that the information we provide is meaningful to end-users. A key challenge in providing transparent access is presenting often highly technical information in a way that is approachable and easy to understand, even for those with little expertise in technology. </p>
<p>Providing people access to their data involves working across numerous products and surfaces that our users interact with every day. The way that information is stored on our back-end systems is not always directly intelligible to end-users, and it takes both understanding of the individual products and features as well as the needs of users to make it user-friendly. This is a large, collaborative undertaking, leveraging many forms of expertise. Our process entails working with product teams familiar with the data from their respective products, applying our historical expertise in access surfaces, using innovative tools we have developed, and consulting with experts.</p>
<p>In more detail, a cross-functional team of access experts works with specialist teams to review these tables, taking care to avoid exposing information that could adversely affect the rights and freedoms of other users. For example, if you block another user Facebook, this information would not be provided to the person that you have blocked. Similarly, when you view another user’s profile, this information will be available to you, but not the person whose profile you viewed. This is a key principle Meta upholds to respect the rights of everyone who engages with our platforms. It also means that we need a rigorous process to ensure that the data made available is never shared incorrectly. Many of the datasets that power a social network will reference more than one person, but that does not imply everyone referenced should always have equal access to that information. </p>
<p>Additionally, Meta must take care not to disclose information that may compromise our integrity or safety systems, or our intellectual property rights. For instance, Meta sends <a href="https://transparency.meta.com/en-gb/ncmec-q2-2023/">millions of NCMEC Cybertip reports per year</a> to help protect children on our platforms. Disclosing this information, or the data signals used to detect apparent violations of laws protecting children, may undermine the sophisticated techniques we have developed to proactively seek out and report these types of content and interactions. </p>
<p>One particularly time consuming and challenging task is ensuring that Meta-internal text strings that describe our systems and products are translated into more easily human readable terms. For instance, a <a href="https://docs.hhvm.com/hack/built-in-types/enum">Hack enum</a> could define a set of user interface element references. Exposing the jargon-heavy internal versions of these enums would definitely not be meaningful to an end-user — they may not be meaningful at first glance to other employees without sufficient context! In this case, user-friendly labels are created to replace these internal-facing strings. The resulting content is reviewed for explainability, simplicity, and consistency, with product experts also helping to verify that the final version is accurate.</p>
<p>This process makes information more useful by reducing duplicative information. When engineers build and iterate on a product for our platforms, they may log slightly different versions of the same information with the goal of better understanding how people use the product. For example, when users select an option from a list of actions, each part of the system may use slightly different values that represent the same underlying option, such as an option to move content to trash as part of <a href="https://about.fb.com/news/2020/06/introducing-manage-activity/">Manage Activity</a>.  As a concrete example, we found this action stored with different values: In the first instance it was entered as MOVE_TO_TRASH, in the second as StoryTrashPostMenuItem, and in the third FBFeedMoveToTrashOption. These differences stemmed from the fact that the logging in question was coming from different parts of the system with different conventions. Through a  series of cross-functional reviews with support from product experts, Meta determines an appropriate column header (e.g., “Which option you interacted with”), and the best label for the option (e.g., “Move to trash”).</p>
<p>Finally, once content has been reviewed, it can be implemented in code using the renderers we described above. These are responsible for reading the raw values and transforming them into user-friendly representations. This includes transforming raw integer values into meaningful references to entities and converting raw enum values into user-friendly representations. An ID like 1786022095521328 might become “John Doe”; enums with integer values 0, 1, 2, converted into text like “Disabled,” “Active,” or “Hidden;”and columns with string enums can remove jargon that is not understandable to end-users and de-duplicate (as in our “Move to trash” example above). </p>
<p>Together, these culminate in a much friendlier representation of the data that might look like this:</p>
<figure id="attachment_22228" aria-describedby="caption-attachment-22228" class="wp-caption alignnone c2"><img class="size-large wp-image-22228" src="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?w=1024" alt="" width="1024" height="197" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=916,176 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=768,148 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1024,197 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1536,295 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=96,18 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=192,37 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22228" class="wp-caption-text">An example of what a data log might look like, demonstrating meaningful and intelligible output. It shows column headers with descriptions and a row of data.</figcaption></figure>]]></description>
      <link>https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/</link>
      <guid>https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/</guid>
      <pubDate>Tue, 04 Feb 2025 21:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Precision Time Protocol handles leap seconds]]></title>
      <description><![CDATA[<p>We’ve previously described why we think <a href="https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/">it’s time to leave the leap second in the past</a>. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly, does more harm than good. This is particularly true in the data center space, where new protocols like <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/">Precision Time Protocol (PTP)</a> are allowing systems to be synchronized down to nanosecond precision.  </p>
<p>With the ever-growing demand for higher precision time distribution, and <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/" target="_blank" rel="noopener">the larger role of PTP for time synchronization</a> in data centers, we need to consider how to address leap seconds within systems that use PTP and are thus much more time sensitive.</p>
<h2>Leap second smearing – a solution past its time</h2>
<p>Leap second smearing is a process of adjusting the speeds of clocks to accommodate the correction that has been a common method for handling leap seconds. At Meta, we’ve traditionally focused our smearing effort on <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">NTP</a> since it has been the de facto standard for time synchronization in data centers.</p>
<p>In large NTP deployments, leap second smearing is generally performed at the <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">Stratum 2 (layer</a>), which consists of NTP servers that directly interact with NTP clients (the Stratum 3) that are the downstream users of the NTP service.</p>
<p><img class="alignnone size-large" src="https://engineering.fb.com/wp-content/uploads/2020/03/Time-Infra-NTP-Service.jpg" width="2000" height="1125" alt="image" /></p>
<p>There are multiple approaches to smearing. In the case of NTP, linear or quadratic smearing formulas can be applied.</p>
<p>Quadratic smearing is often preferred due to the layered nature of the NTP protocol, where clients are encouraged to dynamically adjust their polling interval as the value of pending correction increases. This solution has its own tradeoffs, such as inconsistent adjustments, which can lead to different offset values across a large server fleet. </p>
<p>Linear smearing may be superior if an entire fleet is relying on the same time sources and performs smearing at the same time. In combination with more frequent sync cycles of typically once per second, this is a more predictable, precise and reliable approach.</p>
<p><img class="alignnone size-large wp-image-22216" src="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1536,865 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=2048,1153 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Handling leap seconds in PTP</h2>
<p>In contrast to NTP, which synchronizes at the millisecond level, PTP provides a level of precision typically in the range of nanoseconds. At this level of precision even periodic linear smearing would create too much delta across the fleet and violate guarantees provided to the customers.</p>
<p>To handle leap seconds in a PTP environment we take an algorithmic approach that shifts time automatically for systems that use PTP and combine this with an emphasis on using <a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time" target="_blank" rel="noopener">Coordinated Universal Time (UTC)</a> over International Atomic Time (TAI).</p>
<h3>Self-smearing</h3>
<p>At Meta, users interact with the PTP service via the <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/#client" target="_blank" rel="noopener">fbclock</a> library, which provides a tuple of values, <em>{earliest_ns, latest_ns},</em> which represents a time interval referred to as the Window of Uncertainty (WOU). Each time the library is called during the smearing period we adjust the return values based on the smearing algorithm, which shifts the time values 1 nanosecond every 62.5 microseconds.</p>
<p>This approach has a number of advantages, including being completely stateless and reproducible. The service continues to utilize TAI timestamps but can return UTC timestamps to clients via the API. And, as the start time is determined by <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">tzdata</a> timestamps, the current smearing position can be determined even after a server is rebooted.</p>
<p>This approach does come with some tradeoffs. For example, as the leap smearing strategy differs between the NTP (quadratic) and PTP (linear) ecosystems, services may struggle to match timestamps acquired from different sources during the smearing period. </p>
<p>The difference between two approaches can mean differences of over 100 microseconds, creating challenges for services that consume time from both systems.</p>
<h3>UTC over TAI</h3>
<p>The smearing strategy we implemented in our fbclock library shows good performance. However, it still introduces significant time deltas between multiple hosts during the smearing period, despite being fully stateless and using small (1 nanosecond) and fixed step sizes. </p>
<p>Another significant drawback comes from periodically running jobs. Smearing time means our scheduling is off by close to 1 millisecond after 60 seconds for services that run at precise intervals. </p>
<p><img class="alignnone size-large wp-image-22215" src="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?w=1024" alt="" width="1024" height="617" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=916,552 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=768,463 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1024,617 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1536,925 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=2048,1233 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=192,116 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>This is not ideal for a service that guarantees nanosecond-level accuracy and precision.</p>
<p>As a result, we recommend that customers use TAI over UTC and thus avoid having to deal with the leap seconds. Unfortunately, though, in most cases, the conversion to UTC is still required and eventually has to be performed somewhere.</p>
<h2>PTP without leap seconds</h2>
<p>At Meta, we support the recent push to <a href="https://www.nytimes.com/2022/11/19/science/time-leap-second-bipm.html">freeze any new leap seconds after 2035</a>. If we can cease the introduction of new leap seconds, then the entire industry can rely on UTC instead of TAI for higher precision timekeeping. This will simplify infrastructure and remove the need for different smearing solutions.</p>
<p>Ultimately, a future without leap seconds is one where we can push systems to greater levels of timekeeping precision more easily and efficiently.</p>]]></description>
      <link>https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/</link>
      <guid>https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/</guid>
      <pubDate>Mon, 03 Feb 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Bringing Jetpack Compose to Instagram for Android]]></title>
      <description><![CDATA[<p>Introducing a new Android UI framework like Jetpack Compose into an existing app is more complicated than importing some AARS and coding away. What if your app has specific performance goals to meet? What about existing design components, integrations with navigation, and logging frameworks?</p>
<p>On this episode of the Meta Tech Podcast <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> is joined by Summer, a software engineer whose team handles large-scale migrations for Instagram. Summer walks through the various thoughtful and intricate phases that Instagram goes through to ensure that developers have the best possible experience when working on our codebases. She also discusses balancing all of this with Meta’s infrastructure teams, who have to maintain multiple implementations at once.</p>
<p>Learn how Meta approaches the rollout of a new framework and more!</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/34599290/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/3tKuh8iNENKMugNdZ0jTqU" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/jetpack-compose-at-meta/id1370910331?i=1000681564997" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/as8y2yqo" target="_blank" rel="noopener">Pocket Casts</a></li>
<li><a href="https://overcast.fm/login" target="_blank" rel="noopener">Overcast</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/</link>
      <guid>https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/</guid>
      <pubDate>Fri, 24 Jan 2025 18:30:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta discovers data flows via lineage at scale]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example we’ll walk through in this post.</li>
<li class="c1" aria-level="1">In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. We then built an intuitive UX into our tooling that enables developers to effectively consume all of this lineage data in a systematic way, saving significant engineering time for building privacy controls. </li>
<li class="c1" aria-level="1">As we expanded PAI across Meta, <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/#learnings">we gained valuable insights</a> about the data lineage space. Our understanding of the privacy space evolved, revealing the need for early focus on data lineage, tooling, a cohesive ecosystem of libraries, and more. These initiatives have assisted in accelerating the development of data lineage and implementing purpose limitation controls more quickly and efficiently.</li>
</ul><p>At Meta, we believe that privacy enables product innovation. This belief has led us to developing <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">Privacy Aware Infrastructure (PAI)</a>, which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">purpose limitation</a>, which restricts the purposes for which data can be processed and used. </p>
<p>In this blog, we will delve into an early stage in PAI implementation: <em>data lineage</em>. Data lineage refers to the process of tracing the journey of data as it moves through various systems, illustrating how data transitions from one data asset, such as a database table (the <em>source</em> asset), to another (the <em>sink</em> asset). We’ll also walk through how we track the lineage of users’ “religion” information in our Facebook Dating app.</p>
<p>Millions of data assets are vital for supporting our product ecosystem, ensuring the functionality our users anticipate, maintaining high product quality, and safeguarding user safety and integrity. Data lineage enables us to efficiently navigate these assets and protect user data. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products.</p>
<p>Note that data lineage is dependent on having already completed important and complex preliminary steps to inventory, schematize, and annotate data assets into a unified asset catalog. This took Meta multiple years to complete across our millions of disparate data assets, and we’ll cover each of these more deeply in future blog posts:</p>
<ul><li><strong>Inventorying</strong> involves collecting various code and data assets (e.g., web endpoints, data tables, AI models) used across Meta.</li>
<li><strong>Schematization</strong> expresses data assets in structural detail (e.g., indicating that a data asset has a field called “religion”).</li>
<li><strong>Annotation</strong> labels data to describe its content (e.g., specifying that the identity column contains religion data).</li>
</ul><h2>Understanding data lineage at Meta</h2>
<p>To establish robust privacy controls, an essential part of our PAI initiative is to understand how data flows across different systems. Data lineage is part of this discovery step in the PAI workflow, as shown in the following diagram:</p>
<p><img class="alignnone size-large wp-image-22176" src="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?w=1024" alt="" width="1024" height="217" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png 2710w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=916,194 916w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=768,163 768w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1024,217 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1536,326 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=2048,435 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=96,20 96w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=192,41 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Data lineage is a key precursor to implementing Policy Zones, our information flow control technology, because it answers the question, “Where does my data come from and where does it go?” – helping inform the right places to apply privacy controls. In conjunction with Policy Zones, data lineage provides the following key benefits to thousands of developers at Meta: </p>
<ul><li class="c1" aria-level="1"><strong>Scalable data flow discovery</strong>: Data lineage answers the question above by providing an end-to-end, scalable graph of relevant data flows. We can leverage the lineage graphs to visualize and explain the flow of relevant data from the point where it is collected to all the places where it is processed.</li>
<li class="c1" aria-level="1"><strong>Efficient rollout of privacy controls</strong>: By leveraging data lineage to track data flows, we can easily pinpoint the optimal integration points for privacy controls like Policy Zones within the codebase, streamlining the rollout process. Thus we have developed a powerful flow discovery tool as part of our PAI tool suite, Policy Zone Manager (PZM), based on data lineage. PZM enables developers to rapidly identify multiple downstream assets from a set of sources simultaneously, thereby accelerating the rollout process of privacy controls.</li>
<li class="c1" aria-level="1"><strong>Continuous compliance verification</strong>: Once the privacy requirement has been fully implemented, data lineage plays a vital role in monitoring and validating data flows continuously, in addition to the enforcement mechanisms such as Policy Zones.</li>
</ul><p>Traditionally, data lineage has been collected via code inspection using manually authored data flow diagrams and spreadsheets. However, this approach does not scale in large and dynamic environments, such as Meta, with billions of lines of continuously evolving code. To tackle this challenge, we’ve developed a robust and scalable lineage solution that uses static code analysis signals as well as runtime signals.</p>
<h2>Walkthrough: Implementing data lineage for religion data</h2>
<p>We’ll share how we have automated lineage tracking to identify religion data flows through our core systems, eventually creating an end-to-end, precise view of downstream religion assets being protected, via the following two key stages:</p>
<ol><li class="c1" aria-level="1"><strong>Collecting data flow signals</strong>: a process to capture data flow signals from many processing activities across different systems, not only for religion, but for all other types of data, to create an end-to-end lineage graph. </li>
<li class="c1" aria-level="1"><strong>Identifying relevant data flows</strong>: a process to identify the specific subset of data flows (“subgraph”) within the lineage graph that pertains to religion. </li>
</ol><p>These stages propagate through various systems including <em>function-based systems</em> that load, process, and propagate data through stacks of function calls in different programming languages (e.g., Hack, C++, Python, etc.) such as web systems and backend services, and <em>batch-processing systems</em> that process data rows in batch (mainly via SQL) such as data warehouse and AI systems.</p>
<p>For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below.</p>
<p><img class="alignnone size-large wp-image-22187" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?w=1024" alt="" width="1024" height="559" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png 2743w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=916,500 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=768,419 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1024,559 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1536,838 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=2048,1118 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=192,105 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>Collecting data flow signals for the web system</h3>
<p>When setting up a profile on the Facebook Dating app, people can populate their religious views. This information is then utilized to identify relevant matches with other people who have specified matched values in their dating preferences. On Dating, religious views are subject to purpose limitation requirements, for example, <a href="https://about.fb.com/news/2020/10/privacy-matters-facebook-dating/">they will not be used to personalize experiences on other Facebook Products</a>.</p>
<p><img class="alignnone size-large wp-image-22169" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?w=1024" alt="" width="1024" height="651" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png 1964w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=916,582 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=768,488 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1024,651 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1536,976 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=192,122 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>We start with someone entering their religion information on their dating media profile using their mobile device, which is then transmitted to a web endpoint. The web endpoint subsequently logs the data into a logging table and stores it in a database, as depicted in the following code snippet:</p>
<p><img class="alignnone size-large wp-image-22191" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?w=895" alt="" width="895" height="1024" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png 1504w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=801,916 801w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=768,878 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=895,1024 895w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=1343,1536 1343w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=96,110 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=192,220 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Now let’s see how we collect lineage signals. To do this, we need to employ both static and runtime analysis tools to effectively discover data flows, particularly focusing on where religion is logged and stored. By combining static and runtime analysis, we enhance our ability to accurately track and manage data flows.</p>
<p><a href="https://engineering.fb.com/2021/10/20/security/static-analysis-award/'">Static analysis tools</a> simulate code execution to map out data flows within our systems. They also emit quality signals to indicate the confidence of whether a data flow signal is a true positive. However, these tools are limited by their lack of access to runtime data, which can lead to false positives from unexecuted code.</p>
<p>To address this limitation, we utilize <strong>Privacy Probes</strong>, a key component of our PAI lineage technologies. Privacy Probes automate data flow discovery by collecting runtime signals. These signals are gathered in real time during the execution of requests, allowing us to trace the flow of data into loggers, databases, and other services. </p>
<p>We have instrumented Meta’s core data frameworks and libraries at both the data origin points (sources) and their eventual outputs (sinks), such as logging framework, which allows for comprehensive data flow tracking. This approach is exemplified in the following code snippet:</p>
<p><img class="alignnone size-large wp-image-22192" src="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?w=1024" alt="" width="1024" height="791" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=916,707 916w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=768,593 768w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=1024,791 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=96,74 96w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=192,148 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><br />
During runtime execution, Privacy Probes does the following:</p>
<ol><li class="c1" aria-level="1"><strong>Capturing payloads</strong>: It captures source and sink payloads in memory on a sampled basis, along with supplementary metadata such as event timestamps, asset identifiers, and stack traces as evidence for the data flow. </li>
<li class="c1" aria-level="1"><strong>Comparing payloads</strong>: It then compares the source and sink payloads within a request to identify data matches, which helps in understanding how data flows through the system. </li>
<li class="c1" aria-level="1"><strong>Categorizing results</strong>: It categorizes results into two sets. The <em>match-set</em> includes pairs of source and sink assets where data matches exactly or one is contained by another, therefore providing high confidence evidence of data flow between the assets. The <em>full-set</em> includes all source and sink pairs within a request no matter whether the sink is tainted by the source. Full-set is a superset of match-set with some noise but still important to send to human reviewers since it may contain transformed data flows. </li>
</ol><p>The above procedure is depicted in the diagram below:</p>
<p><img class="alignnone size-large wp-image-22177" src="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?w=1024" alt="" width="1024" height="378" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png 2216w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=916,338 916w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=768,283 768w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1024,378 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1536,566 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=2048,755 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=192,71 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Let’s look at the following examples where various religions are received in an endpoint and various values (copied or transformed) being logged in three different loggers:</p>
<table border="1"><tbody><tr><td class="c2"><strong>Input Value (source)</strong></td>
<td class="c2"><strong>Output Value (sink)</strong></td>
<td class="c2"><strong>Data Operation</strong></td>
<td class="c2"><strong>Match Result</strong></td>
<td class="c2"><strong>Flow Confidence</strong></td>
</tr><tr><td>“Atheist”</td>
<td>“Atheist”</td>
<td>Data Copy</td>
<td>EXACT_MATCH</td>
<td>HIGH</td>
</tr><tr><td>“Buddhist”</td>
<td>{metadata: {religion: Buddhist}}</td>
<td>Substring</td>
<td>CONTAINS</td>
<td>HIGH</td>
</tr><tr><td>{religions:<br />
[“Catholic”, “Christian”]}</td>
<td>{count : 2}</td>
<td>Transformed</td>
<td>NO_MATCH</td>
<td>LOW</td>
</tr></tbody></table><p><br />
In the examples above, the first two rows show a precise match of religions in the source and the sink values, thus belonging to the high confidence match-set. The third row depicts a transformed data flow where the input string value is transformed to a count of values before being logged, belonging to full-set. </p>
<p>These signals together are used to construct a lineage graph to understand the flow of data through our web system as shown in the following diagram:</p>
<p><img class="alignnone size-large wp-image-22174" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?w=1024" alt="" width="1024" height="312" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png 2437w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=916,279 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=768,234 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1024,312 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1536,468 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=2048,624 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=96,29 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=192,59 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h3>Collecting data flow signals for the data warehouse system</h3>
<p>With the user’s religion logged in our web system, it can propagate to the data warehouse for offline processing. To gather data flow signals, we employ a combination of both runtime instrumentation and static code analysis in a different way from the web system. The involved SQL queries are logged for data processing activities by the <a href="https://research.facebook.com/publications/presto-sql-on-everything/">Presto</a> and <a href="https://spark.apache.org/">Spark</a> compute engines (among others). Static analysis is then performed for the logged SQL queries and job configs in order to extract data flow signals.</p>
<p>Let’s examine a simple SQL query example that processes data for the data warehouse as the following:</p>
<p><img class="alignnone size-large wp-image-22190" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?w=1024" alt="" width="1024" height="292" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=916,261 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=768,219 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=1024,292 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=96,27 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=192,55 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><br />
We’ve developed a <a href="https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/" target="_blank" rel="noopener">SQL analyzer</a> to extract data flow signals between the input table, “safety_log_tbl” and the output table, “safety_training_tbl” as shown in the following diagram. In practice, we also collect more granular-level lineage such as at column-level (e.g., “user_id” -&gt; “target_user_id”, “religion” -&gt; “target_religion”).</p>
<p>There are instances where data is not fully processed by SQL queries, resulting in logs that contain data flow signals for either reads or writes, but not both. To ensure we have complete lineage data, we leverage contextual information (such as execution environments; job or trace IDs) collected at runtime to connect these reads and writes together. </p>
<p>The following diagram illustrates how the lineage graph has expanded:</p>
<p><img class="alignnone size-large wp-image-22173" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?w=1024" alt="" width="1024" height="593" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png 2436w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=916,530 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=768,445 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1024,593 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1536,889 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=2048,1185 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h3>Collecting data flow signals for the AI system</h3>
<p>For our AI systems, we collect lineage signals by tracking relationships between various assets, such as input datasets, features, models, workflows, and inferences. A common approach is to extract data flows from job configurations used for different AI activities such as model training.For instance, in order to improve the relevance of dating matches, we use an AI model to recommend potential matches based on shared religious views from users. Let’s take a look at the following training config example for this model that uses religion data:</p>
<p><img class="alignnone size-large wp-image-22193" src="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?w=1024" alt="" width="1024" height="829" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=916,741 916w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=768,621 768w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=1024,829 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=96,78 96w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=192,155 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>By parsing this config obtained from the model training service, we can track the data flow from the input dataset (with asset ID asset://hive.table/dating_training_tbl) and feature (with asset ID asset://ai.feature/DATING_USER_RELIGION_SCORE) to the model (with asset ID asset://ai.model/dating_ranking_model).</p>
<p>Our AI systems are also instrumented so that asset relationships and data flow signals are captured at various points at runtime, including data-loading layers (e.g., <a href="https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/">DPP</a>) and libraries (e.g., <a href="https://pytorch.org/">PyTorch</a>), workflow engines (e.g., <a href="https://engineering.fb.com/2016/05/09/core-infra/introducing-fblearner-flow-facebook-s-ai-backbone/">FBLearner Flow</a>), training frameworks, inference systems (as backend services), etc. Lineage collection for backend services utilizes the approach for function-based systems described above. By matching the source and sink assets for different data flow signals, we are able to capture a holistic lineage graph at the desired granularities:</p>
<p><img class="alignnone size-large wp-image-22172" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?w=1024" alt="" width="1024" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png 2416w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=916,537 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=768,450 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1024,600 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1536,900 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=2048,1200 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=192,113 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Identifying relevant data flows from a lineage graph</h2>
<p>Now that we have the lineage graph at our disposal, how can we effectively distill a subset of data flows pertinent to a specific privacy requirement for religion data? To address this question, we have developed an iterative analysis tool that enables developers to pinpoint precise data flows and systematically filter out irrelevant ones. The tool kicks off a repetitive discovery process aided by the lineage graph and privacy controls from Policy Zones, to narrow down the most relevant flows. This refined data allows developers to make a final determination about the flows they would like to use, producing an optimal path for traversing the lineage graph. The following are the major steps involved, captured holistically in the diagram, below:</p>
<ol><li class="c1" aria-level="1"><strong>Discover data flows:</strong> identify data flows from source assets and stop at downstream assets with low-confidence flows (yellow nodes). </li>
<li class="c1" aria-level="1"><strong>Exclude and include candidates:</strong> Developers or automated heuristics exclude candidates (red nodes) that don’t have religion data or include remaining ones (green nodes). By excluding the red nodes early on, it helps to exclude all of their downstream in a cascaded manner, and thus saves developer efforts significantly. As an additional safeguard, developers also implement privacy controls via Policy Zones, so all relevant data flows can be captured.</li>
<li class="c1" aria-level="1"><strong>Repeat discovery cycle:</strong> use the green nodes as new sources and repeat the cycle until no more green nodes are confirmed. </li>
</ol><p><img class="alignnone size-large wp-image-22170" src="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?w=1024" alt="" width="1024" height="711" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png 2232w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=916,636 916w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=768,533 768w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1024,711 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1536,1066 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=2048,1421 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=96,67 96w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=192,133 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>With the collection and data flow identification steps complete, developers are able to successfully locate granular data flows that contain religion across Meta’s complex systems, allowing them to move forward in the PAI workflow to apply necessary privacy controls to safeguard the data. This once-intimidating task has been completed efficiently. </p>
<p>Our data lineage technology has provided developers with an unprecedented ability to quickly understand and protect religion and similar sensitive data flows. It enables Meta to scalably and efficiently implement privacy controls via PAI to protect our users’ privacy and deliver products safely.</p>
<h2 id="learnings">Learnings and challenges</h2>
<p>As we’ve worked to develop and implement lineage as a core PAI technology, we’ve gained valuable insights and overcome significant challenges, yielding some important lessons:</p>
<ul><li class="c1" aria-level="1"><strong>Focus on lineage early and reap the rewards</strong>: As we developed privacy technologies like Policy Zones, it became clear that gaining a deep understanding of data flows across various systems is essential for scaling the implementation of privacy controls. By investing in lineage, we not only accelerated the adoption of Policy Zones but also uncovered new opportunities for applying the technology. Lineage can also be extended to other use cases such as security and integrity.</li>
<li class="c1" aria-level="1"><strong>Build lineage consumption tools to gain engineering efficiency</strong>: We initially focused on building a lineage solution but didn’t give sufficient attention to consumption tools for developers. As a result, owners had to use raw lineage signals to discover relevant data flows, which was overwhelmingly complex. We addressed this issue by developing the iterative tooling to guide engineers in discovering relevant data flows, significantly reducing engineering efforts by orders of magnitude.</li>
<li class="c1" aria-level="1"><strong>Integrate lineage with systems to scale the coverage</strong>: Collecting lineage from diverse Meta systems was a significant challenge. Initially, we tried to ask every system to collect lineage signals to ingest into the centralized lineage service, but the progress was slow. We overcame this by developing reliable, computationally efficient, and widely applicable PAI libraries with built-in lineage collection logic in various programming languages (Hack, C++, Python, etc.). This enabled much smoother integration with a broad range of Meta’s systems.</li>
<li class="c1" aria-level="1"><strong>Measurement improves our outcomes</strong>: By incorporating the measurement of coverage, we’ve been able to evolve our data lineage so that we stay ahead of the ever-changing landscape of data and code at Meta. By enhancing our signals and adapting to new technologies, we can maintain a strong focus on privacy outcomes and drive ongoing improvements in lineage coverage across our tech stacks.</li>
</ul><h2>The future of data lineage</h2>
<p>Data lineage is a vital component of Meta’s PAI initiative, providing a comprehensive view of how data flows across different systems. While we’ve made significant progress in establishing a strong foundation, our journey is ongoing. We’re committed to:</p>
<ul><li class="c1" aria-level="1"><strong>Expanding coverage</strong>: continuously enhance the coverage of our data lineage capabilities to ensure a comprehensive understanding of data flows.</li>
<li class="c1" aria-level="1"><strong>Improving consumption experience</strong>: streamline the consumption experience to make it easier for developers and stakeholders to access and utilize data lineage information.</li>
<li class="c1" aria-level="1"><strong>Exploring new frontiers</strong>: investigate new applications and use cases for data lineage, driving innovation and collaboration across the industry.</li>
</ul><p>By advancing data lineage, we aim to foster a culture of privacy awareness and drive progress in the broader fields of study. Together, we can create a more transparent and accountable data ecosystem.</p>
<h2>Acknowledgements</h2>
<p><em>The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing data lineage technologies over the years. In particular, we would like to extend special thanks to (in alphabetical order) Amit Jain, Aygun Aydin, Ben Zhang, Brian Romanko, Brian Spanton, Daniel Ramagem, David Molnar, Dzmitry Charnahalau, Gayathri Aiyer, George Stasa, Guoqiang Jerry Chen, Graham Bleaney, Haiyang Han, Howard Cheng, Ian Carmichael, Ibrahim Mohamed, Jerry Pan, Jiang Wu, Jonathan Bergeron, Joanna Jiang, Jun Fang, Kiran Badam, Komal Mangtani, Kyle Huang, Maharshi Jha, Manuel Fahndrich, Marc Celani, Lei Zhang, Mark Vismonte, Perry Stoll, Pritesh Shah, Qi Zhou, Rajesh Nishtala, Rituraj Kirti, Seth Silverman, Shelton Jiang, Sushaant Mujoo, Vlad Fedorov, Yi Huang, Xinbo Gao, and Zhaohui Zhang. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Aleksandar Ilic, Avtar Brar, Benjamin Renard, Bogdan Shubravyi, Brianna O’Steen, Chris Wiltz, Daniel Chamberlain, Hannes Roth, Imogen Barnes, Jason Hendrickson, Koosh Orandi, Rituraj Kirti, and Xenia Habekoss. We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, Supriya Anand for leading the editorial effort to shape the blog content, and Katherine Bates for pulling all required support together to make this blog post happen.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/</link>
      <guid>https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/</guid>
      <pubDate>Thu, 23 Jan 2025 06:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Strobelight: A profiling service built on open source technology]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing details about Strobelight, Meta’s profiling orchestrator.</li>
<li class="c1" aria-level="1">Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet.</li>
<li class="c1" aria-level="1">Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings.</li>
</ul><p>Strobelight, Meta’s profiling orchestrator, is not really one technology. It’s several (many open source) combined to make something that unlocks truly amazing efficiency wins. Strobelight is also not a single profiler but an orchestrator of many different profilers (even ad-hoc ones) that runs on all production hosts at Meta, collecting detailed information about CPU usage, memory allocations, and other performance metrics from running processes. Engineers and developers can use this information to identify performance and resource bottlenecks, optimize their code, and improve utilization.</p>
<p>When you combine talented engineers with rich performance data you can get efficiency wins by both creating tooling to identify issues before they reach production and finding opportunities in already running code. Let’s say an engineer makes a code change that introduces an unintended copy of some large object on a service’s hot path. Meta’s existing tools can identify the issue and query Strobelight data to estimate the impact on compute cost. Then Meta’s code review tool can notify the engineer that they’re about to waste, say, 20,000 servers.</p>
<p>Of course, static analysis tools can pick up on these sorts of issues, but they are unaware of global compute cost and oftentimes these inefficiencies aren’t a problem until they’re gradually serving millions of requests per minute. The frog can boil slowly.</p>
<h2>Why do we use profilers?</h2>
<p>Profilers operate by sampling data to perform statistical analysis. For example, a profiler takes a sample every N events (or milliseconds in the case of time profilers) to understand where that event occurs or what is happening at the moment of that event. With a CPU-cycles event, for example, the profile will be CPU time spent in functions or function call stacks executing on the CPU. This can give an engineer a high-level understanding of the code execution of a service or binary.</p>
<h2>Choosing your own adventure with Strobelight</h2>
<p>There are other daemons at Meta that collect observability metrics, but Strobelight’s wheelhouse is software profiling. It connects resource usage to source code (what developers understand best). Strobelight’s profilers are often, but not exclusively, built using <a href="https://docs.ebpf.io/" target="_blank" rel="noopener">eBPF</a>, which is a Linux kernel technology. eBPF allows the safe injection of custom code into the kernel, which enables very low overhead collection of different types of data and unlocks so many possibilities in the observability space that it’s hard to imagine how Strobelight would work without it.</p>
<p>As of the time of writing this, Strobelight has 42 different profilers, including:</p>
<ul><li class="c1" aria-level="1">Memory profilers powered by <a href="https://github.com/jemalloc/jemalloc" target="_blank" rel="noopener">jemalloc.</a></li>
<li class="c1" aria-level="1">Function call count profilers.</li>
<li class="c1" aria-level="1">Event-based profilers for both native and non-native languages (e.g., Python, Java, and Erlang).</li>
<li class="c1" aria-level="1">AI/GPU profilers.</li>
<li class="c1" aria-level="1">Profilers that track off-CPU time.</li>
<li class="c1" aria-level="1">Profilers that track service requests.</li>
</ul><p>Engineers can utilize any one of these to collect data from servers on demand via Strobelight’s command line tool or web UI.</p>
<figure id="attachment_22158" aria-describedby="caption-attachment-22158" class="wp-caption alignnone c2"><img class="size-large wp-image-22158" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?w=1024" alt="" width="1024" height="579" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=916,518 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1024,579 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1536,869 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=192,109 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22158" class="wp-caption-text">The Strobelight web UI.</figcaption></figure><p>Users also have the ability to set up continuous or “triggered” profiling for any of these profilers by updating a configuration file in Meta’s <a href="https://research.facebook.com/publications/holistic-configuration-management-at-facebook/" target="_blank" rel="noopener">Configerator</a>, allowing them to target their entire service or, for example, only hosts that run in certain regions. Users can specify how often these profilers should run, the run duration, the symbolization strategy, the process they want to target, and a lot more.</p>
<p>Here is an example of a simple configuration for one of these profilers:</p>
<pre class="line-numbers"><code class="language-none">add_continuous_override_for_offcpu_data(
    "my_awesome_team", // the team that owns this service
    Type.SERVICE_ID,
    "my_awseome_service",
    30_000, // desired samples per hour
)
</code></pre>
<p>Why does Strobelight have so many profilers? Because there are so many different things happening in these systems powered by so many different technologies.</p>
<p>This is also why Strobelight provides ad-hoc profilers. Since the kind of data that can be gathered from a binary is so varied, engineers often need something that Strobelight doesn’t provide out of the box. Adding a new profiler from scratch to Strobelight involves several code changes and could take several weeks to get reviewed and rolled out.</p>
<p>However, engineers can write a single <a href="https://github.com/bpftrace/bpftrace" target="_blank" rel="noopener"><em>bpftrace</em></a> script (a simple language/tool that allows you to easily write eBPF programs) and tell Strobelight to run it like it would any other profiler. An engineer that really cares about the latency of a particular C++ function, for example, could write up a little bpftrace script, commit it, and have Strobelight run it on any number of hosts throughout Meta’s fleet – all within a matter of hours, if needed.</p>
<p>If all of this sounds powerfully dangerous, that’s because it is. However, Strobelight has several safeguards in place to prevent users from causing performance degradation for the targeted workloads and retention issues for the databases Strobelight writes to. Strobelight also has enough awareness to ensure that different profilers don’t conflict with each other. For example, if a profiler is tracking CPU cycles, Strobelight ensures another profiler can’t use another PMU counter at the same time (as there are other services that also use them).</p>
<p>Strobelight also has concurrency rules and a profiler queuing system. Of course, service owners still have the flexibility to really hammer their machines if they want to extract a lot of data to debug.</p>
<h2>Default data for everyone</h2>
<p>Since its inception, one of Strobelight’s core principles has been to provide automatic, regularly-collected profiling data for all of Meta’s services. It’s like a flight recorder – something that doesn’t have to be thought about until it’s needed. What’s worse than waking up to an alert that a service is unhealthy and there is no data as to why?</p>
<p>For that reason, Strobelight has a handful of curated profilers that are configured to run automatically on every Meta host. They’re not running all the time; that would be “bad” and not really “profiling.” Instead, they have custom run intervals and sampling rates specific to the workloads running on the host. This provides just the right amount of data without impacting the profiled services or overburdening the systems that store Strobelight data.</p>
<p>Here is an example:</p>
<p>A service, named Soft Server, runs on 1,000 hosts and let’s say we want profiler A to gather 40,000 CPU-cycles samples per hour for this service (remember the config above). Strobelight, knowing how many hosts Soft Server runs on, but not how CPU intensive it is, will start with a conservative run probability, which is a sampling mechanism to prevent bias (e.g., profiling these hosts at noon every day would hide traffic patterns).</p>
<p>The next day Strobelight will look at how many samples it was able to gather for this service and then automatically tune the run probability (with some very simple math) to try to hit 40,000 samples per hour. We call this dynamic sampling and Strobelight does this readjustment every day for every service at Meta.</p>
<p>And if there is more than one service running on the host (excluding daemons like systemd or Strobelight) then Strobelight will default to using the configuration that will yield more samples for both.</p>
<p>Hang on, hang on. If the run probability or sampling rate is different depending on the host for a service, then how can the data be aggregated or compared across the hosts? And how can profiling data for multiple services be compared?</p>
<p>Since Strobelight is aware of all these different knobs for profile tuning, it adjusts the “weight” of a profile sample when it’s logged. A sample’s weight is used to normalize the data and prevent bias when analyzing or viewing this data in aggregate. So even if Strobelight is profiling Soft Server less often on one host than on another, the samples can be accurately compared and grouped. This also works for comparing two different services since Strobelight is used both by service owners looking at their specific service as well as efficiency experts who look for “horizontal” wins across the fleet in shared libraries.</p>
<h2>How Strobelight saves capacity</h2>
<p>There are two default continuous profilers that should be called out because of how much they end up saving in capacity.</p>
<h3>The last branch record (LBR) profiler </h3>
<p>The LBR profiler, true to its name, is used to sample <a href="https://lwn.net/Articles/680985/" target="_blank" rel="noopener">last branch records</a> (a hardware feature that started on Intel). The data from this profiler doesn’t get visualized but instead is fed into Meta’s feedback directed optimization (FDO) pipeline. This data is used to create FDO profiles that are consumed at compile time (<a href="https://ieeexplore.ieee.org/document/10444807" target="_blank" rel="noopener">CSSPGO</a>) and post-compile time (<a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/">BOLT</a>) to speed up binaries through the added knowledge of runtime behavior. Meta’s top 200 largest services all have FDO profiles from the LBR data gathered continuously across the fleet. Some of these services see up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta.</p>
<h3>The event profiler</h3>
<p>The second profiler is Strobelight’s event profiler. This is Strobelight’s version of the Linux perf tool. Its primary job is to collect user and kernel stack traces from multiple performance (perf) events e.g., CPU-cycles, L3 cache misses, instructions, etc. Not only is this data looked at by individual engineers to understand what the hottest functions and call paths are, but this data is also fed into monitoring and testing tools to identify regressions; ideally <em>before</em> they hit production.</p>
<h2>Did someone say Meta…data?</h2>
<p>Looking at function call stacks with <a href="https://www.brendangregg.com/flamegraphs.html" target="_blank" rel="noopener">flame graphs</a> is great, nothing against it. But a service owner looking at call stacks from their service, which imports many libraries and utilizes Meta’s software frameworks, will see a lot of “foreign” functions. Also, what about finding just the stacks for p99 latency requests? Or how about all the places where a service is making an unintended string copy?</p>
<h3>Stack schemas</h3>
<p>Strobelight has multiple mechanisms for enhancing the data it produces according to the needs of its users. One such mechanism is called Stack Schemas (inspired by <a href="https://learn.microsoft.com/en-us/windows-hardware/test/wpt/stack-tags" target="_blank" rel="noopener">Microsoft’s stack tags</a>), which is a small DSL that operates on call stacks and can be used to add tags (strings) to entire call stacks or individual frames/functions. These tags can then be utilized in our visualization tool. Stack Schemas can also remove functions users don’t care about with regex matching. Any number of schemas can be applied on a per-service or even per-profile basis to customize the data.</p>
<p>There are even folks who create dashboards from this metadata to help other engineers identify expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, and much more. Static analysis tools that can do this have been around for a long time, but they can’t pinpoint the really painful or computationally expensive instances of these issues across a large fleet of machines.</p>
<h3>Strobemeta</h3>
<p>Strobemeta is another mechanism, which utilizes thread local storage, to attach bits of dynamic metadata at runtime to call stacks that we gather in the event profiler (and others). This is one of the biggest advantages of building profilers using eBPF: complex and customized actions taken at sample time. Collected Strobemeta is used to attribute call stacks to specific service endpoints, or request latency metrics, or request identifiers. Again, this allows engineers and tools to do more complex filtering to focus the vast amounts of data that Strobelight profilers produce.</p>
<h2>Symbolization</h2>
<p>Now is a good time to talk about symbolization: taking the virtual address of an instruction, converting it into an actual symbol (function) name, and, depending on the symbolization strategy, also getting the function’s source file, line number, and type information.</p>
<p>Most of the time getting the whole enchilada means using a binary’s DWARF debug info. But this can be many megabytes (or even gigabytes) in size because DWARF debug data contains much more than the symbol information.</p>
<p>This data needs to be downloaded then parsed. But attempting this while profiling, or even afterwards on the same host where the profile is gathered, is far too computationally expensive. Even with optimal caching strategies it can cause memory issues for the host’s workloads.</p>
<p>Strobelight gets around this problem via a symbolization service that utilizes several open source technologies including DWARF, ELF, <a href="https://github.com/YtnbFirewings/gsym" target="_blank" rel="noopener">gsym</a>, and <a href="https://github.com/libbpf/blazesym" target="_blank" rel="noopener">blazesym</a>. At the end of a profile Strobelight sends stacks of binary addresses to a service that sends back symbolized stacks with file, line, type info, and even inline information.</p>
<p>It can do this because it has already done all the heavy lifting of downloading and parsing the DWARF data for each of Meta’s binaries (specifically, production binaries) and stores what it needs in a database. Then it can serve multiple symbolization requests coming from different instances of Strobelight running throughout the fleet.</p>
<p>To add to that enchilada (hungry yet?), Strobelight also delays symbolization until after profiling and stores raw data to disk to prevent memory thrash on the host. This has the added benefit of not letting the consumer impact the producer – meaning if Strobelight’s user space code can’t handle the speed at which the eBPF kernel code is producing samples (because it’s spending time symbolizing or doing some other processing) it results in dropped samples.</p>
<p>All of this is made possible with the inclusion of <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html" target="_blank" rel="noopener">frame pointers</a> in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient). </p>
<figure id="attachment_22159" aria-describedby="caption-attachment-22159" class="wp-caption alignnone c2"><img class="size-large wp-image-22159" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?w=1024" alt="" width="1024" height="633" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png 1607w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=916,566 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=768,475 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1024,633 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1536,949 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=192,119 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22159" class="wp-caption-text">A simplified Strobelight service graph.</figcaption></figure><h2>Show me the data (and make it nice)!</h2>
<p>The primary tool Strobelight customers use is <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener">Scuba</a> – a query language (like SQL), database, and UI. The Scuba UI has a large suite of visualizations for the queries people construct (e.g., flame graphs, pie charts, time series graphs, distributions, etc).</p>
<p>Strobelight, for the most part, produces Scuba data and, generally, it’s a happy marriage. If someone runs an on-demand profile, it’s just a few seconds before they can visualize this data in the Scuba UI (and send people links to it). Even tools like <a href="https://perfetto.dev/" target="_blank" rel="noopener">Perfetto</a> expose the ability to query the underlying data because they know it’s impossible to try to come up with enough dropdowns and buttons that can express everything you want to do in a query language – though the Scuba UI comes close.</p>
<figure id="attachment_22160" aria-describedby="caption-attachment-22160" class="wp-caption alignnone c2"><img class="size-large wp-image-22160" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?w=1024" alt="" width="1024" height="554" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1024,554 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1536,831 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22160" class="wp-caption-text">An example flamegraph/icicle of function call stacks of the CPU cycles event for the symbol service for one hour.</figcaption></figure><p>The other tool is a trace visualization tool used at Meta named <a href="https://www.facebook.com/atscaleevents/videos/996197807391867/">Tracery</a>. We use this tool when we want to combine correlated but different streams of profile data on one screen. This data is also a natural fit for viewing on a timeline. Tracery allows users to make custom visualizations and curated workspaces to share with other engineers to pinpoint the important parts of that data. It’s also powered by a client-side columnar database (written in JavaScript!), which makes it very fast when it comes to zooming and filtering. Strobelight’s Crochet profiler combines service request spans, CPU-cycles stacks, and off-CPU data to give users a detailed snapshot of their service.</p>
<figure id="attachment_22161" aria-describedby="caption-attachment-22161" class="wp-caption alignnone c3"><img class="size-large wp-image-22161" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?w=975" alt="" width="975" height="552" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png 975w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=916,519 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22161" class="wp-caption-text">An example trace in Tracery.</figcaption></figure><h2>The Biggest Ampersand</h2>
<p>Strobelight has helped engineers at Meta realize countless efficiency and latency wins, ranging from increases in the number of requests served, to large reductions in heap allocations, to regressions caught in pre-prod analysis tools.</p>
<p>But one of the most significant wins is one we call, “The Biggest Ampersand.”</p>
<p>A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++.</p>
<p>The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t.</p>
<p>It was a simple mistake that any engineer working in C++ has made a hundred times.</p>
<p>So, the engineer typed an “&amp;” in front of the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!</p>
<p>Go back and re-read that sentence. One ampersand! </p>
<h2>An open ending</h2>
<p>This only scratches the surface of everything Strobelight can do. The Strobelight team works closely with Meta’s performance engineers on new features that can better analyze code to help pinpoint where things are slow, computationally expensive, and why.</p>
<p>We’re currently working on <a href="https://github.com/facebookincubator/strobelight" target="_blank" rel="noopener">open-sourcing</a> Strobelight’s profilers and libraries, which will no doubt make them more robust and useful. Most of the technologies Strobelight uses are already public or open source, so please use and contribute to them!</p>
<h2>Acknowledgements</h2>
<p><em>Special thanks to Wenlei He, Andrii Nakryiko, Giuseppe Ottaviano, Mark Santaniello, Nathan Slingerland, Anita Zhang, and the Profilers Team at Meta.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/01/21/uncategorized/strobelight-a-profiling-service-built-on-open-source-technology/</link>
      <guid>https://engineering.fb.com/2025/01/21/uncategorized/strobelight-a-profiling-service-built-on-open-source-technology/</guid>
      <pubDate>Tue, 21 Jan 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Strobelight: A profiling service built on open source technology]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing details about Strobelight, Meta’s profiling orchestrator.</li>
<li class="c1" aria-level="1">Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet.</li>
<li class="c1" aria-level="1">Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings.</li>
</ul><p>Strobelight, Meta’s profiling orchestrator, is not really one technology. It’s several (many open source) combined to make something that unlocks truly amazing efficiency wins. Strobelight is also not a single profiler but an orchestrator of many different profilers (even ad-hoc ones) that runs on all production hosts at Meta, collecting detailed information about CPU usage, memory allocations, and other performance metrics from running processes. Engineers and developers can use this information to identify performance and resource bottlenecks, optimize their code, and improve utilization.</p>
<p>When you combine talented engineers with rich performance data you can get efficiency wins by both creating tooling to identify issues before they reach production and finding opportunities in already running code. Let’s say an engineer makes a code change that introduces an unintended copy of some large object on a service’s critical path. Meta’s existing tools can identify the issue and query Strobelight data to estimate the impact on compute cost. Then Meta’s code review tool can notify the engineer that they’re about to waste, say, 20,000 servers.</p>
<p>Of course, static analysis tools can pick up on these sorts of issues, but they are unaware of global compute cost and oftentimes these inefficiencies aren’t a problem until they’re gradually serving millions of requests per minute. The frog can boil slowly.</p>
<h2>Why do we use profilers?</h2>
<p>Profilers operate by sampling data to perform statistical analysis. For example, a profiler takes a sample every N events (or milliseconds in the case of time profilers) to understand where that event occurs or what is happening at the moment of that event. With a CPU-cycles event, for example, the profile will be CPU time spent in functions or function call stacks executing on the CPU. This can give an engineer a high-level understanding of the code execution of a service or binary.</p>
<h2>Choosing your own adventure with Strobelight</h2>
<p>There are other daemons at Meta that collect observability metrics, but Strobelight’s wheelhouse is software profiling. It connects resource usage to source code (what developers understand best). Strobelight’s profilers are often, but not exclusively, built using <a href="https://docs.ebpf.io/" target="_blank" rel="noopener">eBPF</a>, which is a Linux kernel technology. eBPF allows the safe injection of custom code into the kernel, which enables very low overhead collection of different types of data and unlocks so many possibilities in the observability space that it’s hard to imagine how Strobelight would work without it.</p>
<p>As of the time of writing this, Strobelight has 42 different profilers, including:</p>
<ul><li class="c1" aria-level="1">Memory profilers powered by <a href="https://github.com/jemalloc/jemalloc" target="_blank" rel="noopener">jemalloc.</a></li>
<li class="c1" aria-level="1">Function call count profilers.</li>
<li class="c1" aria-level="1">Event-based profilers for both native and non-native languages (e.g., Python, Java, and Erlang).</li>
<li class="c1" aria-level="1">AI/GPU profilers.</li>
<li class="c1" aria-level="1">Profilers that track off-CPU time.</li>
<li class="c1" aria-level="1">Profilers that track service request latency.</li>
</ul><p>Engineers can utilize any one of these to collect data from servers on demand via Strobelight’s command line tool or web UI.</p>
<figure id="attachment_22158" aria-describedby="caption-attachment-22158" class="wp-caption alignnone c2"><img class="size-large wp-image-22158" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?w=1024" alt="" width="1024" height="579" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=916,518 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1024,579 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1536,869 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=192,109 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22158" class="wp-caption-text">The Strobelight web UI.</figcaption></figure><p>Users also have the ability to set up continuous or “triggered” profiling for any of these profilers by updating a configuration file in Meta’s <a href="https://research.facebook.com/publications/holistic-configuration-management-at-facebook/" target="_blank" rel="noopener">Configerator</a>, allowing them to target their entire service or, for example, only hosts that run in certain regions. Users can specify how often these profilers should run, the run duration, the symbolization strategy, the process they want to target, and a lot more.</p>
<p>Here is an example of a simple configuration for one of these profilers:</p>
<pre class="line-numbers"><code class="language-none">add_continuous_override_for_offcpu_data(
    "my_awesome_team", // the team that owns this service
    Type.SERVICE_ID,
    "my_awseome_service",
    30_000, // desired samples per hour
)
</code></pre>
<p>Why does Strobelight have so many profilers? Because there are so many different things happening in these systems powered by so many different technologies.</p>
<p>This is also why Strobelight provides ad-hoc profilers. Since the kind of data that can be gathered from a binary is so varied, engineers often need something that Strobelight doesn’t provide out of the box. Adding a new profiler from scratch to Strobelight involves several code changes and could take several weeks to get reviewed and rolled out.</p>
<p>However, engineers can write a single <a href="https://github.com/bpftrace/bpftrace" target="_blank" rel="noopener"><em>bpftrace</em></a> script (a simple language/tool that allows you to easily write eBPF programs) and tell Strobelight to run it like it would any other profiler. An engineer that really cares about the latency of a particular C++ function, for example, could write up a little bpftrace script, commit it, and have Strobelight run it on any number of hosts throughout Meta’s fleet – all within a matter of hours, if needed.</p>
<p>If all of this sounds powerfully dangerous, that’s because it is. However, Strobelight has several safeguards in place to prevent users from causing performance degradation for the targeted workloads and retention issues for the databases Strobelight writes to. Strobelight also has enough awareness to ensure that different profilers don’t conflict with each other. For example, if a profiler is tracking CPU cycles, Strobelight ensures another profiler can’t use another PMU counter at the same time (as there are other services that also use them).</p>
<p>Strobelight also has concurrency rules and a profiler queuing system. Of course, service owners still have the flexibility to really hammer their machines if they want to extract a lot of data to debug.</p>
<h2>Default data for everyone</h2>
<p>Since its inception, one of Strobelight’s core principles has been to provide automatic, regularly-collected profiling data for all of Meta’s services. It’s like a flight recorder – something that doesn’t have to be thought about until it’s needed. What’s worse than waking up to an alert that a service is unhealthy and there is no data as to why?</p>
<p>For that reason, Strobelight has a handful of curated profilers that are configured to run automatically on every Meta host. They’re not running all the time; that would be “bad” and not really “profiling.” Instead, they have custom run intervals and sampling rates specific to the workloads running on the host. This provides just the right amount of data without impacting the profiled services or overburdening the systems that store Strobelight data.</p>
<p>Here is an example:</p>
<p>A service, named Soft Server, runs on 1,000 hosts and let’s say we want profiler A to gather 40,000 CPU-cycles samples per hour for this service (remember the config above). Strobelight, knowing how many hosts Soft Server runs on, but not how CPU intensive it is, will start with a conservative run probability, which is a sampling mechanism to prevent bias (e.g., profiling these hosts at noon every day would hide traffic patterns).</p>
<p>The next day Strobelight will look at how many samples it was able to gather for this service and then automatically tune the run probability (with some very simple math) to try to hit 40,000 samples per hour. We call this dynamic sampling and Strobelight does this readjustment every day for every service at Meta.</p>
<p>And if there is more than one service running on the host (excluding daemons like systemd or Strobelight) then Strobelight will default to using the configuration that will yield more samples for both.</p>
<p>Hang on, hang on. If the run probability or sampling rate is different depending on the host for a service, then how can the data be aggregated or compared across the hosts? And how can profiling data for multiple services be compared?</p>
<p>Since Strobelight is aware of all these different knobs for profile tuning, it adjusts the “weight” of a profile sample when it’s logged. A sample’s weight is used to normalize the data and prevent bias when analyzing or viewing this data in aggregate. So even if Strobelight is profiling Soft Server less often on one host than on another, the samples can be accurately compared and grouped. This also works for comparing two different services since Strobelight is used both by service owners looking at their specific service as well as efficiency experts who look for “horizontal” wins across the fleet in shared libraries.</p>
<h2>How Strobelight saves capacity</h2>
<p>There are two default continuous profilers that should be called out because of how much they end up saving in capacity.</p>
<h3>The last branch record (LBR) profiler </h3>
<p>The LBR profiler, true to its name, is used to sample <a href="https://lwn.net/Articles/680985/" target="_blank" rel="noopener">last branch records</a> (a hardware feature that started on Intel). The data from this profiler doesn’t get visualized but instead is fed into Meta’s feedback directed optimization (FDO) pipeline. This data is used to create FDO profiles that are consumed at compile time (<a href="https://ieeexplore.ieee.org/document/10444807" target="_blank" rel="noopener">CSSPGO</a>) and post-compile time (<a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/">BOLT</a>) to speed up binaries through the added knowledge of runtime behavior. Meta’s top 200 largest services all have FDO profiles from the LBR data gathered continuously across the fleet. Some of these services see up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta.</p>
<h3>The event profiler</h3>
<p>The second profiler is Strobelight’s event profiler. This is Strobelight’s version of the Linux perf tool. Its primary job is to collect user and kernel stack traces from multiple performance (perf) events e.g., CPU-cycles, L3 cache misses, instructions, etc. Not only is this data looked at by individual engineers to understand what the hottest functions and call paths are, but this data is also fed into monitoring and testing tools to identify regressions; ideally <em>before</em> they hit production.</p>
<h2>Did someone say Meta…data?</h2>
<p>Looking at function call stacks with <a href="https://www.brendangregg.com/flamegraphs.html" target="_blank" rel="noopener">flame graphs</a> is great, nothing against it. But a service owner looking at call stacks from their service, which imports many libraries and utilizes Meta’s software frameworks, will see a lot of “foreign” functions. Also, what about finding just the stacks for p99 latency requests? Or how about all the places where a service is making an unintended string copy?</p>
<h3>Stack schemas</h3>
<p>Strobelight has multiple mechanisms for enhancing the data it produces according to the needs of its users. One such mechanism is called Stack Schemas (inspired by <a href="https://learn.microsoft.com/en-us/windows-hardware/test/wpt/stack-tags" target="_blank" rel="noopener">Microsoft’s stack tags</a>), which is a small DSL that operates on call stacks and can be used to add tags (strings) to entire call stacks or individual frames/functions. These tags can then be utilized in our visualization tool. Stack Schemas can also remove functions users don’t care about with regex matching. Any number of schemas can be applied on a per-service or even per-profile basis to customize the data.</p>
<p>There are even folks who create dashboards from this metadata to help other engineers identify expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, and much more. Static analysis tools that can do this have been around for a long time, but they can’t pinpoint the really painful or computationally expensive instances of these issues across a large fleet of machines.</p>
<h3>Strobemeta</h3>
<p>Strobemeta is another mechanism, which utilizes thread local storage, to attach bits of dynamic metadata at runtime to call stacks that we gather in the event profiler (and others). This is one of the biggest advantages of building profilers using eBPF: complex and customized actions taken at sample time. Collected Strobemeta is used to attribute call stacks to specific service endpoints, or request latency metrics, or request identifiers. Again, this allows engineers and tools to do more complex filtering to focus the vast amounts of data that Strobelight profilers produce.</p>
<h2>Symbolization</h2>
<p>Now is a good time to talk about symbolization: taking the virtual address of an instruction, converting it into an actual symbol (function) name, and, depending on the symbolization strategy, also getting the function’s source file, line number, and type information.</p>
<p>Most of the time getting the whole enchilada means using a binary’s DWARF debug info. But this can be many megabytes (or even gigabytes) in size because DWARF debug data contains much more than the symbol information.</p>
<p>This data needs to be downloaded then parsed. But attempting this while profiling, or even afterwards on the same host where the profile is gathered, is far too computationally expensive. Even with optimal caching strategies it can cause memory issues for the host’s workloads.</p>
<p>Strobelight gets around this problem via a symbolization service that utilizes several open source technologies including DWARF, ELF, <a href="https://github.com/YtnbFirewings/gsym" target="_blank" rel="noopener">gsym</a>, and <a href="https://github.com/libbpf/blazesym" target="_blank" rel="noopener">blazesym</a>. At the end of a profile Strobelight sends stacks of binary addresses to a service that sends back symbolized stacks with file, line, type info, and even inline information.</p>
<p>It can do this because it has already done all the heavy lifting of downloading and parsing the DWARF data for each of Meta’s binaries (specifically, production binaries) and stores what it needs in a database. Then it can serve multiple symbolization requests coming from different instances of Strobelight running throughout the fleet.</p>
<p>To add to that enchilada (hungry yet?), Strobelight also delays symbolization until after profiling and stores raw data to disk to prevent memory thrash on the host. This has the added benefit of not letting the consumer impact the producer – meaning if Strobelight’s user space code can’t handle the speed at which the eBPF kernel code is producing samples (because it’s spending time symbolizing or doing some other processing) it results in dropped samples.</p>
<p>All of this is made possible with the inclusion of <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html" target="_blank" rel="noopener">frame pointers</a> in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient). </p>
<figure id="attachment_22159" aria-describedby="caption-attachment-22159" class="wp-caption alignnone c2"><img class="size-large wp-image-22159" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?w=1024" alt="" width="1024" height="633" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png 1607w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=916,566 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=768,475 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1024,633 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1536,949 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=192,119 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22159" class="wp-caption-text">A simplified Strobelight service graph.</figcaption></figure><h2>Show me the data (and make it nice)!</h2>
<p>The primary tool Strobelight customers use is <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener">Scuba</a> – a query language (like SQL), database, and UI. The Scuba UI has a large suite of visualizations for the queries people construct (e.g., flame graphs, pie charts, time series graphs, distributions, etc).</p>
<p>Strobelight, for the most part, produces Scuba data and, generally, it’s a happy marriage. If someone runs an on-demand profile, it’s just a few seconds before they can visualize this data in the Scuba UI (and send people links to it). Even tools like <a href="https://perfetto.dev/" target="_blank" rel="noopener">Perfetto</a> expose the ability to query the underlying data because they know it’s impossible to try to come up with enough dropdowns and buttons that can express everything you want to do in a query language – though the Scuba UI comes close.</p>
<figure id="attachment_22160" aria-describedby="caption-attachment-22160" class="wp-caption alignnone c2"><img class="wp-image-22160 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?w=1024" alt="" width="1024" height="554" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1024,554 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1536,831 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22160" class="wp-caption-text">An example flamegraph/icicle of function call stacks of the CPU cycles event for the mononoke service for one hour.</figcaption></figure><p>The other tool is a trace visualization tool used at Meta named <a href="https://www.facebook.com/atscaleevents/videos/996197807391867/">Tracery</a>. We use this tool when we want to combine correlated but different streams of profile data on one screen. This data is also a natural fit for viewing on a timeline. Tracery allows users to make custom visualizations and curated workspaces to share with other engineers to pinpoint the important parts of that data. It’s also powered by a client-side columnar database (written in JavaScript!), which makes it very fast when it comes to zooming and filtering. Strobelight’s Crochet profiler combines service request spans, CPU-cycles stacks, and off-CPU data to give users a detailed snapshot of their service.</p>
<figure id="attachment_22161" aria-describedby="caption-attachment-22161" class="wp-caption alignnone c3"><img class="size-large wp-image-22161" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?w=975" alt="" width="975" height="552" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png 975w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=916,519 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22161" class="wp-caption-text">An example trace in Tracery.</figcaption></figure><h2>The Biggest Ampersand</h2>
<p>Strobelight has helped engineers at Meta realize countless efficiency and latency wins, ranging from increases in the number of requests served, to large reductions in heap allocations, to regressions caught in pre-prod analysis tools.</p>
<p>But one of the most significant wins is one we call, “The Biggest Ampersand.”</p>
<p>A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++.</p>
<p>The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t.</p>
<p>It was a simple mistake that any engineer working in C++ has made a hundred times.</p>
<p>So, the engineer typed an “&amp;” in front of the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!</p>
<p>Go back and re-read that sentence. One ampersand! </p>
<h2>An open ending</h2>
<p>This only scratches the surface of everything Strobelight can do. The Strobelight team works closely with Meta’s performance engineers on new features that can better analyze code to help pinpoint where things are slow, computationally expensive, and why.</p>
<p>We’re currently working on <a href="https://github.com/facebookincubator/strobelight" target="_blank" rel="noopener">open-sourcing</a> Strobelight’s profilers and libraries, which will no doubt make them more robust and useful. Most of the technologies Strobelight uses are already public or open source, so please use and contribute to them!</p>
<h2>Acknowledgements</h2>
<p><em>Special thanks to Wenlei He, Andrii Nakryiko, Giuseppe Ottaviano, Mark Santaniello, Nathan Slingerland, Anita Zhang, and the Profilers Team at Meta.</em></p>]]></description>
      <link>https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/</link>
      <guid>https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/</guid>
      <pubDate>Tue, 21 Jan 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Measuring productivity impact with Diff Authoring Time]]></title>
      <description><![CDATA[<p>Do types actually make developers more productive? Or is it just more typing on the keyboard? To answer that question we’re revisiting <a href="https://engineering.fb.com/2024/10/25/developer-tools/diff-authoring-time-dat-measuring-developer-productivity-meta/" target="_blank" rel="noopener">Diff Authoring Time (DAT)</a> – how Meta measures how long it takes to submit changes to a codebase.</p>
<p>DAT is just one of the ways e measure developer productivity and this latest episode of the Meta Tech Podcast takes a look at two concrete use cases for DAT, including a type-safe mocking framework in Hack.</p>
<p>Tune in to learn how we leverage metrics to run experiments on productivity in our internal codebase at Meta.</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/34195175/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/1hQrnC2opzGA80MjX61MRa" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/to-type-or-not-to-type-measuring-productivity-impact/id1370910331?i=1000678671324" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/85xvrheb" target="_blank" rel="noopener">Pocket Casts</a></li>
<li><a href="https://overcast.fm/login" target="_blank" rel="noopener">Overcast</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2025/01/16/developer-tools/measuring-productivity-impact-with-diff-authoring-time/</link>
      <guid>https://engineering.fb.com/2025/01/16/developer-tools/measuring-productivity-impact-with-diff-authoring-time/</guid>
      <pubDate>Thu, 16 Jan 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[ILA Evo: Meta’s journey to reimagine fiber optic in-line amplifier sites]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Today’s rapidly evolving landscape of use cases that demand highly performant and efficient network infrastructure is placing new emphasis on how in-line amplifiers (ILAs) are designed and deployed.</li>
<li class="c1" aria-level="1">Meta’s ILA Evo effort seeks to reimagine how an ILA site could be deployed to improve speed and cost while making a step function improvement in power efficiency.</li>
</ul><p>Over the past year, Meta has been on a journey to reimagine fiber optic in-line amplifier (ILA) sites. An important piece of network infrastructure, ILAs  serve to amplify optical signals and are often placed in remote locations between data centers. If one ILA fails, an entire intercity route fails, and if one ILA cannot grow, the entire fiber route is constrained. Meta is excited to introduce new ideas and concepts to help modernize the ILAs for tomorrow.</p>
<p>To that end, we’ve launched the ILA Evo effort to overcome the historic design constraints of today’s ILAs, namely:</p>
<ul><li class="c1" aria-level="1">The minimal skilled labor and raw material required at the deployment site;</li>
<li class="c1" aria-level="1">The requirement that buildings meet local snow, wind, and seismic load along with fire codes and health and safety regulation –  plus lifespan greater than 25 years; </li>
</ul><p>This new effort seeks to propel advancement through several new requirements: </p>
<ul><li class="c1" aria-level="1">Requiring the building and inside plant (ISP) must be deployed in three to four days.</li>
<li class="c1" aria-level="1">Reducing the need for specialized heavy equipment (avoiding the cost and time for heavy-lift cranes to travel to a remote site).</li>
<li class="c1" aria-level="1">Minimizing concrete (avoid cost and time to transport, form, tie rebar, pour, and cure concrete).</li>
<li class="c1" aria-level="1">Reducing the power usage effectiveness (PUE) to less than 1.5 – nowhere near <a href="https://sustainability.atmeta.com/data-centers/" target="_blank" rel="noopener">Meta’s operational data center PUE average of 1.09</a> (2023 average), but an achievable and significant improvement.</li>
</ul><h2>A short history of ILAs</h2>
<p>Fiber optic cable networks have seen exponential growth in both size and capacity since <a href="https://en.wikipedia.org/wiki/GTE" target="_blank" rel="noopener">GTE launched the first fiber optic network in 1977</a>. <a href="https://docs.fcc.gov/public/attachments/DOC-334523A1.pdf" target="_blank" rel="noopener">U.S. network operators would install 20,039 mi (32,250 km) of intercity fiber routes by 1985.</a> This would <a href="https://docs.fcc.gov/public/attachments/DOC-334526A1.pdf" target="_blank" rel="noopener">quadruple to 83,618 mi (134,570 km) by 1989</a> and <a href="https://transition.fcc.gov/Bureaus/Common_Carrier/Reports/FCC-State_Link/Fiber/fiber98.pdf" target="_blank" rel="noopener">double again to 159,779 mi (257,149 km) by 1998</a>, with MCI, Sprint, USTelecom, and WilTel being the major players in those early days.</p>
<p>As fiber was rolled out along roads, railways, and pipelines, real estate to house optical signal repeaters was developed in parallel. What later became known ILA sites were spaced 18 to 25 miles (30 to 40 km) apart. With rapid improvements in both optical fiber purity and composition, plus advancements in optronics, spacing soon doubled into the 50 to 60 mile (80 to 100 km) range where it has largely remained until today.</p>
<p>Early ILA building designs were roughly modeled on Bell Telephone central offices, albeit a shrunken down version: concrete shells (or stick framed construction on a steel I-beam base) placed atop concrete foundations; wall-mounted HVAC units; -48V power distribution; lead-acid batteries; diesel backup generators and so forth surrounded by chain link fences. </p>
<p>Buildings were constructed in a central location with ISP fitted into the shell before shipment to the site (via specialized motor carrier) and placed with a heavy-load crane. However, unlike the remarkable (and ongoing) advancements in fiber and optronics (e.g., CWDM to DWDM to Coherent DWDM), ILA sites themselves have received little attention.</p>
<figure id="attachment_22130" aria-describedby="caption-attachment-22130" class="wp-caption alignnone c2"><img class="size-large wp-image-22130" src="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?w=1024" alt="" width="1024" height="408" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png 3390w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=916,365 916w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=768,306 768w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=1024,408 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=1536,612 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=2048,816 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=96,38 96w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-2.png?resize=192,76 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22130" class="wp-caption-text">Current ILA site design.</figcaption></figure><p>Today’s ILA buildings are often larger, with more efficient HVAC systems. Components like security and building access systems have been modernized, but if you dropped a field technician from 1990 into one of today’s ILAs, they’d have little difficulty navigating. Historically, ILA sites haven’t required significant evolution, however newfound capacity growth and innovation has warranted the development of new ILA approaches.</p>
<h2>The structure of ILA Evo</h2>
<p>Working with global engineering consultancy <a href="https://aecom.com/" target="_blank" rel="noopener">AECOM</a>, we’ve organized the problem and our engineering efforts into several categories: different building systems and foundations; a new ISP installation method; alternative ballistics protection; introducing more efficient cooling; and modernizing backup power systems.</p>
<h3>Building system</h3>
<p><strong>Identify lightweight building designs which can be flat packed for easy, quick shipment and unloaded at the deployment site using a lift gate.</strong> Our emphasis has been on buildings composed of fiberglass-reinforced polymer (FRP) aka glass-reinforced polymer (GRP) wall and roof panels light enough for two people to handle, but sturdy enough to meet our design needs. There is a robust ecosystem of companies offering solutions in this space. This approach also allows for slightly taller buildings than prefab to provide additional overhead space for HVAC system elements or other components. </p>
<figure id="attachment_22135" aria-describedby="caption-attachment-22135" class="wp-caption alignnone c2"><img class="size-large wp-image-22135" src="https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?w=1024" alt="" width="1024" height="550" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png 6596w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=916,492 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=768,412 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=1024,550 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=1536,824 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=2048,1099 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-1.png?resize=192,103 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22135" class="wp-caption-text">Lightweight flat-pack buildings suitable for a wide range of local and climatic conditions.</figcaption></figure><h3>Building foundations</h3>
<p><strong>Where geology permits, utilize low- or no-concrete foundation designs for easier deployment into both greenfield and brownfield sites.</strong> Typically, ILA buildings sit atop slab-on-grade foundations with a perimeter edge return. This design works well in a variety of soil conditions and is well suited to the weights involved. However, as we pursue lighter buildings, other foundation designs become possible. In particular, the project has focused on steel or FRP I-beams over concrete pad footings or helical steel screw piles. Both options offer the potential for both lower cost and more rapid deployment.</p>
<figure id="attachment_22129" aria-describedby="caption-attachment-22129" class="wp-caption alignnone c2"><img class="size-large wp-image-22129" src="https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?w=1024" alt="" width="1024" height="541" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png 6596w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=916,484 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=768,406 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=1024,541 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=1536,812 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=2048,1083 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Helical-Screw-Pile-Mock-20241111_EDITED.png?resize=192,102 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22129" class="wp-caption-text">Lighter buildings can unlock alternative foundation designs like helical steel screw piles.</figcaption></figure><h3>Manufactured owner-furnished equipment ISP</h3>
<p><strong>Devise a process for rapidly installing ISP to minimize post-construction interior dust and debris clean-up.</strong> We envision employing a manufactured owner-furnished equipment (MOFE) process for ISP: six rack modules consisting of equipment racks and overhead tiers supported by an exoskeleton, and movable on casters, are preassembled in a centralized, clean factory-like environment. Like the building system, these can be delivered on regular trucks with lift gates and modules rolled into the building to be bolted down.</p>
<figure id="attachment_22136" aria-describedby="caption-attachment-22136" class="wp-caption alignnone c2"><img class="size-large wp-image-22136" src="https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?w=1024" alt="" width="1024" height="550" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png 6596w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=916,492 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=768,412 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=1024,550 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=1536,824 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=2048,1099 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Sprint-4-Enclosure2-2.png?resize=192,103 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22136" class="wp-caption-text">MOFE ISP for rapid on-site installation.</figcaption></figure><h3>Ballistics</h3>
<p><strong>Where required, provide appropriate ballistics protection</strong>; e.g., NIJ 0101.06 Level IIA. However, we felt assigning this function to the building system wasn’t the only option, so we’ve left open the possibility of ballistic privacy fences and other ideas as ways to solve this requirement which offers more flexibility for what “inside the fence” could look like.</p>
<h3>Cooling</h3>
<p><strong>The single biggest opportunity to improve PUE lies in introducing more efficient cooling tech combined with higher temperature set points.</strong> PUE hasn’t historically been a focus metric for small telecoms sites, but this needs to change. Traditional sites use self-contained, wall-mounted HVAC units that rely on large fans to force air through the building without a duct system. This approach works, but is not very efficient: North American ILA PUE is typically between 2.5 and 3.0. However, over the past few years, power requirements have moved from 600-800W per rack into the 2-4 kW range. With increasing power density, we began looking for other options which could do the job more effectively and efficiently (i.e., reduce PUE).</p>
<p>Main elements:</p>
<ul><li class="c1" aria-level="1">Radically improve ILA site HVAC efficiency with advanced passive (i.e., compressorless) and/or liquid-based cooling tech.</li>
<li class="c1" aria-level="1">Because ILAs are unmanned, allow the building to run hotter: e.g., increase temperature set points from typical 22°C (72°F) to &gt;35°C (95°F); most optical transport kit is GR-63-CORE NEBS-3 compliant and able to run continuously at 40°C (104°F).</li>
<li class="c1" aria-level="1">Thermally “leaky” building systems; i.e., not air leaks , but lower R-value walls.</li>
<li class="c1" aria-level="1">Move rectifiers and batteries into outdoor cabinets: more space for optical gear and less heat generation.</li>
<li class="c1" aria-level="1">Utilize <a href="https://maintainability.com.sg/defect-library/green-tech/facade-coatings/cool-paint/" target="_blank" rel="noopener">cool surface materials and coatings</a> (e.g., heat reflective paints) to further reduce solar heat load.</li>
</ul><p>The project investigated many hyper efficient options such as chilled beams (ceiling mounted, air-to-liquid heat exchangers) combined with a ground-based heat exchange (water or glycol circulating in a closed, buried ground loop), but unfortunately, none could handle 24 x 2 kW. However, a loop thermosyphon system coupled with a high-efficiency compressor (a helper used during the hottest part of the hottest days) could achieve our PUE &lt;1.5 goal. Similarly, scaled-down chiller and computer room air conditioning (CRAC) units supplying liquid cooling into a thoughtfully designed floor plan could get us to that same goal.</p>
<p>The project hasn’t made a final selection yet as these technologies will make our goal of “no cranes” difficult to achieve, but we believe we have line-of-sight to a solution.</p>
<h3>Backup Power</h3>
<p><strong>Investigate modern alternatives to diesel generators and lead-acid batteries for standby power.</strong> We are considering substitutions for diesel generator backup that may include H2 fuel cells, capacitors, or other solutions based on location and commercial feasibility.</p>
<p>Additionally, we have investigated a range of battery technologies. At the moment, the economics aren’t attractive for this scale. In the meantime, we believe an H2 fuel cell combined with a small NaNiCl molten salt battery system (to handle site load for a few minutes while the fuel cell spins up and takes load) is an attractive, low maintenance solution.</p>
<figure id="attachment_22132" aria-describedby="caption-attachment-22132" class="wp-caption alignnone c2"><img class="size-large wp-image-22132" src="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?w=1024" alt="" width="1024" height="305" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png 3630w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=916,273 916w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=768,228 768w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=1024,305 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=1536,457 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=2048,609 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=96,29 96w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-2.png?resize=192,57 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22132" class="wp-caption-text">ILA Evo site design.</figcaption></figure><h2>Macro benefits of ILA Evo</h2>
<p>Beyond the technical aspects, ILA Evo brings potentially powerful commercial and risk management benefits: <strong>supply chain and ability to stockpile. </strong></p>
<p>Starting with the familiar current ILA supply chain, the chief advantage is completing the majority of work at a single site. However, this creates challenges in terms of cost, the complexity of scaling production up or down, and the difficulty and cost of transporting the necessary heavy equipment to remote sites. Additionally, the feasibility of stockpiling current ILA buildings is debatable: The space requirement would consume valuable real estate at the production site and staging them off-site simply amplifies the transportation issue to move them multiple times.</p>
<figure id="attachment_22131" aria-describedby="caption-attachment-22131" class="wp-caption alignnone c2"><img class="size-large wp-image-22131" src="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?w=1024" alt="" width="1024" height="566" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png 2220w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=916,506 916w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=768,424 768w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=1024,566 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=1536,848 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=2048,1131 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=96,53 96w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-Current-ILA-SC-2.png?resize=192,106 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22131" class="wp-caption-text">Typical supply chain for today’s ILAs.</figcaption></figure><p>Looking to ILA Evo and starting with a new disaggregated supply chain, many ISP materials and other items like HVAC units would most likely come from the usual sources, but these sources already tend to be scaled. Other elements such as the building system and MOFE ISP would be manufactured or assembled by companies that are not necessarily part of today’s telecom ecosystem.</p>
<p>Additionally, the process is very different. Concrete and stick framed buildings are created via a <em>construction process</em>, the nature of which is bespoke and relatively low volume. ILA Evo is predominantly <em>manufacturing</em> or <em>assembly processes</em>, which are inherently geared toward scale, including the possibility of 24×7 operations.</p>
<figure id="attachment_22133" aria-describedby="caption-attachment-22133" class="wp-caption alignnone c2"><img class="size-large wp-image-22133" src="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?w=1024" alt="" width="1024" height="623" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png 2220w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=916,557 916w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=768,467 768w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=1024,623 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=1536,934 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=2048,1245 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2025/01/ILA-Innovation-_-ILA-Evo-Blog-Post-20241118-ILA-Evo-SC-2.png?resize=192,117 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22133" class="wp-caption-text">Supply Chain envisioned for ILA Evo.</figcaption></figure><p>Next, stockpiling. It’s not hard to imagine manufacturing 500 building systems, packed efficiently and ready for next day shipment. One can also imagine 2,000 MOFE ISP modules (four for each 24 rack building) preassembled and ready to ship. Additionally, purchasing in bulk allows vendors to de-risk their own investments and achieve scales required for cost compression not possible with today’s designs.</p>
<h2>Crossover with current ILA design</h2>
<p>One final consideration is identifying ideas that could be retrofitted into existing sites. Putting aside a different building system and MOFE ISP, the HVAC system and the other “efficiency tweaks” could go into existing sites. Similarly, H2 fuel cell backup power could be applied. We also expect to explore substituting commodity ISP materials for their FRP analogs. If this works, FRP equipment racks and ladder racks (which have a lower carbon footprint than steel or aluminum) could also be an option for existing sites.</p>
<h2>What’s next?</h2>
<p>Following the completion of our research and design phase, there are a number of next steps planned for 2025:</p>
<ul><li class="c1" aria-level="1">The project has already engaged fiber optic operators in North America and Europe to gather early feedback and gain their insights. In the coming months, we expect to expand this consultation with operators in Latin America, Africa, Middle East, and Asia.</li>
<li class="c1" aria-level="1">A prototype site showcasing some of the best ideas from this work.</li>
<li class="c1" aria-level="1">We plan to create and broadly share blueprints, bills of material, and analyses that show our path and help seed operators’ own research, engineering, and technical real estate product development.</li>
</ul>]]></description>
      <link>https://engineering.fb.com/2025/01/10/production-engineering/ila-evo-in-line-amplifier-sites-meta/</link>
      <guid>https://engineering.fb.com/2025/01/10/production-engineering/ila-evo-in-line-amplifier-sites-meta/</guid>
      <pubDate>Fri, 10 Jan 2025 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Indexing code at scale with Glean]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re sharing details about <a href="https://glean.software/" target="_blank" rel="noopener">Glean</a>, Meta’s open source system for collecting, deriving and working with facts about source code.</li>
<li class="c1" aria-level="1">In this blog post we’ll talk about why a system like Glean is important, explain the rationale for Glean’s design, and run through some of the ways we’re using Glean to supercharge our developer tooling at Meta.</li>
</ul><p>In August 2021 we open-sourced our code indexing system <a href="https://glean.software/" target="_blank" rel="noopener">Glean</a>. Glean collects information about source code and provides it to developer tools through an efficient and flexible query language. We use Glean widely within Meta to power a range of developer tools including code browsing, code search, and documentation generation.</p>
<h2>Code Indexing</h2>
<p>Many tools that developers use rely on information extracted from the code they’re working on. For example:</p>
<ul><li class="c1" aria-level="1">Code navigation (“Go to definition”) in an IDE or a code browser;</li>
<li class="c1" aria-level="1">Code search;</li>
<li class="c1" aria-level="1">Automatically-generated documentation;</li>
<li class="c1" aria-level="1">Code analysis tools, such as dead code detection or linting.</li>
</ul><p>The job of collecting information from code is often called <em>code indexing</em>. A code indexing system’s job is to efficiently answer the questions your tools need to ask, such as, “Where is the definition of MyClass?” or “Which functions are defined in myfile.cpp?”</p>
<p>An IDE will typically do indexing as needed, when you load a new file or project for example. But the larger your codebase, the more important it becomes to do code indexing ahead of time. For large projects it becomes impractical to have the IDE process all the code of your project at startup and, depending on what language you’re using, that point may come earlier or later: C++ in particular is problematic due to the long compile times.</p>
<p>Moreover, with a larger codebase and many developers working on it, it makes sense to have a shared centralized indexing system so that we don’t repeat the work of indexing on every developer’s machine. And as the data produced by indexing can become large, we want to make it available over the network through a query interface rather than having to download it.</p>
<p>This leads to an architecture like this:<img class="alignnone wp-image-22098" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-1.png?w=768" alt="" width="636" height="450" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-1.png 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-1.png?resize=96,68 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-1.png?resize=192,136 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>In practice the real architecture is highly distributed:</p>
<ul><li class="c1" aria-level="1">Indexing can be heavily parallelized and we may have many indexing jobs running concurrently;</li>
<li class="c1" aria-level="1">The query service will be widely distributed to support load from many clients that are also distributed;</li>
<li class="c1" aria-level="1">The databases will be replicated across the query service machines and also backed up centrally.</li>
</ul><p>We’ve found that having a centralized indexing infrastructure enables a wide range of powerful developer tools. We’ll talk about some of the ways we’ve deployed Glean shortly, but first we’ll dive into the rationale for Glean’s design.</p>
<h2><strong>How is Glean different?</strong></h2>
<p>Code indexing systems have been around for a while. For example, there’s a well-established format called <a href="https://microsoft.github.io/language-server-protocol/" target="_blank" rel="noopener">LSIF</a> used by IDEs that caches information about code navigation.</p>
<p>When we designed Glean we wanted a system that wasn’t tied either to particular programming languages or to any particular use case. While we had some use cases in mind that we wanted to support—primarily code navigation of course—we didn’t want to design the system around one use case, in the hope that a more general system would support emerging requirements further into the future.</p>
<p>Therefore:</p>
<ul><li class="c1" aria-level="1"><strong>Glean doesn’t decide for you what data you can store</strong>. Indeed, most languages that Glean indexes have their own data schema and Glean can store arbitrary non-programming-language data too. The data is ultimately stored using <a href="https://rocksdb.org/" target="_blank" rel="noopener">RocksDB</a>, providing good scalability and efficient retrieval.</li>
<li class="c1" aria-level="1"><strong>Glean’s query language is very general</strong>. It’s a declarative logic-based query language that we call <em>Angle</em> (“Angle” is an anagram of “Glean”, and means “to fish”). Angle supports <em>deriving</em> information automatically, either on-the-fly at query time or ahead of time; this is a powerful mechanism that enables Glean to abstract over language-specific data and provide a language-neutral view of the data.</li>
</ul><p>Storing arbitrary language-specific data can be very powerful. For example, in C++ we use the detailed data to detect dead code such as unused #include or using statements. The latter in particular is rather tricky to do correctly and requires the data to include some C++-specific details, such as which using statement is used to resolve each symbol reference.</p>
<p>On the other hand, clients often don’t want the full language-specific data. They want to work at a higher level of abstraction. Imagine asking questions like, “Give me the names and locations of all the declarations in this file”, which should work for any language, and which you could use to implement a code outline feature in a code browser. Glean can provide this language-neutral view of the data by defining an abstraction layer in the schema itself – the mechanism is similar to SQL views if you’re familiar with those. This means that we don’t have to compromise between having detailed language-specific data or a lowest-common-denominator language-neutral view; we can have both.</p>
<p>This generality has allowed Glean to extend to a number of use cases beyond what we originally envisaged. We’ll cover some of those later in this post.</p>
<h2>A taste of Angle</h2>
<p>Glean has a unified language, Angle, for specifying both schemas and queries. As mentioned above, each language that we index has its own schema. To give you a flavor of this, here’s a fragment of the schema for C++ function declarations:</p>
<p><img class="size-large wp-image-22099 alignnone" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-2.png?w=380" alt="" width="380" height="206" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-2.png 380w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-2.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-2.png?resize=192,104 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Defining a schema for Glean is just like writing a set of type definitions. The braces surround a record definition, with a set of fields and their types. </p>
<ul><li class="c1" aria-level="1">A FunctionDeclaration is a <em>predicate</em> (roughly equivalent to a table in SQL). </li>
<li class="c1" aria-level="1">The instances of a predicate are called <em>facts</em> (roughly equivalent to rows in SQL). </li>
<li class="c1" aria-level="1">A predicate is a thing that you can query, and a query returns facts. </li>
</ul><p>To query efficiently you specify a prefix of the fields. So, for example, we can retrieve a particular FunctionDeclaration efficiently if we know its name.</p>
<p>Let’s write a query to find the function folly::parseJson:</p>
<p><img class="size-large wp-image-22100 alignnone" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-3.png?w=562" alt="" width="562" height="83" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-3.png 562w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-3.png?resize=96,14 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-3.png?resize=192,28 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>Without going into all the details, at a high level this query specifies that we want to find FunctionDeclaration facts that have a particular name and namespace. Glean can return results for this query in about a millisecond.</p>
<p>Angle supports more complex queries too. For example, to find all classes that inherit from a class called exception and have a method called what that overrides a method in a base class:</p>
<p><img class="size-large wp-image-22101 alignnone" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-4.png?w=717" alt="" width="717" height="240" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-4.png 717w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-4.png?resize=96,32 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-4.png?resize=192,64 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>This query returns the first results in a few milliseconds, and because there might be a lot of results we can fetch the results incrementally from the query server.</p>
<h2>Incremental indexing</h2>
<p>An important innovation in Glean is the ability to index <em>incrementally</em>. As the codebase grows, and the rate of change of the codebase increases (a monorepo suffers from both of these problems) we find that we can’t provide up-to-date information about the latest code because indexing the entire repository can take a long time. The index is perpetually out of date, perhaps by many hours.</p>
<p>The solution to this scaling problem is to process <em>just the changes</em>. In terms of computer science big-O notation, we want the cost of indexing to be <em>O(changes)</em> rather than <em>O(repository)</em>.</p>
<p>But actually achieving this is not as straightforward as it might sound.</p>
<p>We don’t want to destructively modify the original data, because we would like to be able to provide data at multiple revisions of the repository, and to do that without storing multiple full-sized copies of the data. So we would like to store the changes in such a way that we can view the whole index at both revisions simultaneously.</p>
<p>Even if we figure out a way to represent the changes, in practice it isn’t possible to achieve <em>O(changes)</em> for many programming languages. For example, in C++ if a header file is modified, we have to reprocess every source file that depends on it (directly or indirectly). We call this the <em>fanout</em>. So in practice the best we can do is <em>O(fanout)</em>.</p>
<p>Glean solves the first problem with an ingenious method of <em>stacking</em> immutable databases on top of each other. A stack of databases behaves just like a single database from the client’s perspective, but each layer in the stack can non-destructively add information to, or hide information from, the layers below. </p>
<p><img class="alignnone size-large wp-image-22102" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-5.png?w=569" alt="" width="569" height="458" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-5.png 569w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-5.png?resize=96,77 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-5.png?resize=192,155 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The full details are beyond the scope of this post, for more on how incrementality works see: <a href="https://glean.software/blog/incremental/" target="_blank" rel="noopener">Incremental indexing with Glean</a>.</p>
<p>Finding the fanout of a set of changes is different for each language.  Interestingly the fanout can often be obtained using Glean queries: for example for C++, the fanout is calculated by finding all the files that #include one of the changed files, and then repeating that query until there are no more files to find.</p>
<h2>How we use Glean at Meta</h2>
<h3>Code navigation</h3>
<p>Code navigation at scale, on large monorepos containing millions of lines in diverse programming languages, is a challenging problem. But what makes it different from the code navigation support available in modern IDEs, other than scale? In our experience, code indexing a la Glean offers the following advantages over IDEs:</p>
<ol><li class="c1" aria-level="1">Instantly available: Just open the code browser web app (our internal tool uses Monaco) and navigate without waiting for the IDE, build system, and LSP server to initialize</li>
<li class="c1" aria-level="1">More widely available: You can integrate code navigation in pretty much any app that shows code! One particularly useful integration is in your code review tool (ours is called Phabricator), but more on that later.</li>
<li class="c1" aria-level="1">Full repo visibility: Glean allows you to, for example, find all the references to a function, not just the ones visible to the IDE. This is particularly useful for finding dead code, or finding clients of an API that you want to change.</li>
<li class="c1" aria-level="1">Symbol search for all the languages across the whole repository.</li>
<li class="c1" aria-level="1">Cross language navigation: A common situation that comes up is a remote procedure call (RPC). When browsing the code you might want to jump to the service definition or, indeed, to the service implementation itself. Another case is languages with a foreign function interface (FFI), where you would like to browse from an FFI call to the corresponding definition in the target language.</li>
</ol><p>Our architecture for code navigation is based on <a href="https://github.com/facebookincubator/Glean/tree/main/glean/glass" target="_blank" rel="noopener">Glass</a>, a symbol server that abstracts all the complexities of Glean by implementing the usual code navigation logic in a simple but powerful API. The code browser needs only a single Glass API call, <em>documentSymbols(repo,path,revision),</em> to obtain a list of all the definitions and references in a source file, including source and target spans. The list of definitions is used to render an outline of the file, and the list of references to render underlines that can be hovered over or clicked to navigate. Finally, other code browser features like Find References or Call Hierarchy are also driven by API calls to Glass. </p>
<p><img class="alignnone size-large wp-image-22103" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-6.png?w=844" alt="" width="844" height="186" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-6.png 844w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-6.png?resize=768,169 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-6.png?resize=96,21 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-6.png?resize=192,42 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The code for Glass is also open-source, you can find it in <a href="https://github.com/facebookincubator/Glean/tree/main/glean/glass" target="_blank" rel="noopener">glean/glass</a> on GitHub.</p>
<h3>Speeding up the IDE</h3>
<p>Using an IDE such as VS Code on a large project, or a project with a large set of dependencies, or in a large monorepo tends to lead to a degraded experience as the IDE isn’t able to analyze all the code that you might want to explore. At Meta we’re using Glean to plug this gap for C++ developers: Because Glean has already analyzed the whole repository, C++ developers have access to basic functionality such as go-to-definition, find-references, and doc comment hovercards for the whole repository immediately on startup. As the IDE loads the files the developer is working on, the C++ language service seamlessly blends the Glean-provided data with that provided by the native clangd backend.</p>
<p>Our target was C++ developers initially because that group typically has the worst IDE experience due to the long compile times, but the approach is not specific to C++ and we imagine other languages following the same path in the future.</p>
<h3>Documentation generation</h3>
<p>The data we store in Glean includes enough information to reconstruct the full details of an API: classes, methods, type signatures, inheritance, and so on. Glean also collects documentation from the source code when it uses the standard convention for the language, e.g., in C++ the convention is /// comment or /** comment */. With API data and documentation strings in Glean we can produce automatically-generated documentation on demand. </p>
<p>Here’s an example page for the folly::Singleton type:</p>
<p><img class="alignnone size-large wp-image-22104" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?w=1024" alt="" width="1024" height="664" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png 1672w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=916,594 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=768,498 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=1024,664 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=1536,997 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=96,62 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Glean-image-7.png?resize=192,125 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<p>The data for these pages is produced by Glass and rendered by a client-side UI. The documentation is fully hyperlinked so the user can navigate around all the APIs throughout the repository easily. Meta engineers get consistent code documentation integrations across all the programming languages supported by Glean.</p>
<h3>Symbol IDs</h3>
<p>Glass assigns every symbol a <em>symbol ID</em><strong><em>,</em></strong> a unique string that identifies the symbol. For example, the symbol ID for folly::Singleton would be something like, REPOSITORY/cpp/folly/Singleton. The symbol ID can be used to link directly to the documentation page for the symbol, so there’s a URL for every symbol that doesn’t change even if the symbol’s definition moves around. </p>
<p>We can use the symbol ID to request information about a symbol from Glass, for example to find all the references to the symbol throughout the repository. All of this works for every language, although the exact format for a symbol ID varies per language.</p>
<h3>Analyzing code changes</h3>
<p>Glean indexing runs on diffs (think, “pull requests”) to extract a mechanical summary of the changeset that we call a <em>diff sketch</em>. For example, a diff might introduce a new class, remove a method, add a field to a type, introduce a new call to a function, and so on. The diff sketch lists all of these changes in a machine-readable form.</p>
<p>Diff sketches are used to drive a simple static analysis that can identify potential issues that might require further review. They can also be used to drive non-trivial lint rules, rich notifications, and semantic search over commits. One example of the latter is connecting a production stack trace to recent commits that modified the affected function(s), to help root-cause performance issues or new failures.</p>
<p>Indexing diffs also powers code navigation in our code review tools, giving code reviewers access to accurate go-to-definition on the code changes being reviewed, along with other code insights such as type-on-hover and documentation. This is a powerful lift to the code review process, making it easier for reviewers to understand the changes and provide valuable review feedback. At Meta this is enabled for a <a href="https://engineering.fb.com/2022/07/27/developer-tools/programming-languages-endorsed-for-server-side-use-at-meta/" target="_blank" rel="noopener">variety of different languages</a>, including C++, Python, PHP, Javascript, <a href="https://engineering.fb.com/2021/04/29/developer-tools/rust/" target="_blank" rel="noopener">Rust</a>, Erlang, Thrift, and even Haskell.</p>
<h2>More applications for Glean</h2>
<p>Aside from the primary applications described above, Glean is also used to</p>
<ul><li class="c1" aria-level="1">Analyse build dependency graphs.</li>
<li class="c1" aria-level="1"><a href="https://engineering.fb.com/2023/10/24/data-infrastructure/automating-dead-code-cleanup/" target="_blank" rel="noopener">Detect and remove dead code</a>.</li>
<li class="c1" aria-level="1">Track the progress of API migrations.</li>
<li class="c1" aria-level="1">Measure various metrics that contribute to code complexity.</li>
<li class="c1" aria-level="1">Track test coverage and select tests to run.</li>
<li class="c1" aria-level="1"><a href="https://engineering.fb.com/2023/10/31/data-infrastructure/automating-data-removal/" target="_blank" rel="noopener">Automate data removal</a>.</li>
<li class="c1" aria-level="1">Retrieval Augmented Generation (RAG) in AI coding assistants</li>
</ul><p>Furthermore, there are an ever-growing number of ad-hoc queries made by various people and systems to solve a variety of problems. Having a system like Glean means you can ask questions about your code: we don’t know all the questions we might want to ask, nor do we know all the data we might want to store, so Glean deliberately aims to be as general as possible on both of these fronts.</p>
<h2>Try Glean today</h2>
<p>Visit the <a href="https://glean.software/" target="_blank" rel="noopener">Glean site</a> for more details, technical documentation, and information on how to get started.</p>]]></description>
      <link>https://engineering.fb.com/2024/12/19/developer-tools/glean-open-source-code-indexing/</link>
      <guid>https://engineering.fb.com/2024/12/19/developer-tools/glean-open-source-code-indexing/</guid>
      <pubDate>Thu, 19 Dec 2024 15:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Translating Java to Kotlin at Scale]]></title>
      <description><![CDATA[<ul><li>Meta has been on a years-long undertaking to translate our entire Android codebase from Java to Kotlin.</li>
<li>Today, despite having one of the largest Android codebases in the world, we’re well past the halfway point and still going.</li>
<li>We’re sharing some of the tradeoffs we’ve made to support automating our transition to Kotlin, seemingly simple transformations that are surprisingly tricky, and how we’re collaborating with other companies to capture hundreds more corner cases.</li>
</ul><p>Android development at Meta has been Kotlin-first since 2020, and developers have been saying they prefer Kotlin as a language for even longer.</p>
<p>But, adoption doesn’t necessarily entail translation. We could simply decide to write all new code in Kotlin and leave our existing Java code as is, just as many other companies have. Or we could take it a little further and translate just the most important files. Instead, we decided that the only way to leverage the full value of Kotlin was to go all in on conversion, even if it meant building our own infrastructure to automate translation at scale. So, a few years ago, engineers at Meta decided to take <a href="https://engineering.fb.com/2022/10/24/android/android-java-kotlin-migration/" target="_blank" rel="noopener">roughly ten million lines of perfectly good Java code and rewrite them in Kotlin</a>.</p>
<p>Of course, we had to solve problems beyond translation, such as slow build speeds and insufficient linters. To learn more about Meta’s broader adoption effort, see Omer Strulovich’s 2022 blog post on our <a href="https://engineering.fb.com/2022/10/24/android/android-java-kotlin-migration/" target="_blank" rel="noopener">migration from Java to Kotlin</a> or Lisa Watkin’s talk about <a href="https://atscaleconference.com/videos/kotlin-instagram/" target="_blank" rel="noopener">Kotlin adoption at Instagram</a>.</p>
<div class="jetpack-video-wrapper"><iframe title="Translating Java to Kotlin at Scale | Eve Matthaey" width="1778" height="1000" src="https://www.youtube.com/embed/zfnOjAYdWrc?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h1>How much translation is enough?</h1>
<p>To maximize our gains in developer productivity and null safety, we’re aiming to translate virtually all of our actively developed code, <em>plus</em> any code that’s central in the dependency graph. Not surprisingly, that’s most of our code, which adds up to tens of millions of lines, including some of the most complex files.</p>
<p>It’s pretty intuitive that if we want to maximize productivity gains, we should translate our actively developed code. It’s a little less obvious why translating beyond that provides incremental null-safety benefits. The short answer is that any remaining Java code can be an agent of nullability chaos, especially if it’s not null safe and even more so if it’s central to the dependency graph. (For a more detailed explanation, see the section below on null safety.)</p>
<p>We also want to minimize the drawbacks of a mixed codebase. As long as we have substantial amounts of Java, we need to continue supporting parallel tool chains. There’s also the much-lamented issue of slower build speeds: Compiling Kotlin is slower than compiling Java, but compiling both together is the slowest of all. </p>
<h1>How did we get here?</h1>
<p>Like most folks in the industry, we started migrating incrementally by repeatedly clicking a button in the Intellij IDE. This button would trigger <a href="https://github.com/JetBrains/intellij-community/tree/master/plugins/kotlin/j2k" target="_blank" rel="noopener">Intellij’s translation tool</a>, commonly known as J2K. It quickly became clear that this approach wasn’t going to scale for a codebase of our size: We would have to click that button—and then wait the couple of minutes it takes to run—almost 100,000 times to translate our Android codebase. </p>
<p>With this in mind, we set out to automate the conversion process and minimize interference with our developers’ daily work. The result was a tool we call the Kotlinator that we built around J2K. It’s now comprised of six phases:</p>
<ol><li class="c1" aria-level="1"><strong>“Deep” build:</strong> Building the code we’re about to translate helps the IDE resolve all the symbols, especially when third-party dependencies or generated code are involved.</li>
<li class="c1" aria-level="1"><strong>Preprocessing:</strong> This phase is built on top of our custom tool, Editus. It contains about 50 steps for nullability, J2K workarounds, changes to support our custom DI framework, and more.</li>
<li class="c1" aria-level="1"><strong>Headless J2K:</strong> The J2K we know and love, but server-friendly!</li>
<li class="c1" aria-level="1"><strong>Postprocessing:</strong> This phase is similar in architecture to our preprocessing. It consists of about 150 steps for Android-specific changes, as well as more nullability changes, and tweaks to make the resulting Kotlin more idiomatic.</li>
<li class="c1" aria-level="1"><strong>Linters:</strong> Running our linters with autofixes allows us to implement perennial fixes in a way that benefits both conversion diffs and regular diffs going forward.</li>
<li class="c1" aria-level="1"><strong>Build error-based fixes:</strong> Finally, the Kotlinator makes even more fixes based on build errors. After a failed build of the just-translated code, we parse the errors and apply further fixes (e.g., adding a missing import or inserting a !!).</li>
</ol><p>We’ll dive into more detail on the most interesting phases below.</p>
<h2>Going headless with J2K</h2>
<p>The first step was creating a headless version of J2K that could run on a remote machine—not easy, given how tightly coupled J2K and the rest of the Intellij IDE are. We considered a few approaches, including running J2K using a setup similar to Intellij’s testing environment, but after talking to JetBrains’ J2K expert, Ilya Kirillov, we eventually settled on something more like a headless inspection. To implement this approach, we created an Intellij plugin that includes a class extending ApplicationStarter and calling directly into the JavaToKotlinConverter class that’s also referenced by the IDE’s conversion button.</p>
<p>On top of not blocking developers’ local IDEs, the headless approach allowed us to translate multiple files at once, and it unblocked all sorts of helpful but time-consuming steps, like the “build and fix errors” process detailed below. Overall conversion time grew longer (a typical remote conversion now takes about 30 minutes to run), but time spent by the developers decreased substantially.</p>
<p>Of course, going headless presents another conundrum: If developers aren’t clicking the button themselves, who decides what to translate, and how does it get reviewed and shipped? The answer turned out to be pretty easy: Meta has an internal system that allows developers to set up what is essentially a cron job that produces a daily batch of <a href="https://engineering.fb.com/2024/10/25/developer-tools/diff-authoring-time-dat-measuring-developer-productivity-meta/">diffs</a> (our version of pull requests) based on user-defined selection criteria. This system also helps choose relevant reviewers, ensures that tests and other validations pass, and ships the diff once it’s approved by a human. We also offer a web UI for developers to trigger a remote conversion of a specific file or module; behind the scenes, it runs the same process as the cron job.</p>
<p>As for choosing what and when to translate, we don’t enforce any particular order beyond prioritizing actively developed files. At this point, the Kotlinator is sophisticated enough to handle most compatibility changes required in external files (for example, changing Kotlin dependents’ references of foo.getName() to foo.name), so there’s no need to order our translations based on the dependency graph. </p>
<h2>Adding custom pre- and post-conversion steps</h2>
<p>Due to the size of our codebase and the custom frameworks we use, the vast majority of conversion diffs produced by the vanilla J2K would not build. To address this problem, we added two custom phases to our conversion process, preprocessing and postprocessing. Both phases contain dozens of steps that take in the file being translated, analyze it (and sometimes its dependencies and dependents, too), and perform a Java-&gt;Java or Kotlin-&gt;Kotlin transformation if needed. <a href="https://github.com/fbsamples/kotlin_ast_tools" target="_blank" rel="noopener">A few of our postprocessing transformations have been open-sourced</a>.</p>
<p>These custom translation steps are built on top of an internal metaprogramming tool that leverages Jetbrains’ PSI libraries for both Java and Kotlin. Unlike most metaprogramming tools, it is very much <em>not</em> a compiler plugin, so it can analyze broken code across both languages, and does so very quickly. This is especially helpful for postprocessing because it’s often running on code with compilation errors, doing analysis that requires type information. Some postprocessing steps that deal with dependents may need to resolve symbols across several thousand unbuildable Java and Kotlin files. For example, one of our postprocessing steps helps translate interfaces by examining its Kotlin implementers and updating overridden getter functions to instead be overridden properties, like in the example below.</p>
<pre class="line-numbers"><code class="language-java">interface JustConverted {
  val name: String // I used to be a method called `getName`
}
</code></pre>
<pre class="line-numbers"><code class="language-java">class ConvertedAWhileAgo : JustConverted {
  override fun getName(): String = "JustConvertedImpl"
}</code></pre>
<pre class="line-numbers"><code class="language-java">class ConvertedAWhileAgo : JustConverted {
  override val name: String = "JustConvertedImpl"
}</code></pre>
<p>The downside to this tool’s speed and flexibility is that it can’t always provide answers about type information, especially when symbols are defined in third-party libraries. In those cases, it bails quickly and obviously, so we don’t execute a transformation with false confidence. The resulting Kotlin code might not build, but the appropriate fix is usually pretty obvious to a human (if a little tedious).</p>
<p>We originally added these custom phases to reduce developer effort, but over time we also leveraged them to reduce developer unreliability. Contrary to popular belief, we’ve found it’s often safer to leave the most delicate transformations to bots. There are certain fixes we’ve automated as part of postprocessing, even though they aren’t strictly necessary, because we want to minimize the temptation for human (i.e., error-prone) intervention. One example is condensing long chains of null checks: The resulting Kotlin code isn’t more correct, but it’s less susceptible to a well-meaning developer accidentally dropping a negation. </p>
<h2>Leveraging build errors</h2>
<p>In the course of doing our own conversions, we noticed that we spent a lot of time at the end repeatedly building and fixing our code based on the compiler’s error messages. In theory, we could fix many of these problems in our custom postprocessing, but doing so would require us to reimplement a lot of complex logic that’s baked into the Kotlin compiler. </p>
<p>Instead, we added a new, final step in the Kotlinator that leverages the compiler’s error messages the same way a human would. Like postprocessing, these fixes are performed with a metaprogramming that can analyze unbuildable code.</p>
<h2>The limitations of custom tooling</h2>
<p>Between the preprocessing, postprocessing, and post-build phases, the Kotlinator contains well over 200 custom steps. Unfortunately, some conversion issues simply can’t be solved by adding even more steps.</p>
<p>Originally we treated J2K as a black box—even though it was open sourced—because its code was complex and not actively developed; diving in and submitting PRs didn’t seem worth the effort. That changed early in 2024, however, when JetBrains began work to make J2K compatible with the new Kotlin compiler, K2. We took the opportunity to work with JetBrains to improve J2K and address problems that had been plaguing us for years, such as disappearing override keywords.</p>
<p>Collaborating with JetBrains also gave us the opportunity to insert hooks into J2K that would allow clients like Meta to run their own custom steps directly in the IDE before and after conversion. This may sound strange, given the number of custom processing steps we’ve already written, but there are a couple of major benefits:</p>
<ol><li class="c1" aria-level="1"><strong>Improved symbol resolution</strong>. Our custom symbol resolution is fast and flexible, but it’s less precise than J2K’s, especially when it comes to resolving symbols defined in third-party libraries. Porting some of our preprocessing and postprocessing steps over to leverage J2K’s extension points will make them more accurate, and allow us to use Intellij’s more sophisticated static-analysis tooling.</li>
<li class="c1" aria-level="1"><strong>Easier open sourcing and collaboration</strong>. Some of our custom steps are too Android-specific to be incorporated into J2K but might still be useful to other companies. Unfortunately, most of them depend on our custom symbol resolution. Porting these steps over to instead rely on J2K’s symbol resolution gives us the option to open-source them and benefit from the community’s pooled efforts.</li>
</ol><h1>But first, null safety!</h1>
<p>In order to translate our code without spewing null-pointer exceptions (NPEs) everywhere, it first needs to be null safe (by “null safe” we mean code checked by a static analyzer such as <a href="https://github.com/facebook/infer/blob/main/infer/annotations/src/main/java/com/facebook/infer/annotation/Nullsafe.java" target="_blank" rel="noopener">Nullsafe</a> or <a href="https://github.com/uber/NullAway" target="_blank" rel="noopener">NullAway</a>). Null safety still isn’t sufficient to eliminate the possibility of NPEs, but it’s an excellent start. Unfortunately, making code null safe is easier said than done.</p>
<h2>Even null-safe Java throws NPEs sometimes</h2>
<p>Anyone who has worked with null-safe Java code long enough knows that while it’s more reliable than vanilla Java code, it’s still prone to NPEs. Unfortunately <a href="https://engineering.fb.com/2022/11/22/developer-tools/meta-java-nullsafe/" target="_blank" rel="noopener">static analysis is only 100% effective for 100% code coverage</a>, which is simply not viable in any large mobile codebase that interacts with the server and third-party libraries.</p>
<p>Here’s a canonical example of a seemingly innocuous change that can introduce an NPE:</p>
<p><em>MyNullsafeClass.java</em></p>
<pre class="line-numbers"><code class="language-java">@Nullsafe
public class MyNullsafeClass {
  void doThing(String s) {
    // can we safely add this dereference?
    // s.length;
  }
}</code></pre>
<p>Say there are a dozen dependents that call MyNullsafeJava::doThing. A single non-null-safe dependent could pass in a null argument (for example,  MyNullsafeJava().doThing(null)), which would lead to an NPE if a dereference is inserted in the body of doThing. </p>
<p>Of course, while we can’t <em>eliminate</em> NPEs in Java via null-safety coverage, we can greatly reduce their frequency. In the example above, NPEs are possible but fairly rare when there’s only one non-null-safe dependent. If multiple transitive dependents lacked null safety, or if one of the more central dependent nodes did, the NPE risk would be much higher.</p>
<h2>What makes Kotlin different</h2>
<p>The biggest difference between null-safe Java and Kotlin is the presence of <a href="https://kotlinlang.org/docs/java-interop.html#null-safety-and-platform-types" target="_blank" rel="noopener">runtime validation in Kotlin bytecode</a> at the interlanguage boundary. This validation is invisible but powerful because it allows developers to trust the stated nullability annotations in any code they’re modifying or calling.</p>
<p>If we return to our earlier example, MyNullsafeClass.java, and translate it to Kotlin, we get something like:</p>
<p><em>MyNullsafeClass.kt</em></p>
<pre class="line-numbers"><code class="language-java">class MyNullsafeClass {
  fun doThing(s: String) {
    // there's an invisible `checkNotNull(s)` here in the bytecode
    // so adding this dereference is now risk-free!
    // s.length
  }
}</code></pre>
<p>Now there’s an invisible checkNotNull(s) in the bytecode at the start of doThing’s body, so we can safely add a dereference to s, because if s <em>were</em> nullable, this code would already be crashing. As you can imagine, this certainty makes for much smoother, safer development.</p>
<p>There are also some differences at the static analysis level: The Kotlin compiler enforces a slightly <a href="https://kotlinlang.org/docs/null-safety.html" target="_blank" rel="noopener">stricter set of null safety rules</a> than Nullsafe does when it comes to concurrency. More specifically, the Kotlin compiler throws an error for <a href="https://discuss.kotlinlang.org/t/smartcast-for-nullable-variable-properties/8976" target="_blank" rel="noopener">dereferences of class-level properties</a> that could have been set to null in another thread. This difference isn’t terribly important to us, but it does lead to more !! than one might expect when translating null-safe code.</p>
<h2>Great, let’s translate it all to Kotlin!</h2>
<p>Not so fast. As is always the case, going from more ambiguity to less ambiguity doesn’t come for free. For a case like MyNullsafeClass, development is much easier after Kotlin translation, but someone has to take that initial risk of effectively inserting a nonnull assertion for its hopefully-really-not-nullable parameter s. That “someone” is whichever developer or bot ends up shipping the Kotlin conversion.</p>
<p>We can take a number of steps to minimize the risk of introducing new NPEs during conversion, the simplest of which is erring on the side of “more nullable” when translating parameters and return types. In the case of MyNullsafeClass, the Kotlinator would have used context clues (in this case, the absence of any dereferences in the body of doThing) to infer that String s should be translated to s: String?.</p>
<p>One of the changes we ask developers to scrutinize most when reviewing conversion diffs is the addition of !! outside of preexisting dereferences. Funnily enough, we’re not worried about an expression like foo!!.name, because it’s not any more likely to crash in Kotlin than it was in Java. An expression such as someMethodDefinedInJava(foo!!) is much more concerning, however, because it’s possible that someMethodDefinedInJava is simply missing a @Nullable on its parameter, and so adding !! will introduce a very unnecessary NPE.</p>
<p>To avoid problems like adding unnecessary !! during conversion, we run over a dozen complementary codemods that comb through the codebase looking for parameters, return types, and member variables that might be missing @Nullable. More accurate nullability across the codebase—even in Java files that we may never translate—is not only safer, it’s also conducive to more successful conversions, especially as we approach the final stretch in this project.</p>
<p>Of course, the last remaining null safety issues in our Java code have usually stuck around because they’re very hard to solve. Previous attempts to resolve them relied mostly on static analysis, so we decided to borrow an idea from the Kotlin compiler and create a Java compiler plugin that helps us collect runtime nullability data. This plugin allows us to collect data on all return types and parameters that are receiving/returning a null value and are not annotated as such. Whether these are from Java/Kotlin interop or classes that were annotated incorrectly at a local level, we can determine ultimate sources of truth and use codemods to finally fix the annotations.</p>
<h1>Other ways to break your code</h1>
<p>On top of the risks of regressing null safety, there are dozens of other ways to break your code during conversion. In the course of shipping over 40,000 conversions, we’ve learned about many of these the hard way and now have several layers of validation to prevent them. Here are a couple of our favorites:</p>
<h3>Confusing initialization with getters</h3>
<pre class="line-numbers"><code class="language-java">// Incorrect!
val name: String = getCurrentUser().name
// Correct
val name: String
  get() = getCurrentUser().name</code></pre>
<h3>Nullable booleans</h3>
<pre class="line-numbers"><code class="language-java">// Original
if (foo != null &amp;&amp; !foo.isEnabled) println("Foo is not null and disabled")
// Incorrect!
if (foo?.isEnabled != true) println("Foo is not null and disabled")
// Correct
if (foo?.isEnabled == false) println("Foo is not null and disabled")</code></pre>
<h1>The fun part</h1>
<p>At this point, more than half of Meta’s Android Java code has been translated to Kotlin (or, more rarely, deleted). But that was the easy half! The <em>really</em> fun part lies ahead of us, and it’s a doozy. There are still thousands of fully automated conversions we hope to unblock by adding and refining custom steps and by contributing to J2K. And there are thousands more semi-automated conversions we hope to ship smoothly and safely as a result of other Kotlinator improvements.</p>
<p>Many of the problems we face also affect other companies translating their Android codebases. If this sounds like you, we’d love for you to leverage our <a href="https://github.com/fbsamples/kotlin_ast_tools">fixes</a> and share some of your own. Come chat with us and others in the <a href="https://slack-chats.kotlinlang.org/c/j2k" target="_blank" rel="noopener">#j2k channel of the Kotlinlang Slack</a>.</p>]]></description>
      <link>https://engineering.fb.com/2024/12/18/android/translating-java-to-kotlin-at-scale/</link>
      <guid>https://engineering.fb.com/2024/12/18/android/translating-java-to-kotlin-at-scale/</guid>
      <pubDate>Wed, 18 Dec 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How we think about Threads’ iOS performance]]></title>
      <description><![CDATA[<ul><li>How did the Threads iOS team maintain the app’s performance during its incredible growth?</li>
<li>Here’s how Meta’s Threads team thinks about performance, including the key metrics we monitor to keep the app healthy.</li>
<li>We’re also diving into some case studies that impact publish reliability and navigation latency.</li>
</ul><p>When Meta <a href="https://engineering.fb.com/2023/09/07/culture/threads-inside-story-metas-newest-social-app/">launched Threads</a> in 2023, it became the fastest-growing app in history, gaining 100 million users in only five days. The app now has grown to more than 300 million monthly international users, and its <a href="https://engineering.fb.com/2023/12/19/core-infra/how-meta-built-the-infrastructure-for-threads/">development team has expanded</a> from a small group of scrappy engineers to an organization with more than a hundred contributors.</p>
<p>Looking back on where the Threads iOS app was a year ago, so much has changed: We’ve expanded into Europe, integrated with the <a href="https://engineering.fb.com/2024/03/21/networking-traffic/threads-has-entered-the-fediverse/">Fediverse</a>, launched a public API, developed many new ways for people to share what’s going on in their world, and introduced new methods to find and read the best content being produced. We even celebrated our first birthday with party hats and scratch-off app icons! </p>
<p>To make sure the app is easy and delightful to use—and to scale with a quickly growing user base and development team—it has to be performant. Here’s how we think about performance in the Threads iOS app, what we’ve learned in our first year, and how we’ve tackled a few of our biggest performance challenges.</p>
<div class="jetpack-video-wrapper"><iframe title="Performance in Threads for iOS | Dave LaMacchia" width="1778" height="1000" src="https://www.youtube.com/embed/HrF5i1ZvTtk?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>How Threads measures performance at scale</h2>
<p>Having a fast and performant app is critical to providing the best user experience. We want Threads to be the best place for live, creative commentary about what’s happening now; that means Threads also needs to be the fastest and most responsive app in its class. If the app doesn’t feel lightning fast, or if it hangs or drains a phone’s battery, no one will want to use it. Our features have to work reliably and fail infrequently no matter what kind of phone someone is using, or how much memory their phone has, or whether they’re using Threads somewhere that has robust cellular coverage or a network that keeps dropping out.</p>
<p>Some performance issues are encountered only rarely but still can be frustrating. As the iOS app’s usage grew rapidly during our first year after release, we wanted to learn what the biggest pain points were for most people as well as the extreme performance issues experienced by a small percentage of users. We measured how quickly the app launches, how long it takes to post a photo or video, how often we would experience crashes, and how many bug reports were filed by people. </p>
<h3>%FIRE: Frustrating image-render experience</h3>
<p>In addition to all the text updates people share, we have a lot of photos shared on Threads. When images load slowly or not at all, that can cause someone to stop using the app. That’s why we monitor an important metric to alert when there’s a regression in how images are loading for our users. That metric, %FIRE, is the percentage of people who experience a <strong>frustrating image-render experience</strong>, and it’s calculated as shown in Figure 1, below.</p>
<figure id="attachment_22061" aria-describedby="caption-attachment-22061" class="wp-caption aligncenter c1"><img class="size-large wp-image-22061" src="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?w=1024" alt="" width="1024" height="541" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png 1848w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=916,484 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=768,406 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=1024,541 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=1536,811 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-1.png?resize=192,101 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22061" class="wp-caption-text">Figure 1: %FIRE calculation.</figcaption></figure><p>All kinds of things can regress %FIRE, both on the client end and the backend, but not all image-rendering bugs are covered by this metric. For example, in Threads iOS, we had a bug earlier this year where user profile photos would flicker because of how we were comparing view models when reusing them. That triggered a frustrating user experience, but not one where users would contribute to %FIRE.</p>
<h3>Time-to-network content (TTNC)</h3>
<p>How fast the app starts and how fast we deliver a user’s feed to them is also important. We know if someone has to stare at an app launch screen, activity spinner, or loading shimmer for too long, they’ll just close the app. This is all measured in something we call TTNC, or time-to-network content<em>.</em> In addition to having the app start fast, people also want us to show them what’s happening now, so TTNC measures how fast we’re able to load a fresh, personalized feed, not just cached, locally stored posts.</p>
<p>The Threads iOS team has also improved the app launch time by keeping the app’s binary size small. Every time someone tries to commit code to Threads, they’re alerted if that code change would increase our app’s binary size above a configured threshold. Code that violates our binary size policy isn’t allowed to be merged. </p>
<p>We’re proactive, too: To help reduce TTNC, we have spent a lot of time since Threads launched removing unnecessary code and graphics assets from our app bundle, resulting in a binary one-quarter the size of Instagram. It doesn’t hurt that this also can reduce our iOS app’s build time, which makes the app more fun to develop! Threads compiles two times faster than Instagram for our non-incremental builds.</p>
<h3>Creation-publish success rate (cPSR)</h3>
<p>Where %FIRE and TTNC measure how content is presented to a user, we have one other important metric: cPSR, the creation-publish success rate<em>.</em> We measure this separately for text posts, photos, and video published to Threads. When someone tries to post a photo or video, many things can prevent it from succeeding. Photos and videos are locally transcoded into formats we want to upload, which happens asynchronously as part of the publishing process. They both use a lot more data and take longer than text to upload, so there’s more time for something to go wrong. A user might background the app after they tap “Post” without waiting for it to succeed, which on iOS might give us only a few seconds to complete the upload before we’re terminated by the operating system. </p>
<p>Later in this blog post, we’ll go into some of the strategies we’re using to improve cPSR.</p>
<h2>Deep dive: Navigation latency</h2>
<p>Navigation latency is important to the user experience because it’s tied to how fast the app starts and everything the user does once the app has launched. When we measure navigation latency, we want to know how long it takes to finish rendering content after a user navigates to part of the app. That could be after app start, either from launching Threads directly on your phone, or by tapping on a push notification from Threads, or by simply tapping on a post in your Feed and navigating to the conversation view. </p>
<p>Early in 2024, the Threads Performance team knew we wanted to focus on a few key areas, but which ones? Data from Instagram suggested navigation latency is important, but Threads is used differently than Instagram. Having been available to download for only six months at the time, we knew that to prioritize areas of improvement we would first have to spend some time learning.</p>
<h3>Learning from a boundary test</h3>
<p>We started by creating a <strong>boundary test</strong> to measure latency, focusing on a few key places that people visit when they launch Threads or use the app. A boundary test is one where we measure extreme ends of a boundary to learn what the effect is. In our case, we introduced a slight bit of latency when a small percentage of our users would navigate to a user profile, to the conversion view for a post, or to their activity feed. </p>
<table border="1"><tbody><tr><td class="c2"><strong>Latency injection</strong></td>
<td class="c2">
</td><td class="c2"><strong>Daily Active Users</strong></td>
<td class="c2"><strong>Foreground sessions</strong></td>
<td class="c2"><strong>Likes</strong></td>
<td class="c2"><strong>Conversation views</strong></td>
</tr><tr><td>Activity: 0.12sConversation: 0.29sProfile: 0.28s</td>
<td class="c3" rowspan="3"><strong>In-app navigation</strong></td>
<td>
</td><td>
</td><td>
</td><td>
</td></tr><tr><td>Activity: 0.15sConversation: 0.36sProfile: 0.35s</td>
<td>
</td><td>
</td><td><strong>-0.68%</strong></td>
<td>
</td></tr><tr><td>Activity: 0.19sConversation: 0.54sProfile: 0.53s</td>
<td><strong>-0.54%</strong></td>
<td>
</td><td><strong>-0.81%</strong></td>
<td>
</td></tr><tr><td>Activity: 0.12sConversation: 0.29sProfile: 0.28s</td>
<td class="c3" rowspan="3"><strong>App launch</strong></td>
<td><strong>-0.37%</strong></td>
<td><strong>-0.67%</strong></td>
<td>
</td><td><strong>-1.63%</strong></td>
</tr><tr><td>Activity: 0.15sConversation: 0.36sProfile: 0.35s</td>
<td>
</td><td><strong>-0.67%</strong></td>
<td>
</td><td><strong>-2.55%</strong></td>
</tr><tr><td>Activity: 0.19sConversation: 0.54sProfile: 0.53s</td>
<td><strong>-0.52%</strong></td>
<td><strong>-0.65%</strong></td>
<td>
</td><td>
</td></tr></tbody></table><p><em>Table 1: Navigation latency boundary test results.</em></p>
<p>This latency would allow us to extrapolate what the effect would be if we similarly <em>improved</em> how we delivered content to those views.</p>
<p>We already had robust analytics logging, but we didn’t have the ability to differentiate between navigation to these views from a cold app launch and from within the app. After adding that, we injected latency into three buckets, each with slight variability depending on surface. </p>
<p>We learned that iOS users don’t tolerate a lot of latency. The more we added, the less often they would launch the app and the less time they would stay in it. With the smallest latency injection, the impact was small or negligible for some views, but the largest injections had negative effects across the board. People would read fewer posts, post less often themselves, and in general interact less with the app. Remember, we weren’t injecting latency into the core feed, either; just into the profile, permalink, and activity.</p>
<h3>Measuring navigation latency with SLATE</h3>
<figure id="attachment_22085" aria-describedby="caption-attachment-22085" class="wp-caption alignright c4"><img class="wp-image-22085" src="https://engineering.fb.com/wp-content/uploads/2024/12/SLATE-debugger-Threads-iOS-performance.png?w=450" alt="" width="248" height="500" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/SLATE-debugger-Threads-iOS-performance.png 450w, https://engineering.fb.com/wp-content/uploads/2024/12/SLATE-debugger-Threads-iOS-performance.png?resize=96,193 96w, https://engineering.fb.com/wp-content/uploads/2024/12/SLATE-debugger-Threads-iOS-performance.png?resize=192,387 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22085" class="wp-caption-text">The SLATE debugger.</figcaption></figure><p>Navigation latency is difficult to measure consistently. If you have a big app that does many different things, you have to have a consistent way of “starting” your timer, measuring time to render a view across many different surfaces with different types of content and behavior, and finally “stopping” your timer. Also, you have to be aware of error states and empty views, which need to be considered terminal states. There can be many permutations and custom implementations across all of an app’s surfaces.</p>
<p>To solve this problem and measure navigation latency consistently, we developed a new tool we call SLATE: the “Systemic LATEncy” logger. It gives us the ability to observe events that trigger a new navigation when the user interface (UI) is being built, when activity spinners or shimmers are displayed, when content is displayed from the network, and when a user sees an error condition. It’s implemented using a set of common components that are the foundation for a lot of our UI and a system that measures performance by setting “markers” in code for specific events. Typically these markers are created with a specific purpose in mind. The great thing about SLATE is that it automatically creates these markers for a developer, as long as they’re using common components. This makes the system highly scalable and maintainable in a very large code base such as Threads or Instagram.</p>
<p>When our iOS developers are creating a new feature, it’s easy to see if it has an effect on navigation latency. Anyone can enable the SLATE debugger (depicted in Image 1, below) right in the internal build of our app, and it’s easy to create a dashboard so they can get a report about how their code is running in production.</p>
<h3>Case study: Using SLATE to validate GraphQL adoption</h3>
<p>Over the last year, both Instagram and Threads have been adopting GraphQL for network requests. Even though Meta created GraphQL back in 2012, we built Instagram on a network stack based on REST, so Threads for iOS and Android originally inherited that technical legacy.</p>
<p>When <a href="https://engineering.fb.com/2024/05/14/web/threads-for-web-behind-the-scenes/">Threads for Web was developed</a>, it was a fresh code base built on the modern GraphQL standard instead of REST. While this was great for web, it meant that new features delivered to both web and iOS/Android had to be written twice: once to support the GraphQL endpoints and once for REST. We wanted to move new development to GraphQL, but because the implementation was unproven for Threads, we first needed to measure and make sure it was ready to be adopted. We expected GraphQL to result in less data that would need to be moved over the network, but to parse and store the data, the infrastructure to support it might introduce additional latency.</p>
<p>We decided to run a test where we took one of our views and implemented its network delivery code using GraphQL. Then we could run the REST and GraphQL implementations side by side and compare the results. We opted to run the test for the “user list” views that power Followers and Following lists and determine if the new code that delivered and parsed GraphQL responses was at least as fast as the legacy REST code.</p>
<p>This was easy to do using Swift. We created an abstraction that extracted the existing API into a protocol that both the REST and GraphQL code could use; then when the code would be called, a factory method generated the appropriate provider.</p>
<p>Once the code was running, we needed to measure the impact on the end-to-end latency of fetching results from the network and rendering the content on screen. SLATE to the rescue! Using SLATE’s performance markers, we could easily compare latency data for each of the different user view network implementations. </p>
<p>Below is an example graph of the latency data (p95) for when a user views the list of their followers. The blue line compares the REST and GraphQL latency data, which are very similar. We saw similar results across all the different views, which gave the Threads iOS team confidence to adopt GraphQL for all new endpoints.</p>
<figure id="attachment_22063" aria-describedby="caption-attachment-22063" class="wp-caption aligncenter c1"><img class="size-large wp-image-22063" src="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?w=1024" alt="" width="1024" height="434" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg 1464w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?resize=916,388 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?resize=768,325 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?resize=1024,434 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?resize=96,41 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-2.jpg?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22063" class="wp-caption-text">Figure 2: Latency (p95) loading Following and Followers lists via REST and GraphQL.</figcaption></figure><h2>Deep dive: Publish reliability and latency</h2>
<p>As mentioned previously, cPSR is one of the top metrics we’re trying to improve on Threads, because if people can’t reliably post what they want, they’ll have a terrible user experience. We also know from reading user-submitted bug reports that posting can be a source of frustration for people.</p>
<p>Let’s dive into two features added to Threads iOS that approach improving the posting experience in very different ways: Drafts, and reducing the perceived latency of text posts.</p>
<h3>Drafts</h3>
<p>In early 2024, Threads introduced basic saving of drafts on iOS and Android. In addition to being one of our most user-requested features, Drafts provides resiliency to unexpected failures such as bad network connectivity. Looking at user-filed bug reports, we had seen that the top concern was being unable to post. Often users didn’t know why they couldn’t post. We knew a draft feature would help with some of these concerns.</p>
<p>These user bug reports were used to measure the success of Drafts. Drafts doesn’t directly move cPSR, which measures the reliability of posting in a single session, but we theorized it might result in either more posts being created or less overall user frustration with posting. We released Drafts to a small group of people and compared the number of subsequent bug reports related to posting they submitted compared to reports from people who didn’t have Drafts. We discovered that 26 percent fewer people submitted bug reports about posting if they had Drafts. The feature was clearly making a difference.</p>
<p>We quickly followed up with a small but necessary improvement. Previously, if a user ran into a network issue while posting, they would be asked if they wanted to retry or discard their post,  but were given no option to save it as a draft. This meant a lot of people who couldn’t send were losing their post, which was frustrating. Unfortunately, measuring the impact of this resiliency feature was also difficult because not many people ran into it.</p>
<p>Then, a surprising thing happened: A serious bug took down all of Threads for a short period of time. Though this was bad, it had the side effect of testing some of our resiliency features, including Drafts. We saw a huge spike in usage during the short outage, which confirmed that people were benefiting from being able to save their posts if there was a serious problem.</p>
<p>You can see in Figure 3 below the spike in Drafts usage during the outage around noon on March 31.</p>
<figure id="attachment_22082" aria-describedby="caption-attachment-22082" class="wp-caption aligncenter c1"><img class="size-large wp-image-22082" src="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?w=1024" alt="" width="1024" height="546" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=916,488 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=768,409 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=1024,546 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=1536,818 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=96,51 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-figure-3_crop.jpg?resize=192,102 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22082" class="wp-caption-text">Figure 3: A spike in Drafts usage during a brief outage.</figcaption></figure><h3>Minimizing Drafts’ local storage</h3>
<p>After Drafts was released to the public, we discovered an unfortunate bug: The average amount of storage Threads used was increasing dramatically. People on Threads noticed, too, and posted a lot of complaints about it. Some of these people reported that Threads was taking up many gigabytes of storage space. Maintaining a low disk footprint helps performance, and addressing this bug provided an opportunity to learn about the impact of excessive disk usage in Threads.</p>
<figure id="attachment_22066" aria-describedby="caption-attachment-22066" class="wp-caption aligncenter c1"><img class="size-large wp-image-22066" src="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?w=1024" alt="" width="1024" height="699" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png 1726w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=916,625 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=768,524 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=1024,699 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=1536,1048 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=96,66 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-iOS-performance-figure-4.png?resize=192,131 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22066" class="wp-caption-text">Figure 4: Disk usage in Threads after Drafts launched.</figcaption></figure><p>The culprit was Drafts. In the iOS app, we use PHPickerViewController, introduced in iOS 14, to power the photo and video gallery presented in the Composer. </p>
<p>PHPickerViewController is a nice component that runs out of process and provides users with privacy and safety by allowing them to give an app access to exactly the media they want. When a photo is selected, an app receives a URL that points to the image asset on the device. We found, however, that access to this image is only temporary; between sessions, Threads would lose permission to read an image that had been attached to a draft. In addition, if a user deleted an image from the gallery, it would also disappear from a draft, which was not ideal.</p>
<p>The solution was to copy photos and videos to an area in the application container that was specific to Drafts. Unfortunately, copied media wasn’t being cleaned up entirely, leading disk usage to grow, sometimes dramatically, over time.</p>
<p>Cleaning up this excessive disk usage had dramatic results in areas we didn’t expect. App launch became faster (-0.35%), our daily active users grew (+0.21%), and people posted additional original content (+0.76%)—quite a lot more.</p>
<figure id="attachment_22086" aria-describedby="caption-attachment-22086" class="wp-caption alignright c5"><img class="wp-image-22086" src="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-Posted-Toast-Threads-iOS.png?w=450" alt="" width="250" height="500" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Threads-Posted-Toast-Threads-iOS.png 450w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-Posted-Toast-Threads-iOS.png?resize=96,192 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Threads-Posted-Toast-Threads-iOS.png?resize=192,384 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22086" class="wp-caption-text">Threads’ Posted toast.</figcaption></figure><h3>Blazing fast text posts</h3>
<p>Similar to doing the navigation latency boundary test, the performance team had previously measured the impact of latency on text replies and knew we wanted to improve them. In addition to implementing improvements to reduce absolute latency, we decided to reduce <em>perceived latency</em>. </p>
<p>A new feature in Threads’ network stack allows the server to notify a client when a posting request has been fully received, but before it’s been processed and published. Most failures happen between the mobile client and Threads’ servers, so once a request is received, it’s very likely to succeed.</p>
<p>Using the new server-acknowledgement callback, the iOS client could now present the “Posted” toast when a publish request was received, but before it was fully created in the backend. It would appear as if text posts were publishing a little faster. The result is a better user experience that makes the app feel more conversational.</p>
<h2>Adopting Swift Concurrency for more stable code</h2>
<p>Migrating the Threads iOS publishing code from a synchronous model to an asynchronous one also revealed the potential for race conditions. In addition to the asynchronous transcoding step mentioned previously, there were some new ones related to management of the upload tasks and media metadata. We noticed some mysterious malformed payloads that turned up only occasionally in our analytics and dashboards. Operating at massive scale tends to turn up some rare edge cases that can have negative consequences on performance metrics and give people a bad user experience.</p>
<p>One of the best things about working in the Threads code base is that it’s mostly in Swift. Some of the publishing code was written in Objective-C, though. While Objective-C has a lot of benefits, Swift’s strong data-race protections and type safety would be an improvement, so we decided to migrate Threads’ publishing code to Swift.</p>
<p>iOS teams throughout Meta are adopting Swift’s “complete concurrency” in preparation for moving to Swift 6. On the Threads team, we’ve been migrating older Swift code and using complete concurrency in new frameworks that we’re building. Moving to complete concurrency is probably the biggest change to iOS development since Automatic Reference Counting (ARC) was introduced way back in iOS 4. When you adopt complete concurrency, Swift does a great job at preventing pesky data races, such as some that were causing issues with our optimistic uploader. If you haven’t started adopting Swift’s strict concurrency by enabling complete concurrency in your code, you might find that your code is more stable and less prone to hard-to-debug problems caused by data races.</p>
<h2>The future of Threads iOS performance</h2>
<p>As Threads continues to scale in its second year and beyond, the iOS app will have to adapt to meet new challenges. As we add new product features, we will keep monitoring our time-worn metrics such as %FIRE, TTNC, and cPSR to make sure the user experience doesn’t degrade. We’re updating the code that delivers posts to you, so you see content faster and experience fewer loading indicators. We’ll continue to take advantage of the most modern language features in Swift, which will make the app more stable and faster to build and load into memory. Meanwhile, we’re going to iterate and evolve tools like SLATE that help us improve our testing and debug regressions. </p>
<p>As part of the Threads community, you can also contribute to making the app better. We mentioned earlier that user-submitted bug reports were used to identify areas for the development team to focus on and verify that features like Drafts were actually solving user frustrations. In both Threads and Instagram, you can long-press on the Home tab or shake your phone to submit a bug report. We really do read them.</p>]]></description>
      <link>https://engineering.fb.com/2024/12/18/ios/how-we-think-about-threads-ios-performance/</link>
      <guid>https://engineering.fb.com/2024/12/18/ios/how-we-think-about-threads-ios-performance/</guid>
      <pubDate>Wed, 18 Dec 2024 16:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How to build a mixed reality headset]]></title>
      <description><![CDATA[<p>How do you take a mixed reality (MR) headset from idea to finished product?</p>
<p>Alfred Jones, VP of hardware engineering at Meta Reality Labs, joins Pascal Hartig (<a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">@passy</a>) on the latest episode of the Meta Tech Podcast for a discussion on the realities (no pun intended) of building MR hardware.</p>
<p>Jones shares his strategy for avoiding choice paralysis. With so many options out there, how do you choose the right display technology, battery, and thermal budget (and do so at the right price point)?</p>
<p>He also discusses what makes passthrough such a challenge, gives an inside look into dogfooding MR hardware at Meta, and ponders what the future holds for mixed reality.</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/33684347/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/0gyzNHKwmEBWd8WCHpIzGx?si=qdpwMkq1R6K-aUSnYYBnyg" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/how-to-build-a-mixed-reality-headset/id1370910331?i=1000675116001" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/jzahyxws" target="_blank" rel="noopener">PocketCasts</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2024/12/12/virtual-reality/how-to-build-a-mixed-reality-headset/</link>
      <guid>https://engineering.fb.com/2024/12/12/virtual-reality/how-to-build-a-mixed-reality-headset/</guid>
      <pubDate>Thu, 12 Dec 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Inside Facebook’s video delivery system]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">We’re explaining the end-to-end systems the Facebook app leverages to deliver relevant content to people.</li>
<li class="c1" aria-level="1">Learn about our video-unification efforts that have simplified our product experience and infrastructure, in-depth details around mobile delivery, and new features we are working on in our video-content delivery stack.</li>
</ul><p>The end-to-end delivery of highly relevant, personalized, timely, and responsive content comes with complex challenges. At Facebook’s scale, the systems built to support and overcome these challenges require extensive trade-off analyses, focused optimizations, and architecture built to allow our engineers to push for the same user and business outcomes.</p>
<div class="jetpack-video-wrapper"><iframe title="Facebook Video Delivery | Colin Smith" width="1778" height="1000" src="https://www.youtube.com/embed/ycfSTzkcmrM?feature=oembed" frameborder="0" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="allowfullscreen">[embedded content]</iframe></div>
<h2>Video unification on Facebook</h2>
<p>It would be hard to talk about Facebook’s video delivery without mentioning our two years of video unification efforts. Many of the capabilities and technologies we will reference below would not have been possible without the efforts taken to streamline and simplify Facebook video products and technical stacks. </p>
<p>In its simplest form, we have three systems to support Facebook video delivery: ranking, server, and mobile:</p>
<h3>Ranking (RecSys)</h3>
<p>Recommends content that fulfills people’s interests (i.e., short-term, long-term, and real-time) while also allowing for novel content and discovery of content that is outside the person’ historical engagement. The architecture supports flexibility for various optimization functions and value modeling, and it builds on delivery systems that allow for tight latency budgets, rapid modeling deployments, and bias towards fresh content. (We classify freshness as how long ago ranking generated this video candidate. We generally consider fresher content to be better, since it operates on more recent and relevant signals.</p>
<h3>Server (WWW)</h3>
<p>Brokers between mobile/web and RecSys, which serves as the central business logic that powers all of Facebook video’s feature sets and consumes the appropriate content recommendation through RecSys to the mobile clients. It controls key delivery characteristics such as content pagination, deduplication, and ranking signal collection. It also manages key systems tradeoffs such as capacity through caching and throttling.</p>
<h3>Mobile – Facebook for Android (FB4A) and Facebook for iOS (FBiOS)</h3>
<p>Facebook’s mobile apps are highly optimized for user experience beyond the pixels. Mobile is built with frameworks, such as client-side ranking (CSR), which allows for delivery of the content that’s most optimal to people at exactly the point of consumption, without needing a round trip to the server at times.</p>
<h2>Why unify?</h2>
<p>Let’s dive into our video unification efforts. Previously, we had various user experiences, mobile client layers, server layers, and ranking layers for Watch and Reels. In the past couple of years, we have been consolidating the app’s video experiences and infrastructure into a single entity. </p>
<p>The reason is simple. Maintaining multiple video products and services leads to a fragmented user and developer experience. Fragmentation leads to slower developer times, complicated and inconsistent user experiences, and fewer positive app recommendations. Facebook Watch and Reels on Facebook —two similar products—were functioning quite separately, which meant we couldn’t share improvements between them, leading to a worse experience across the board.</p>
<p>This separation also created a lot of overhead for creators. Previously, if creators wanted distribution to a certain surface, they would need to create two types of content, such as a Reel for immersive Reels surfaces and a VOD for Watch tab. For advertisers, this meant creating different ads for different ad formats.</p>
<h2>Mobile and server technical stack unification</h2>
<p>The first step in unification was unifying our two client and server data models and two architectures into one, with no changes to the user interface (UI). The complexity here was immense, and this technical stack unification took a year across the server, iOS, and Android, but was a necessary step in paving the way for further steps. </p>
<p><img class="aligncenter size-large wp-image-22038" src="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png 2500w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=2048,1152 2048w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-1.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Several variables added to the complexity, including:</p>
<ul><li class="c1" aria-level="1">Billions of users for the product. Any small, accidental shift in logging, UI, or performance would immediately be seen in top line metrics.</li>
<li class="c1" aria-level="1">Tens of thousands of lines of code per layer across Android, iOS, and the server.</li>
<li class="c1" aria-level="1">Merging Reels and Watch while keeping the best of both systems took a lot of auditing and debugging. (We needed to audit hundreds of features and thousands of lines of code to ensure that we preserved all key experiences).</li>
<li class="c1" aria-level="1">The interactions between layers also needed to be maintained while the code beneath them was shifting. Logging played a key role in ensuring this.</li>
<li class="c1" aria-level="1">Product engineers continued work on both the old Reels and Watch systems, improving the product experience for users and improving key video metrics for the Facebook app. This created a “moving goal post” effect for the new unified system, since we had to match these new launches. We had to move quickly and choose the right “cutoff” point to move all the video engineers to work on the new stack as early as possible.
<ul><li class="c1" aria-level="2">If we transferred the engineers too early, approximately 50 product engineers would not be able to hit their goals, while also causing churn in the core infrastructure.</li>
<li class="c1" aria-level="2">If we transferred them too late, even more work would be required on the new, unified infrastructure to port old features.</li>
</ul></li>
<li class="c1" aria-level="1">Maintenance of logging for core metrics for the new stack. Logging is extremely sensitive and implemented in different ways across surfaces. We had to sometimes re-implement logging in a new way to serve both products. We also had to ensure we maintained hundreds of key logging parameters.</li>
<li class="c1" aria-level="1">We had to do all of this while maintaining engagement and performance metrics to ensure the new architecture met our performance bar.</li>
</ul><h3><img class="aligncenter size-large wp-image-22039" src="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png 2500w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=2048,1152 2048w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-2.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></h3>
<h2>Migrating Watch users to a Reels surface</h2>
<p>The next step was moving all our VOD feed-chaining experiences to use the immersive Reels UI. Since the immersive Reels UI was optimized for viewing short-form video, whereas VOD feed-chaining UI was optimized for viewing long-form video, it took many product iterations to ensure that the unified surface could serve all our users’ needs without any compromises. We ran hundreds of tests to identify and polish the most optimal video feature set. This project took another year to complete.</p>
<h2>Unifying ranking across Reels and Watch</h2>
<p>The next step, which shipped in August of 2024, was unifying our ranking layers. This had to be done after the previous layers, because ranking relies on the signals derived from the UI across surfaces being the same. For example, the Like button is on top of the vertical sidebar and quite prominent on the Reels UI. But on the Watch UI, it is on the bottom at the far left. They are the same signal but have different contexts, and if ranking treated them equally you would see a degradation in video recommendations.</p>
<p>In addition to UI unification, another significant challenge is building a ranking system that can recommend a mixed inventory of Reels and VOD content while catering to both short-form video-heavy and long-form video-heavy users. The ranking team has made tremendous progress in this regard, starting with the unification of our data, infrastructure, and algorithmic foundations across the Reels and Watch stacks. They followed with the creation of a unified content pool that includes both short-form and long-form videos, enabling greater content liquidity. The team then optimized the recommendation machine learning (ML) models to surface the most relevant content without video length bias, ensuring a seamless transition for users with different product affinities (e.g., Reels heavy versus Watch heavy) to a unified content-recommendation experience.</p>
<h2>The unified video tab</h2>
<p>The last step was shipping the new video tab. This tab uses a Reels immersive UI, with unified ranking and product infrastructure across all layers to deliver recommendations ranging from Reels, long-form VOD, and live videos. It allows us to deliver the best of all worlds from a UI, performance, and recommendations perspective.</p>
<p>With video unification nearly completed, we are able to accomplish much deeper integrations and complex features end to end across the stack.</p>
<h2>Video unification precursor</h2>
<p>Before any formal video unification occurred across the Facebook app, we made a smaller effort within the video organization to modernize the Watch UI. When you tap on a video, the Watch Tab will open a new video-feed modal screen. This leads to surfaces within surfaces that can be a confusing experience. Watch Tab is also closer to the News Feed UI, which doesn’t match modern immersive video products in the industry.</p>
<p>This project worked to make the Watch Tab UI immersive and modern, while also flattening the feeds within feeds into a single feed. The issue was that we had not consolidated our infrastructure layers across mobile, server, and ranking. This led to slowdowns when trying to implement modern recommendation features. We also realized too late that ranking would play a key role in this project, and we made ranking changes late in the project life cycle.</p>
<p>These key learnings allowed the video organization to take the right steps and order of operations listed above. Without the learnings from this project, we might not have seen a successful video unification outcome.</p>
<h2>How Facebook’s video delivery system works</h2>
<p>When delivering content to people on Facebook we operate from five key principles:</p>
<ol><li class="c1" aria-level="1"> Prioritize fresh content.</li>
<li class="c1" aria-level="1">Let ranking decide the order of content.</li>
<li class="c1" aria-level="1">Only vend content (moving it from memory to the UI layer) when needed and vend as little as possible.</li>
<li class="c1" aria-level="1">Ensure fetching behavior is deterministic.</li>
<li class="c1" aria-level="1">Give people content when there is a clear signal they want it.</li>
</ol><h3>The lifecycle of a video feed network request</h3>
<figure id="attachment_22041" aria-describedby="caption-attachment-22041" class="wp-caption aligncenter c2"><img class="size-large wp-image-22041" src="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?w=1024" alt="" width="1024" height="380" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png 1999w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=916,340 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=768,285 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=1024,380 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=1536,570 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=96,36 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-lifecycle.png?resize=192,71 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22041" class="wp-caption-text">Video Feed Network Request End to End.</figcaption></figure><h4>The mobile client sends a request</h4>
<p>The mobile client generally has a few mechanisms in place that can trigger a network request. Each request type has its own trigger. A prefetch request (one issued before the surface is visible) is triggered a short time after app startup. Prefetch is available only for our tab surface.</p>
<p>A head load (the initial network request) is triggered when the user navigates to the surface. A tail load request ( which includes all the subsequent requests) is triggered every time the user scrolls, with some caveats.</p>
<p>A prefetch and a head load can both be in flight at once. We will vend content for whichever one returns first. Or if they both take too long, we will vend cache content.</p>
<p>A tail load will be attempted every time the user scrolls, though we will issue a request only if the in-memory pool (the memory store for video stories that are ready for viewing) has three or fewer video stories from the network. So, if we have four stories from the network, we won’t issue the request. If we have four stories from the cache, we will issue the network request. Alongside the tail load request, we will send user signals to ranking to help generate good video candidates for that request.</p>
<h4>The server receives a request</h4>
<p>When a client request arrives at the server, it passes through several key layers. Since it’s a GraphQL request, the first of these layers is the GraphQL framework and our GraphQL schema definition of the video data model. Next is the video delivery stack, which is a generalized architecture capable of serving video for any product experience in Facebook. This architecture has flexibility in supporting various backing-data sources: feeds, databases such as <a href="https://engineering.fb.com/2013/06/25/core-infra/tao-the-power-of-the-graph/">TAO</a>, caches, or other backend systems. So, whether you’re looking at a profile’s video tab, browsing a list of videos that use a certain soundtrack or hashtag, or visiting the new video tab, the video delivery stack serves all of these. </p>
<p>For the video tab, the next step is the Feed stack. At Facebook, we have lots of feeds, so we’ve built a common infrastructure to build, define, and configure various types of feeds: the main News Feed, Marketplace, Groups—you name it. The video tab’s feed implementation then calls into the ranking backend service.</p>
<p>Across these layers, the server handles a significant amount of business logic. This includes throttling requests in response to tight data-center capacity or disaster-recovery scenarios, caching results from the ranking backend, gathering model-input data from various data stores to send to ranking, and latency tracing to log the performance of our system, as well as piping through client-input parameters that need to be passed to ranking.</p>
<h4>Ranking receives a request</h4>
<p>Essentially our ranking stack is a graph-based execution service that orchestrates the entire serving workflow to eventually generate a number of video stories and return them to the web server.</p>
<p>The recommendation-serving workflow typically includes multiple stages such as candidate retrieval from multiple types of retrieval ML models, various filtering, point-wise ranking, list-wise ranking, and heuristic-diversity control. A candidate will have to go through all these stages to survive and be delivered.</p>
<p>Besides, our ranking stack also provides more advanced features to maximize value for people. For example, we use root video to contextualize the top video stories to achieve a pleasant user experience. We also have a framework called elastic ranking that allows dynamic variants of the ranking queries to run based on system load and capacity availability.</p>
<h4>The server receives a response from ranking</h4>
<p>At the most basic level, ranking gives the server a list of video IDs and ranking metadata for each one. The server then needs to load the video entities from our TAO database and execute privacy checks to ensure that the viewer can see these videos. The logic of these privacy checks is defined on the server, so ranking can’t execute these privacy checks, although ranking has some heuristics to reduce the prevalence of recommending videos that the viewer won’t be able to see anyway. Then the video is passed back to the GraphQL framework, which materializes the fields that the client’s query originally asked for. These two steps, privacy checking and materialization, together and in aggregate constitute a meaningful portion of global CPU usage in our data centers, so optimizing here is a significant focus area to alleviate our data center demand and power consumption.</p>
<h4>The mobile client receives a response from the server</h4>
<p>When the client receives the network response, the video stories are added to the in-memory pool. We prioritize the stories based on the server sort key, which is provided by the ranking layer, as well as on whether it has been viewed by the person. In this way, we defer content prioritization to the ranking layer, which has much more complex mechanisms for content recommendation compared to the client. </p>
<p>By deferring to the server sort key on the client, we accomplish our key principles of deferring to ranking for content prioritization as well as prioritizing fresh content.</p>
<p>The balance for mobile clients is between content freshness and performance/efficiency. If we wait too long for the network request to complete, people will leave the surface. If we instantly vend cache every time someone comes to the surface, the first pieces of content they see may be stale and thus not relevant or interesting.</p>
<p>If we fetch too often, then our capacity costs will increase. If we don’t fetch enough, relevant network content won’t be available, and we will serve stale content.</p>
<p>When stories are added from the in-memory pool, we also perform media prefetching; this ensures that swiping from one video to another is a seamless experience.</p>
<p>This is the constant balancing act we have to play on behalf of our mobile clients in the content delivery space.</p>
<h3>Dynamic pagination, a new approach to video feed delivery</h3>
<p>In a typical delivery scenario, everyone on Facebook receives a page of videos with a fixed size from ranking to client. However, this approach can be limiting for a large user base where we need to optimize capacity costs on demand. User characteristics vary widely, ranging from those who never swipe down on a video to those who consume many videos in a single session. Someone could be completely new to our video experience or someone who visits the Facebook app multiple times a day. To accommodate both ends of the user-consumption spectrum, we developed a new and dynamic pagination framework.</p>
<p>Under this approach, the ranking layer has full control over the video page size that should be ranked for a given person and served. The server’s role is to provide a guardrail for deterministic page size contracts between the server and client device. In summary, the contract between ranking and server is dynamic page size, while the contract between server and client is fixed page size, with the smallest possible value. This setup helps ensure that if the quantity of ranked videos is too large, the person’s device doesn’t end up receiving all of them. At the same time, it simplifies client-delivery infrastructure by ensuring there is deterministic page size behavior between the client and server.</p>
<p>With the above setup, ranking can provide personalization to varying degrees. If ranking is confident in its understanding of someone’s consumption needs, it can output a larger set of ranked content. Conversely, if ranking is less confident, it can output a smaller set of ranked content. By incorporating this level of personalization, we can carefully curate content for people who are relatively new to the platform while providing a larger recommendation batch for regular users. This approach allows us to conserve capacity and serve the best content to our extremely large user base.</p>
<p><img class="aligncenter size-large wp-image-22040" src="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png 2500w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=2048,1152 2048w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Facebook-video-delivery-3.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h3>Real-time ranking</h3>
<p>Real-time ranking adjusts video content ranking based on user interactions and engagement signals, delivering more relevant content as people interact with the platform.</p>
<p>Collecting real-time signals such as video view time, likes, and other interactions using asynchronous data pipelines enables accurate ranking, or it enables using synchronous data pipelines such as piggybacking a batch of these signals into the next tail load request. This process relies on system latency and signal completeness. It’s important to note that if the snapshot of real-time signals between two distinct ranking requests is similar, there is little to no adjustment that ranking can perform to react to the person’s current interest.</p>
<p>Ranking videos in real time ensures prominent display of relevant and engaging content, while eliminating duplicates for content diversification as well as new topic exploration. This approach enhances user engagement by providing a personalized and responsive viewing experience, adapting to people’s preferences and behaviors in real time during app sessions. Think of responsiveness as how well and consistently our end-to-end infrastructure delivers fresh content.</p>]]></description>
      <link>https://engineering.fb.com/2024/12/10/video-engineering/inside-facebooks-video-delivery-system/</link>
      <guid>https://engineering.fb.com/2024/12/10/video-engineering/inside-facebooks-video-delivery-system/</guid>
      <pubDate>Tue, 10 Dec 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Typed Python in 2024: Well adopted, yet usability challenges persist]]></title>
      <description><![CDATA[<p>This summer, JetBrains, Meta, and Microsoft collaborated to conduct a comprehensive survey on the state of Python typing*. The survey aimed to understand how developers in the open source community are using type hints, the challenges they face, and the tools they rely on. Over 1,000 people took the survey and we are delighted to share the findings. Despite the positive typing sentiment, we received fantastic (even if a little biting at times) feedback about the type system. We’ll give a summary of the findings including usage statistics, overall sentiment and takeaways that can improve Python developer tooling. </p>
<h2>Overall findings</h2>
<ul><li class="c1" aria-level="1">88% of respondents “Always” or “Often” use Types in their Python code.</li>
<li class="c1" aria-level="1">IDE tooling, documentation, and catching bugs are drivers for the high adoption of types in survey responses,</li>
<li class="c1" aria-level="1">The usability of types and ability to express complex patterns still are challenges that leave some code unchecked.</li>
<li class="c1" aria-level="1">Latency in tooling and lack of types in popular libraries are limiting the effectiveness of type checkers.</li>
<li class="c1" aria-level="1">Inconsistency in type check implementations and poor discoverability of the documentation create friction in onboarding types into a project and seeking help when using the tools. </li>
</ul><h2>Survey methodology</h2>
<p>A survey about types is likely to attract a lot of typing enthusiasts, so we don’t take this to be an unbiased nor representative view of everyone in the community. We did our best to distribute to as many developers as possible and aimed for easy-to-understand questions for all skill levels. We created questions that would give a picture of developer profiles, tools, and overall sentiment towards typed Python. Beyond metrics, we wanted to get a sense of the current mood and are thankful for the detailed and candid feedback. </p>
<h2>Developer cohorts</h2>
<p>As a general purpose language, it was not surprising to see Python types used across many fields. Scripting/automation, web development, data analysis, AI/ML, devOps and teaching all had large representation. One surprising finding was the value Python types are demonstrating outside of collaborative environments.  A significant portion of respondents use Python types in personal projects (66% of respondents who only use Python personally “Always” or “Often” use types, compared to 78% of only “Professional” developers) and without CI (29.6% respondents don’t have type checking in CI use types “Always” or “Often”).</p>
<p><img class="aligncenter size-large wp-image-22028" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?w=1024" alt="" width="1024" height="620" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png 1114w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?resize=916,554 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?resize=768,465 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?resize=1024,620 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Development_Cohorts_dark_V1.png?resize=192,116 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>IDEs and type checkers</h2>
<p>When it comes to development environments, Visual Studio (VS) Code emerged as the most popular choice. The most popular configuration of IDE plus type checker was VS Code with Mypy followed by PyCharm with Mypy. Mypy remains the most popular type checker, with 67% of respondents using it and 38% using Pyright (24% use both). Emacs or NeoVIM also has a strong user base at 11% combined. The community’s preference for both IDE and type checker tooling is still quite varied. While not a static type checker, 62% of developers use Pydantic and 14% <em>only</em> use Pydantic, showing the use of the type system extending into runtime use cases.</p>
<p><img class="aligncenter size-large wp-image-22030" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?w=1024" alt="" width="1024" height="649" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png 1038w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?resize=916,581 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?resize=768,487 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?resize=1024,649 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-piechart_dark.png?resize=192,122 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>What people love</h2>
<p>Despite the challenges, developers appreciate the enhanced autocompletion and improved code clarity that type hints provide. “Better IDE Support” was the most useful feature (59%) followed by “Preventing Bugs” (49.8%) and “Documentation” (49.2%). They value the ability to catch potential bugs early and the ease of refactoring code. The optional nature of typing allows for gradual adoption, which many find beneficial.</p>
<blockquote class="blockquote">
<p>“<strong>It finds real bugs.</strong> It often points to design flaws when typing is hard or impossible.”</p>
</blockquote>
<h2>Common issues with type system documentation and usability</h2>
<p>We gave developers the opportunity to provide freeform feedback and saw several issues with the current type system come up repeatedly. The most common concerns are the complexity of the type system of expressing dynamic features (29 responses), the slow performance of type checkers like Mypy (22 responses), and the inconsistencies across different type checkers (21 responses). Lack of clarity in documentation, especially for advanced constructs, was also a pain point (10 responses). </p>
<blockquote class="blockquote">
<p>“Numerous libraries lack any type annotations, hindering code analysis and potentially leading to runtime errors.”</p>
</blockquote>
<blockquote class="blockquote">
<p>“The hoops you sometimes have to jump through to at least somewhat correctly express runtime dynamic features, and even then they are often not correctly covered.”</p>
</blockquote>
<h2>Why developers don’t use types</h2>
<p>Among respondents, 321 (29%) of developers cited the following reasons for not using types in their Python code. The primary reason for not using types is, “Not required for my projects,” which accounted for 11% of total survey responses. Interestingly, among the 321 developers who cited this reason, the majority (60%) still reported using types “Always” or “Often.” This is 28 points below the overall survey average, yet it remains a substantial proportion.</p>
<p><img class="aligncenter size-large wp-image-22031" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?w=1024" alt="" width="1024" height="538" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png 1126w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?resize=916,482 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?resize=768,404 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?resize=1024,538 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?resize=96,50 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Static-Types_Dark.png?resize=192,101 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Recommendations for Python language maintainers and tooling authors</h2>
<p>Developers are asking for better standardization and consistency across tools. Improving support for dynamic and complex patterns, as well as enhancing runtime type checking, are all key areas for further thought. Better type checker performance was a common pain point cited by developers in all cohorts. Beyond features and performance, the accessibility and discoverability of Python documentation was mentioned numerous times. <a href="https://docs.python.org/3/library/typing.html">The Python 3 typing docs</a> were the most popular way for people to learn about types or get help with issues. There was consistent feedback asking for better documentation, particularly for advanced typing features that included examples. “Lack of familiarity” was the second highest reason (8% of all responses) people are not using types. There is an opportunity to improve discoverability and usability of documentation.</p>
<p><img class="aligncenter size-large wp-image-22029" src="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?w=1024" alt="" width="1024" height="702" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png 1172w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?resize=916,628 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?resize=768,527 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?resize=1024,702 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?resize=96,66 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Meta-Python-typing-survey-2024-Help-dark.png?resize=192,132 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p>
<h2>Thank you! Let’s do this again!</h2>
<p>Thanks to everyone who helped create and share the survey. An extra big thanks for everyone who filled out the survey and gave honest, detailed feedback. We had more responses than expected! It’s encouraging to see so much engagement from the community, and look forward to incorporating the feedback into discussions around the future of Python type checking and tools. </p>
<p>We hope to run the survey again in summer 2025 to see how sentiment changes and the adoption of tooling grows. We have a few ideas for how to improve the survey for next year. We want to ensure that many opinions across the community are heard and that we can capture typing sentiment from folks of different ranges of experience and levels of enthusiasm for typing. </p>
<p>What would you like to see in the survey next year? How can the Python Type System evolve to meet your needs? Join the conversation on <a href="https://discuss.python.org/c/typing/32">discourse</a>. You can also <a href="https://lookerstudio.google.com/reporting/15599c5b-0e51-4423-8998-cf5c1bfeea00/page/8lQ9D/edit">explore the data yourself through this tool</a> and comment below with your insights. </p>
<p><em>*Based on an online survey conducted among 1,083 people distributed through X, LinkedIn, Reddit,and other social media platforms for targeting Python developers. The research was conducted by Meta, Microsoft and JetBrains. Data was collected between 07/29/2024 and 10/08/2024.</em></p>]]></description>
      <link>https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/</link>
      <guid>https://engineering.fb.com/2024/12/09/developer-tools/typed-python-2024-survey-meta/</guid>
      <pubDate>Mon, 09 Dec 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Powering AI innovation by acccelerating the next wave of nuclear]]></title>
      <description><![CDATA[<div class="hero hero--news hero--regular-article"><div class="hero__thumbnail"><figure class="hero__background no-caption"><div class="hero__background-image"><img width="1200" height="673" src="https://sustainability.atmeta.com/wp-content/uploads/2024/12/AdobeStock_981369621.jpeg?fit=1200%2C673" class="attachment-large size-large" alt="A digital art rendering of the nuclear power plant with blue glowing accents, showcasing energy production and technology elements in a wireframe." srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/12/AdobeStock_981369621.jpeg?w=1200 1200w, https://sustainability.atmeta.com/wp-content/uploads/2024/12/AdobeStock_981369621.jpeg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2024/12/AdobeStock_981369621.jpeg?w=768 768w" sizes="(max-width: 1200px) 100vw, 1200px" /></div>
</figure></div></div><div class="container entry-content"><ul class="wp-block-list c1"><li class="has-x-small-font-size">Today, Meta announced it will release a <a href="https://sustainability.atmeta.com/nuclear-energy-rfp-qualification-intake/">request for proposals (RFP)</a> to identify nuclear energy developers to help us meet our AI innovation and sustainability objectives — targeting 1-4 gigawatts (GW) of new nuclear generation capacity in the U.S.; qualified developers can fill out the intake form to receive further guidance on the RFP process.</li>
<li class="has-x-small-font-size">We are taking an open approach with this RFP so we can partner with others across the industry to bring new nuclear energy to the grid.  </li>
</ul><p class="c3">Advancing the technologies that will build the future of human connection — including the next wave of AI innovation — requires electric grids to expand and embrace new sources of reliable, clean and renewable energy. As new innovations bring impactful technological advancements across sectors and support economic growth, we believe that nuclear energy can help provide firm, baseload power to support the growth needs of the electric grids that power both our data centers (the physical infrastructure on which Meta’s platforms operate) as well as the communities around them.</p><p>Supporting the development of clean energy must continue to be a priority as electric grids expand to accommodate growing energy needs. At Meta, we believe nuclear energy will play a pivotal role in the transition to a cleaner, more reliable, and diversified electric grid. That is why today we announced that we will be releasing a <a href="https://sustainability.atmeta.com/nuclear-energy-rfp-qualification-intake/">request for proposals (RFP)</a> to identify nuclear energy developers to help us meet our AI and sustainability objectives.</p><p>Our aim is to add 1-4 GW of new nuclear generation capacity in the U.S. to be delivered starting in the early 2030s. We are looking to identify developers that can help accelerate the availability of new nuclear generators and create sufficient scale to achieve material cost reductions by deploying multiple units, both to provide for Meta’s future energy needs and to advance broader industry decarbonization. We believe working with partners who will ultimately permit, design, engineer, finance, construct, and operate these power plants will ensure the long-term thinking necessary to accelerate nuclear technology.  </p><p>When we began engaging with the renewable energy industry more than a decade ago, the industry was scaling. Our early engagement with developers of renewable energy allowed Meta to design contracts that enable both Meta and our developer partners to achieve our respective goals. We want to work creatively with developers to structure an agreement that will similarly enable development of nuclear technology.</p><p>Compared to renewable energy projects that we continue to invest in, such as solar and wind, nuclear energy projects are more capital intensive, take longer to develop, are subject to more regulatory requirements, and have a longer expected operational life. These differences mean we need to engage nuclear energy projects earlier in their development lifecycle and consider their operational requirements when designing a contract. And, as scaling deployments of nuclear technology offers the best chance of rapidly reducing cost, engaging with a partner across projects and locations will allow us to ensure that we can deploy strategically. An RFP process will allow us to approach these projects thoroughly and thoughtfully with these considerations in mind.</p><p>As we look ahead to our next decade of innovation and growth, we are planning for our data center energy needs while simultaneously contributing to a reliable grid and advancing our <a href="https://sustainability.atmeta.com/">sustainability commitments</a>. Building on our <a href="https://sustainability.atmeta.com/blog/2024/10/14/our-approach-to-clean-and-renewable-energy/">efforts to bring new clean and renewable energy to the grid</a> — including solar, wind, battery storage, and, most recently, <a href="https://about.fb.com/news/2024/08/new-geothermal-energy-project-to-support-our-data-centers/" target="_blank" rel="noreferrer noopener">geothermal</a> — we continue to look for innovative ways to enable additional clean energy resources. Since 2020, we have matched our global operations with 100% clean and renewable energy and focused on bringing new resources to the grid through <a href="https://sustainability.atmeta.com/blog/2024/10/14/our-approach-to-clean-and-renewable-energy/">innovative partnerships</a> – totaling over 12,000 MW of renewable energy contracts worldwide to date. Going forward, this commitment is more important than ever to support our vision of operating sustainably. As our sector continues to grow, we are committed to working across the industry to advance our sustainability commitments and transform the grid of the future.</p><div class="wp-block-group is-style-wide is-layout-constrained wp-block-group-is-layout-constrained wp-block-columns alignfull is-style-default is-layout-flex wp-container-core-columns-is-layout-2 wp-block-columns-is-layout-flex wp-block-column is-layout-flow wp-block-column-is-layout-flow"><p class="has-text-align-center">At Meta, we’re building a better reality at every level of our impact.</p><div class="wp-block-columns alignfull is-style-link-gallery is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex"><div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="800" height="800" src="https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_102358080-1-crop.jpg?resize=800%2C800" alt="top down image of ocean water" class="wp-image-6373" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_102358080-1-crop.jpg?w=800 800w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_102358080-1-crop.jpg?w=150 150w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_102358080-1-crop.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_102358080-1-crop.jpg?w=768 768w" sizes="(max-width: 800px) 100vw, 800px" /></figure><p class="is-style-paragraph-overline">On our planet</p><p>Taking daily action toward net zero</p><p><a class="wp-block-button__link wp-element-button" href="https://sustainability.atmeta.com/climate/">Climate</a></p></div><div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="500" height="500" src="https://sustainability.atmeta.com/wp-content/uploads/2024/08/water-biodiversity-compressed.jpg?resize=500%2C500" alt="Romantic golden sunset river lake fog loving couple small rowing boat" class="wp-image-6368" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/08/water-biodiversity-compressed.jpg?w=500 500w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/water-biodiversity-compressed.jpg?w=150 150w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/water-biodiversity-compressed.jpg?w=300 300w" sizes="(max-width: 500px) 100vw, 500px" /></figure><p class="is-style-paragraph-overline">With our resources</p><p>Promoting resiliency and restoring loss</p></div><div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="800" height="800" src="https://sustainability.atmeta.com/wp-content/uploads/2024/08/GettyImages-539998802-crop.jpg?resize=800%2C800" alt="Solar panels and wind turbines" class="wp-image-6372" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/08/GettyImages-539998802-crop.jpg?w=800 800w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/GettyImages-539998802-crop.jpg?w=150 150w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/GettyImages-539998802-crop.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/GettyImages-539998802-crop.jpg?w=768 768w" sizes="(max-width: 800px) 100vw, 800px" /></figure><p class="is-style-paragraph-overline">Across our footprint</p><p>Driving efficient management of resources</p></div><div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="800" height="800" src="https://sustainability.atmeta.com/wp-content/uploads/2024/08/ATN-Exterior-resize-crop.jpg?resize=800%2C800" alt="Exterior of ATN facility" class="wp-image-6371" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/08/ATN-Exterior-resize-crop.jpg?w=800 800w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/ATN-Exterior-resize-crop.jpg?w=150 150w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/ATN-Exterior-resize-crop.jpg?w=300 300w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/ATN-Exterior-resize-crop.jpg?w=768 768w" sizes="(max-width: 800px) 100vw, 800px" /></figure><p class="is-style-paragraph-overline">Within our facilities</p><p>Meeting the highest standards for sustainable buildings</p></div><div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><figure class="wp-block-image size-full"><img data-recalc-dims="1" width="500" height="500" src="https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_497220421-crop.jpg?resize=500%2C500" alt="Happy businesspeople having a discussion during a meeting" class="wp-image-6370" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_497220421-crop.jpg?w=500 500w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_497220421-crop.jpg?w=150 150w, https://sustainability.atmeta.com/wp-content/uploads/2024/08/AdobeStock_497220421-crop.jpg?w=300 300w" sizes="(max-width: 500px) 100vw, 500px" /></figure><p class="is-style-paragraph-overline">Throughout our business</p><p>Working together for the highest outcome</p><p><a class="wp-block-button__link wp-element-button" href="https://sustainability.atmeta.com/responsible-ai/">Responsible AI</a></p></div></div></div><div class="wp-block-group is-style-wide is-layout-constrained wp-block-group-is-layout-constrained wp-block-media-text has-media-on-the-right is-stacked-on-mobile is-vertically-aligned-center is-style-media-text-cta"><div class="wp-block-media-text__content"><p>Learn more about our progress as we work to achieve net zero emissions across our value chain and become water positive in 2030.</p><p><a class="wp-block-button__link wp-element-button" href="https://sustainability.atmeta.com/asset/2024-sustainability-report/">Download the full report</a></p></div><figure class="wp-block-media-text__media"><img width="3000" height="1875" src="https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?w=1620" alt="Person kayaking in lake" class="wp-image-5772 size-full" srcset="https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif 3000w, https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?resize=300,188 300w, https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?resize=768,480 768w, https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?resize=1620,1013 1620w, https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?resize=1536,960 1536w, https://sustainability.atmeta.com/wp-content/uploads/2024/07/kayak.avif?resize=2048,1280 2048w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure></div></div>]]></description>
      <link>https://sustainability.atmeta.com/blog/2024/12/03/accelerating-the-next-wave-of-nuclear-to-power-ai-innovation/</link>
      <guid>https://sustainability.atmeta.com/blog/2024/12/03/accelerating-the-next-wave-of-nuclear-to-power-ai-innovation/</guid>
      <pubDate>Tue, 03 Dec 2024 21:27:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta Andromeda: Supercharging Advantage+ automation with the next-gen personalized ads retrieval engine]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Andromeda is Meta’s proprietary machine learning (ML) system design for retrieval in ad recommendation focused on delivering a step-function improvement in value to our advertisers and people. </li>
<li class="c1" aria-level="1">This system pushes the boundary of cutting edge AI for retrieval with NVIDIA Grace Hopper Superchip and <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/" target="_blank" rel="noopener">Meta Training and Inference Accelerator (MTIA)</a> hardware through innovations in ML model architecture, feature representation, learning algorithm, indexing, and inference paradigm.</li>
<li class="c1" aria-level="1">We’re sharing how Andromeda establishes an efficient scaling law for retrieval by harnessing the power of state-of-the-art deep neural networks, benefitting from the co-design of ML, system, and hardware (NVIDIA and MTIA chips) that improves performance and return on investment.</li>
</ul><p>AI plays an important role in Meta’s advertising <a href="https://www.facebook.com/business/news/good-questions-real-answers-how-does-facebook-use-machine-learning-to-deliver-ads" target="_blank" rel="noopener">system</a> by leveraging the power of machine learning (ML) to predict which ads a person will find most interesting. This helps people learn about a business or product they are interested in while helping an advertiser meet their objectives such as increasing brand awareness, acquiring new customers, and driving sales.</p>
<p>Retrieval is the first step in our multi-stage ads recommendation system. This stage is tasked with selecting ads from tens of millions of ad candidates into a few thousand relevant ad candidates. In the following stage, larger and more sophisticated ranking models predict people and advertiser value to determine the final set of ads to be shown to the person. </p>
<h2>Challenges and opportunities in this new era of advertiser automation with generative AI</h2>
<p>The retrieval stage is challenging primarily because of scalability constraints in two axes: volume of ad candidates and tight latency constraints.</p>
<p><strong>Volume of ad candidates:</strong> Retrieval processes three orders of magnitude more ads than subsequent stages. Features like predictive targeting, which dramatically improve advertiser outcomes, are computationally expensive. The continued positive momentum of Meta’s <a href="https://www.facebook.com/business/help/397103717129942?id=1913105122334058&amp;content_id=YAvwvDmp3OKQagl&amp;ref=sem_smb&amp;utm_term=dsa-1720753164846&amp;gclid=Cj0KCQjw1Yy5BhD-ARIsAI0RbXYPq9-cjHq5qekJJW6O98EqjXQu8GZjHIsq-1VEFWFwXD37coGHgE4aAu_KEALw_wcB&amp;gad_source=1" target="_blank" rel="noopener">Advantage+</a> suite further increases the number of eligible ads through <a href="https://www.facebook.com/business/ads/automation" target="_blank" rel="noopener">automation</a> of audience creation, optimal budget allocation, dynamic placement across Meta surfaces, and creative generation. Finally, with the adoption of powerful new tools based on generative AI for creating and optimizing ad creative content, the number of ads creatives in Meta’s recommendation systems is expected to grow significantly.</p>
<p><strong>Tight latency constraints:</strong> Selecting ads rapidly is essential for delivering timely and relevant ads, as any delay can disrupt the viewers experience by not providing the most current content. As advertising becomes increasingly dynamic, frequent updates to both delivery and each person’s interests demand increased model complexity in near real-time.</p>
<p>Processing such a vast number of ads in so little time is capacity intensive, which requires substantial optimization and innovation to scale up model complexity for better personalization while maintaining a high return on investment (ROI) on the required infrastructure investments. </p>
<h2>Unlocking advertiser value through industry-leading ML innovation</h2>
<p>Meta Andromeda is a personalized ads retrieval engine that leverages the NVIDIA Grace Hopper Superchip, to enable cutting edge ML innovation in the Ads retrieval stage to drive efficiency and advertiser performance. Key AI advancements include: </p>
<h3>Deep neural networks custom-designed for the NVIDIA Grace Hopper Superchip to deliver superior performance</h3>
<p>Andromeda improves performance of Meta ads system by delivering more personalized ads to viewers and maximizing return on ad spend for advertisers. Meta’s Ads team has created a deep neural network with increased compute complexity and massive parallelism on the NVIDIA Grace Hopper Superchip to better learn higher-order interactions from people and ads data. Its deployment across Instagram and Facebook applications has achieved +6% recall improvement to the retrieval system, delivering +8% <a href="https://www.facebook.com/business/help/1767120243598011" target="_blank" rel="noopener">ads quality</a> improvement on selected segments.</p>
<h3>Hierarchical indexing to support exponential ad creatives growth from Advantage+ creative </h3>
<p>Advantage+ automates budget allocation, audience targeting, and bid adjustments – streamlining campaign management and boosting performance through more ads in the system for different audiences. </p>
<p>For example, when advertisers who did not previously use Advantage+ creative turned on its AI-driven targeting features, they experienced a 22% increase in ROAS from our ads. We estimate that businesses using image generation are seeing a +7% increase in conversions. Even at this early stage, more than a million advertisers used our generative AI (GenAI) tools to create more than 15 million ads in a month. Andromeda is designed to maximize ads performance by utilizing the exponential growth in volume of eligible ads available to the retrieval stage. It introduces an efficient hierarchical index to scale up to a large volume of ads creatives, empowering the adoption of GenAI technologies by advertisers.</p>
<h3>AI development efficiency</h3>
<p>Andromeda reduces system complexity by minimizing components and rule-based logic, allowing for end-to-end performance optimization. This streamlined system enhances pace of adoption for future AI innovation in the retrieval space.</p>
<h2>Meta’s new personalized ads retrieval paradigm</h2>
<p>Before Andromeda, Meta’s retrieval systems were only able to apply limited personalization, relying on a process with isolated model stages and numerous rule-based heuristics to manage the vast number of ads. This approach hindered end-to-end optimization and efficient global resource allocation to maximize performance. Handling such a massive volume of ads per request was complex, memory bandwidth-intensive, and difficult to scale, resulting in low hardware-level parallelism in conventional retrieval models. This often led to suboptimal performance and slower adoption of AI innovations.</p>
<p><img class="aligncenter wp-image-22014 size-large" src="https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?w=1024" alt="" width="1024" height="593" srcset="https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png 1862w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=916,530 916w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=768,445 768w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=1024,593 1024w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=1536,889 1536w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2024/12/Personalized-Ads-Retrieval-Paradigm-Final.png?resize=192,111 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Andromeda represents a significant technological leap in retrieval – addressing the above challenges with key ML and system innovations.</p>
<h3>A state-of-the-art deep neural network for retrieval</h3>
<p>Andromeda is able to efficiently scale retrieval models by designing a highly customized deep neural network with sublinear inference cost, enabling a meaningful increase of model capacity (10,000x) for enhanced personalization. Complex latent relationships between people’s interests, products, and services offered through ads are captured through advanced interaction features and new algorithms, further enhancing recommendation relevance and accuracy.</p>
<p>The design is optimized for AI hardware, minimizing memory bandwidth bottlenecks and enabling highly parallel, computation-intensive retrieval models with high performance. GPU preprocessing is used for feature extraction, and all precomputed ad embeddings and features are stored in the local memory of the Grace Hopper Superchip. This approach addresses the traditional scaling constraints of limited CPU-to-GPU interconnect bandwidth, heavy memory IO overhead, and low GPU utilization and enables efficient handling of a larger set of diverse feature inputs.</p>
<h3>Hierarchical indexing for efficiency and scalable retrieval</h3>
<p>Andromeda organizes ads into a hierarchical index with multiple layers, reducing the number of inference steps by focusing only on most relevant nodes. The hierarchical index and retrieval models are jointly trained, which aligns the index representations with neural networks; this improves both precision and recall compared to commonly used two-tower neural networks or approximate nearest neighbor search. </p>
<p>The hierarchical structured neural network provides sub-linear inference costs, enabling retrieval models to scale up to much higher capacity, allowing efficient handling of a larger volume of ads with high retrieval accuracy while achieving higher performance.</p>
<h3>Model elasticity</h3>
<p>Andromeda enhances overall system ROI by enabling agile and efficient resource allocation. A segment-aware design leverages higher complexity models to serve high value ads segments to maximize ROI. It automatically adjusts model complexity and inference steps in real-time based on available resources, thereby allowing a more scalable retrieval system. Together with a hierarchical structured neural network, model elasticity further boosts model inference efficiency by 10x.</p>
<h3>An optimized retrieval model</h3>
<p>Andromeda significantly enhances the retrieval model’s instruction and thread-level parallelism through innovations in model architecture, features, learning algorithms, and the inference paradigm. This model is built with low-latency, high-throughput, and memory-IO aware GPU operators, utilizing deep kernel fusion and advanced software pipelining techniques. This minimizes kernel dispatching overhead, avoids bottlenecks on repeated HBM-SRAM memory IO, and reduces dependency on low arithmetic intensity modules. </p>
<p>Unlike conventional retrieval models that rely on expert-engineered features, Andromeda leverages the NVIDIA Hopper GPU’s massive parallel computing capabilities to dynamically reconstruct latent user-ad interaction signals on-the-fly, achieving over 100x improvement in both feature extraction latency and throughput of previous CPU based components. In addition, the chip’s high-bandwidth CPU-GPU interconnection supercharges ads retrieval inference to process an enormous number of ads per request, enabling a faster and more efficient delivery of relevant and personalized Ads. The effort has enhanced end-to-end model inference queries per second (QPS) by over 3x.</p>
<h2>Advancing the state of art in ads retrieval</h2>
<p>Andromeda significantly enhances Meta’s ads system by enabling the integration of AI that optimizes and improves personalization capabilities at the retrieval stage and improves return on ad spend. A hierarchical indexing solution leveraging deep neural networks co-designed with the NVIDIA Grace Hopper Superchip helps address the scalability challenges presented by the exponential growth of creatives while delivering the best experience given the strict latency and capacity ROI budgets. Andromeda capitalizes the fast industry adoption of Advantage+ automation and GenAI to deliver value for our advertisers, people who use our suite of products, and Meta.</p>
<p>Looking forward, the Andromeda model architecture is expected to transition to support an autoregressive loss function, leading to a more efficient and faster inferencing solution that delivers a more diverse set of ad candidates. Increased ad diversity can improve people’s experience with ads and drive better advertiser outcomes. </p>
<p>Integrating Andromeda with MTIA and future generations of commercially-available GPUs will continue to push the boundaries of scaling retrieval – further improving advertiser performance and achieving what we estimate will be another 1,000x increase in model complexity. </p>
<h3>Acknowledgements</h3>
<p><em>We would like to thank Habiya Beg, Zain Brohi, Wenlin Chen. Chunli Fu, Golnaz Ghasemiesfeh, Xingfeng He, Akshay Hegde, Liquan Huang, Liuhan Huang, Kamran Izadi, Santosh Janardhan, Karthik Jayaraman, Changkyu Kim, Santanu Kolay, Ilia Lewis, Wenqian Li, Xiaotian Li, Rocky Liu, Paolo Massimi, Kexin Nie, Sandeep Pandey, Uladzimir Pashkevich, Varna Puvvada, Hang Qu, Melanie Roe, Yan Shi, Matt Steiner, Alisha Swinteck, Bangsheng Tang, Jim Tao, Sunay Vaishnav, Arunprasad Venkatraman, Vidhoon Viswanathan, Sasha Vorontsov, Minghui Wanghan, Fangzhou Xu, Nathan Yan, Tak Yan, Yang Yang, Qing Zhang, Fangyu Zou, and everyone who contributed to the success of Meta Andromeda.</em></p>]]></description>
      <link>https://engineering.fb.com/2024/12/02/production-engineering/meta-andromeda-advantage-automation-next-gen-personalized-ads-retrieval-engine/</link>
      <guid>https://engineering.fb.com/2024/12/02/production-engineering/meta-andromeda-advantage-automation-next-gen-personalized-ads-retrieval-engine/</guid>
      <pubDate>Mon, 02 Dec 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Sequence learning: A paradigm shift for personalized ads recommendations]]></title>
      <description><![CDATA[<p>AI plays a fundamental role in creating valuable connections between people and advertisers within Meta’s family of apps. Meta’s ad recommendation engine, powered by <a href="https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/" target="_blank" rel="noopener">deep learning recommendation models (DLRMs)</a>, has been instrumental in delivering personalized ads to people. Key to this success was incorporating thousands of human-engineered signals or features in the DLRM-based recommendation system.</p>
<p>Despite training on vast amounts of data, there are limitations to current DLRM-based ads recommendations with manual feature engineering due to the inability of DLRMs to leverage sequential information from people’s experience data. To better capture the experiential behavior, the ads recommendation models have undergone foundational transformations along two dimensions:</p>
<ol><li>Event-based learning: learning representations directly from a person’s engagement and conversion events rather than traditional human-engineered features.</li>
<li>Learning from sequences: developing new sequence learning architectures to replace traditional DLRM neural network architectures.</li>
</ol><p>By incorporating these advancements from the fields of natural language understanding and computer vision, Meta’s next-generation ads recommendation engine addresses the limitations of traditional DLRMs, resulting in more relevant ads for people, higher value for advertisers, and better infrastructure efficiency.</p>
<p>These innovations have enabled our ads system to develop a deeper understanding of people’s behavior before and after converting on an ad, enabling us to infer the next set of relevant ads. Since launch, the new ads recommendation system has improved ads prediction accuracy – leading to higher value for advertisers and 2-4% more conversions on select segments.</p>
<h2>The limits of DLRMs for ads recommendations</h2>
<p>Meta’s DLRMs for personalized ads rely on a wide array of signals to understand people’s purchase intent and preferences. DLRMs have revolutionized learning from <a href="https://ai.meta.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/" target="_blank" rel="noopener">sparse features</a>, which capture a person’s interactions on entities like Facebook pages, which have massive cardinalities often in the billions. The success of DLRMs is founded on their ability to learn generalizable, high dimensional representations, i.e., embeddings from sparse features. </p>
<p>To leverage tens of thousands of such features, various strategies are employed to combine features, transform intermediate representations, and compose the final outputs. Further, sparse features are built by aggregating attributes across a person’s actions over various time windows with different data sources and aggregation schemes. </p>
<p>Some examples of legacy sparse features thus engineered would be:</p>
<ul><li class="c1" aria-level="1">Ads that a person clicked in the last N days → [Ad-id1, Ad-id2, Ad-id3, …, Ad-idN]</li>
<li class="c1" aria-level="1">Facebook pages a person visited in the past M days with a score of how many visits on each page  → [(Page-id1, 45), (Page-id2, 30), (Page-id3, 8), …]</li>
</ul><p>Human-engineered sparse features, as described above, have been a cornerstone for personalized recommendations with DLRMs for several years. But this approach has limitations:</p>
<ul><li class="c1" aria-level="1">Loss of sequential information: Sequence information, i.e., the order of a person’s events, can provide valuable insights for better ads recommendations relevant to a person’s behavior. Sparse feature aggregations lose the sequential information in a person’s journeys.</li>
<li class="c1" aria-level="1">Loss of granular information: Fine-grained information like collocation of attributes in the same event is lost as features are aggregated across events.</li>
<li class="c1" aria-level="1">Reliance on human intuition: Human intuition is unlikely to recognize non-intuitive, complex interactions and patterns from vast quantities of data.</li>
<li class="c1" aria-level="1">Redundant feature space: Multiple variants of features get created with different aggregation schemes. Though providing incremental value, overlapping aggregations increase compute and storage costs and make feature management cumbersome.</li>
</ul><p>People’s interests evolve over time with continuously evolving and dynamic intents. Such complexities are hard to model with handcrafted features. Modeling these inter-dynamics helps achieve a deeper understanding of a person’s behavior over time for better ad recommendations. </p>
<h2>A paradigm shift with learning from sequences for recommendation systems</h2>
<p>Meta’s new system for ads recommendations uses sequence learning at its core. This necessitated a complete redesign of the ads recommendations system across data storage, feature input formats, and model architecture. The redesign required building a new people-centric infrastructure, training and serving optimization for state-of-the-art sequence learning architectures, and model/system codesign for efficient scaling.</p>
<h3>Event-based features</h3>
<p>Event-based features (EBFs) are the building blocks for the new sequence learning models. EBFs – an upgrade to traditional features – standardizes heterogeneous inputs to sequence learning models along three dimensions:</p>
<ol><li>Event streams: the data stream for an EBF, e.g. the sequence of recent ads people engaged with or the sequence of pages people liked.</li>
<li>Sequence length defines how many recent events are incorporated from each stream and is determined by the importance of each stream.</li>
<li>Event Information: captures semantic and contextual information about each event in the stream such as the ad category a person engaged with and the timestamp of the event.</li>
</ol><p>Each EBF is a single coherent object that captures all key information about an event. EBFs allow us to incorporate rich information and scale inputs systematically. EBF sequences replace legacy sparse features as the main inputs to the recommendation models. When combined with event models described below, EBFs have ushered in a departure from human-engineered feature aggregations.</p>
<h3>Sequence modeling with EBFs</h3>
<p>An event model synthesizes event embeddings from event attributes. It learns embeddings for each attribute and uses linear compression to summarize them into a single event attributed-based embedding. Events are timestamp encoded to capture their recency and temporal order. The event model combines timestamp encoding with the synthesized event attribute-based embedding to produce the final event-level representation – thus translating an EBF sequence into an event embedding sequence.</p>
<p>This is akin to how language models use embeddings to represent words. The difference is that EBFs have a vocabulary that is many orders of magnitude larger than a natural language because they come from heterogeneous event streams and encompass millions of entities.</p>
<p>The event embeddings from the event model are then fed into the sequence model at the center of the next-generation ads recommendation system. The event sequence model is a person level event summarization model that consumes sequential event embeddings. It utilizes state-of-the-art attention mechanisms to synthesize the event embeddings to a predefined number of  embeddings that are keyed by the ad to be ranked. With techniques like multi-headed attention pooling, the complexity of the self-attention module is reduced from <em>O</em>(N*N) to <em>O</em>(M*N) . M is a tunable parameter and N is the maximum event sequence length.</p>
<p>The following figure illustrates the differences between DLRMs with a human-engineered features paradigm (left) and the sequence modeling paradigm with EBFs (right) from a person’s event flow perspective.</p>
<p><img class="aligncenter size-large wp-image-21985" src="https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?w=1024" alt="" width="1024" height="899" srcset="https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png 1999w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=916,804 916w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=768,674 768w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=1024,899 1024w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=1536,1349 1536w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=96,84 96w, https://engineering.fb.com/wp-content/uploads/2024/11/Event-Sequence-Learning-Meta.png?resize=192,169 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h2>Scaling the new sequence learning paradigm</h2>
<p>Following the redesign to shift from sparse feature learning to event-based sequence learning, the next focus was scaling across two domains — scaling the sequence learning architecture and scaling event sequences to be longer and richer.</p>
<h3>Scaling sequence learning architectures</h3>
<p>A custom transformer architecture that incorporates complex feature encoding schemes to fully model sequential information was developed to enable faster exploration and adoption of state-of-the-art techniques for recommendation systems. The main challenge with this architectural approach is achieving the performance and efficiency requirements for production. A request to Meta’s ads recommendation system has to rank thousands of ads in a few hundred milliseconds.</p>
<p>To scale representation learning for higher fidelity, the existing sum pooling approach was replaced with a new architecture that learned feature interactions from unpooled embeddings. Whereas the prior system based on aggregated features was highly optimized for fixed length embeddings that are pooled by simple methods like averaging, sequence learning introduces new challenges because different people have different event lengths. Longer variable length event sequences, represented by jagged embedding tensors and unpooled embeddings, result in larger compute and communication costs with higher variance.This challenge of growing costs is addressed by adopting hardware codesign innovations for supporting jagged tensors, namely:</p>
<ul><li class="c1" aria-level="1">Native PyTorch capabilities to support Jagged tensors.</li>
<li class="c1" aria-level="1">Kernel-level optimization for processing Jagged tensors on GPUs.</li>
<li class="c1" aria-level="1">A <a href="https://dl.acm.org/doi/10.1145/3640457.3688040" target="_blank" rel="noopener">Jagged Flash Attention</a> module to support Flash Attention on Jagged tensors.</li>
</ul><h3>Scaling with longer, richer sequences</h3>
<p>Meta’s next-generation recommendation system’s ability to learn directly from event sequences to better understand people’s preferences is further enhanced with longer sequences and richer event attributes.</p>
<p>Sequence scaling entailed:</p>
<ul><li class="c1" aria-level="1"><strong>Scaling with longer sequences:</strong> Increasing sequence lengths gives deeper insights and context about a person’s interests. Techniques like multi-precision quantization and value-based sampling techniques are used to efficiently scale sequence length.</li>
<li class="c1" aria-level="1"><strong>Scaling with richer semantics</strong>: EBFs enable us to capture richer semantic signals about each event e.g. through multimodal content embeddings. Customized vector quantization techniques are used to efficiently encode the embedding attributes of each event. This yields a more informative representation of the final event embedding.</li>
</ul><h2>The impact and future of sequence learning</h2>
<p>The event sequence learning paradigm has been widely adopted across Meta’s ads systems, resulting in gains in ad relevance and performance, more efficient infrastructure, and accelerated research velocity. Coupled with our focus on advanced <a href="https://arxiv.org/pdf/2406.05898" target="_blank" rel="noopener">transformer architectures</a>, event sequence learning has reshaped Meta’s approach to ads recommendation systems. </p>
<p>Going forward, the focus will be on further scaling event sequences by 100X, developing more efficient sequence modeling architectures like linear attention and state space models, key-value (KV) cache optimization, and multimodal enrichment of event sequences.</p>
<h2>Acknowledgements</h2>
<p><em>We would like to thank</em> <em>Neeraj Bhatia</em><em>, Zhirong Chen, Parshva Doshi,</em> <em>Jonathan Herbach</em><em>,</em> <em>Yuxi Hu</em><em>, Kun Jiang,</em> <em>Santanu Kolay</em><em>, Boyang Li,  Hong Li</em><em>,</em> <em>Paolo Massimi,</em> <em>Sandeep Pandey</em><em>,</em> <em>Dinesh Ramasamy</em><em>,</em> <em>Ketan Singh</em><em>, Doris Wang, Rengan Xu, Junjie Yang, and the entire event sequence learning team involved in the development and productionization of the next-generation sequencing learning-based ads recommendation system.</em></p>]]></description>
      <link>https://engineering.fb.com/2024/11/19/data-infrastructure/sequence-learning-personalized-ads-recommendations/</link>
      <guid>https://engineering.fb.com/2024/11/19/data-infrastructure/sequence-learning-personalized-ads-recommendations/</guid>
      <pubDate>Tue, 19 Nov 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[How Meta built large-scale cryptographic monitoring]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Cryptographic monitoring at scale has been instrumental in helping our engineers understand how cryptography is used at Meta.</li>
<li class="c1" aria-level="1">Monitoring has given us a distinct advantage in our efforts to proactively detect and remove weak cryptographic algorithms and has assisted with our general change safety and reliability efforts.</li>
<li class="c1" aria-level="1">We’re sharing insights into our own cryptographic monitoring system, including challenges faced in its implementation, with the hope of assisting others in the industry aiming to deploy cryptographic monitoring at a similar scale.</li>
</ul><p>Meta’s managed cryptographic library, FBCrypto, plays an important role within Meta’s infrastructure and is used by the majority of our core infrastructure services. Given this, having a robust monitoring system in place for FBCrypto has been instrumental in ensuring its reliability as well as in helping our engineers understand how cryptography is used at Meta so they can make informed development decisions.</p>
<p>Monitoring the health of our library allows us to detect and revert bugs before they reach production services. The data from our monitoring service provides insight into the usage of FBCrypto, allowing us to make data-driven decisions when deciding what improvements to make to the library. For example, it helps us identify components that need more attention either because they are on a hot path or are less stable.</p>
<p>Understanding exactly how clients are using said library is a common pain point in managing any widely distributed library. But the improved understanding of FBCrypto provided by our monitoring helps us maintain a high bar for security posture. Since there is a limit to how much data a symmetric cryptographic key can protect, logging allows us to detect key overuse and rotate keys proactively. It also helps us build an inventory of cryptography usage, making it easy to identify the callsites of weakened algorithms that need to be migrated – a very important task because we need to proactively switch from weakened algorithms to newer, more robust ones as cryptography strength decays over time.</p>
<p>More generally, improved understanding helps us to make emergency algorithm migrations when a vulnerability of a primitive is discovered.</p>
<p>More recently, this is aiding our efforts to ensure <a href="https://engineering.fb.com/2024/05/22/security/post-quantum-readiness-tls-pqr-meta/" target="_blank" rel="noopener">post-quantum readiness</a> in our asymmetric use cases. The available data improves our decision-making process while prioritizing quantum-vulnerable use cases</p>
<h2>How cryptographic monitoring works at Meta</h2>
<p>Effective cryptographic monitoring requires storing persisted logs of cryptographic events, upon which diagnostic and analytic tools can be used to gather further insights. Supporting logging at the scale of FBCrypto requires an implementation with unique performance considerations in mind. Given that FBCrypto is used along many high-volume and critical code paths, a naive logging implementation could easily overwhelm a standard logging infrastructure or cause significant performance regressions. This is true for most widely distributed libraries and is especially true in the field of cryptography, where the sheer volume of usage can come as a complete surprise to those unfamiliar with the space. For example, we recently disclosed that roughly 0.05% of CPU cycles at Meta are spent on X25519 key exchange. </p>
<p>Most of Meta’s logs are constructed and written via <a href="https://engineering.fb.com/2019/10/07/core-infra/scribe/" target="_blank" rel="noopener">Scribe</a>, Meta’s standard logging framework. From there, data persists in <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener">Scuba</a> and <a href="https://research.facebook.com/publications/hive-a-warehousing-solution-over-a-map-reduce-framework/" target="_blank" rel="noopener">Hive</a>, Meta’s short-term and long term data stores, respectively.</p>
<p>Typically, the Scribe API is called directly to construct a log for every “event” that needs to be logged. For FBCrypto, this would mean constructing a log for nearly every cryptographic operation that our library is used for. Unfortunately, given the sheer frequency of such operations, a solution like this would consume an unreasonable amount of write throughput and storage capacity. A common solution to this problem would be to introduce sampling (i.e., only log 1/X cryptographic operations, and increase X until we no longer have capacity concerns). However, we felt strongly about not introducing any sampling since doing so would result in most logs being omitted, giving us a less clear picture of the library’s usage.</p>
<p>Instead, the logging uses a “buffering and flushing” strategy, in which cryptographic events are aggregated across time and flushed to a data store at a preconfigured interval.</p>
<p>During the aggregation, a “count” is maintained for every unique event. When it comes time to flush, this count is exported along with the log to convey how often that particular event took place.</p>
<p>Below is a rough illustration of what this looks like:</p>
<p><img class="aligncenter wp-image-21936" src="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-1-e1731001505528.png?w=859" alt="" width="600" height="450" /></p>
<p>In the above example, the key named “myKeyName” is used to perform encryption using the AES-GCM-SIV encryption algorithm (in practice we log more fields than just key name, method, and algorithm). The operation happens five times and is assigned on a count of five. Since machines often compute millions of cryptographic operations per day, this strategy can lead to significant compute savings in production. </p>
<h3>A client-side view</h3>
<p>The aggregation and flushing is implemented within FBCrypto, so the logging and flushing code sits on the client hosts. When clients call a given cryptographic operation (e.g., “encrypt()”), the operation is performed and the log is added to our aggregated buffer. We refer to the object that holds the buffer as the “buffered logger.”</p>
<p>Note that the logging does not change the interface of FBCrypto, so all of this is transparent to the clients of the library. </p>
<p><img class="aligncenter wp-image-21937" src="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?w=939" alt="" width="600" height="338" srcset="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png 939w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?resize=768,433 768w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-2-e1731001623468.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>In multithreaded environments all threads will log to the same buffer. For this to be performant, we need to choose the right underlying data structure (see the section below on <em>“Additional optimizations”</em> for more details).</p>
<p>While the aggregation works to reduce space and time overhead, the logs need to eventually be written to storage for further use. To do this, a background thread runs on the client host to periodically call the Scribe API to export the logs and flush the map’s contents. </p>
<p>Below is an overview of the overall flow: </p>
<p><img class="aligncenter wp-image-21941" src="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?w=1024" alt="" width="600" height="520" srcset="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png 1478w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?resize=916,793 916w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?resize=768,665 768w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?resize=1024,887 1024w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?resize=96,83 96w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-3-cropped.png?resize=192,166 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h3>Additional optimizations</h3>
<p>We had to make some additional optimizations to support cryptographic monitoring on Meta’s major products (Facebook, Whatsapp, Instagram, etc.).</p>
<p>With careful design choices around the logging logic and data structures used, our cryptographic logging operates with <strong>no sampling</strong> and has had a negligible impact on compute performance across Meta’s fleet.</p>
<h4>Partially randomized flushing</h4>
<p>Due to the nature of our buffering and flushing strategy, certain clients who were running jobs that restarted large sets of machines at around the same time would have those machines’ logs get flushed at about the same time. This would result in “spiky” writes to the logging platform, followed by longer periods of underutilization between flushes. To normalize our write throughput, we distribute these spikes across time by applying a randomized delay on a per-host basis before logs are flushed for the first time. This leads to a more uniform flushing cadence, allowing for a more consistent load on Scribe. </p>
<p>The figure below demonstrates how this works:</p>
<p><img class="aligncenter size-large wp-image-21939" src="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?w=1024" alt="" width="1024" height="388" srcset="https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=916,347 916w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=768,291 768w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=1024,388 1024w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=1536,582 1536w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=96,36 96w, https://engineering.fb.com/wp-content/uploads/2024/11/Cryptographic-monitoring_Meta-4.png?resize=192,73 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<h4>Derived crypto</h4>
<p>FBCrypto supports a feature called derived crypto, which allows “child” keysets to be derived from “parent” keysets by applying a key derivation function (KDF) to all the keys in the keyset with some salt. This feature is used by a few large-scale use cases that need to generate millions of keys.</p>
<p>Our logging initially created a unique row in the buffered logger for every derived keyset, which used a lot of space and put increased load on backend data stores. To address this, we now aggregate the cryptographic operations of derived keys under the name of the parent key. This reduces our overall capacity needs without harming our ability to detect key overuse since, in the worst case, the aggregations would be a pessimistic counter for any given child key. </p>
<p>Thanks to this aggregation, we were able to cut down on the vast majority of our logging volume, compared to the space that would have been used with no aggregation. </p>
<h4>The Folly library </h4>
<p>Internally, our buffering makes use of the <a href="https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h" target="_blank" rel="noopener">folly::ConcurrentHashMap</a>, which is built to be performant under heavy writes in multithreaded environments, while still guaranteeing atomic accesses.  </p>
<h3>Unified offerings</h3>
<p>Meta’s existing infrastructure and its emphasis on unified offerings are key to supporting this at scale (see the <a href="https://engineering.fb.com/2019/10/07/core-infra/scribe/">Scribe</a> logging framework and the FBCrypto library). These properties often mean that solutions only have to be implemented once in order for the entire company to benefit.</p>
<p>This is especially true here. Most machines in Meta’s fleet can log to Scribe, giving us easy log ingestion support. Furthermore, the wide adoption of FBCrypto gives us insights into cryptographic operations without needing clients to migrate to a new library/API. </p>
<p>From an engineering perspective, this helps us overcome many hurdles that others in the industry might face. For example, it helps us avoid fragmentation that might require multiple custom solutions to be implemented, which would increase our engineering workload.</p>
<h2>The impact of cryptographic monitoring</h2>
<p>The insights from our cryptographic monitoring efforts have served multiple use cases across our security and infrastructure reliability efforts.</p>
<h3>Preemptively mitigating security vulnerabilities</h3>
<p>Thanks to our long retention window, we can monitor trends over time and use them for more predictive modeling and analysis. We can present our findings to cryptography experts, who can do further analysis and predict whether vulnerabilities may emerge. This allows us to preemptively identify clients using cryptography in risky ways and work with them to mitigate these issues before they become real security vulnerabilities. </p>
<p>This is particularly beneficial in preparation for the world of <a href="https://en.wikipedia.org/wiki/Post-quantum_cryptography">post-quantum cryptography</a> (PQC), which requires us to find clients using vulnerable algorithms and ensure they are migrated off in a timely fashion. </p>
<p>We have also found that being able to preemptively detect these vulnerabilities well in advance has led to stronger support during cross-team collaborations. Thanks to the ample notice, teams can seamlessly integrate any necessary migration efforts into their roadmap with minimal interruption to their ongoing projects.</p>
<h3>Promoting infrastructure reliability</h3>
<p>Our root dataset has also served as a useful proxy for client health. This is partially thanks to the lack of sampling, as we can see the exact number of calls taking place, along with their respective success rates. This has been particularly important during large-scale migrations, where anomalous drops in success rate, call volume, etc., may indicate a bug in a new code path. Indeed, numerous detectors and alarms have been built off our dataset to help us perform big migrations safely.</p>
<p>The dataset also contains library versioning information, so we can monitor what versions of our library are running across the fleet in real-time. This has been especially useful for rolling out new features, as we can see exactly which clients have picked up the latest changes. This allows us to move faster and more confidently, even when running large-scale migrations across the fleet. </p>
<h2>Challenges to cryptographic monitoring</h2>
<p>Supporting cryptographic logging at Meta’s scale has had its own unique set of challenges.</p>
<h3>Capacity constraints</h3>
<p>Despite our optimizations, we have occasionally found ourselves putting increased load on Scribe (see point above about underestimating cryptographic usage) and have worked with the Scribe team to manage the unexpected increase in write throughput. Doing so has been relatively easy for the company, considering the design optimizations mentioned above.</p>
<p>We also occasionally put an increased load on <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener">Scuba</a>, which is optimized to be performant for real-time data (i.e., warm storage) and can be inefficient if used for larger datasets. To minimize compute costs, we also rely on <a href="https://research.facebook.com/publications/hive-a-warehousing-solution-over-a-map-reduce-framework/">Hive</a> tables for longer-term storage (i.e., cold storage). </p>
<h3>Flushing on shutdown</h3>
<p>Besides flushing the logs in the shared singleton map at a preconfigured time interval, client machines will also do one final flush to log all remaining contents of their log buffer to Scribe when a job is being shut down. We have found that operating in a “shutdown environment” can lead to a number of interesting scenarios, particularly when attempting to access Scribe and its dependencies. Many of these scenarios boil down to the nuances of <a href="https://github.com/facebook/folly/blob/main/folly/Singleton.h">folly::Singleton</a>, which is Meta’s go-to library for managing singletons. Likewise, running something “on shutdown” in Java requires using only synchronous I/O code and operating quickly.</p>
<h2>Our next initiatives for cryptographic monitoring</h2>
<p>While our work thus far has been largely a success, there are many exciting avenues for improvements. For example, further optimizing Scribe throughput and Scuba storage utilization to make more efficient use of Meta’s infrastructure  </p>
<p>We will also continue to leverage the logging data to further develop monitoring and data analytics to promote security and reliability. On the security side, this means continuing to take an inventory of use cases that would be vulnerable in a PQC world and migrate them to more resilient algorithms/configurations. In terms of reliability, it means gaining a better understanding of the end-to-end latency for cryptography use cases.</p>
<p>Within all of this it’s also important that we continue driving the unification of cryptographic offerings and monitoring tooling. While FBCrypto provides a unified set of offerings, there are other cryptographic use cases across Meta that use a different set of tools for telemetry and data collection. More non-trivial work is needed to achieve full unification with all use cases.</p>
<h2>Acknowledgments</h2>
<p><em>This work could not have been accomplished without the critical efforts of numerous folks, particularly Grace Wu, Ilya Maykov, Isaac Elbaz, and the rest of the CryptoEng team at Meta.</em></p>]]></description>
      <link>https://engineering.fb.com/2024/11/12/security/how-meta-built-large-scale-cryptographic-monitoring/</link>
      <guid>https://engineering.fb.com/2024/11/12/security/how-meta-built-large-scale-cryptographic-monitoring/</guid>
      <pubDate>Tue, 12 Nov 2024 18:00:00 +0100</pubDate>
    </item>
    <item>
      <title><![CDATA[Diff Authoring Time: Measuring developer productivity at Meta]]></title>
      <description><![CDATA[<p>At Meta, we’re always looking for ways to enhance the productivity of our engineers and developers. But how exactly do you measure developer productivity?</p>
<p>On this episode of the Meta Tech Podcast Pascal Hartig (<a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">@passy</a>) sits down with Sarita and <a href="https://x.com/Inventitech" target="_blank" rel="noopener">Moritz</a>, two engineers at Meta who have been working on Diff Authoring Time (DAT) – method for measuring how long it takes to submit changes to a codebase.</p>
<p>They talk about the challenges of measuring productivity, how DAT is implemented, and the new abilities it unlocks for developers.</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/33265257/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe><br />
You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/4D7HJeNs40U2C6uMQoPcMc?si=tfM2ZSC7REGIAGq693fSaA" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/gb/podcast/measuring-developer-productivity-with-diff-authoring/id1370910331?i=1000671324538" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/7vbp2djc" target="_blank" rel="noopener">Pocket Casts</a></li>
<li><a href="https://overcast.fm/itunes1370910331" target="_blank" rel="noopener">Overcast</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2024/10/25/developer-tools/diff-authoring-time-dat-measuring-developer-productivity-meta/</link>
      <guid>https://engineering.fb.com/2024/10/25/developer-tools/diff-authoring-time-dat-measuring-developer-productivity-meta/</guid>
      <pubDate>Fri, 25 Oct 2024 18:32:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[IPLS: Privacy-preserving storage for your WhatsApp contacts]]></title>
      <description><![CDATA[<p>Your contact list is fundamental to the experiences you love and enjoy on WhatsApp. With contacts, you know which of your friends and family are on WhatsApp, you can easily message or call them, and it helps give you context on who is in your groups. But losing your phone could mean losing your contact list as well. Traditionally, WhatsApp has lacked the ability to store your contact list in a way that can be easily and automatically restored in the event you lose it. What’s more, the only place you were able to add contacts was from your mobile device, by either typing in a phone number or scanning a QR code.</p>
<p>As part of WhatsApp’s new feature to privately add and manage your contacts on WhatsApp across linked devices, we’re announcing a novel encrypted storage system we’ve designed called Identity Proof Linked Storage (IPLS). IPLS allows you to save your contacts and automatically restore them directly through WhatsApp. With IPLS in place, you can now create contacts directly within WhatsApp and choose to sync them to your phone or securely save them only to WhatsApp – giving you the ability to create contacts that are specific to your account. If you use linked devices, this also allows you to add and manage contacts seamlessly regardless of which device you’re on.</p>
<p>Additionally, if you have multiple accounts on the same phone, such as a work and personal account, you can now customize your contact list for each account. If you lose your phone, your contact list can be restored on a newly registered device. </p>
<p>Contact names are stored encrypted within WhatsApp, and we’ve built this with additional, robust protections by using IPLS to deter access to contacts to anyone except the user.</p>
<p>IPLS incorporates new privacy technology that protects your contact lists in a privacy-preserving fashion. To further ensure the safety and security of this system, we’ve <a href="https://www.cloudflare.com/press-releases/2024/cloudflare-helps-secure-the-worlds-most-popular-messaging-applications/">partnered with Cloudflare</a> to provide <a href="https://blog.cloudflare.com/key-transparency/">independent third-party auditing</a> of its cryptographic properties. The new technology stack was reviewed by external researchers and NCC Group, an independent cybersecurity consultant. </p>
<h2>What is Identity Proof Linked Storage?</h2>
<p>IPLS is a novel system at WhatsApp that allows users to store their contact names in an encrypted way. IPLS allows the client device to save the contact information using a strong encryption key generated on the client device. Its retrieval is based on the client authenticating its primary device identity.</p>
<p>IPLS is based on two existing pieces of technology that are already used at scale by WhatsApp: <a href="https://engineering.fb.com/2023/04/13/security/whatsapp-key-transparency/" target="_blank" rel="noopener">key transparency</a> and our <a href="https://engineering.fb.com/2021/09/10/security/whatsapp-e2ee-backups/" target="_blank" rel="noopener">hardware security module (HSM)</a>. </p>
<p>Certain events associated with your phone’s WhatsApp application (such as installing or reinstalling) trigger the creation of a new cryptographic keypair that is associated with your phone number. WhatsApp’s key transparency system publishes records of these primary device identity key changes to an append-only, cryptographic <a href="https://github.com/facebook/akd/" target="_blank" rel="noopener">Auditable Key Directory (AKD)</a> that allows WhatsApp clients to automatically verify a user’s encryption key. </p>
<p>Key transparency allows WhatsApp, and the public at large, to cryptographically verify if a given phone number used for a WhatsApp account is tied to a given identity key.</p>
<p>The HSMs are employed by <a href="https://www.whatsapp.com/security/WhatsApp_Security_Encrypted_Backups_Whitepaper.pdf" target="_blank" rel="noopener">WhatsApp end-to-end encrypted backups</a> and allow for private, tamper-resistant execution of application logic within WhatsApp data centers in a privacy-preserving way. Data processing within HSM’s security boundary remains opaque even to WhatsApp insiders with the highest privilege and physical access to the hardware. </p>
<h2>The components of IPLS</h2>
<h3>The AKD and Cloudflare integration</h3>
<p>As mentioned, the first building block of IPLS is WhatsApp’s AKD, which maps a client phone number to a client identity key. Primary device identity is used to authenticate the client to ensure that only the owner of the contact encryption key is allowed to restore the contacts.</p>
<p>To strengthen the single instance nature of AKD, <a href="https://blog.cloudflare.com/key-transparency/" target="_blank" rel="noopener">WhatsApp has engaged Cloudflare</a> to act as an additional witness of the additions to AKD. Cloudflare digitally signs each epoch, and associated root hash, and returns a digital signature validation confirming that the directory was not tampered with. The HSM-based Key Vault validates Cloudflare signature using Cloudflare’s public key.</p>
<p>WhatsApp relies on the availability of the Cloudflare signing service and cannot proceed with the updates to AKD in the absence of the digital signature of each update.</p>
<p><img class="aligncenter size-large wp-image-21822" src="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?w=1024" alt="" width="1024" height="320" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png 1920w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=916,286 916w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=768,240 768w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=1024,320 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=1536,480 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=96,30 96w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-1_crop-Copy.png?resize=192,60 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>In addition, WhatsApp provides auditable proofs of consistency for the transitions between epochs. The auditable proofs are published to a write-once, read-many enabled Amazon S3 instance, which has a public interface for any entity to retrieve the proofs.</p>
<p>Using AKD and partnering with Cloudflare ensures that there is only a single instance of the directory that is validated by a 3rd party.</p>
<h3>HSM-based key storage</h3>
<p>To ensure privacy for user contacts registered on WhatsApp, contact names are first encrypted using a symmetric encryption key generated by the user’s device, and then stored in the HSM-based Key Vault. Storage and retrieval of the contact encryption key occurs via an end-to-end encrypted channel between the client and the HSM-based Key Vault, ensuring that the data in transit remains opaque to WhatsApp.  </p>
<p><img class="aligncenter size-large wp-image-21823" src="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?w=1024" alt="" width="1024" height="320" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png 1920w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=916,286 916w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=768,240 768w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=1024,320 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=1536,480 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=96,30 96w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-2_crop.png?resize=192,60 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Storing the contact key in the HSM-based Key Vault ensures its availability even when the user loses their phone. If a user loses their client device and wants to restore their contacts, the new client device can retrieve the contact key by establishing a secure session with the HSM-based Key Vault. The Key Vault verifies the client identity key by accessing AKD via a secure cryptographic protocol and verifying that the client has the corresponding private key.</p>
<p><img class="aligncenter size-large wp-image-21824" src="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?w=1024" alt="" width="1024" height="320" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png 1920w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=916,286 916w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=768,240 768w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=1024,320 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=1536,480 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=96,30 96w, https://engineering.fb.com/wp-content/uploads/2024/10/WhatsApp-IPLS-image-3_crop.png?resize=192,60 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Once the client is verified, the new client is allowed to access the contact key in the HSM-based Key Vault using the secure channel established with the client identity key and the HSM key.</p>
<h2>Privacy-preserving contacts storage at WhatsApp scale</h2>
<p>IPLS is a new system that deters unauthorized access to sensitive data by effectively coupling any data access to publicly auditable identity key changes published to WhatsApp’s key transparency infrastructure. This approach is similar to how a QR code scanning technology can be used to detect a public key compromise in an <a href="https://faq.whatsapp.com/820124435853543" target="_blank" rel="noopener">end-to-end encrypted messaging</a> system.</p>
<p>WhatsApp’s new approach on contacts will give users more ways to easily manage contacts across devices and accounts and store them securely without losing them if they change phones or reinstall WhatsApp. We’re excited about how IPLS has helped enable this new feature and will help ensure WhatsApp contacts are encrypted and can easily move with users when they get a new phone.</p>]]></description>
      <link>https://engineering.fb.com/2024/10/22/security/ipls-privacy-preserving-storage-for-your-whatsapp-contacts/</link>
      <guid>https://engineering.fb.com/2024/10/22/security/ipls-privacy-preserving-storage-for-your-whatsapp-contacts/</guid>
      <pubDate>Tue, 22 Oct 2024 14:59:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[OCP Summit 2024: The open future of networking hardware for AI]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters.</li>
<li class="c1" aria-level="1">We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP.</li>
<li class="c1" aria-level="1">We look forward to continued collaboration with OCP to open designs for racks, servers, storage boxes, and motherboards to benefit companies of all sizes across the industry.</li>
</ul><p>At Meta, we believe that open hardware drives innovation. In today’s world, where more and more data center infrastructure is being devoted to supporting new and emerging AI technologies, open hardware takes on an important role in assisting with disaggregation. By breaking down traditional data center technologies into their core components we can build new systems that are more flexible, scalable, and efficient. </p>
<p>Since helping found OCP in 2011, we’ve shared our data center and component designs, and open-sourced our network orchestration software to spark new ideas both in our own data centers and across the industry. Those ideas have made Meta’s data centers <a href="https://sustainability.atmeta.com/2024-sustainability-report/" target="_blank" rel="noopener">among the most sustainable and efficient in the world</a>. Now, through OCP, we’re bringing new open advanced network technologies to our data centers, and the wider industry, for advanced AI applications.</p>
<p>We’re announcing two new milestones for our data centers: Our next-generation network fabric for AI, and a new portfolio of network hardware that we’ve developed in close partnership with multiple vendors.</p>
<figure id="attachment_21877" aria-describedby="caption-attachment-21877" class="wp-caption aligncenter c2"><img class="size-large wp-image-21877" src="https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?w=960" alt="" width="960" height="540" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png 960w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta-1.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21877" class="wp-caption-text">Disaggregated network fabrics offer significant advantages in scalability over modular-chassis fabric switches.</figcaption></figure><h2>DSF: Scheduled fabric that is disaggregated and open </h2>
<p>Network performance and availability play an important role in extracting the best performance out of our <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/" target="_blank" rel="noopener">AI training clusters</a>. It’s for that reason that we’ve continued to push for disaggregation in the backend network fabrics for our AI clusters. Over the past year we have developed a Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters to help us develop open, vendor-agnostic systems with interchangeable building blocks from vendors across the industry. DSF-based fabrics allow us to build large, non-blocking fabrics to support high-bandwidth AI clusters.</p>
<p>DSF extends our disaggregating network systems to our VoQ-based switched systems that are powered by the open <a href="https://github.com/opencomputeproject/SAI" target="_blank" rel="noopener">OCP-SAI</a> standard and <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/" target="_blank" rel="noopener">FBOSS</a>, Meta’s own network operating system for controlling network switches. VoQ-based traffic scheduling ensures proactive congestion avoidance in the fabric rather than reactive congestion signaling and reaction.</p>
<p>The DSF fabric supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several xPUs and NICs, including Meta’s <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">MTIA</a> as well as from several vendors. </p>
<h2>DSF platforms for next-generation AI fabrics </h2>
<h3>Arista 7700R4 series</h3>
<p>The DSF platforms, Arista 7700R4 series,  consist of dedicated leaf and spine systems that are combined to create a large, distributed switch. As a distributed system, DSF is designed to support high scale AI clusters.</p>
<p><img class="size-large wp-image-21878 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2024/10/7700R4C-38PE-e1729011213805.png?w=476" alt="" width="476" height="267" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/7700R4C-38PE-e1729011213805.png 476w, https://engineering.fb.com/wp-content/uploads/2024/10/7700R4C-38PE-e1729011213805.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/10/7700R4C-38PE-e1729011213805.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>7700R4C-38PE: DSF Leaf Switch</p>
<ul><li class="c1" aria-level="1">DSF Distributed Leaf Switch (Broadcom Jericho3-AI based)</li>
<li class="c1" aria-level="1">18 x 800GE (36 x 400GE) OSFP800 host ports</li>
<li class="c1" aria-level="1">20 x 800Gbps (40 x 400Gbps) fabric ports</li>
<li class="c1" aria-level="1">14.4Tbps of wirespeed performance with 16GB of buffers</li>
</ul><p><img class="size-large wp-image-21879 aligncenter" src="https://engineering.fb.com/wp-content/uploads/2024/10/7720R4-128PE-e1729011256820.png?w=597" alt="" width="597" height="335" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/7720R4-128PE-e1729011256820.png 597w, https://engineering.fb.com/wp-content/uploads/2024/10/7720R4-128PE-e1729011256820.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/10/7720R4-128PE-e1729011256820.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/10/7720R4-128PE-e1729011256820.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>7720R4-128PE: DSF Spine Switch</p>
<ul><li class="c1" aria-level="1">DSF Distributed Spine Switch (Broadcom Ramon3 based)</li>
<li class="c1" aria-level="1">Accelerated compute optimized pipeline</li>
<li class="c1" aria-level="1">128 x 800Gbps (256 x 400Gbps) fabric ports</li>
<li class="c1" aria-level="1">102.4Tbps of wirespeed performance</li>
</ul><h2>51T switches for  next-generation 400G/800G fabrics</h2>
<figure id="attachment_21880" aria-describedby="caption-attachment-21880" class="wp-caption aligncenter c3"><img class="size-large wp-image-21880" src="https://engineering.fb.com/wp-content/uploads/2024/10/Minipack3-e1729010564784.png?w=600" alt="" width="600" height="401" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Minipack3-e1729010564784.png 600w, https://engineering.fb.com/wp-content/uploads/2024/10/Minipack3-e1729010564784.png?resize=96,64 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Minipack3-e1729010564784.png?resize=192,128 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21880" class="wp-caption-text">Minipack3 (Broadcom Tomahawk5 based, designed by Meta and manufactured by Celestica) 51.2T switch.</figcaption></figure><p>Meta will deploy two next-generation 400G fabric switches, the Minipack3 (the latest version of <a href="https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/" target="_blank" rel="noopener">Minipack</a>, Meta’s own fabric network switch) and the Cisco 8501, both of which are also backward compatible with previous 200G and 400G switches and will support upgrades to 400G and 800G.</p>
<p>The Minipack3 utilizes Broadcom’s latest Tomahawk5 ASIC while the Cisco 8501 is based on Cisco’s Silicon One G200 ASIC. These high-performance switches transmit up to 51.2 Tbps with 64x OSFP ports, and the design is optimized without the need of retimers to achieve maximum power efficiency. They also have significantly reduced power per bit compared with predecessor models.</p>
<p>Meta will run both the Minipack3 and Cisco 8501 on FBOSS.</p>
<figure id="attachment_21881" aria-describedby="caption-attachment-21881" class="wp-caption aligncenter c3"><img class="wp-image-21881 size-large" src="https://engineering.fb.com/wp-content/uploads/2024/10/Cisco-8501-e1729010680692.png?w=600" alt="" width="600" height="230" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Cisco-8501-e1729010680692.png 600w, https://engineering.fb.com/wp-content/uploads/2024/10/Cisco-8501-e1729010680692.png?resize=96,37 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Cisco-8501-e1729010680692.png?resize=192,74 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21881" class="wp-caption-text">Cisco 8501 (Cisco Silicon One G200 based, designed and manufactured by Cisco) 51.2T switch.</figcaption></figure><h2>Optics: 2x400G FR4 optics for 400G/800G optical interconnection </h2>
<p><img class="aligncenter size-large wp-image-21882" src="https://engineering.fb.com/wp-content/uploads/2024/10/400G-FR4--e1729010852824.png?w=372" alt="" width="372" height="209" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/400G-FR4--e1729010852824.png 372w, https://engineering.fb.com/wp-content/uploads/2024/10/400G-FR4--e1729010852824.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/10/400G-FR4--e1729010852824.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Meta’s data center fabrics have evolved from 200 Gbps/400 Gbps to 400 Gbps/800 Gbps and we’ve already deployed 2x400G optics in our data centers.</p>
<h2>Evolving FBOSS and SAI for DSF</h2>
<p><img class="aligncenter size-large wp-image-21883" src="https://engineering.fb.com/wp-content/uploads/2024/10/SAI-FBOSS-logo.png?w=456" alt="" width="456" height="168" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/SAI-FBOSS-logo.png 456w, https://engineering.fb.com/wp-content/uploads/2024/10/SAI-FBOSS-logo.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2024/10/SAI-FBOSS-logo.png?resize=192,71 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We continue to embrace OCP-SAI to onboard the new network fabrics, switch hardware platforms, and optical transceivers to FBOSS. We have collaborated with vendors, and the OCP community, to evolve SAI. It now supports new features and concepts like DSF and other enhanced routing schemes.</p>
<p>Developers and engineers from all over the world can work with this open hardware and contribute their own software that they, in turn, can use themselves and share with the wider industry.</p>
<h2>FBNIC: A multi-host foundational NIC designed by Meta</h2>
<p><img class="aligncenter size-large wp-image-21884" src="https://engineering.fb.com/wp-content/uploads/2024/10/FBNIC-e1729010986979.png?w=600" alt="" width="600" height="280" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/FBNIC-e1729010986979.png 600w, https://engineering.fb.com/wp-content/uploads/2024/10/FBNIC-e1729010986979.png?resize=96,45 96w, https://engineering.fb.com/wp-content/uploads/2024/10/FBNIC-e1729010986979.png?resize=192,90 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We are continuing to design more ASICs, including the ASIC for FBNIC. FBNIC is a true multi-host foundational NIC and contains the first of our Meta-designed network ASICs for our server fleet and <a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">MTIA</a> solutions. It can support up to four hosts with complete datapath isolation for each host.The FBNIC driver has been upstreamed (available from v6.11 kernel). The NIC module was designed by Marvell and has been contributed to OCP.</p>
<p>FBNIC’s key features include:</p>
<ul><li class="c1" aria-level="1">Network interfaces for up to 4×100/4×50/4×25 GE with SerDes support for up to 56G PAM4 per lane.</li>
<li class="c1" aria-level="1">Up to 4 independent PCIe Gen5 slices</li>
<li class="c1" aria-level="1">HW offloads including LSO, Checksum</li>
<li class="c1" aria-level="1">Line rate timestamping (for each host all the way from PHY) for PTP</li>
<li class="c1" aria-level="1">Header-Data split to assist Zero-Copy</li>
<li class="c1" aria-level="1">Compliant with OCP NIC 3.0, version 1.2.0, design specification</li>
</ul><h2>The future is open</h2>
<p>Advancing AI means building data center infrastructure that goes beyond scale. It also has to allow for flexibility and perform efficiently and sustainably. At Meta, we envision a future of AI hardware systems that are not only scalable, but also open and collaborative.</p>
<p>We encourage anyone who wants to help advance the future of networking hardware for AI to engage with OCP and Meta to help share the future of AI infrastructure.</p>]]></description>
      <link>https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/</link>
      <guid>https://engineering.fb.com/2024/10/15/data-infrastructure/open-future-networking-hardware-ai-ocp-2024-meta/</guid>
      <pubDate>Tue, 15 Oct 2024 19:06:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Meta’s open AI hardware vision]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community.</li>
<li class="c1" aria-level="1">These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. </li>
<li class="c1" aria-level="1">By sharing our designs, we hope to inspire collaboration and foster innovation. If you’re passionate about building the future of AI, we invite you to engage with us and OCP to help shape the next generation of open hardware for AI.</li>
</ul><p>AI has been at the core of the experiences Meta has been delivering to people and businesses for years, including AI modeling innovations to optimize and improve on features like <a href="https://ai.meta.com/blog/facebook-feed-improvements-ai-show-more-less/" target="_blank" rel="noopener">Feed</a> and our <a href="https://engineering.fb.com/2024/07/10/data-infrastructure/machine-learning-ml-prediction-robustness-meta/" target="_blank" rel="noopener">ads system</a>. As we develop and release new, advanced AI models, we are also driven to advance our infrastructure to support our new and emerging AI workloads.</p>
<p>For example, <a href="https://ai.meta.com/blog/meta-llama-3-1/" target="_blank" rel="noopener">Llama 3.1 405B</a>, Meta’s largest model, is a dense transformer with 405B parameters and a context window of up to 128k tokens. To train a large language model (LLM) of this magnitude, with over 15 trillion tokens, we had to make substantial optimizations to our entire training stack. This effort pushed our infrastructure to operate across more than 16,000 NVIDIA H100 GPUs, making Llama 3.1 405B the first model in the Llama series to be trained at such a massive scale. </p>
<p>Prior to Llama, our largest AI jobs ran on 128 NVIDIA A100 GPUs. ​But things have rapidly accelerated. ​Over the course of 2023, we rapidly scaled up our training clusters from 1K, 2K, 4K, to eventually 16K GPUs to support our AI workloads. Today, we’re training our models on two <a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/" target="_blank" rel="noopener">24K-GPU clusters</a>.</p>
<p>We don’t expect this upward trajectory for AI clusters to slow down any time soon. In fact, we expect the amount of compute needed for AI training will grow significantly from where we are today.</p>
<p>Building AI clusters requires more than just GPUs. Networking and bandwidth play an important role in ensuring the clusters’ performance. Our systems consist of a tightly integrated HPC compute system and an isolated high-bandwidth compute network that connects all our GPUs and domain-specific accelerators. This design is necessary to meet our injection needs and address the challenges posed by our need for bisection bandwidth.</p>
<p>In the next few years, we anticipate greater injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a growth of more than an order of magnitude compared to today’s networks!</p>
<p>To support this growth, we need a high-performance, multi-tier, non-blocking network fabric that can utilize modern congestion control to behave predictably under heavy load. This will enable us to fully leverage the power of our AI clusters and ensure they continue to perform optimally as we push the boundaries of what is possible with AI.</p>
<p>Scaling AI at this speed requires open hardware solutions. Developing new architectures, network fabrics, and system designs is the most efficient and impactful when we can build it on principles of openness. By investing in open hardware, we unlock AI’s full potential and propel ongoing innovation in the field.</p>
<h2>Introducing Catalina: Open Architecture for AI Infra</h2>
<figure id="attachment_21841" aria-describedby="caption-attachment-21841" class="wp-caption alignleft c2"><img class="wp-image-21841" src="https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png?w=683" alt="" width="456" height="683" srcset="https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png 720w, https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png?resize=611,916 611w, https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png?resize=683,1024 683w, https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png?resize=96,144 96w, https://engineering.fb.com/wp-content/uploads/2050/05/Catalina-Front-Back-2.png?resize=192,288 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21841" class="wp-caption-text">Catalina front view (left) and rear view (right).</figcaption></figure><p>Today, we announced the upcoming release of Catalina, our new high-powered rack designed for AI workloads, to the OCP community. Catalina is based on the <a href="https://nvidianews.nvidia.com/news/nvidia-contributes-blackwell-platform-design-to-open-hardware-ecosystem-accelerating-ai-infrastructure-innovation" target="_blank" rel="noopener">NVIDIA Blackwell platform full rack-scale solution</a>, with a focus on modularity and flexibility. It is built to support the latest NVIDIA GB200 Grace Blackwell Superchip, ensuring it meets the growing demands of modern AI infrastructure. </p>
<p>The growing power demands of GPUs means open rack solutions need to support higher power capability. With Catalina we’re introducing the Orv3, a high-power rack (HPR) capable of supporting up to 140kW.</p>
<p>The full solution is liquid cooled and consists of a power shelf that supports a compute tray, switch tray, the Orv3 HPR, the <a href="https://engineering.fb.com/2021/11/09/data-center-engineering/ocp-summit-2021/" target="_blank" rel="noopener">Wedge 400</a> fabric switch, a management switch, battery backup unit, and a rack management controller.</p>
<p>We aim for Catalina’s modular design to empower others to customize the rack to meet their specific AI workloads while leveraging both existing and emerging industry standards.</p>
<h2>The Grand Teton Platform now supports AMD accelerators</h2>
<p><img class="aligncenter wp-image-21858" src="https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?w=916" alt="" width="600" height="436" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png 1109w, https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?resize=916,665 916w, https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?resize=768,557 768w, https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?resize=1024,743 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?resize=96,70 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Grand-Teton-AMD-MI300X-Open-small.png?resize=192,139 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>In 2022, we announced <a href="https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/" target="_blank" rel="noopener">Grand Teton</a>, our next-generation AI platform (the follow-up to our Zion-EX platform). Grand Teton is designed with compute capacity to support the demands of memory-bandwidth-bound workloads, such as Meta’s <a href="https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model/" target="_blank" rel="noopener">deep learning recommendation models (</a>DLRMs), as well as compute-bound workloads like content understanding.</p>
<p>Now, we have expanded the Grand Teton platform to support the AMD Instinct MI300X and will be contributing this new version to OCP. Like its predecessors, this new version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces. This high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads.</p>
<p>In addition to supporting a range of accelerator designs, now including the AMD Instinct MI300x, Grand Teton offers significantly greater compute capacity, allowing faster convergence on a larger set of weights. This is complemented by expanded memory to store and run larger models locally, along with increased network bandwidth to scale up training cluster sizes efficiently.</p>
<h2>​Open Disaggregated Scheduled Fabric </h2>
<p><img class="aligncenter size-large wp-image-21860" src="https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?w=1024" alt="" width="1024" height="508" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png 1871w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=916,454 916w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=768,381 768w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=1024,508 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=1536,762 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=96,48 96w, https://engineering.fb.com/wp-content/uploads/2024/10/OCP-2024-DSF-Meta.png?resize=192,95 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>Developing open, vendor-agnostic networking backend is going to play an important role going forward as we continue to push the performance of our AI training clusters. Disaggregating our network allows us to work with vendors from across the industry to design systems that are innovative as well as scalable, flexible, and efficient.</p>
<p>Our new Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters offers several advantages over our existing switches. By opening up our network fabric we can overcome limitations in scale, component supply options, and power density. DSF is powered by the open <a href="https://github.com/opencomputeproject/SAI" target="_blank" rel="noopener">OCP-SAI</a> standard and <a href="https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/" target="_blank" rel="noopener">FBOSS</a>, Meta’s own network operating system for controlling network switches. It also supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several GPUS and NICS from several different vendors, including our partners at NVIDIA, Broadcom, and AMD.</p>
<p>In addition to DSF, we have also developed and built new 51T fabric switches based on Broadcom and Cisco ASICs. Finally, we are sharing our new FBNIC, a new NIC module that contains our first Meta-design network ASIC. In order to meet the growing needs of our AI </p>
<h2>Meta and Microsoft: Driving Open Innovation Together</h2>
<p>Meta and Microsoft have a long-standing partnership within OCP, beginning with the development of the <a href="https://www.opencompute.org/documents/switch-abstraction-interface-ocp-specification-v0-2-pdf" target="_blank" rel="noopener">Switch Abstraction Interface (SAI)</a> for data centers in 2018. Over the years together, we’ve contributed to key initiatives such as the <a href="https://www.opencompute.org/blog/new-open-accelerator-infrastructure-oai-sub-project-to-launch-within-the-ocp-server-project" target="_blank" rel="noopener">Open Accelerator Module (OAM)</a> standard and SSD standardization, showcasing our shared commitment to advancing open innovation.</p>
<p>Our current <a href="https://azure.microsoft.com/en-us/blog/accelerating-industry-wide-innovations-in-datacenter-infrastructure-and-security/" target="_blank" rel="noopener">collaboration focuses on Mount Diablo</a>, a new disaggregated power rack. It’s a cutting-edge solution featuring a scalable 400 VDC unit that enhances efficiency and scalability. This innovative design allows more AI accelerators per IT rack, significantly advancing AI infrastructure. We’re excited to continue our collaboration through this contribution.</p>
<h2>The open future of AI infra</h2>
<p><a href="https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/" target="_blank" rel="noopener">Meta is committed to open source AI</a>. We believe that open source will put the benefits and opportunities of AI into the hands of people all over the word. </p>
<p>AI won’t realize its full potential without collaboration. We need open software frameworks to drive model innovation, ensure portability, and promote transparency in AI development. We must also prioritize open and standardized models so we can leverage collective expertise, make AI more accessible, and work towards minimizing biases in our systems.​</p>
<p>Just as important, we also need open AI hardware systems. These systems are necessary for delivering the kind of high-performance, cost-effective, and adaptable infrastructure necessary for AI advancement.​</p>
<p>We encourage anyone who wants to help advance the future of AI hardware systems to engage with the OCP community. By addressing AI’s infrastructure needs together, we can unlock the true promise of open AI for everyone.​</p>]]></description>
      <link>https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/</link>
      <guid>https://engineering.fb.com/2024/10/15/data-infrastructure/metas-open-ai-hardware-vision/</guid>
      <pubDate>Tue, 15 Oct 2024 19:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[How open source AI can improve population estimates, sustainable energy, and the delivery of climate change interventions]]></title>
      <description><![CDATA[<ul><li class="c1" aria-level="1">Data for Good at Meta is open-sourcing the data used to train our AI-powered population maps.</li>
<li class="c1" aria-level="1"> We’re hoping that researchers and other organizations around the world will be able to leverage these tools to assist with a wide range of projects including those on climate adaptation, public health and disaster response.</li>
<li class="c1" aria-level="1">The dataset and code are available now on <a href="https://github.com/facebookresearch/HighResolutionSettlementLayer" target="_blank" rel="noopener">GitHub</a>.</li>
</ul><p>To support the ongoing work of researchers, governments, nonprofits, and humanitarians around the world, the Data for Good at Meta program is open-sourcing the first set of training data and sample code used to construct <a href="https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps" target="_blank" rel="noopener">Meta’s AI-powered population maps.</a></p>
<p>As the world looks towards the increasing threat of climate change, Meta’s AI-powered population maps, and the data behind them, offer significant opportunities to direct investments in disaster preparedness through improved estimation of <a href="https://www.nature.com/articles/s41467-019-09282-y" target="_blank" rel="noopener">global flood exposure</a> and in <a href="https://www.cambridge.org/core/journals/global-sustainability/article/upscaling-urban-data-science-for-global-climate-solutions/D2D622B43CD50A9B2FD5DF855BCC0F18?fbclid=IwY2xjawEnQjVleHRuA2FlbQIxMAABHbTiWUPUhcbX0JBxfPLVwtg9fd6wyYO98jy1N0MatP_Fse1Sv7078P2pYg_aem_Y5QcbSZqolPCKpdKynnlfQ" target="_blank" rel="noopener">climate adaptation planning</a>.</p>
<p>By open sourcing these tools, we hope that other researchers can generate new insights for speeding the delivery of sustainable energy and climate resilient infrastructure around the world.</p>
<h2>Why we need better population maps</h2>
<p>Accurate estimates of population are taken for granted in many countries. Governments in advanced economies can rely on a variety of sources including tax records or census datasets to better estimate their population and make informed decisions on the delivery of services. However, in other parts of the world, accurate population data is hard to come by. In certain low- and middle-income countries, the most recent census may have been conducted decades ago or lack accurate representation of vulnerable populations. Furthermore, estimates between censuses are often fraught with inaccuracies and remote populations may be entirely missing from official sources. As a result, uncounted communities may live outside the reach of critical programs. </p>
<p>To combat this challenge, Meta began <a href="https://ai.meta.com/research/publications/mapping-the-world-population-one-building-at-a-time/" target="_blank" rel="noopener">the process of mapping the world’s population using artificial intelligence and satellite imagery</a> in 2017. Alongside other leading population mapping institutions like <a href="https://people.climate.columbia.edu/units/view/5" target="_blank" rel="noopener">Columbia University’s Center for Earth Science Information Network</a> (CIESIN) and <a href="https://www.worldpop.org/" target="_blank" rel="noopener">WorldPop at the University of Southampton</a>, we have <a href="https://data.humdata.org/organization/meta" target="_blank" rel="noopener">openly published hundreds of high resolution population maps and datasets</a>. These have been used around the world by governments and nonprofits for social programs ranging from the <a href="https://openknowledge.worldbank.org/server/api/core/bitstreams/a155c5ae-cd99-5635-a9de-4b86905f402f/content" target="_blank" rel="noopener">targeting of COVID-19 interventions</a> to the delivery of clean water. As the world’s natural resource and energy demands scale, accurate population estimates also offer significant opportunities to improve sustainability efforts.</p>
<figure id="attachment_21795" aria-describedby="caption-attachment-21795" class="wp-caption aligncenter c2"><img class="size-large wp-image-21795" src="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?w=1024" alt="" width="1024" height="791" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png 1118w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?resize=916,708 916w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?resize=768,594 768w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?resize=1024,791 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?resize=96,74 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-1.png?resize=192,148 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21795" class="wp-caption-text">The World Bank leveraged Meta’s AI-powered population maps to identify potential COVID-19 hotspots in Kinshasa, DRC.</figcaption></figure><h2>Background on Meta’s AI-powered population maps</h2>
<p>Data for Good’s AI-powered population maps estimate the number of people living within 30-meter grid tiles in nearly every country around the world. These maps leverage computer vision techniques – similar to those leveraged to <a href="https://about.fb.com/news/2021/01/using-ai-to-improve-photo-descriptions-for-blind-and-visually-impaired-people/" target="_blank" rel="noopener">identify objects in photos for the visually impaired</a> – to identify human-made structures in satellite imagery. The outputs of Meta’s AI model are then combined with population stock estimates from CIESIN to approximate the number of people living in each tile. </p>
<p>In addition to total population counts, Meta’s population maps also include demographic breakdowns for groups such as the number of children under five, women of reproductive age, youth, and the elderly. </p>
<p>AI-powered population estimates have been scientifically evaluated to be among the most accurate in the world for mapping population distribution for a variety of geographies and use-cases. For example, <a href="https://www.nature.com/articles/s41598-022-07720-4" target="_blank" rel="noopener">this 2022 paper by researchers at the University of Southampton and University of Ghana in <em>Nature – Scientific Reports</em></a> compares various population density estimates for use in mapping flooding risk in west Africa. Other studies have investigated a variety of use-cases such as mapping <a href="https://link.springer.com/article/10.1007/s11069-023-06283-5" target="_blank" rel="noopener">landslide risk</a> and <a href="https://www.biorxiv.org/content/10.1101/2020.06.18.160101v1.full" target="_blank" rel="noopener">malaria eradication</a>; and a range of countries including <a href="https://www.mdpi.com/2306-5729/3/3/33" target="_blank" rel="noopener">Haiti, Malawi, Madagascar, Nepal, Rwanda, and Thailand</a>. </p>
<h2>Open-sourcing training data for our AI population maps</h2>
<p>This initial set of training data consists of almost 10 million labels for over 126 gigabytes of satellite imagery and includes human labels on these satellite imagery patches indicating if a building is present. <a href="https://resources.maxar.com/data-sheets/imagery-basemaps-data-sheet" target="_blank" rel="noopener">These labels were created on satellite imagery dating from 2011 – 2020</a>; however, even labels made on older imagery are useful to train the next generation of machine vision models (like <a href="https://ai.meta.com/sam2/" target="_blank" rel="noopener">Meta’s Segment Anything</a>) to more accurately identify buildings in a range of land-cover environments. In addition to this first batch, we plan to release additional data and code for computer vision training in the future.</p>
<p>Open sourcing Meta’s training data and code allows population mapping partners like CIESIN and WorldPop to continue the progress made in the last decade. These tools reduce development costs for research units to generate even more accurate population estimates and also allows researchers working on building detection to improve their methods, especially when combined with more recent satellite imagery. Future data released from CIESIN and data collaborations like GRID3 will continue to push boundaries of spatial resolution and accuracy as the result of their work collaborating with many African countries to generate, validate, and use core spatial datasets in support of sustainable development.   </p>
<blockquote class="blockquote">
<p><em>To better visualize village settlement locations and calculate service coverage, World Vision turned to an innovative dataset developed by Meta’s Data for Good (D4G) and Columbia University’s Center for International Earth Science Information Network  (CIESIN). The resulting High Resolution Settlement Layer (HRSL) has been a game-changer for visualizing the geography of clean water.<br /></em> <em>–A</em>llen Hollenbach, Technical Director for World Vision Water and Sanitation</p>
</blockquote>
<h2>Applications in sustainable electrification, clean water, and climate change adaptation</h2>
<p>Nonprofit organizations and governments around the world have already leveraged Meta’s AI-powered population maps for a range of social impact programs, including <a href="https://dataforgood.facebook.com/dfg/resources/world-bank-global-electrification-platform-case-study" target="_blank" rel="noopener">the World Bank’s</a> rural electrification efforts in Somalia and Benin and similar efforts in Uganda by the <a href="https://www.wri.org/update/using-metas-relative-wealth-index-and-high-resolution-population-density-data-help-expand" target="_blank" rel="noopener">World Resources Institute</a>.  </p>
<p><a href="https://storymaps.arcgis.com/stories/a73563c0d11b433fa35e0bd10a546087" target="_blank" rel="noopener">World Vision</a> has also used these datasets in accelerating the progress in five-year plans for water and sanitation in places like Rwanda and Zambia and  just recently announced <a href="https://storymaps.arcgis.com/stories/50e5063b79374c3d924d662ba6f2e863" target="_blank" rel="noopener">having reached one million additional Rwandans with clean water</a> using insights from these maps to track progress towards universal water coverage.</p>
<figure id="attachment_21796" aria-describedby="caption-attachment-21796" class="wp-caption aligncenter c2"><img class="size-large wp-image-21796" src="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?w=1024" alt="" width="1024" height="683" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png 1999w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=916,611 916w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=768,512 768w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=1024,683 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=1536,1024 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=96,64 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-open-source-population-maps-2.png?resize=192,128 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-21796" class="wp-caption-text">World Vision used Meta’s high resolution population maps to identify the population and associated settlements closest to existing water points and target areas where new water points were needed.</figcaption></figure><p>Innovation in global population mapping is only possible through the type of collaboration Meta continues to have with Columbia University and WorldPop and a shared commitment to open source enables researchers and governments around the world to participate in this process.</p>
<p>Please visit the <a href="https://dataforgood.facebook.com/" target="_blank" rel="noopener">Data for Good</a> website for more information about Meta’s Data for Good program. And please visit this blog for more <a href="https://about.fb.com/news/2020/06/privacy-matters-data-for-good/" target="_blank" rel="noopener">information about how we protect user privacy in our tools.</a></p>]]></description>
      <link>https://engineering.fb.com/2024/10/03/ml-applications/open-source-ai-population-maps-meta/</link>
      <guid>https://engineering.fb.com/2024/10/03/ml-applications/open-source-ai-population-maps-meta/</guid>
      <pubDate>Thu, 03 Oct 2024 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[React at Meta Connect 2024]]></title>
      <description><![CDATA[<p>At Meta, <a href="https://react.dev/">React</a> and <a href="https://reactnative.dev/">React Native</a> are more than just tools; they are integral to our product development and innovation. With over five thousand people at Meta building products and experiences with React every month, these technologies are fundamental to our engineering culture and our ability to quickly build and ship high quality products. In this post, we will dive into the development experiences of some of the product teams who leveraged React and React Native to deliver exciting projects showcased at Meta Connect 2024.</p>
<h2>Instagram and Facebook For Meta Quest</h2>
<div class="wp-video c1"><a href="https://engineering.fb.com/wp-content/uploads/2024/10/RNBlogDemo-compressed.mp4">https://engineering.fb.com/wp-content/uploads/2024/10/RNBlogDemo-compressed.mp4</a></div>
<p>At Connect, Mark Zuckerberg shared that we have re-built Instagram and Facebook for mixed reality (MR) on Meta Quest. Our goal was to bring our flagship social experiences to the Meta Quest headset, letting people catch up with their friends and watch Stories and Reels, all while showcasing new possibilities enabled only through MR. </p>
<p>Building Meta’s social apps from scratch in MR required our teams to thoughtfully leverage the platform capabilities offered by Meta Quest while keeping a tremendously high bar for quality. The teams first had to decide how to build them: reusing the existing Android apps, writing a new native Android app, or using React Native to build from scratch. We wanted to offer a hero experience that looked and felt at home on Meta Quest, taking advantage of the additional input types, gestures, and larger visual surface area. Instead of simply porting our mobile social apps, we chose React Native as it enabled our teams to iterate and build quickly with robust animation capabilities, great performance, and a shared platform that powers most of the 2D Meta Quest system apps.</p>
<p>On Instagram, React Native enabled our teams to build rich animations and novel interactions that embody the brand’s deep focus on quality and delight. For this new app, we introduced seamless transitions of video posts from feed into a full screen view side by side with comments, without dropping a single frame. We enabled the ability to swipe through stacks of photos with the controller joystick or pinching your hands. We also introduced a unique hover animation over interactive elements that smoothly follows your controller movements.</p>
<p>When building Facebook for Meta Quest, our teams took advantage of the mature code and infrastructure that supports our <a href="http://facebook.com">Facebook.com desktop experience</a>. We leveraged code sharing technologies to reuse some of the most complex and robust features from Facebook.com like Newsfeed and commenting. Some of these code sharing technologies include our Meta open source projects like <a href="https://stylexjs.com/">StyleX</a> and <a href="https://github.com/facebook/react-strict-dom">React Strict DOM</a>. By sharing code, our teams could spend less time on repetitive business logic and focus more on adding Meta Quest specific interactions and experiences.</p>
<h2>Meta Horizon mobile app</h2>
<p><img class="aligncenter size-large wp-image-21782" src="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?w=1024" alt="" width="1024" height="578" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg 1999w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=916,517 916w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=768,434 768w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=1024,578 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=1536,868 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-React-Connect-2024-compressed.jpg?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>This year, we also <a href="https://www.meta.com/blog/quest/horizon-mobile-app/">rolled out the new Meta Horizon mobile app</a> – a new look and a new name. We expanded the app to make it easier to socialize and express yourself both in and out of the headset. We added a dedicated tab to easily customize your avatar and express your mood, right from your phone. People can also visit Horizon Worlds and complete quests from the app to unlock exclusive avatar styles, items, and emotes.</p>
<p>We’ve also continued to improve app performance. At Meta, our teams typically look to Facebook Marketplace as a React Native performance benchmark. However, the Meta Horizon app is a standalone app with React Native in the initialization path of the app’s cold start, compared to the Facebook app which initializes React Native when you visit your first React Native surface and not on app start. The performance results our teams delivered with React Native exceeded our original expectations and are on par with Meta’s mobile social apps.</p>
<p>Our Meta Horizon team worked closely with our React team to profile our application and find opportunities for improvement using Android Systrace, React DevTools, and the new <a href="https://www.youtube.com/live/b48Lax2-jOQ?si=OgqKzyw-AAnIUefZ&amp;t=4290">React Native DevTools</a>. The most impactful improvement that our teams made was initiating network queries earlier. Instead of initiating network requests when a component of the product surface was rendered, our teams moved that network fetch to start when the navigation button from the previous surface was clicked.</p>
<h2>Meta Horizon Store</h2>
<p><img class="aligncenter size-large wp-image-21783" src="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?w=1024" alt="" width="1024" height="688" srcset="https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg 1536w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?resize=916,615 916w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?resize=768,516 768w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?resize=1024,688 1024w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?resize=96,65 96w, https://engineering.fb.com/wp-content/uploads/2024/10/Meta-Horizon-Store-Quest-React-Connect-2024-crop-compressed.jpg?resize=192,129 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p>
<p>We also announced that the Meta Horizon Store is now open for all developers to publish apps, <a href="https://developers.meta.com/horizon/blog/building-2d-apps-on-the-meta-horizon-store" target="_blank" rel="noopener">including 2D apps</a>. To support this change, we made major changes to the Horizon Store; changes to our navigation to support significantly more categories, better ranking and categorization of apps, and a new “Early Access” section.</p>
<p>The Meta Horizon Store includes the surfaces that let you discover and acquire applications and games for Meta Quest, as well as explore Worlds you can travel to in Horizon. Since we have a centralized team that maintains the Store across four platforms (Android, iOS, Horizon OS, Web) and we need feature parity across these interfaces, the team has benefited tremendously from being able to use React and React Native even though these are primarily separate implementations today. These technologies have enabled the team to roll out new features and experiments much faster with a smaller team.</p>
<p>Just like the new Instagram and Facebook apps, and everything else using React at Meta, our teams use the bleeding edge of React infra like the React Compiler and the New React Native Architecture. The React team partnered with multiple teams over the last few years to build out infrastructure and capabilities to enable cross platform code sharing, which the Meta Horizon Store team has started to take advantage of. For example, the Meta Horizon Store’s navigation and routing infrastructure was originally quite different between platforms. The team is now reusing Meta’s internal router for React apps that was <a href="https://www.youtube.com/watch?v=KT3XKDBZW7M" target="_blank" rel="noopener">originally built for Facebook.com</a> which now also works with React Native. We also converted the Meta Horizon Store on the web from using pure CSS to using <a href="https://stylexjs.com/" target="_blank" rel="noopener">StyleX</a>, which in combination with <a href="https://github.com/facebook/react-strict-dom" target="_blank" rel="noopener">React Strict DOM</a>, has enabled them to reuse the Spotlight section of the Meta Horizon Store across web and mixed reality. This enabled us to more quickly support internationalized text rendering and light/dark mode for banners, and accelerated future enhancements for our merchandising team.</p>
<h2>Meta Spatial Editor</h2>
<div class="wp-video c1"><a href="https://engineering.fb.com/wp-content/uploads/2024/10/spatial-compressed.mp4">https://engineering.fb.com/wp-content/uploads/2024/10/spatial-compressed.mp4</a></div>
<p>We announced the <a href="https://developers.meta.com/horizon/develop/spatial-sdk">Meta Spatial SDK</a> and Meta Spatial Editor to enable mobile developers to create immersive experiences for Meta Horizon OS using familiar Android languages, libraries, and tools, along with unique Meta Quest capabilities, such as physics, MR, and 3D. Creating great 3D experiences always requires being able to visualize and edit your scenes directly. The Meta Spatial Editor is a new desktop app that lets you import, organize, and transform your assets into visual compositions and export them, using the glTF standard, into Meta Spatial SDK.</p>
<p>Our teams built the app with <a href="https://microsoft.github.io/react-native-windows/">React Native for Desktop</a>, providing users with native Windows and macOS apps and providing our teams with the incredible developer experience of React. One of the key factors in the teams’ decision to use React Native for Desktop instead of other web-based desktop solutions is that React Native enables the team to utilize native integrations when needed. The main 3D scene in the app is powered by a custom 3D rendering engine, requiring a custom React Native Native Component integration. The React Native panels on the scene let users modify all sorts of properties which then communicate with the 3D renderer via C++, enabling us to update the UI at 60fps.</p>
<p>The Meta Spatial Editor team had many engineers who primarily had a C++ background and were used to building with Qt. These team members were initially skeptical of JavaScript but ended up loving the developer experience provided by React Native, such as Fast Refresh. Web developers take for granted that code changes can be seen on file-save, but it is still extremely uncommon for native engineers. This developer experience enabled our teams to build much more quickly with React Native.</p>
<h2>This is how Meta builds React</h2>
<p>Over a decade ago, Meta introduced React to the industry through open source. Our React team at Meta is so proud of these experiences that were announced at Meta Connect 2024. These products showcase the power, expressivity, and flexibility of what’s possible with React: delightful interactions, deeply complex integrations, and incredibly responsive interfaces. And of course, they all render natively on their respective platforms to match user expectations.</p>
<p>Over the past decade, the React team has partnered deeply with both teams at Meta as well as members of the open source community to enable these types of product and developer experiences. Engineers at Meta use React on every platform where we ship user interfaces: web, mobile, desktop, and new platforms such as MR. Each time the React team has added support for a new platform, the team has invested in deeply understanding the idioms and expectations for user experiences on that platform, then adapting and optimizing React accordingly. We’ve consistently found that improving React for one platform benefits others as well — an approach the React teams described in their <a href="https://reactnative.dev/blog/2021/08/26/many-platform-vision">Many Platform Vision</a>. </p>
<p>This pattern has continued as the teams expanded support to the constraints and opportunities of mixed reality devices. Our teams have improved startup and application responsiveness, improved efficiency to reduce battery drain, and taken major steps to enable code sharing across web and native platforms — with platform-specific customizations. These wins have consistently benefited our apps on other platforms, with user experience improvements in products such as Facebook.com and Facebook Marketplace. </p>
<p>Our engineers invest in these improvements knowing that they will benefit not only products created by Meta, but all React products in the world. Meta continues to share these improvements with the open source community whenever we have built our confidence that they are stable enough for broader adoption. We’ve previously shared some of these improvements with the open source community, including <a href="https://youtu.be/lyEKhv8-3n0?si=sg-gbtEMtUCxFqOs&amp;t=2269">React Compiler</a>, <a href="https://react.dev/blog/2024/04/25/react-19">React 19</a>, React Native’s <a href="https://youtu.be/Q5SMmKb7qVI?si=i5K0pUmYCYOeBbKu&amp;t=766">New Architecture</a>, <a href="https://stylexjs.com/">StyleX</a>, <a href="https://github.com/facebook/react-strict-dom">React Strict DOM</a>, and performance improvements <a href="https://www.youtube.com/watch?v=rElD4RaR3gk" target="_blank" rel="noopener">to Hermes</a>. These innovations and more are currently under development, and our teams look forward to sharing them with the open source community in the future!</p>
<p><small><em>Stranger Things ™/© Netflix. Used with permission.</em></small></p>]]></description>
      <link>https://engineering.fb.com/2024/10/02/android/react-at-meta-connect-2024/</link>
      <guid>https://engineering.fb.com/2024/10/02/android/react-at-meta-connect-2024/</guid>
      <pubDate>Wed, 02 Oct 2024 18:00:00 +0200</pubDate>
    </item>
    <item>
      <title><![CDATA[Inside Bento: Jupyter Notebooks at Meta]]></title>
      <description><![CDATA[<p>This episode of the Meta Tech Podcast is all about <a href="https://developers.facebook.com/blog/post/2021/09/20/eli5-bento-interactive-notebook-empowers-development-collaboration-best-practices/" target="_blank" rel="noopener">Bento</a>, Meta’s internal distribution of Jupyter Notebooks, an open-source web-based computing platform. Bento allows our engineers to mix code, text, and multimedia in a single document and serves a wide range of use cases at Meta from prototyping to complex machine learning workflows.</p>
<p>Pascal Hartig (<a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">@passy</a>) is joined by Steve, whose team has built several features on top of Jupyter, including <a href="https://engineering.fb.com/2023/08/29/security/scheduling-jupyter-notebooks-meta/" target="_blank" rel="noopener">scheduled notebooks</a>, sharing with colleagues, and <a href="https://engineering.fb.com/2024/06/10/data-infrastructure/serverless-jupyter-notebooks-bento-meta/" target="_blank" rel="noopener">running notebooks without a remote server component</a> by leveraging WebAssembly in the browser.</p>
<p>Download or listen to the podcast episode below:</p>
<p><iframe class="c1" title="Libsyn Player" src="https://html5-player.libsyn.com/embed/episode/id/32811392/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen">[embedded content]</iframe></p>
<p>You can also find the episode wherever you get your podcasts, including:</p>
<ul><li><a href="https://open.spotify.com/episode/0RvTSFzjAlqJzW9tuJwokl" target="_blank" rel="noopener">Spotify</a></li>
<li><a href="https://podcasts.apple.com/us/podcast/inside-bento-serverless-jupyter-notebooks-at-meta/id1370910331?i=1000667487405" target="_blank" rel="noopener">Apple Podcasts</a></li>
<li><a href="https://pca.st/7vbp2djc" target="_blank" rel="noopener">PocketCasts</a></li>
<li><a href="https://overcast.fm/itunes1370910331" target="_blank" rel="noopener">Overcast</a></li>
</ul><p>The <a href="https://insidefacebookmobile.libsyn.com/">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p>
<p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p>
<p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p>]]></description>
      <link>https://engineering.fb.com/2024/09/17/data-infrastructure/inside-bento-jupyter-notebooks-at-meta/</link>
      <guid>https://engineering.fb.com/2024/09/17/data-infrastructure/inside-bento-jupyter-notebooks-at-meta/</guid>
      <pubDate>Tue, 17 Sep 2024 19:53:00 +0200</pubDate>
    </item>
  </channel>
</rss>
